Data Science Machine Learning
Data Science Machine Learning
The textbook for the Data Science course series is freely available online.
Learning Objectives
• The basics of machine learning
• How to perform cross-validation to avoid overtraining
• Several popular machine learning algorithms
• How to build a recommendation system
• What regularization is and why it is useful
Course Overview
There are six major sections in this course: introduction to machine learning; machine learning basics;
linear regression for prediction, smoothing, and working with matrices; distance, knn, cross validation, and
generative models; classification with more than two classes and the caret package; and model fitting and
recommendation systems.
In this section, you’ll be introduced to some of the terminology and concepts you’ll need going forward.
In this section, you’ll learn how to start building a machine learning algorithm using training and test data
sets and the importance of conditional probabilities for machine learning.
In this section, you’ll learn why linear regression is a useful baseline approach but is often insufficiently
flexible for more complex analyses, how to smooth noisy data, and how to use matrices for machine learning.
In this section, you’ll learn different types of discriminative and generative approaches for machine learning
algorithms.
Classification with More than Two Classes and the Caret Package
In this section, you’ll learn how to overcome the curse of dimensionality using methods that adapt to higher
dimensions and how to use the caret package to implement many different machine learning algorithms.
1
Model Fitting and Recommendation Systems
In this section, you’ll learn how to apply the machine learning algorithms you have learned.
Notation
There is a link to the relevant section of the textbook: Notation
Key points
• 𝑋1 , ..., 𝑋𝑝 denote the features, 𝑌 denotes the outcomes, and 𝑌 ̂ denotes the predictions.
• Machine learning prediction tasks can be divided into categorical and continuous outcomes. We
refer to these as classification and prediction, respectively.
An Example
There is a link to the relevant section of the textbook: An Example
Key points
⊠ A. True
□ B. False
2. True or False: In machine learning, we build algorithms that take feature values (X) and train a model
using known outcomes (Y) that is then used to predict outcomes when presented with features without
known outcomes.
⊠ A. True
□ B. False
2
Section 2 - Machine Learning Basics Overview
In the Machine Learning Basics section, you will learn the basics of machine learning.
After completing this section, you will be able to:
This section has two parts: basics of evaluating machine learning algorithms and conditional prob-
abilities.
There is a link to the relevant sections of the textbook: Training and test sets and Overall accuracy
Key points
• Note: the set.seed() function is used to obtain reproducible results. If you have R 3.6 or later, please
use the sample.kind = "Rounding" argument whenever you set the seed for this course.
• To mimic the ultimate evaluation process, we randomly split our data into two — a training set and a
test set — and act as if we don’t know the outcome of the test set. We develop algorithms using only
the training set; the test set is used only for evaluation.
• The createDataPartition() function from the caret package can be used to generate indexes for
randomly splitting data.
• Note: contrary to what the documentation says, this course will use the argument p as the percentage
of data that goes to testing. The indexes made from createDataPartition() should be used to create
the test set. Indexes should be created on the outcome and not a predictor.
• The simplest evaluation metric for categorical outcomes is overall accuracy: the proportion of cases
that were correctly predicted in the test set.
Code
if(!require(tidyverse)) install.packages("tidyverse")
3
if(!require(caret)) install.packages("caret")
##
## Attaching package: 'caret'
if(!require(dslabs)) install.packages("dslabs")
library(tidyverse)
library(caret)
library(dslabs)
data(heights)
# compute accuracy
mean(y_hat == test_set$sex)
## [1] 0.5238095
4
## # A tibble: 2 x 3
## sex `mean(height)` `sd(height)`
## <fct> <dbl> <dbl>
## 1 Female 64.9 3.76
## 2 Male 69.3 3.61
y_hat <- ifelse(x > 62, "Male", "Female") %>% factor(levels = levels(test_set$sex))
mean(y == y_hat)
## [1] 0.7933333
0.8
0.7
accuracy
0.6
0.5
62.5 65.0 67.5 70.0
cutoff
max(accuracy)
## [1] 0.8361905
5
best_cutoff <- cutoff[which.max(accuracy)]
best_cutoff
## [1] 64
## [1] 0.8171429
1. For each of the following, indicate whether the outcome is continuous or categorical.
2. How many features are available to us for prediction in the mnist digits dataset?
You can download the mnist dataset using the read_mnist() function from the dslabs package.
## [1] 784
Confusion matrix
Code
6
# tabulate each combination of prediction and actual value
table(predicted = y_hat, actual = test_set$sex)
## actual
## predicted Female Male
## Female 50 27
## Male 69 379
test_set %>%
mutate(y_hat = y_hat) %>%
group_by(sex) %>%
summarize(accuracy = mean(y_hat == sex))
## # A tibble: 2 x 2
## sex accuracy
## <fct> <dbl>
## 1 Female 0.420
## 2 Male 0.933
7
Balanced accuracy and F1 score
There is a link to the relevant section of the textbook: Balanced accuracy and F1 Score
Key points
• For optimization purposes, sometimes it is more useful to have a one number summary than studying
both specificity and sensitivity. One preferred metric is balanced accuracy. Because specificity and
sensitivity are rates, it is more appropriate to compute the harmonic average. In fact, the F1-score,
a widely used one-number summary, is the harmonic average of precision and recall.
• Depending on the context, some type of errors are more costly than others. The F1-score can be
adapted to weigh specificity and sensitivity differently.
• You can compute the F1-score using the F_meas() function in the caret package.
Code
# maximize F-score
cutoff <- seq(61, 70)
F_1 <- map_dbl(cutoff, function(x){
y_hat <- ifelse(train_set$height > x, "Male", "Female") %>%
factor(levels = levels(test_set$sex))
F_meas(data = y_hat, reference = factor(train_set$sex))
})
0.6
0.5
F_1
0.4
0.3
0.2
62.5 65.0 67.5 70.0
cutoff
8
max(F_1)
## [1] 0.6142322
## [1] 66
## [1] 0.6806723
## [1] 0.8349754
• A machine learning algorithm with very high sensitivity and specificity may not be useful in practice
when prevalence is close to either 0 or 1. For example, if you develop an algorithm for disease diagnosis
with very high sensitivity, but the prevalence of the disease is pretty low, then the precision of your
algorithm is probably very low based on Bayes’ theorem.
• A very common approach to evaluating accuracy and F1-score is to compare them graphically by
plotting both. A widely used plot that does this is the receiver operating characteristic (ROC)
curve. The ROC curve plots sensitivity (TPR) versus 1 - specificity or the false positive rate (FPR).
• However, ROC curves have one weakness and it is that neither of the measures plotted depend on
prevalence. In cases in which prevalence matters, we may instead make a precision-recall plot,
which has a similar idea with ROC curve.
Code
Note: your results and plots may be slightly different.
p <- 0.9
n <- length(test_index)
y_hat <- sample(c("Male", "Female"), n, replace = TRUE, prob=c(p, 1-p)) %>%
factor(levels = levels(test_set$sex))
mean(y_hat == test_set$sex)
9
## [1] 0.7180952
# ROC curve
probs <- seq(0, 1, length.out = 10)
guessing <- map_df(probs, function(p){
y_hat <-
sample(c("Male", "Female"), n, replace = TRUE, prob=c(p, 1-p)) %>%
factor(levels = c("Female", "Male"))
list(method = "Guessing",
FPR = 1 - specificity(y_hat, test_set$sex),
TPR = sensitivity(y_hat, test_set$sex))
})
guessing %>% qplot(FPR, TPR, data =., xlab = "1 - Specificity", ylab = "Sensitivity")
1.00
0.75
Sensitivity
0.50
0.25
0.00
10
geom_line() +
geom_point() +
xlab("1 - Specificity") +
ylab("Sensitivity")
1.00
0.75
Sensitivity
method
0.50 Guessing
Height cutoff
0.25
0.00
if(!require(ggrepel)) install.packages("ggrepel")
library(ggrepel)
map_df(cutoffs, function(x){
y_hat <- ifelse(test_set$height > x, "Male", "Female") %>%
factor(levels = c("Female", "Male"))
list(method = "Height cutoff",
cutoff = x,
FPR = 1-specificity(y_hat, test_set$sex),
TPR = sensitivity(y_hat, test_set$sex))
}) %>%
ggplot(aes(FPR, TPR, label = cutoff)) +
geom_line() +
geom_point() +
geom_text_repel(nudge_x = 0.01, nudge_y = -0.01)
11
1.00
72 73 75 80
70 71
69 74
68
0.75 67
66
65
TPR
0.50
64
63
0.25
62
61
60
0.00
50
0.00 0.25 0.50 0.75 1.00
FPR
12
0.6
0.4
method
precision
Guess
Height cutoff
0.2
0.0
13
1.0
0.9
method
precision
Guess
Height cutoff
0.8
0.7
The following questions all ask you to work with the dataset described below.
The reported_heights and heights datasets were collected from three classes taught in the Departments
of Computer Science and Biostatistics, as well as remotely through the Extension School. The Biostatistics
class was taught in 2016 along with an online version offered by the Extension School. On 2016-01-25 at
8:15 AM, during one of the lectures, the instructors asked student to fill in the sex and height questionnaire
that populated the reported_heights dataset. The online students filled out the survey during the next
few days, after the lecture was posted online. We can use this insight to define a variable which we will call
type, to denote the type of student, inclass or online.
The code below sets up the dataset for you to analyze in the following exercises:
if(!require(dplyr)) install.packages("dplyr")
if(!require(lubridate)) install.packages("lubridate")
##
## Attaching package: 'lubridate'
14
library(dplyr)
library(lubridate)
data(reported_heights)
1. The type column of dat indicates whether students took classes in person (“inclass”) or online (“on-
line”). What proportion of the inclass group is female? What proportion of the online group is female?
Enter your answer as a percentage or decimal (eg “50%” or “0.50”) to at least the hundredths place.
## # A tibble: 2 x 2
## type prop_female
## <chr> <dbl>
## 1 inclass 0.667
## 2 online 0.378
2. In the course videos, height cutoffs were used to predict sex. Instead of height, use the type variable
to predict sex. Assume that for each class type the students are either all male or all female, based on
the most prevalent sex in each class type you calculated in Q1. Report the accuracy of your prediction
of sex based on type. You do not need to split the data into training and test sets.
Enter your accuracy as a percentage or decimal (eg “50%” or “0.50”) to at least the hundredths place.
## [1] 0.6333333
3. Write a line of code using the table() function to show the confusion matrix between y_hat and y.
Use the exact format function(a, b) for your answer and do not name the columns and rows. Your
answer should have exactly one space.
table(y_hat, y)
## y
## y_hat Female Male
## Female 26 13
## Male 42 69
15
4. What is the sensitivity of this prediction? You can use the sensitivity() function from the caret
package. Enter your answer as a percentage or decimal (eg “50%” or “0.50”) to at least the hundredths
place.
sensitivity(y_hat, y)
## [1] 0.3823529
5. What is the specificity of this prediction? You can use the specificity() function from the caret
package. Enter your answer as a percentage or decimal (eg “50%” or “0.50”) to at least the hundredths
place.
specificity(y_hat, y)
## [1] 0.8414634
6. What is the prevalence (% of females) in the dat dataset defined above? Enter your answer as a
percentage or decimal (eg “50%” or “0.50”) to at least the hundredths place.
mean(y == "Female")
## [1] 0.4533333
data(iris)
iris <- iris[-which(iris$Species=='setosa'),]
y <- iris$Species
7. First let us create an even split of the data into train and test partitions using createDataPartition()
from the caret package. The code with a missing line is given below:
16
# set.seed(2) # if using R 3.5 or earlier
set.seed(2, sample.kind="Rounding") # if using R 3.6 or later
8. Next we will figure out the singular feature in the dataset that yields the greatest overall accuracy when
predicting species. You can use the code from the introduction and from Q7 to start your analysis.
Using only the train iris dataset, for each feature, perform a simple search to find the cutoff that produces
the highest accuracy, predicting virginica if greater than the cutoff and versicolor otherwise. Use the seq
function over the range of each feature by intervals of 0.1 for this search.
Which feature produces the highest accuracy?
□ A. Sepal.Length
□ B. Sepal.Width
⊠ C. Petal.Length
□ D. Petal.Width
9. For the feature selected in Q8, use the smart cutoff value from the training data to calculate overall
accuracy in the test data. What is the overall accuracy?
17
## [1] 0.9
10. Notice that we had an overall accuracy greater than 96% in the training data, but the overall accuracy
was lower in the test data. This can happen often if we overtrain. In fact, it could be the case that a
single feature is not the best choice. For example, a combination of features might be optimal. Using
a single feature and optimizing the cutoff as we did on our training data can lead to overfitting.
Given that we know the test data, we can treat it like we did our training data to see if the same feature
with a different cutoff will optimize our predictions.
Which feature best optimizes our overall accuracy?
□ A. Sepal.Length
□ B. Sepal.Width
□ C. Petal.Length
⊠ D. Petal.Width
11. Now we will perform some exploratory data analysis on the data.
Notice that Petal.Length and Petal.Width in combination could potentially be more information than
either feature alone.
Optimize the the cutoffs for Petal.Length and Petal.Width separately in the train dataset by using the
seq function with increments of 0.1. Then, report the overall accuracy when applied to the test dataset by
creating a rule that predicts virginica if Petal.Length is greater than the length cutoff OR Petal.Width is
greater than the width cutoff, and versicolor otherwise.
What is the overall accuracy for the test data now?
data(iris)
iris <- iris[-which(iris$Species=='setosa'),]
y <- iris$Species
plot(iris,pch=21,bg=iris$Species)
18
2.0 3.0 1.0 1.5 2.0 2.5
Sepal.Width
2.0
7
Petal.Length
5
3
2.0
Petal.Width
1.0
2.6
Species
2.0
5.0 6.0 7.0 8.0 3 4 5 6 7 2.0 2.4 2.8
19
y_hat <- ifelse(test$Petal.Length>length_cutoff | test$Petal.Width>width_cutoff,'virginica','versicolor'
mean(y_hat==test$Species)
## [1] 0.88
Conditional probabilities
• In machine learning, this is referred to as Bayes’ Rule. This is a theoretical rule because in practice
we don’t know 𝑝(𝑥). Having a good estimate of the 𝑝(𝑥) will suffice for us to build optimal prediction
models, since we can control the balance between specificity and sensitivity however we wish. In fact,
estimating these conditional probabilities can be thought of as the main challenge of machine learning.
There is a link to the relevant sections of the textbook: Conditional expectations and Loss functions
Key points
• For continuous outcomes, we define a loss function to evaluate the model. The most commonly used
one is MSE (Mean Squared Error). The reason why we care about the conditional expectation in
machine learning is that the expected value minimizes the MSE:
1. In a previous module, we covered Bayes’ theorem and the Bayesian paradigm. Conditional probabilities
are a fundamental part of this previous covered rule.
𝑃 (𝐴)
𝑃 (𝐴|𝐵) = 𝑃 (𝐵|𝐴) 𝑃 (𝐵)
20
• The test is positive 85% of the time when tested on a patient with the disease (high sensitivity):
𝑃 (test + |disease) = 0.85
• The test is negative 90% of the time when tested on a healthy patient (high specificity): 𝑃 (test −
|heathy) = 0.90
• The disease is prevalent in about 2% of the community: 𝑃 (disease) = 0.02
Using Bayes’ theorem, calculate the probability that you have the disease if the test is positive.
𝑃 (disease) 𝑃 (test+|disease)𝑃 (disease)
𝑃 (disease|test+) = 𝑃 (test + |disease) × 𝑃 (test+) = 𝑃 (test+|disease)𝑃 (disease)+𝑃 (test+|healthy)𝑃 (healthy)] =
0.85×0.02
0.85×0.02+0.1×0.98 = 0.1478261
The following 4 questions (Q2-Q5) all relate to implementing this calculation using R.
We have a hypothetical population of 1 million individuals with the following conditional probabilities as
described below:
• The test is positive 85% of the time when tested on a patient with the disease (high sensitivity):
𝑃 (test + |disease) = 0.85
• The test is negative 90% of the time when tested on a healthy patient (high specificity): 𝑃 (test −
|heathy) = 0.90
• The disease is prevalent in about 2% of the community: 𝑃 (disease) = 0.02
mean(test)
## [1] 0.114509
3. What is the probability that an individual has the disease if the test is negative?
mean(disease[test==0])
## [1] 0.003461356
4. What is the probability that you have the disease if the test is positive? Remember: calculate the
conditional probability the disease is positive assuming a positive test.
21
mean(disease[test==1]==1)
## [1] 0.1471762
5. Compare the prevalence of disease in people who test positive to the overall prevalence of disease.
If a patient’s test is positive, by how many times does that increase their risk of having the disease? First
calculate the probability of having the disease given a positive test, then divide by the probability of having
the disease.
mean(disease[test==1]==1)/mean(disease==1)
## [1] 7.389106
6. We are now going to write code to compute conditional probabilities for being male in the heights
dataset. Round the heights to the closest inch. Plot the estimated conditional probability 𝑃 (𝑥) =
Pr(Male|height = 𝑥).
data("heights")
# MISSING CODE
qplot(height, p, data =.)
Which of the following blocks of code can be used to replace # MISSING CODE to make the correct
plot?
□ A.
heights %>%
group_by(height) %>%
summarize(p = mean(sex == "Male")) %>%
□ B.
heights %>%
mutate(height = round(height)) %>%
group_by(height) %>%
summarize(p = mean(sex == "Female")) %>%
□ C.
heights %>%
mutate(height = round(height)) %>%
summarize(p = mean(sex == "Male")) %>%
⊠ D.
22
heights %>%
mutate(height = round(height)) %>%
group_by(height) %>%
summarize(p = mean(sex == "Male")) %>%
data("heights")
heights %>%
mutate(height = round(height)) %>%
group_by(height) %>%
summarize(p = mean(sex == "Male")) %>%
qplot(height, p, data =.)
1.00
0.75
0.50
p
0.25
0.00
50 60 70 80
height
7. In the plot we just made in Q6 we see high variability for low values of height. This is because we
have few data points. This time use the quantile 0.1, 0.2, … , 0.9 and the cut() function to assure each
group has the same number of points. Note that for any numeric vector x, you can create groups based
on quantiles like this: cut(x, quantile(x, seq(0, 1, 0.1)), include.lowest = TRUE).
23
group_by(g) %>%
summarize(p = mean(sex == "Male"), height = mean(height)) %>%
qplot(height, p, data =.)
Which of the following lines of code can be used to replace # MISSING CODE to make the correct plot?
□ A.
⊠ B.
□ C.
□ D.
24
1.0
0.8
p
0.6
0.4
64 68 72 76
height
8. You can generate data from a bivariate normal distrubution using the MASS package using the
following code:
if(!require(MASS)) install.packages("MASS")
##
## Attaching package: 'MASS'
plot(dat)
25
80
75
70
y
65
60
60 65 70 75 80
x
Using an approach similar to that used in the previous exercise, let’s estimate the conditional expectations
and make a plot. Part of the code has again been provided for you:
Which of the following blocks of code can be used to replace # MISSING CODE to make the correct
plot?
⊠ A.
□ B.
□ C.
□ D.
26
mutate(g = cut(x, quantile(x, ps), include.lowest = TRUE)) %>%
group_by(g) %>%
summarize(y =(y), x =(x)) %>%
72
71
70
y
69
68
67
66 69 72
x
27
• Use logistic regression for categorical data.
• Detect trends in noisy data using smoothing (also known as curve fitting or low pass filtering).
• Convert predictors to matrices and outcomes to vectors when all predictors are numeric (or can be
converted to numerics in a meaningful way).
• Perform basic matrix algebra calculations.
This section has three parts: linear regression for prediction, smoothing, and working with matrices.
There is a link to the relevant section of the textbook: Linear regression for prediction
Key points
• Linear regression can be considered a machine learning algorithm. Although it can be too rigid to be
useful, it works rather well for some challenges. It also serves as a baseline approach: if you can’t beat
it with a more complex approach, you probably want to stick to linear regression.
Code
Note: the seed was not set before createDataPartition so your results may be different.
if(!require(HistData)) install.packages("HistData")
library(HistData)
y <- galton_heights$son
test_index <- createDataPartition(y, times = 1, p = 0.5, list = FALSE)
## [1] 70.50114
mean((avg - test_set$son)^2)
## [1] 6.034931
28
## (Intercept) father
## 34.8934373 0.5170499
## [1] 4.632629
Predict Function
• The predict() function takes a fitted object from functions such as lm() or glm() and a data frame
with the new predictors for which to predict. We can use predict like this:
• predict() is a generic function in R that calls other functions depending on what kind of object it
receives. To learn about the specifics, you can read the help files using code like this:
?predict.lm # or ?predict.glm
Code
## [1] 4.632629
n <- 100
Sigma <- 9*matrix(c(1.0, 0.5, 0.5, 1.0), 2, 2)
dat <- MASS::mvrnorm(n = 100, c(69, 69), Sigma) %>%
data.frame() %>% setNames(c("x", "y"))
29
We will build 100 linear models using the data above and calculate the mean and standard deviation of the
combined models. First, set the seed to 1 again (make sure to use sample.kind="Rounding" if your R is
version 3.6 or later). Then, within a replicate() loop, (1) partition the dataset into test and training sets
with p = 0.5 and using dat$y to generate your indices, (2) train a linear model predicting y from x, (3)
generate predictions on the test set, and (4) calculate the RMSE of that model. Then, report the mean and
standard deviation (SD) of the RMSEs from all 100 models.
Report all answers to at least 3 significant digits.
mean(rmse)
## [1] 2.488661
sd(rmse)
## [1] 0.1243952
2. Now we will repeat the exercise above but using larger datasets. Write a function that takes a size n,
then (1) builds a dataset using the code provided at the top of Q1 but with n observations instead of
100 and without the set.seed(1), (2) runs the replicate() loop that you wrote to answer Q1, which
builds 100 linear models and returns a vector of RMSEs, and (3) calculates the mean and standard
deviation of the 100 RMSEs.
Set the seed to 1 (if using R 3.6 or later, use the argument sample.kind="Rounding") and then use sapply()
or map() to apply your new function to n <- c(100, 500, 1000, 5000, 10000).
Hint: You only need to set the seed once before running your function; do not set a seed within your function.
Also be sure to use sapply() or map() as you will get different answers running the simulations individually
due to setting the seed.
30
n <- c(100, 500, 1000, 5000, 10000)
res <- sapply(n, function(n){
Sigma <- 9*matrix(c(1.0, 0.5, 0.5, 1.0), 2, 2)
dat <- MASS::mvrnorm(n, c(69, 69), Sigma) %>%
data.frame() %>% setNames(c("x", "y"))
rmse <- replicate(100, {
test_index <- createDataPartition(dat$y, times = 1, p = 0.5, list = FALSE)
train_set <- dat %>% slice(-test_index)
test_set <- dat %>% slice(test_index)
fit <- lm(y ~ x, data = train_set)
y_hat <- predict(fit, newdata = test_set)
sqrt(mean((y_hat-test_set$y)^2))
})
c(avg = mean(rmse), sd = sd(rmse))
})
res
3. What happens to the RMSE as the size of the dataset becomes larger?
⊠ A. On average, the RMSE does not change much as n gets larger, but the variability of the RMSE
decreases.
□ B. Because of the law of large numbers the RMSE decreases; more data means more precise estimates.
□ C. n = 10000 is not sufficiently large. To see a decrease in the RMSE we would need to make it larger.
□ D. The RMSE is not a random variable.
4. Now repeat the exercise from Q1, this time making the correlation between x and y larger, as in the
following code:
n <- 100
Sigma <- 9*matrix(c(1.0, 0.95, 0.95, 1.0), 2, 2)
dat <- MASS::mvrnorm(n = 100, c(69, 69), Sigma) %>%
data.frame() %>% setNames(c("x", "y"))
31
rmse <- replicate(100, {
test_index <- createDataPartition(dat$y, times = 1, p = 0.5, list = FALSE)
train_set <- dat %>% slice(-test_index)
test_set <- dat %>% slice(test_index)
fit <- lm(y ~ x, data = train_set)
y_hat <- predict(fit, newdata = test_set)
sqrt(mean((y_hat-test_set$y)^2))
})
mean(rmse)
## [1] 0.9099808
sd(rmse)
## [1] 0.06244347
5. Which of the following best explains why the RMSE in question 4 is so much lower than the RMSE in
question 1?
Sigma <- matrix(c(1.0, 0.75, 0.75, 0.75, 1.0, 0.25, 0.75, 0.25, 1.0), 3, 3)
dat <- MASS::mvrnorm(n = 100, c(0, 0, 0), Sigma) %>%
data.frame() %>% setNames(c("y", "x_1", "x_2"))
Note that y is correlated with both x_1 and x_2 but the two predictors are independent of each other, as
seen by cor(dat).
Set the seed to 1, then use the caret package to partition into test and training sets with p = 0.5. Compare
the RMSE when using just x_1, just x_2 and both x_1 and x_2. Train a single linear model for each (not
100 like in the previous questions).
Which of the three models performs the best (has the lowest RMSE)?
32
test_index <- createDataPartition(dat$y, times = 1, p = 0.5, list = FALSE)
train_set <- dat %>% slice(-test_index)
test_set <- dat %>% slice(test_index)
## [1] 0.600666
## [1] 0.630699
## [1] 0.3070962
□ A. x_1
□ B. x_2
⊠ C. x_1 and x_2
## [1] 0.3070962
8. Repeat the exercise from Q6 but now create an example in which x_1 and x_2 are highly correlated.
Sigma <- matrix(c(1.0, 0.75, 0.75, 0.75, 1.0, 0.95, 0.75, 0.95, 1.0), 3, 3)
dat <- MASS::mvrnorm(n = 100, c(0, 0, 0), Sigma) %>%
data.frame() %>% setNames(c("y", "x_1", "x_2"))
Set the seed to 1, then use the caret package to partition into a test and training set of equal size. Compare
the RMSE when using just x_1, just x_2, and both x_1 and x_2.
Compare the results from Q6 and Q8. What can you conclude?
33
# set.seed(1) # if using R 3.5 or earlier
set.seed(1, sample.kind="Rounding") # if using R 3.6 or later
## [1] 0.6592608
## [1] 0.640081
## [1] 0.6597865
There is a link to the relevant section of the textbook: Regression for a categorical outcome
Key points
• The regression approach can be extended to categorical data. For example, we can try regression to
estimate the conditional probability:
• Once we have estimates 𝛽0 and 𝛽1 , we can obtain an actual prediction 𝑝(𝑥). Then we can define a
specific decision rule to form a prediction.
Code
34
data("heights")
y <- heights$height
train_set %>%
filter(round(height)==66) %>%
summarize(y_hat = mean(sex=="Female"))
## y_hat
## 1 0.2424242
heights %>%
mutate(x = round(height)) %>%
group_by(x) %>%
filter(n() >= 10) %>%
summarize(prop = mean(sex == "Female")) %>%
ggplot(aes(x, prop)) +
geom_point()
35
0.6
prop
0.4
0.2
0.0
60 65 70 75
x
## Accuracy
## 0.7851711
Logistic Regression
There is a link to the relevant section of the textbook: Logistic regression
Key points
• Logistic regression is an extension of linear regression that assures that the estimate of conditional
probability 𝑃 𝑟(𝑌 = 1|𝑋 = 𝑥) is between 0 and 1. This approach makes use of the logistic transforma-
tion:
𝑝
𝑔(𝑝) = 𝑙𝑜𝑔 1−𝑝
• Note that with this model, we can no longer use least squares. Instead we compute the maximum
likelihood estimate (MLE).
36
• In R, we can fit the logistic regression model with the function glm() (generalized linear models). If
we want to compute the conditional probabilities, we want type="response" since the default is to
return the logistic transformed values.
Code
heights %>%
mutate(x = round(height)) %>%
group_by(x) %>%
filter(n() >= 10) %>%
summarize(prop = mean(sex == "Female")) %>%
ggplot(aes(x, prop)) +
geom_point() +
geom_abline(intercept = lm_fit$coef[1], slope = lm_fit$coef[2])
0.6
prop
0.4
0.2
0.0
60 65 70 75
x
range(p_hat)
37
p_hat_logit <- predict(glm_fit, newdata = test_set, type = "response")
0.8
0.6
prop
0.4
0.2
0.0
60 65 70 75
x
y_hat_logit <- ifelse(p_hat_logit > 0.5, "Female", "Male") %>% factor
confusionMatrix(y_hat_logit, test_set$sex)$overall[["Accuracy"]]
## [1] 0.7984791
Case Study: 2 or 7
There is a link to the relevant section of the textbook: Case study: 2 or 7
38
Key points
• In this case study we apply logistic regression to classify whether a digit is two or seven. We are
interested in estimating a conditional probability that depends on two variables:
• Through this case, we know that logistic regression forces our estimates to be a plane and our boundary
to be a line. This implies that a logistic regression approach has no chance of capturing the non-linear
nature of the true 𝑝(𝑥1 , 𝑥2 ). Therefore, we need other more flexible methods that permit other shapes.
Code
39
largest smallest
0
value
10
250
200
Column
150
100
50
20 0
0 10 20 0 10 20
Row
data("mnist_27")
mnist_27$train %>% ggplot(aes(x_1, x_2, color = y)) + geom_point()
40
0.6
0.5
0.4
y
x_2
0.3 7
0.2
0.1
41
largest smallest
0
value
10
250
200
Column
150
100
50
20 0
0 10 20 0 10 20
Row
## Accuracy
## 0.76
42
0.6
0.4
p
1.00
0.75
x_2
0.50
0.25
0.2
0.0
43
0.6
0.4
p
1.00
0.75
x_2
0.50
0.25
0.2
0.0
44
0.6
0.4
p_hat
10
5
x_2
0.2 −5
0.0
45
0.6
0.4
y
x_2
2
7
0.2
0.0
y <- rbinom(n, 1, p)
f_0 <- rnorm(n, mu_0, sigma_0)
f_1 <- rnorm(n, mu_1, sigma_1)
x <- ifelse(y == 1, f_1, f_0)
46
Note that we have defined a variable x that is predictive of a binary outcome y:
dat$train %>% ggplot(aes(x, color = y)) + geom_density().
Set the seed to 1, then use the make_data() function defined above to generate 25 different datasets with
mu_1 <- seq(0, 3, len=25). Perform logistic regression on each of the 25 different datasets (predict 1 if
p > 0.5) and plot accuracy (res in the figures) vs mu_1 (delta in the figures).
Which is the correct plot?
0.9
0.8
res
0.7
0.6
0.5
0 1 2 3
delta
⊠ A.
47
□ B.
48
□ C.
49
□ D.
50
Introduction to Smoothing
• Smoothing is a very powerful technique used all across data analysis. It is designed to detect trends
in the presence of noisy data in cases in which the shape of the trend is unknown.
• The concepts behind smoothing techniques are extremely useful in machine learning because condi-
tional expectations/probabilities can be thought of as trends of unknown shapes that we need to
estimate in the presence of uncertainty.
Code
data("polls_2008")
qplot(day, margin, data = polls_2008)
51
0.10
0.05
margin
0.00
−0.05
• The general idea of smoothing is to group data points into strata in which the value of 𝑓(𝑥) can be
assumed to be constant. We can make this assumption because we think 𝑓(𝑥) changes slowly and, as
a result, 𝑓(𝑥) is almost constant in small windows of time.
• This assumption implies that a good estimate for 𝑓(𝑥) is the average of the 𝑌𝑖 values in the window.
The estimate is:
̂ )=
𝑓(𝑥 1
∑𝑖∈𝐴 𝑌𝑖
0 𝑁0 0
• In smoothing, we call the size of the interval |𝑥 − 𝑥0 | satisfying the particular condition the window
size, bandwidth or span.
Code
# bin smoothers
span <- 7
fit <- with(polls_2008,ksmooth(day, margin, x.points = day, kernel="box", bandwidth =span))
polls_2008 %>% mutate(smooth = fit$y) %>%
ggplot(aes(day, margin)) +
geom_point(size = 3, alpha = .5, color = "grey") +
geom_line(aes(day, smooth), color="red")
52
0.10
0.05
margin
0.00
−0.05
# kernel
span <- 7
fit <- with(polls_2008, ksmooth(day, margin, x.points = day, kernel="normal", bandwidth = span))
polls_2008 %>% mutate(smooth = fit$y) %>%
ggplot(aes(day, margin)) +
geom_point(size = 3, alpha = .5, color = "grey") +
geom_line(aes(day, smooth), color="red")
53
0.10
0.05
margin
0.00
−0.05
There is a link to the relevant section of the textbook: Local weighted regression
Key points
• A limitation of the bin smoothing approach is that we need small windows for the approximately con-
stant assumptions to hold which may lead to imprecise estimates of 𝑓(𝑥). Local weighted regression
(loess) permits us to consider larger window sizes.
• One important difference between loess and bin smoother is that we assume the smooth function is
locally linear in a window instead of constant.
• The result of loess is a smoother fit than bin smoothing because we use larger sample sizes to estimate
our local parameters.
Code
54
0.10
0.05
margin
0.00
−0.05
1. In the Wrangling course of this series, PH125.6x, we used the following code to obtain mortality counts
for Puerto Rico for 2015-2018:
if(!require(purrr)) install.packages("purrr")
if(!require(pdftools)) install.packages("pdftools")
library(tidyverse)
library(lubridate)
library(purrr)
library(pdftools)
55
tail_index <- str_which(s, "Total")
n <- str_count(s, "\\d+")
out <- c(1:header_index, which(n==1), which(n>=28), tail_index:length(s))
s[-out] %>%
str_remove_all("[^\\d\\s]") %>%
str_trim() %>%
str_split_fixed("\\s+", n = 6) %>%
.[,1:5] %>%
as_data_frame() %>%
setNames(c("day", header)) %>%
mutate(month = month,
day = as.numeric(day)) %>%
gather(year, deaths, -c(day, month)) %>%
mutate(deaths = as.numeric(deaths))
}) %>%
mutate(month = recode(month, "JAN" = 1, "FEB" = 2, "MAR" = 3, "APR" = 4, "MAY" = 5, "JUN" = 6,
"JUL" = 7, "AGO" = 8, "SEP" = 9, "OCT" = 10, "NOV" = 11, "DEC" = 12)) %>%
mutate(date = make_date(year, month, day)) %>%
dplyr::filter(date <= "2018-05-01")
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
Use the loess() function to obtain a smooth estimate of the expected number of deaths as a function of
date. Plot this resulting smooth function. Make the span about two months long.
Which of the following plots is correct?
56
140
120
100
deaths
80
60
⊠ A.
57
□ B.
58
□ C.
59
□ D.
60
2. Work with the same data as in Q1 to plot smooth estimates against day of the year, all on the same
plot, but with different colors for each year.
dat %>%
mutate(smooth = predict(fit, as.numeric(date)), day = yday(date), year = as.character(year(date))) %
ggplot(aes(day, smooth, col = year)) +
geom_line(lwd = 2)
61
100
year
2015
smooth
90
2016
2017
2018
80
70
0 100 200 300
day
□ A.
dat %>%
mutate(smooth = predict(fit), day = yday(date), year = as.character(year(date))) %>%
ggplot(aes(day, smooth, col = year)) +
geom_line(lwd = 2)
□ B.
dat %>%
mutate(smooth = predict(fit, as.numeric(date)), day = mday(date), year = as.character(year(date))) %
ggplot(aes(day, smooth, col = year)) +
geom_line(lwd = 2)
□ C.
dat %>%
mutate(smooth = predict(fit, as.numeric(date)), day = yday(date), year = as.character(year(date))) %
ggplot(aes(day, smooth)) +
geom_line(lwd = 2)
⊠ D.
62
dat %>%
mutate(smooth = predict(fit, as.numeric(date)), day = yday(date), year = as.character(year(date))) %
ggplot(aes(day, smooth, col = year)) +
geom_line(lwd = 2)
3. Suppose we want to predict 2s and 7s in the mnist_27 dataset with just the second covariate. Can we
do this? On first inspection it appears the data does not have much predictive power.
In fact, if we fit a regular logistic regression the coefficient for x_2 is not significant!
This can be seen using this code:
if(!require(broom)) install.packages("broom")
library(broom)
mnist_27$train %>% glm(y ~ x_2, family = "binomial", data = .) %>% tidy()
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.0907 0.247 -0.368 0.713
## 2 x_2 0.685 0.827 0.829 0.407
63
7
y
mnist_27$train %>%
mutate(y = ifelse(y=="7", 1, 0)) %>%
ggplot(aes(x_2, y)) +
geom_smooth(method = "loess")
64
1.6
1.2
y
0.8
0.4
Matrices
• The main reason for using matrices is that certain mathematical operations needed to develop efficient
code can be performed using techniques from a branch of mathematics called linear algebra.
• Linear algebra and matrix notation are key elements of the language used in academic papers
describing machine learning techniques.
Code
class(mnist$train$images)
65
x <- mnist$train$images[1:1000,]
y <- mnist$train$labels[1:1000]
Matrix Notation
• In matrix algebra, we have three main types of objects: scalars, vectors, and matrices.
– Scalar: 𝛼 = 1
𝑥1,1
– Vector: 𝑋1 = ⎛
⎜ ⋮ ⎞ ⎟
𝑥
⎝ 𝑁,1 ⎠
𝑥1,1 𝑥1,2
– Matrix: 𝑋 = [𝑋1 𝑋2 ] = ⎛
⎜ ⋮ ⋮ ⎞ ⎟
⎝𝑥𝑁,1 𝑥𝑁,2 ⎠
• In R, we can extract the dimension of a matrix with the function dim(). We can convert a vector into
a matrix using the function as.matrix().
Code
length(x[,1])
## [1] 1000
## x_1 x_2
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
dim(x)
dim(x_1)
## NULL
dim(as.matrix(x_1))
## [1] 5 1
66
dim(x)
• In R, we can convert a vector into a matrix with the matrix() function. The matrix is filled in by
column, but we can fill by row by using the byrow argument. The function t() can be used to directly
transpose a matrix.
• Note that the matrix function recycles values in the vector without warning if the product of
columns and rows does not match the length of the vector.
Code
# fill by row
mat_t <- matrix(my_vector, 3, 5, byrow = TRUE)
mat_t
identical(t(mat), mat_t)
## [1] TRUE
matrix(my_vector, 5, 5)
67
grid <- matrix(x[3,], 28, 28)
image(1:28, 1:28, grid)
25
20
1:28
15
10
5
5 10 15 20 25
1:28
# flip the image back
image(1:28, 1:28, grid[, 28:1])
25
20
1:28
15
10
5
5 10 15 20 25
1:28
There is a link to the relevant section of the textbook: Row and column summaries
68
Key points
Code
60
row_averages
40
20
0 1 2 3 4 5 6 7 8 9
labels
69
Filtering Columns Based on Summaries
There is a link to the relevant section of the textbook: Filtering columns based on summaries
Key points
Code
if(!require(matrixStats)) install.packages("matrixStats")
##
## Attaching package: 'matrixStats'
library(matrixStats)
70
200
150
100
50
0 30 60 90 120
sds
15
10
5
5 10 15 20 25
1:28
#extract columns and rows
x[ ,c(351,352)]
71
## [,1] [,2]
## [1,] 70 0
## [2,] 0 0
## [3,] 0 0
## [4,] 205 253
## [5,] 8 78
## [6,] 0 0
## [7,] 253 253
## [8,] 91 212
## [9,] 254 143
## [10,] 0 0
## [11,] 254 254
## [12,] 78 79
## [13,] 254 248
## [14,] 0 114
## [15,] 254 109
## [16,] 0 0
## [17,] 0 0
## [18,] 80 223
## [19,] 0 0
## [20,] 8 43
## [21,] 109 109
## [22,] 96 204
## [23,] 0 0
## [24,] 142 255
## [25,] 32 254
## [26,] 250 253
## [27,] 0 0
## [28,] 253 253
## [29,] 0 0
## [30,] 2 0
## [31,] 253 253
## [32,] 253 253
## [33,] 0 0
## [34,] 228 216
## [35,] 225 0
## [36,] 141 86
## [37,] 107 0
## [38,] 0 0
## [39,] 0 15
## [40,] 0 0
## [41,] 253 253
## [42,] 232 233
## [43,] 0 182
## [44,] 71 173
## [45,] 253 203
## [46,] 44 199
## [47,] 0 154
## [48,] 0 0
## [49,] 169 254
## [50,] 252 176
## [51,] 254 254
## [52,] 0 0
## [53,] 0 0
72
## [54,] 24 242
## [55,] 71 122
## [56,] 0 186
## [57,] 0 0
## [58,] 0 0
## [59,] 111 189
## [60,] 229 254
## [61,] 0 0
## [62,] 0 227
## [63,] 0 0
## [64,] 253 251
## [65,] 0 0
## [66,] 216 151
## [67,] 128 128
## [68,] 254 254
## [69,] 0 0
## [70,] 29 0
## [71,] 253 122
## [72,] 69 0
## [73,] 254 204
## [74,] 17 179
## [75,] 253 252
## [76,] 182 15
## [77,] 254 254
## [78,] 251 253
## [79,] 173 253
## [80,] 10 0
## [81,] 252 253
## [82,] 0 0
## [83,] 0 0
## [84,] 0 128
## [85,] 0 0
## [86,] 253 253
## [87,] 253 253
## [88,] 21 52
## [89,] 0 0
## [90,] 0 0
## [91,] 0 0
## [92,] 53 53
## [93,] 0 0
## [94,] 70 236
## [95,] 38 0
## [96,] 0 0
## [97,] 0 26
## [98,] 38 38
## [99,] 253 240
## [100,] 69 253
## [101,] 0 0
## [102,] 66 0
## [103,] 254 95
## [104,] 0 0
## [105,] 251 0
## [106,] 253 253
## [107,] 0 0
73
## [108,] 191 255
## [109,] 0 0
## [110,] 163 8
## [111,] 78 253
## [112,] 55 139
## [113,] 252 253
## [114,] 252 252
## [115,] 0 0
## [116,] 0 0
## [117,] 0 15
## [118,] 253 253
## [119,] 0 0
## [120,] 14 0
## [121,] 0 0
## [122,] 0 0
## [123,] 0 150
## [124,] 0 0
## [125,] 253 233
## [126,] 254 178
## [127,] 0 0
## [128,] 61 1
## [129,] 253 253
## [130,] 192 252
## [131,] 254 247
## [132,] 0 5
## [133,] 253 253
## [134,] 141 240
## [135,] 253 251
## [136,] 252 252
## [137,] 254 179
## [138,] 255 255
## [139,] 244 253
## [140,] 0 0
## [141,] 0 0
## [142,] 131 44
## [143,] 0 0
## [144,] 162 255
## [145,] 72 142
## [146,] 0 0
## [147,] 0 34
## [148,] 0 0
## [149,] 0 0
## [150,] 252 252
## [151,] 221 254
## [152,] 0 0
## [153,] 232 254
## [154,] 5 89
## [155,] 253 213
## [156,] 0 36
## [157,] 0 0
## [158,] 179 242
## [159,] 50 50
## [160,] 0 90
## [161,] 254 254
74
## [162,] 229 254
## [163,] 0 0
## [164,] 76 243
## [165,] 0 0
## [166,] 63 167
## [167,] 0 0
## [168,] 0 0
## [169,] 253 252
## [170,] 105 4
## [171,] 37 168
## [172,] 69 168
## [173,] 255 152
## [174,] 170 0
## [175,] 252 253
## [176,] 185 8
## [177,] 254 253
## [178,] 251 253
## [179,] 0 0
## [180,] 59 106
## [181,] 0 178
## [182,] 0 0
## [183,] 176 253
## [184,] 0 64
## [185,] 253 226
## [186,] 0 0
## [187,] 0 0
## [188,] 254 254
## [189,] 0 0
## [190,] 252 252
## [191,] 167 254
## [192,] 0 0
## [193,] 0 0
## [194,] 32 32
## [195,] 0 0
## [196,] 148 149
## [197,] 0 0
## [198,] 250 225
## [199,] 104 252
## [200,] 0 11
## [201,] 253 169
## [202,] 157 252
## [203,] 100 247
## [204,] 162 216
## [205,] 0 0
## [206,] 253 251
## [207,] 0 0
## [208,] 0 0
## [209,] 253 253
## [210,] 0 0
## [211,] 0 0
## [212,] 253 254
## [213,] 199 253
## [214,] 0 20
## [215,] 0 0
75
## [216,] 253 253
## [217,] 0 0
## [218,] 0 0
## [219,] 106 239
## [220,] 181 84
## [221,] 0 0
## [222,] 0 31
## [223,] 152 244
## [224,] 0 0
## [225,] 0 61
## [226,] 253 227
## [227,] 0 136
## [228,] 0 0
## [229,] 0 0
## [230,] 0 0
## [231,] 0 0
## [232,] 253 251
## [233,] 0 0
## [234,] 0 0
## [235,] 0 2
## [236,] 253 253
## [237,] 0 0
## [238,] 0 0
## [239,] 0 0
## [240,] 98 88
## [241,] 253 252
## [242,] 0 0
## [243,] 254 254
## [244,] 0 0
## [245,] 0 169
## [246,] 255 255
## [247,] 0 0
## [248,] 0 2
## [249,] 254 252
## [250,] 0 0
## [251,] 0 1
## [252,] 253 253
## [253,] 253 252
## [254,] 0 0
## [255,] 254 254
## [256,] 253 253
## [257,] 253 171
## [258,] 0 0
## [259,] 0 0
## [260,] 254 231
## [261,] 0 0
## [262,] 0 0
## [263,] 0 0
## [264,] 0 0
## [265,] 0 0
## [266,] 236 62
## [267,] 77 0
## [268,] 0 90
## [269,] 0 93
76
## [270,] 253 253
## [271,] 251 57
## [272,] 0 0
## [273,] 125 168
## [274,] 127 127
## [275,] 232 8
## [276,] 0 0
## [277,] 191 254
## [278,] 0 0
## [279,] 245 254
## [280,] 0 128
## [281,] 0 51
## [282,] 253 255
## [283,] 0 0
## [284,] 0 0
## [285,] 253 253
## [286,] 0 0
## [287,] 253 253
## [288,] 254 251
## [289,] 0 0
## [290,] 0 0
## [291,] 252 253
## [292,] 253 253
## [293,] 2 45
## [294,] 0 0
## [295,] 0 0
## [296,] 133 160
## [297,] 0 0
## [298,] 0 0
## [299,] 253 253
## [300,] 0 155
## [301,] 42 235
## [302,] 0 0
## [303,] 0 0
## [304,] 0 0
## [305,] 29 29
## [306,] 0 0
## [307,] 100 176
## [308,] 0 0
## [309,] 0 0
## [310,] 232 253
## [311,] 235 254
## [312,] 0 0
## [313,] 183 102
## [314,] 0 35
## [315,] 0 0
## [316,] 243 253
## [317,] 255 255
## [318,] 0 0
## [319,] 241 224
## [320,] 0 5
## [321,] 0 0
## [322,] 230 253
## [323,] 0 0
77
## [324,] 0 0
## [325,] 0 0
## [326,] 0 0
## [327,] 0 0
## [328,] 253 253
## [329,] 45 0
## [330,] 0 0
## [331,] 70 70
## [332,] 0 0
## [333,] 0 0
## [334,] 184 184
## [335,] 0 183
## [336,] 211 86
## [337,] 0 0
## [338,] 0 0
## [339,] 0 0
## [340,] 0 0
## [341,] 0 64
## [342,] 253 255
## [343,] 132 152
## [344,] 252 241
## [345,] 0 0
## [346,] 158 254
## [347,] 8 134
## [348,] 0 0
## [349,] 205 254
## [350,] 0 0
## [351,] 0 3
## [352,] 180 253
## [353,] 253 207
## [354,] 0 0
## [355,] 0 102
## [356,] 254 254
## [357,] 253 253
## [358,] 211 253
## [359,] 254 95
## [360,] 0 0
## [361,] 253 253
## [362,] 160 252
## [363,] 0 0
## [364,] 0 96
## [365,] 0 0
## [366,] 0 0
## [367,] 253 217
## [368,] 0 0
## [369,] 254 254
## [370,] 0 0
## [371,] 253 253
## [372,] 0 0
## [373,] 0 43
## [374,] 0 0
## [375,] 121 252
## [376,] 0 0
## [377,] 0 0
78
## [378,] 0 0
## [379,] 0 0
## [380,] 0 3
## [381,] 0 0
## [382,] 0 0
## [383,] 254 84
## [384,] 0 0
## [385,] 0 56
## [386,] 0 52
## [387,] 252 240
## [388,] 0 0
## [389,] 0 0
## [390,] 0 0
## [391,] 38 233
## [392,] 197 173
## [393,] 53 232
## [394,] 64 64
## [395,] 181 0
## [396,] 0 0
## [397,] 0 0
## [398,] 207 252
## [399,] 253 158
## [400,] 27 0
## [401,] 0 0
## [402,] 0 0
## [403,] 0 0
## [404,] 105 0
## [405,] 253 253
## [406,] 93 239
## [407,] 253 58
## [408,] 42 27
## [409,] 254 195
## [410,] 0 0
## [411,] 229 253
## [412,] 0 0
## [413,] 0 100
## [414,] 0 0
## [415,] 0 70
## [416,] 0 0
## [417,] 253 251
## [418,] 58 0
## [419,] 7 221
## [420,] 0 45
## [421,] 252 253
## [422,] 0 0
## [423,] 0 77
## [424,] 0 0
## [425,] 253 253
## [426,] 23 29
## [427,] 252 252
## [428,] 0 0
## [429,] 135 246
## [430,] 0 0
## [431,] 0 0
79
## [432,] 0 0
## [433,] 0 0
## [434,] 253 253
## [435,] 0 0
## [436,] 0 0
## [437,] 0 0
## [438,] 40 8
## [439,] 0 34
## [440,] 254 254
## [441,] 0 0
## [442,] 0 47
## [443,] 0 0
## [444,] 99 253
## [445,] 222 246
## [446,] 252 209
## [447,] 0 0
## [448,] 172 253
## [449,] 12 161
## [450,] 0 0
## [451,] 251 180
## [452,] 0 0
## [453,] 254 253
## [454,] 0 0
## [455,] 254 223
## [456,] 237 252
## [457,] 252 252
## [458,] 0 0
## [459,] 0 0
## [460,] 49 159
## [461,] 0 0
## [462,] 0 0
## [463,] 0 0
## [464,] 0 0
## [465,] 0 0
## [466,] 0 0
## [467,] 98 254
## [468,] 0 0
## [469,] 0 0
## [470,] 0 0
## [471,] 0 0
## [472,] 51 51
## [473,] 154 250
## [474,] 0 0
## [475,] 0 0
## [476,] 211 253
## [477,] 0 0
## [478,] 0 0
## [479,] 114 253
## [480,] 254 253
## [481,] 0 0
## [482,] 0 0
## [483,] 0 0
## [484,] 0 0
## [485,] 253 132
80
## [486,] 0 0
## [487,] 67 0
## [488,] 0 9
## [489,] 254 255
## [490,] 0 0
## [491,] 253 250
## [492,] 0 255
## [493,] 252 250
## [494,] 0 0
## [495,] 0 0
## [496,] 253 253
## [497,] 202 203
## [498,] 0 0
## [499,] 0 0
## [500,] 130 76
## [501,] 0 0
## [502,] 0 0
## [503,] 0 0
## [504,] 115 34
## [505,] 105 0
## [506,] 0 0
## [507,] 0 0
## [508,] 143 253
## [509,] 254 254
## [510,] 160 253
## [511,] 253 224
## [512,] 12 118
## [513,] 0 0
## [514,] 0 0
## [515,] 148 237
## [516,] 0 0
## [517,] 0 0
## [518,] 24 0
## [519,] 0 7
## [520,] 0 0
## [521,] 0 0
## [522,] 128 25
## [523,] 0 0
## [524,] 0 0
## [525,] 0 0
## [526,] 0 0
## [527,] 0 0
## [528,] 12 0
## [529,] 221 62
## [530,] 0 51
## [531,] 0 0
## [532,] 0 0
## [533,] 253 253
## [534,] 18 246
## [535,] 204 252
## [536,] 128 253
## [537,] 0 0
## [538,] 156 127
## [539,] 254 254
81
## [540,] 0 42
## [541,] 114 0
## [542,] 0 0
## [543,] 151 0
## [544,] 0 0
## [545,] 189 112
## [546,] 0 164
## [547,] 252 253
## [548,] 0 15
## [549,] 0 0
## [550,] 82 202
## [551,] 0 8
## [552,] 0 0
## [553,] 215 254
## [554,] 206 252
## [555,] 251 253
## [556,] 0 0
## [557,] 253 253
## [558,] 253 253
## [559,] 115 0
## [560,] 110 231
## [561,] 0 136
## [562,] 254 254
## [563,] 0 0
## [564,] 0 23
## [565,] 0 0
## [566,] 113 206
## [567,] 0 71
## [568,] 0 0
## [569,] 0 0
## [570,] 0 22
## [571,] 0 0
## [572,] 25 119
## [573,] 255 255
## [574,] 246 253
## [575,] 253 128
## [576,] 21 22
## [577,] 194 113
## [578,] 0 0
## [579,] 0 0
## [580,] 0 0
## [581,] 43 225
## [582,] 253 253
## [583,] 0 0
## [584,] 112 166
## [585,] 0 0
## [586,] 0 0
## [587,] 0 0
## [588,] 253 253
## [589,] 70 254
## [590,] 0 0
## [591,] 0 157
## [592,] 0 0
## [593,] 0 6
82
## [594,] 179 253
## [595,] 221 253
## [596,] 0 32
## [597,] 0 0
## [598,] 252 82
## [599,] 0 0
## [600,] 0 0
## [601,] 111 245
## [602,] 0 0
## [603,] 253 65
## [604,] 64 0
## [605,] 47 254
## [606,] 0 14
## [607,] 10 168
## [608,] 7 160
## [609,] 0 0
## [610,] 252 252
## [611,] 0 0
## [612,] 23 172
## [613,] 0 0
## [614,] 253 247
## [615,] 0 0
## [616,] 0 0
## [617,] 0 0
## [618,] 0 0
## [619,] 253 0
## [620,] 0 0
## [621,] 252 253
## [622,] 0 0
## [623,] 253 255
## [624,] 50 7
## [625,] 0 0
## [626,] 0 0
## [627,] 0 0
## [628,] 0 0
## [629,] 182 253
## [630,] 206 253
## [631,] 68 41
## [632,] 0 0
## [633,] 47 5
## [634,] 18 0
## [635,] 0 80
## [636,] 0 0
## [637,] 0 0
## [638,] 193 254
## [639,] 254 177
## [640,] 0 0
## [641,] 84 19
## [642,] 236 253
## [643,] 0 0
## [644,] 253 253
## [645,] 254 254
## [646,] 253 253
## [647,] 164 253
83
## [648,] 0 0
## [649,] 229 254
## [650,] 5 0
## [651,] 88 211
## [652,] 0 0
## [653,] 252 229
## [654,] 0 0
## [655,] 0 9
## [656,] 0 0
## [657,] 5 0
## [658,] 0 0
## [659,] 0 0
## [660,] 8 128
## [661,] 25 0
## [662,] 0 29
## [663,] 19 0
## [664,] 0 0
## [665,] 0 10
## [666,] 235 239
## [667,] 0 0
## [668,] 255 128
## [669,] 0 0
## [670,] 0 0
## [671,] 14 51
## [672,] 253 253
## [673,] 0 0
## [674,] 0 0
## [675,] 244 89
## [676,] 253 253
## [677,] 254 230
## [678,] 20 0
## [679,] 253 253
## [680,] 239 249
## [681,] 0 0
## [682,] 0 0
## [683,] 0 0
## [684,] 0 0
## [685,] 0 0
## [686,] 254 254
## [687,] 0 0
## [688,] 0 0
## [689,] 13 221
## [690,] 0 0
## [691,] 0 0
## [692,] 206 253
## [693,] 131 178
## [694,] 57 144
## [695,] 73 253
## [696,] 252 252
## [697,] 0 47
## [698,] 0 0
## [699,] 253 253
## [700,] 237 165
## [701,] 0 0
84
## [702,] 0 0
## [703,] 0 0
## [704,] 0 0
## [705,] 17 65
## [706,] 253 253
## [707,] 49 189
## [708,] 51 92
## [709,] 133 254
## [710,] 0 0
## [711,] 253 72
## [712,] 252 252
## [713,] 180 0
## [714,] 0 55
## [715,] 113 254
## [716,] 254 253
## [717,] 249 127
## [718,] 0 0
## [719,] 253 254
## [720,] 251 253
## [721,] 253 246
## [722,] 0 0
## [723,] 8 0
## [724,] 0 0
## [725,] 0 0
## [726,] 252 252
## [727,] 254 218
## [728,] 0 0
## [729,] 0 51
## [730,] 0 0
## [731,] 0 0
## [732,] 253 253
## [733,] 209 253
## [734,] 0 0
## [735,] 122 198
## [736,] 0 0
## [737,] 255 29
## [738,] 32 0
## [739,] 254 59
## [740,] 0 5
## [741,] 254 139
## [742,] 0 0
## [743,] 0 0
## [744,] 7 0
## [745,] 226 226
## [746,] 73 0
## [747,] 0 219
## [748,] 176 253
## [749,] 194 71
## [750,] 9 0
## [751,] 0 29
## [752,] 253 254
## [753,] 252 252
## [754,] 0 0
## [755,] 0 0
85
## [756,] 0 0
## [757,] 208 208
## [758,] 246 230
## [759,] 251 252
## [760,] 0 0
## [761,] 243 40
## [762,] 177 8
## [763,] 0 0
## [764,] 0 0
## [765,] 0 57
## [766,] 253 253
## [767,] 203 204
## [768,] 254 200
## [769,] 208 199
## [770,] 252 253
## [771,] 0 0
## [772,] 110 110
## [773,] 15 178
## [774,] 0 0
## [775,] 0 0
## [776,] 60 100
## [777,] 0 0
## [778,] 241 101
## [779,] 0 0
## [780,] 253 252
## [781,] 253 252
## [782,] 7 0
## [783,] 0 0
## [784,] 253 253
## [785,] 224 252
## [786,] 0 0
## [787,] 0 0
## [788,] 0 0
## [789,] 0 0
## [790,] 254 254
## [791,] 0 0
## [792,] 218 253
## [793,] 242 78
## [794,] 0 0
## [795,] 7 0
## [796,] 0 54
## [797,] 24 0
## [798,] 0 10
## [799,] 0 0
## [800,] 253 254
## [801,] 0 103
## [802,] 132 253
## [803,] 0 78
## [804,] 0 6
## [805,] 0 0
## [806,] 254 254
## [807,] 0 15
## [808,] 144 254
## [809,] 252 154
86
## [810,] 253 252
## [811,] 116 137
## [812,] 253 253
## [813,] 0 54
## [814,] 0 131
## [815,] 141 210
## [816,] 203 223
## [817,] 0 0
## [818,] 254 254
## [819,] 0 0
## [820,] 0 0
## [821,] 0 0
## [822,] 253 253
## [823,] 2 41
## [824,] 13 126
## [825,] 0 135
## [826,] 0 0
## [827,] 0 0
## [828,] 0 0
## [829,] 0 0
## [830,] 5 0
## [831,] 252 253
## [832,] 137 184
## [833,] 255 253
## [834,] 253 252
## [835,] 0 0
## [836,] 253 252
## [837,] 82 223
## [838,] 254 254
## [839,] 252 253
## [840,] 0 0
## [841,] 253 204
## [842,] 0 0
## [843,] 253 253
## [844,] 254 253
## [845,] 0 0
## [846,] 249 253
## [847,] 0 0
## [848,] 0 0
## [849,] 0 0
## [850,] 64 0
## [851,] 0 0
## [852,] 0 0
## [853,] 59 0
## [854,] 0 0
## [855,] 0 0
## [856,] 0 0
## [857,] 254 253
## [858,] 252 252
## [859,] 0 0
## [860,] 0 0
## [861,] 0 0
## [862,] 253 134
## [863,] 0 190
87
## [864,] 77 254
## [865,] 159 254
## [866,] 242 253
## [867,] 0 0
## [868,] 253 253
## [869,] 0 0
## [870,] 8 0
## [871,] 253 253
## [872,] 240 254
## [873,] 0 0
## [874,] 0 0
## [875,] 253 253
## [876,] 253 253
## [877,] 44 249
## [878,] 0 0
## [879,] 243 174
## [880,] 97 97
## [881,] 0 0
## [882,] 6 86
## [883,] 0 0
## [884,] 0 0
## [885,] 82 253
## [886,] 197 253
## [887,] 114 0
## [888,] 1 25
## [889,] 0 0
## [890,] 0 0
## [891,] 252 253
## [892,] 240 253
## [893,] 181 20
## [894,] 0 0
## [895,] 203 254
## [896,] 254 253
## [897,] 0 0
## [898,] 0 0
## [899,] 0 0
## [900,] 24 0
## [901,] 6 191
## [902,] 0 0
## [903,] 0 0
## [904,] 0 0
## [905,] 0 0
## [906,] 104 254
## [907,] 0 152
## [908,] 0 8
## [909,] 67 160
## [910,] 253 253
## [911,] 0 0
## [912,] 0 0
## [913,] 0 0
## [914,] 37 167
## [915,] 0 0
## [916,] 35 0
## [917,] 7 108
88
## [918,] 0 0
## [919,] 71 241
## [920,] 254 254
## [921,] 253 253
## [922,] 0 0
## [923,] 1 0
## [924,] 0 64
## [925,] 198 198
## [926,] 0 170
## [927,] 0 0
## [928,] 0 0
## [929,] 0 0
## [930,] 0 0
## [931,] 0 0
## [932,] 0 0
## [933,] 123 254
## [934,] 251 225
## [935,] 0 0
## [936,] 14 69
## [937,] 89 253
## [938,] 0 0
## [939,] 190 252
## [940,] 94 0
## [941,] 0 0
## [942,] 150 254
## [943,] 163 238
## [944,] 7 0
## [945,] 168 169
## [946,] 0 0
## [947,] 75 231
## [948,] 1 0
## [949,] 128 254
## [950,] 0 0
## [951,] 116 253
## [952,] 241 254
## [953,] 0 0
## [954,] 254 254
## [955,] 0 0
## [956,] 0 0
## [957,] 74 53
## [958,] 8 0
## [959,] 253 253
## [960,] 253 253
## [961,] 0 0
## [962,] 234 254
## [963,] 0 0
## [964,] 98 253
## [965,] 222 25
## [966,] 0 0
## [967,] 241 189
## [968,] 0 0
## [969,] 0 46
## [970,] 0 0
## [971,] 6 6
89
## [972,] 0 0
## [973,] 0 0
## [974,] 23 0
## [975,] 231 254
## [976,] 254 254
## [977,] 0 32
## [978,] 15 0
## [979,] 155 0
## [980,] 6 0
## [981,] 135 243
## [982,] 0 0
## [983,] 253 201
## [984,] 198 254
## [985,] 0 0
## [986,] 22 0
## [987,] 3 171
## [988,] 0 0
## [989,] 0 0
## [990,] 0 0
## [991,] 0 0
## [992,] 221 151
## [993,] 254 172
## [994,] 156 253
## [995,] 0 0
## [996,] 254 254
## [997,] 0 0
## [998,] 0 0
## [999,] 103 64
## [1000,] 139 0
x[c(2,3),]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0
## [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0
## [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50]
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0
## [,51] [,52] [,53] [,54] [,55] [,56] [,57] [,58] [,59] [,60] [,61] [,62]
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0
## [,63] [,64] [,65] [,66] [,67] [,68] [,69] [,70] [,71] [,72] [,73] [,74]
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0
## [,75] [,76] [,77] [,78] [,79] [,80] [,81] [,82] [,83] [,84] [,85] [,86]
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0
## [,87] [,88] [,89] [,90] [,91] [,92] [,93] [,94] [,95] [,96] [,97] [,98]
90
## [1,] 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0 0 0
## [,99] [,100] [,101] [,102] [,103] [,104] [,105] [,106] [,107] [,108]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,109] [,110] [,111] [,112] [,113] [,114] [,115] [,116] [,117] [,118]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,119] [,120] [,121] [,122] [,123] [,124] [,125] [,126] [,127] [,128]
## [1,] 0 0 0 0 0 0 0 0 0 51
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,129] [,130] [,131] [,132] [,133] [,134] [,135] [,136] [,137] [,138]
## [1,] 159 253 159 50 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,139] [,140] [,141] [,142] [,143] [,144] [,145] [,146] [,147] [,148]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,149] [,150] [,151] [,152] [,153] [,154] [,155] [,156] [,157] [,158]
## [1,] 0 0 0 0 0 0 48 238 252 252
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,159] [,160] [,161] [,162] [,163] [,164] [,165] [,166] [,167] [,168]
## [1,] 252 237 0 0 0 0 0 0 0 0
## [2,] 0 0 67 232 39 0 0 0 0 0
## [,169] [,170] [,171] [,172] [,173] [,174] [,175] [,176] [,177] [,178]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 62 81 0 0 0 0
## [,179] [,180] [,181] [,182] [,183] [,184] [,185] [,186] [,187] [,188]
## [1,] 0 0 0 54 227 253 252 239 233 252
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,189] [,190] [,191] [,192] [,193] [,194] [,195] [,196] [,197] [,198]
## [1,] 57 6 0 0 0 0 0 0 0 0
## [2,] 120 180 39 0 0 0 0 0 0 0
## [,199] [,200] [,201] [,202] [,203] [,204] [,205] [,206] [,207] [,208]
## [1,] 0 0 0 0 0 0 0 0 0 10
## [2,] 0 0 126 163 0 0 0 0 0 0
## [,209] [,210] [,211] [,212] [,213] [,214] [,215] [,216] [,217] [,218]
## [1,] 60 224 252 253 252 202 84 252 253 122
## [2,] 0 0 0 0 0 0 0 2 153 210
## [,219] [,220] [,221] [,222] [,223] [,224] [,225] [,226] [,227] [,228]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 40 0 0 0 0 0 0 0 0 0
## [,229] [,230] [,231] [,232] [,233] [,234] [,235] [,236] [,237] [,238]
## [1,] 0 0 0 0 0 0 0 163 252 252
## [2,] 220 163 0 0 0 0 0 0 0 0
## [,239] [,240] [,241] [,242] [,243] [,244] [,245] [,246] [,247] [,248]
## [1,] 252 253 252 252 96 189 253 167 0 0
## [2,] 0 0 0 0 0 27 254 162 0 0
## [,249] [,250] [,251] [,252] [,253] [,254] [,255] [,256] [,257] [,258]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 222 163
## [,259] [,260] [,261] [,262] [,263] [,264] [,265] [,266] [,267] [,268]
## [1,] 0 0 0 0 51 238 253 253 190 114
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,269] [,270] [,271] [,272] [,273] [,274] [,275] [,276] [,277] [,278]
91
## [1,] 253 228 47 79 255 168 0 0 0 0
## [2,] 0 0 0 183 254 125 0 0 0 0
## [,279] [,280] [,281] [,282] [,283] [,284] [,285] [,286] [,287] [,288]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 46 245 163 0 0
## [,289] [,290] [,291] [,292] [,293] [,294] [,295] [,296] [,297] [,298]
## [1,] 0 48 238 252 252 179 12 75 121 21
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,299] [,300] [,301] [,302] [,303] [,304] [,305] [,306] [,307] [,308]
## [1,] 0 0 253 243 50 0 0 0 0 0
## [2,] 0 198 254 56 0 0 0 0 0 0
## [,309] [,310] [,311] [,312] [,313] [,314] [,315] [,316] [,317] [,318]
## [1,] 0 0 0 0 0 0 0 0 38 165
## [2,] 0 0 0 120 254 163 0 0 0 0
## [,319] [,320] [,321] [,322] [,323] [,324] [,325] [,326] [,327] [,328]
## [1,] 253 233 208 84 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 23 231
## [,329] [,330] [,331] [,332] [,333] [,334] [,335] [,336] [,337] [,338]
## [1,] 253 252 165 0 0 0 0 0 0 0
## [2,] 254 29 0 0 0 0 0 0 0 0
## [,339] [,340] [,341] [,342] [,343] [,344] [,345] [,346] [,347] [,348]
## [1,] 0 0 0 0 0 7 178 252 240 71
## [2,] 0 159 254 120 0 0 0 0 0 0
## [,349] [,350] [,351] [,352] [,353] [,354] [,355] [,356] [,357] [,358]
## [1,] 19 28 0 0 0 0 0 0 253 252
## [2,] 0 0 0 0 0 0 163 254 216 16
## [,359] [,360] [,361] [,362] [,363] [,364] [,365] [,366] [,367] [,368]
## [1,] 195 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 159
## [,369] [,370] [,371] [,372] [,373] [,374] [,375] [,376] [,377] [,378]
## [1,] 0 0 0 57 252 252 63 0 0 0
## [2,] 254 67 0 0 0 0 0 0 0 0
## [,379] [,380] [,381] [,382] [,383] [,384] [,385] [,386] [,387] [,388]
## [1,] 0 0 0 0 0 0 253 252 195 0
## [2,] 0 14 86 178 248 254 91 0 0 0
## [,389] [,390] [,391] [,392] [,393] [,394] [,395] [,396] [,397] [,398]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 159 254 85
## [,399] [,400] [,401] [,402] [,403] [,404] [,405] [,406] [,407] [,408]
## [1,] 0 198 253 190 0 0 0 0 0 0
## [2,] 0 0 0 47 49 116 144 150 241 243
## [,409] [,410] [,411] [,412] [,413] [,414] [,415] [,416] [,417] [,418]
## [1,] 0 0 0 0 255 253 196 0 0 0
## [2,] 234 179 241 252 40 0 0 0 0 0
## [,419] [,420] [,421] [,422] [,423] [,424] [,425] [,426] [,427] [,428]
## [1,] 0 0 0 0 0 0 0 0 76 246
## [2,] 0 0 0 0 0 150 253 237 207 207
## [,429] [,430] [,431] [,432] [,433] [,434] [,435] [,436] [,437] [,438]
## [1,] 252 112 0 0 0 0 0 0 0 0
## [2,] 207 253 254 250 240 198 143 91 28 5
## [,439] [,440] [,441] [,442] [,443] [,444] [,445] [,446] [,447] [,448]
## [1,] 0 0 253 252 148 0 0 0 0 0
## [2,] 233 250 0 0 0 0 0 0 0 0
## [,449] [,450] [,451] [,452] [,453] [,454] [,455] [,456] [,457] [,458]
92
## [1,] 0 0 0 0 0 0 85 252 230 25
## [2,] 0 0 0 0 119 177 177 177 177 177
## [,459] [,460] [,461] [,462] [,463] [,464] [,465] [,466] [,467] [,468]
## [1,] 0 0 0 0 0 0 0 0 7 135
## [2,] 98 56 0 0 0 0 0 102 254 220
## [,469] [,470] [,471] [,472] [,473] [,474] [,475] [,476] [,477] [,478]
## [1,] 253 186 12 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,479] [,480] [,481] [,482] [,483] [,484] [,485] [,486] [,487] [,488]
## [1,] 0 0 0 0 85 252 223 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,489] [,490] [,491] [,492] [,493] [,494] [,495] [,496] [,497] [,498]
## [1,] 0 0 0 0 0 7 131 252 225 71
## [2,] 0 0 0 0 0 169 254 137 0 0
## [,499] [,500] [,501] [,502] [,503] [,504] [,505] [,506] [,507] [,508]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,509] [,510] [,511] [,512] [,513] [,514] [,515] [,516] [,517] [,518]
## [1,] 0 0 85 252 145 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,519] [,520] [,521] [,522] [,523] [,524] [,525] [,526] [,527] [,528]
## [1,] 0 0 48 165 252 173 0 0 0 0
## [2,] 0 0 0 169 254 57 0 0 0 0
## [,529] [,530] [,531] [,532] [,533] [,534] [,535] [,536] [,537] [,538]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,539] [,540] [,541] [,542] [,543] [,544] [,545] [,546] [,547] [,548]
## [1,] 86 253 225 0 0 0 0 0 0 114
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,549] [,550] [,551] [,552] [,553] [,554] [,555] [,556] [,557] [,558]
## [1,] 238 253 162 0 0 0 0 0 0 0
## [2,] 0 169 254 57 0 0 0 0 0 0
## [,559] [,560] [,561] [,562] [,563] [,564] [,565] [,566] [,567] [,568]
## [1,] 0 0 0 0 0 0 0 0 85 252
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,569] [,570] [,571] [,572] [,573] [,574] [,575] [,576] [,577] [,578]
## [1,] 249 146 48 29 85 178 225 253 223 167
## [2,] 0 0 0 0 0 0 0 0 0 169
## [,579] [,580] [,581] [,582] [,583] [,584] [,585] [,586] [,587] [,588]
## [1,] 56 0 0 0 0 0 0 0 0 0
## [2,] 255 94 0 0 0 0 0 0 0 0
## [,589] [,590] [,591] [,592] [,593] [,594] [,595] [,596] [,597] [,598]
## [1,] 0 0 0 0 0 0 85 252 252 252
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,599] [,600] [,601] [,602] [,603] [,604] [,605] [,606] [,607] [,608]
## [1,] 229 215 252 252 252 196 130 0 0 0
## [2,] 0 0 0 0 0 0 0 169 254 96
## [,609] [,610] [,611] [,612] [,613] [,614] [,615] [,616] [,617] [,618]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,619] [,620] [,621] [,622] [,623] [,624] [,625] [,626] [,627] [,628]
## [1,] 0 0 0 0 28 199 252 252 253 252
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,629] [,630] [,631] [,632] [,633] [,634] [,635] [,636] [,637] [,638]
93
## [1,] 252 233 145 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 169 254 153 0 0
## [,639] [,640] [,641] [,642] [,643] [,644] [,645] [,646] [,647] [,648]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,649] [,650] [,651] [,652] [,653] [,654] [,655] [,656] [,657] [,658]
## [1,] 0 0 0 25 128 252 253 252 141 37
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,659] [,660] [,661] [,662] [,663] [,664] [,665] [,666] [,667] [,668]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 169 255 153 0 0 0 0
## [,669] [,670] [,671] [,672] [,673] [,674] [,675] [,676] [,677] [,678]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,679] [,680] [,681] [,682] [,683] [,684] [,685] [,686] [,687] [,688]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,689] [,690] [,691] [,692] [,693] [,694] [,695] [,696] [,697] [,698]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 96 254 153 0 0 0 0 0 0
## [,699] [,700] [,701] [,702] [,703] [,704] [,705] [,706] [,707] [,708]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,709] [,710] [,711] [,712] [,713] [,714] [,715] [,716] [,717] [,718]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,719] [,720] [,721] [,722] [,723] [,724] [,725] [,726] [,727] [,728]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,729] [,730] [,731] [,732] [,733] [,734] [,735] [,736] [,737] [,738]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,739] [,740] [,741] [,742] [,743] [,744] [,745] [,746] [,747] [,748]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,749] [,750] [,751] [,752] [,753] [,754] [,755] [,756] [,757] [,758]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,759] [,760] [,761] [,762] [,763] [,764] [,765] [,766] [,767] [,768]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,769] [,770] [,771] [,772] [,773] [,774] [,775] [,776] [,777] [,778]
## [1,] 0 0 0 0 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0 0 0 0 0
## [,779] [,780] [,781] [,782] [,783] [,784]
## [1,] 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0
94
class(x[,1])
## [1] "integer"
dim(x[1,])
## NULL
dim(x[, 1, drop=FALSE])
## [1] 1000 1
There is a link to the relevant sections of the textbook: Indexing with matrices and Binarizing the data
Key points
bin_x <- x
bin_x[bin_x < 255/2] <- 0
bin_x[bin_x > 255/2] <- 1
Code
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
95
6e+05
4e+05
2e+05
0e+00
0 100 200
as.vector(x)
new_x <- x
new_x[new_x < 50] <- 0
96
#binarize the data
bin_x <- x
bin_x[bin_x < 255/2] <- 0
bin_x[bin_x > 255/2] <- 1
bin_X <- (x > 255/2)*1
There is a link to the relevant sections of the textbook: Vectorization for matrices and Matrix algebra
operations
Key points
(x - rowMeans(x)) / rowSds(x)
t(t(X) - colMeans(X))
• We can also use a function called sweep() that works similarly to apply(). It takes each entry of a
vector and subtracts it from the corresponding row or column:
Code
#take each entry of a vector and subtracts it from the corresponding row or column
x_mean_0 <- sweep(x, 2, colMeans(x))
97
Comprehension Check - Working with Matrices
1. Which line of code correctly creates a 100 by 10 matrix of randomly generated normal numbers and
assigns it to x?
2. Write the line of code that would give you the specified information about the matrix x that you
generated in q1. Do not include any spaces in your line of code.
Dimension of x: dim(x)
Number of rows of x: nrow(x) or dim(x)[1] or length(x[,1])
Number of columns of x: ncol(x) or dim(x)[2] or length(x[1,])
3. Which of the following lines of code would add the scalar 1 to row 1, the scalar 2 to row 2, and so on,
for the matrix x? Select ALL that apply.
⊠ A. x <- x + seq(nrow(x))
□ B. x <- 1:nrow(x)
□ C. x <- sweep(x, 2, 1:nrow(x),"+")
⊠ D. x <- sweep(x, 1, 1:nrow(x),"+")
4. Which of the following lines of code would add the scalar 1 to column 1, the scalar 2 to column 2, and
so on, for the matrix x? Select ALL that apply.
□ A. x <- 1:ncol(x)
□ B. x <- 1:col(x)
⊠ C. x <- sweep(x, 2, 1:ncol(x), FUN = "+")
□ D. x <- -x
□ A. mean(x)
□ B. rowMedians(x)
□ C. sapply(x,mean)
□ D. rowSums(x)
⊠ E. rowMeans(x)
98
□ A. mean(x)
□ B. sapply(x,mean)
⊠ C. colMeans(x)
□ D. colMedians(x)
□ C. colSums(x)
6. For each observation in the mnist training data, compute the proportion of pixels that are in the grey
area, defined as values between 50 and 205 (but not including 50 and 205). (To visualize this, you can
make a boxplot by digit class.)
What proportion of the 60000*784 pixels in the mnist training data are in the grey area overall, defined as
values between 50 and 205? Report your answer to at least 3 significant digits.
0.10
y
0.05
0.00
0 1 2 3 4 5 6 7 8 9
as.factor(mnist$train$labels)
## [1] 0.06183703
99
Section 4 - Distance, Knn, Cross Validation, and Generative Mod-
els
In the Distance, kNN, Cross Validation, and Generative Models section, you will learn about
different types of discriminative and generative approaches for machine learning algorithms.
After completing this section, you will be able to:
This section has three parts: nearest neighbors, cross-validation, and generative models.
Distance
• Most clustering and machine learning techniques rely on being able to define distance between obser-
vations, using features or predictors.
• With high dimensional data, a quick way to compute all the distances at once is to use the function
dist(), which computes the distance between each row and produces an object of class dist():
d <- dist(x)
• We can also compute distances between predictors. If 𝑁 is the number of observations, the distance
between two predictors, say 1 and 2, is:
𝑁
dist(1, 2) = √∑𝑖=1 (𝑥𝑖,1 − 𝑥𝑖,2 )2
• To compute the distance between all pairs of the 784 predictors, we can transpose the matrix first and
then use dist():
d <- dist(t(x))
Code
100
ind <- which(mnist$train$labels %in% c(2,7)) %>% sample(500)
y[1:3]
## [1] 7 7 2
## [1] 2079.753
sqrt(sum((x_1 - x_3)^2))
## [1] 2252.129
sqrt(sum((x_2 - x_3)^2))
## [1] 2642.906
## [,1]
## [1,] 2079.753
sqrt(crossprod(x_1 - x_3))
## [,1]
## [1,] 2252.129
sqrt(crossprod(x_2 - x_3))
## [,1]
## [1,] 2642.906
## [1] "dist"
101
as.matrix(d)[1:3,1:3]
## 1 2 3
## 1 0.000 2079.753 2252.129
## 2 2079.753 0.000 2642.906
## 3 2252.129 2642.906 0.000
102
#compute distance between predictors
d <- dist(t(x))
dim(as.matrix(d))
15
10
5
5 10 15 20 25
1:28
data(tissue_gene_expression)
dim(tissue_gene_expression$x)
This matrix has the gene expression levels of 500 genes from 189 biological samples representing seven
different tissues. The tissue type is stored in y:
table(tissue_gene_expression$y)
103
##
## cerebellum colon endometrium hippocampus kidney liver
## 38 34 15 31 39 26
## placenta
## 6
Which of the following lines of code computes the Euclidean distance between each observation and stores
it in the object d?
d <- dist(tissue_gene_expression$x)
2. Using the dataset from Q1, compare the distances between observations 1 and 2 (both cerebellum),
observations 39 and 40 (both colon), and observations 73 and 74 (both endometrium).
Distance-wise, are samples from tissues of the same type closer to each other than tissues of different type?
□ A. No, the samples from the same tissue type are not necessarily closer.
□ B. The two colon samples are close to each other, but the samples from the other two tissues are not.
□ C. The two cerebellum samples are close to each other, but the samples from the other two tissues are
not.
⊠ D. Yes, the samples from the same tissue type are closer to each other.
3. Make a plot of all the distances using the image() function to see if the pattern you observed in Q2 is
general.
104
image(as.matrix(d))
1.0
0.8
0.6
0.4
0.2
0.0
□ A. image(d)
⊠ B. image(as.matrix(d))
□ C. d
□ D. image()
Knn
• K-nearest neighbors (kNN) estimates the conditional probabilities in a similar way to bin smooth-
ing. However, kNN is easier to adapt to multiple dimensions.
• Using kNN, for any point (𝑥1 , 𝑥2 ) for which we want an estimate of 𝑝(𝑥1 , 𝑥2 ), we look for the k nearest
points to (𝑥1 , 𝑥2 ) and take an average of the 0s and 1s associated with these points. We refer to the
set of points used to compute the average as the neighborhood. Larger values of k result in smoother
estimates, while smaller values of k result in more flexible and more wiggly estimates.
• To implement the algorithm, we can use the knn3() function from the caret package. There are two
ways to call this function:
1. We need to specify a formula and a data frame. The formula looks like this: outcome ∼ predictor1 +
predictor2 + predictor3 . The predict() function for knn3 produces a probability for each class.
2. We can also call the function with the first argument being the matrix predictors and the second a
vector of outcomes, like this:
x <- as.matrix(mnist_27$train[,2:3])
y <- mnist_27$train$y
knn_fit <- knn3(x,y)
105
Code
data("mnist_27")
mnist_27$test %>% ggplot(aes(x_1, x_2, color = y)) + geom_point()
0.5
0.4
y
x_2
2
0.3 7
0.2
0.1
0.0 0.1 0.2 0.3 0.4
x_1
#logistic regression
library(caret)
fit_glm <- glm(y~x_1+x_2, data=mnist_27$train, family="binomial")
p_hat_logistic <- predict(fit_glm, mnist_27$test)
y_hat_logistic <- factor(ifelse(p_hat_logistic > 0.5, 7, 2))
confusionMatrix(data = y_hat_logistic, reference = mnist_27$test$y)$overall[1]
## Accuracy
## 0.76
x <- as.matrix(mnist_27$train[,2:3])
y <- mnist_27$train$y
knn_fit <- knn3(x, y)
106
## Accuracy
## 0.815
There is a link to the relevant sections of the textbook: Over-training and Over-smoothing
Key points
• Over-training is the reason that we have higher accuracy in the train set compared to the test set.
Over-training is at its worst when we set 𝑘 = 1. With 𝑘 = 1, the estimate for each (𝑥1 , 𝑥2 ) in the
training set is obtained with just the 𝑦 corresponding to that point.
• When we try a larger 𝑘, the 𝑘 might be so large that it does not permit enough flexibility. We call this
over-smoothing.
• Note that if we use the test set to pick this 𝑘, we should not expect the accompanying accuracy estimate
to extrapolate to the real world. This is because even here we broke a golden rule of machine learning:
we selected the 𝑘 using the test set. Cross validation also provides an estimate that takes this
into account.
Code
## Accuracy
## 0.8825
## Accuracy
## 0.815
## [1] 0.995
## [1] 0.74
107
## Accuracy
## 0.79
#pick the k that maximizes accuracy using the estimates built on the test data
ks[which.max(accuracy$test)]
## [1] 41
max(accuracy$test)
## [1] 0.86
1. Previously, we used logistic regression to predict sex based on height. Now we are going to use knn
to do the same. Set the seed to 1, then use the caret package to partition the dslabs heights data
into a training and test set of equal size. Use the sapply() function to perform knn with k values
of seq(1, 101, 3) and calculate F1 scores with the F_meas() function using the default value of the
relevant argument.
data("heights")
108
test_index <- createDataPartition(heights$sex, times = 1, p = 0.5, list = FALSE)
test_set <- heights[test_index, ]
train_set <- heights[-test_index, ]
0.57
0.55
0 20 40 60 80 100
ks
max(F_1)
## [1] 0.6019417
ks[which.max(F_1)]
## [1] 46
2. Next we will use the same gene expression example used in the Comprehension Check: Distance
exercises. You can load it like this:
library(dslabs)
library(caret)
data("tissue_gene_expression")
First, set the seed to 1 and split the data into training and test sets with p = 0.5. Then, report the accuracy
you obtain from predicting tissue type using KNN with k = seq(1, 11, 2) using sapply() or map_df().
Note: use the createDataPartition() function outside of sapply() or map_df().
109
# set.seed(1) # if using R 3.5 or earlier
set.seed(1, sample.kind = "Rounding") # if using R 3.6 or later
y <- tissue_gene_expression$y
x <- tissue_gene_expression$x
test_index <- createDataPartition(y, list = FALSE)
sapply(seq(1, 11, 2), function(k){
fit <- knn3(x[-test_index,], y[-test_index], k = k)
y_hat <- predict(fit, newdata = data.frame(x=x[test_index,]),
type = "class")
mean(y_hat == y[test_index])
})
• For 𝑘-fold cross validation, we divide the dataset into a training set and a test set. We train our
algorithm exclusively on the training set and use the test set only for evaluation purposes.
• For each set of algorithm parameters being considered, we want an estimate of the MSE and then
we will choose the parameters with the smallest MSE. In 𝑘-fold cross validation, we randomly
split the observations into 𝑘 non-overlapping sets, and repeat the calculation for MSE for each of these
sets. Then, we compute the average MSE and obtain an estimate of our loss. Finally, we can select
the optimal parameter that minimized the MSE.
• In terms of how to select 𝑘 for cross validation, larger values of 𝑘 are preferable but they will
also take much more computational time. For this reason, the choices of 𝑘 = 5 and 𝑘 = 10 are
common.
n <- 1000
p <- 10000
x <- matrix(rnorm(n*p), n, p)
colnames(x) <- paste("x", 1:ncol(x), sep = "_")
y <- rbinom(n, 1, 0.5) %>% factor()
110
Because x and y are completely independent, you should not be able to predict y using x with accuracy
greater than 0.5. Confirm this by running cross-validation using logistic regression to fit the model. Because
we have so many predictors, we selected a random sample x_subset. Use the subset when training the
model.
Which code correctly performs this cross-validation?
□ A.
⊠ B.
□ C.
□ D.
2. Now, instead of using a random selection of predictors, we are going to search for those that are most
predictive of the outcome. We can do this by comparing the values for the 𝑦 = 1 group to those in the
𝑦 = 0 group, for each predictor, using a t-test. You can do perform this step like this:
if(!require(BiocManager)) install.packages("BiocManager")
BiocManager::install("genefilter")
111
## Installing package(s) 'genefilter'
##
## The downloaded binary packages are in
## /var/folders/6m/nz2p76pn679b692c99t644bm0000gn/T//Rtmp2prVhC/downloaded_packages
library(genefilter)
##
## Attaching package: 'genefilter'
tt <- colttests(x, y)
Which of the following lines of code correctly creates a vector of the p-values called pvals?
3. Create an index ind with the column numbers of the predictors that were “statistically significantly”
associated with y. Use a p-value cutoff of 0.01 to define “statistically significantly.”
## [1] 108
4. Now re-run the cross-validation after redefinining x_subset to be the subset of x defined by the columns
showing “statistically significant” association with y.
112
x_subset <- x[,ind]
fit <- train(x_subset, y, method = "glm")
fit$results
5. Re-run the cross-validation again, but this time using kNN. Try out the following grid k = seq(101,
301, 25) of tuning parameters. Make a plot of the resulting accuracies.
fit <- train(x_subset, y, method = "knn", tuneGrid = data.frame(k = seq(101, 301, 25)))
ggplot(fit)
0.722
Accuracy (Bootstrap)
0.720
0.718
0.716
⊠ A.
fit <- train(x_subset, y, method = "knn", tuneGrid = data.frame(k = seq(101, 301, 25)))
ggplot(fit)
□ B.
113
fit <- train(x_subset, y, method = "knn")
ggplot(fit)
□ C.
fit <- train(x_subset, y, method = "knn", tuneGrid = data.frame(k = seq(103, 301, 25)))
ggplot(fit)
□ D.
fit <- train(x_subset, y, method = "knn", tuneGrid = data.frame(k = seq(101, 301, 5)))
ggplot(fit)
6. In the previous exercises, we see that despite the fact that x and y are completely independent, we
were able to predict y with accuracy higher than 70%. We must be doing something wrong then.
What is it?
□ A. The function train() estimates accuracy on the same data it uses to train the algorithm.
□ B. We are overfitting the model by including 100 predictors.
⊠ C. We used the entire dataset to select the columns used in the model.
□ D. The high accuracy is just due to random variability.
7. Use the train() function with kNN to select the best k for predicting tissue from gene expression on
the tissue_gene_expression dataset from dslabs. Try k = seq(1,7,2) for tuning parameters. For
this question, do not split the data into test and train sets (understand this can lead to overfitting,
but ignore this for now).
data("tissue_gene_expression")
fit <- with(tissue_gene_expression, train(x, y, method = "knn", tuneGrid = data.frame( k = seq(1, 7, 2))
ggplot(fit)
114
0.990
Accuracy (Bootstrap)
0.985
0.980
0.975
2 4 6
#Neighbors
fit$results
Bootstrap
• When we don’t have access to the entire population, we can use bootstrap to estimate the population
median 𝑚.
• The bootstrap permits us to approximate a Monte Carlo simulation without access to the entire
distribution. The general idea is relatively simple. We act as if the observed sample is the population.
We then sample datasets (with replacement) of the same sample size as the original dataset. Then we
compute the summary statistic, in this case the median, on this bootstrap sample.
• Note that we can use ideas similar to those used in the bootstrap in cross validation: instead of
dividing the data into equal partitions, we simply bootstrap many times.
Code
115
n <- 10^6
income <- 10^(rnorm(n, log10(45000), log10(3)))
qplot(log10(income), bins = 30, color = I("black"))
1e+05
5e+04
0e+00
3 4 5 6 7
log10(income)
m <- median(income)
m
## [1] 44986.86
set.seed(1)
#use set.seed(1, sample.kind="Rounding") instead if using R 3.6 or later
N <- 250
X <- sample(income, N)
M<- median(X)
M
## [1] 47024.18
library(gridExtra)
##
## Attaching package: 'gridExtra'
116
B <- 10^5
M <- replicate(B, {
X <- sample(income, N)
median(X)
})
p1 <- qplot(M, bins = 30, color = I("black"))
p2 <- qplot(sample = scale(M)) + geom_abline()
grid.arrange(p1, p2, ncol = 2)
12500
5.0
10000
2.5
7500
0.0
5000
2500
−2.5
mean(M)
## [1] 45132.14
sd(M)
## [1] 3912.368
B <- 10^5
M_star <- replicate(B, {
X_star <- sample(X, N, replace = TRUE)
median(X_star)
})
117
70000
60000
bootstrap
50000
40000
30000
## 5% 95%
## 38996.50 51811.42
## 5% 95%
## 37112.39 51462.43
118
Comprehension Check - Bootstrap
1. The createResample() function can be used to create bootstrap samples. For example, we can create
the indexes for 10 bootstrap samples for the mnist_27 dataset like this:
data(mnist_27)
# set.seed(1995) # if R 3.5 or earlier
set.seed(1995, sample.kind="Rounding") # if R 3.6 or later
sum(indexes[[1]] == 3)
## [1] 1
sum(indexes[[1]] == 4)
## [1] 4
sum(indexes[[1]] == 7)
## [1] 0
2. We see that some numbers appear more than once and others appear no times. This has to be this
way for each dataset to be independent. Repeat the exercise for all the resampled indexes.
What is the total number of times that 3 appears in all of the resampled indexes?
x=sapply(indexes, function(ind){
sum(ind == 3)
})
sum(x)
## [1] 11
y <- rnorm(100, 0, 1)
Estimate the 75th quantile, which we know is qnorm(0.75), with the sample quantile: quantile(y, 0.75).
Now, set the seed to 1 and perform a Monte Carlo simulation with 10,000 repetitions, generating the random
dataset and estimating the 75th quantile each time. What is the expected value and standard error of the
75th quantile?
Report all answers to at least 3 decimal digits.
119
# set.seed(1) # # if R 3.5 or earlier
set.seed(1, sample.kind = "Rounding") # if R 3.6 or later
B <- 10000
q_75 <- replicate(B, {
y <- rnorm(100, 0, 1)
quantile(y, 0.75)
})
mean(q_75)
## [1] 0.6656107
sd(q_75)
## [1] 0.1353809
y <- rnorm(100, 0, 1)
Set the seed to 1 again after generating y and use 10 bootstrap samples to estimate the expected value and
standard error of the 75th quantile.
y <- rnorm(100, 0, 1)
120
indexes <- createResample(y, 10)
q_75_star <- sapply(indexes, function(ind){
y_star <- y[ind]
quantile(y_star, 0.75)
})
mean(q_75_star)
## [1] 0.7312648
sd(q_75_star)
## [1] 0.07419278
5. Repeat the exercise from Q4 but with 10,000 bootstrap samples instead of 10. Set the seed to 1 first.
## [1] 0.6737512
sd(q_75_star)
## [1] 0.0930575
6. When doing bootstrap sampling, the simulated samples are drawn from the empirical distribution of
the original data.
True or False: The bootstrap is particularly useful in situations when we do not have access to the distribution
or it is unknown.
⊠ A. True
□ B. False
Generative Models
There is a link to the relevant section of the textbook: Generative models
**Key points
• Discriminative approaches estimate the conditional probability directly and do not consider the
distribution of the predictors.
• Generative models are methods that model the joint distribution and 𝑋 (we model how the entire
data, 𝑋 and 𝑌 , are generated).
121
Naive Bayes
• Bayes’ rule:
with 𝑓𝑋|𝑌 =1 and 𝑓𝑋|𝑌 =0 representing the distribution functions of the predictor 𝑋 for the two classes 𝑌 = 1
and 𝑌 = 0.
• The Naive Bayes approach is similar to the logistic regression prediction mathematically. However,
we leave the demonstration to a more advanced text, such as The Elements of Statistical Learning by
Hastie, Tibshirani, and Friedman.
Code
params
## # A tibble: 2 x 3
## sex avg sd
## <fct> <dbl> <dbl>
## 1 Female 64.5 4.02
## 2 Male 69.3 3.52
## [1] 0.2290076
122
Controlling Prevalence
• The Naive Bayes approach includes a parameter to account for differences in prevalence 𝜋 =
̂ as:
𝑃 𝑟(𝑌 = 1). If we use hats to denote the estimates, we can write 𝑝(𝑥)
̂
𝑓𝑋|𝑌 =1 (𝑥)𝜋̂
𝑝(𝑥)
̂ = ̂
𝑓𝑋|𝑌 ̂
=0 (𝑥)(1−𝜋)+
̂ 𝑓𝑋|𝑌 =1 (𝑥)𝜋̂
• The Naive Bayes approach gives us a direct way to correct the imbalance between sensitivity and
specificity by simply forcing 𝜋̂ to be whatever value we want it to be in order to better balance
specificity and sensitivity.
Code
# Computing sensitivity
y_hat_bayes <- ifelse(p_hat_bayes > 0.5, "Female", "Male")
sensitivity(data = factor(y_hat_bayes), reference = factor(test_set$sex))
## [1] 0.2627119
# Computing specificity
specificity(data = factor(y_hat_bayes), reference = factor(test_set$sex))
## [1] 0.9534314
## [1] 0.7118644
## [1] 0.8210784
# Draw plot
qplot(x, p_hat_bayes_unbiased, geom = "line") +
geom_hline(yintercept = 0.5, lty = 2) +
geom_vline(xintercept = 67, lty = 2)
123
1.00
0.75
p_hat_bayes_unbiased
0.50
0.25
0.00
50 60 70 80
x
There is a link to the relevant sections of the textbook: Quadratic discriminant analysis and Linear discrim-
inant analysis
Key points
• Quadratic discriminant analysis (QDA) is a version of Naive Bayes in which we assume that the
distributions 𝑝𝑋|𝑌 =1 (𝑥) and 𝑝𝑋|𝑌 =0 (𝑥) are multivariate normal.
• QDA can work well with a few predictors, but it becomes harder to use as the number of predic-
tors increases. Once the number of parameters approaches the size of our data, the method becomes
impractical due to overfitting.
• Forcing the assumption that all predictors share the same standard deviations and correlations, the
boundary will be a line, just as with logistic regression. For this reason, we call the method linear
discriminant analysis (LDA).
• In the case of LDA, the lack of flexibility does not permit us to capture the non-linearity in the
true conditional probability function.
Code
QDA
# Load data
data("mnist_27")
124
params <- mnist_27$train %>%
group_by(y) %>%
summarize(avg_1 = mean(x_1), avg_2 = mean(x_2),
sd_1 = sd(x_1), sd_2 = sd(x_2),
r = cor(x_1, x_2))
# Contour plots
mnist_27$train %>% mutate(y = factor(y)) %>%
ggplot(aes(x_1, x_2, fill = y, color = y)) +
geom_point(show.legend = FALSE) +
stat_ellipse(type="norm", lwd = 1.5)
0.6
0.5
0.4
y
x_2
0.3 2
7
0.2
0.1
# Fit model
library(caret)
train_qda <- train(y ~., method = "qda", data = mnist_27$train)
# Obtain predictors and accuracy
y_hat <- predict(train_qda, mnist_27$test)
confusionMatrix(data = y_hat, reference = mnist_27$test$y)$overall["Accuracy"]
## Accuracy
## 0.82
125
# Draw separate plots for 2s and 7s
mnist_27$train %>% mutate(y = factor(y)) %>%
ggplot(aes(x_1, x_2, fill = y, color = y)) +
geom_point(show.legend = FALSE) +
stat_ellipse(type="norm") +
facet_wrap(~y)
2 7
0.6
0.5
0.4
y
x_2
0.3 2
7
0.2
0.1
0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4
x_1
LDA
## Accuracy
## 0.75
126
Case Study - More than Three Classes
There is a link to the relevant section of the textbook: Case study: more than three classes
Key points
• In this case study, we will briefly give a slightly more complex example: one with 3 classes instead
of 2. Then we will fit QDA, LDA, and KNN models for prediction.
• Generative models can be very powerful, but only when we are able to successfully approximate
the joint distribution of predictors conditioned on each class.
Code
# cbind proportion of pixels in upper right quadrant and proportion of pixels in lower right quadrant
x <- cbind(rowSums(x[ ,upper_left_ind])/rowSums(x),
rowSums(x[ ,lower_right_ind])/rowSums(x))
127
0.8
0.6
y
1
x_2
0.4
2
7
0.2
0.0
## 1 2 7
## 1 0.22232613 0.6596410 0.11803290
## 2 0.19256640 0.4535212 0.35391242
## 3 0.62749331 0.3220448 0.05046191
## 4 0.04623381 0.1008304 0.85293583
## 5 0.21671529 0.6229295 0.16035523
## 6 0.12669776 0.3349700 0.53833219
## [1] 2 2 1 7 2 7
## Levels: 1 2 7
## Reference
## Prediction 1 2 7
## 1 111 17 7
## 2 14 80 17
## 7 19 25 109
128
confusionMatrix(predict(train_qda, test_set), test_set$y)$overall["Accuracy"]
## Accuracy
## 0.7518797
## Accuracy
## 0.6641604
train_knn <- train(y ~ ., method = "knn", tuneGrid = data.frame(k = seq(15, 51, 2)),
data = train_set)
confusionMatrix(predict(train_knn, test_set), test_set$y)$overall["Accuracy"]
## Accuracy
## 0.7719298
train_set %>% mutate(y = factor(y)) %>% ggplot(aes(x_1, x_2, fill = y, color=y)) + geom_point(show.legen
0.75
0.50
y
1
x_2
2
0.25
7
0.00
129
1. Create a dataset of samples from just cerebellum and hippocampus, two parts of the brain, and a
predictor matrix with 10 randomly selected columns using the following code:
data("tissue_gene_expression")
Use the train() function to estimate the accuracy of LDA. For this question, use the version of x and y
created with the code above: do not split them or tissue_gene_expression into training and test sets
(understand this can lead to overfitting). Report the accuracy from the train() results (do not make
predictions).
What is the accuracy? Enter your answer as a percentage or decimal (eg “50%” or “0.50”) to at least the
thousandths place.
## Accuracy
## 1 0.8707879
2. In this case, LDA fits two 10-dimensional normal distributions. Look at the fitted model by looking
at the finalModel component of the result of train(). Notice there is a component called means
that includes the estimated means of both distributions. Plot the mean vectors against each other and
determine which predictors (genes) appear to be driving the algorithm.
Which TWO genes appear to be driving the algorithm (i.e. the two genes with the highest means)?
130
RAB1B
10
OAZ2
9
hippocampus
HEMK1
7 SAPCD1
SPI1
6 FOXE3
C21orf62
IL18R1
MSH4
5 PLCB1
5 6 7 8 9 10
cerebellum
□ A. PLCB1
⊠ B. RAB1B
□ C. MSH4
⊠ D. OAZ2
□ E. SPI1
□ F. SAPCD1
□ G. HEMK1
Create a dataset of samples from just cerebellum and hippocampus, two parts of the brain, and a predictor
matrix with 10 randomly selected columns using the following code:
data("tissue_gene_expression")
Use the train() function to estimate the accuracy of QDA. For this question, use the version of x and y
created above instead of the default from tissue_gene_expression. Do not split them into training and
test sets (understand this can lead to overfitting).
What is the accuracy?
131
fit_qda <- train(x, y, method = "qda")
fit_qda$results["Accuracy"]
## Accuracy
## 1 0.8147954
4. Which TWO genes drive the algorithm when using QDA instead of LDA (i.e. the two genes with the
highest means)?
RAB1B
10
OAZ2
9
hippocampus
HEMK1
7 SAPCD1
SPI1
6 FOXE3
C21orf62
IL18R1
MSH4
5 PLCB1
5 6 7 8 9 10
cerebellum
□ A. PLCB1
⊠ B. RAB1B
□ C. MSH4
⊠ D. OAZ2
□ E. SPI1
□ F. SAPCD1
□ G. HEMK1
5. One thing we saw in the previous plots is that the values of the predictors correlate in both groups:
some predictors are low in both groups and others high in both groups. The mean value of each
132
predictor found in colMeans(x) is not informative or useful for prediction and often for purposes of
interpretation, it is useful to center or scale each column. This can be achieved with the preProcess
argument in train(). Re-run LDA with preProcess = "center". Note that accuracy does not
change, but it is now easier to identify the predictors that differ more between groups than based on
the plot made in Q2.
Which TWO genes drive the algorithm after performing the scaling?
## Accuracy
## 1 0.8595389
SPI1
SAPCD1
RAB1B
PLCB1
predictor_name
OAZ2
MSH4
IL18R1
HEMK1
FOXE3
C21orf62
□ A. C21orf62
□ B. PLCB1
□ C. RAB1B
□ D. MSH4
⊠ E. OAZ2
133
⊠ F. SPI1
□ G. SAPCD1
□ H. IL18R1
You can see that it is different genes driving the algorithm now. This is because the predictor means change.
In the previous exercises we saw that both LDA and QDA approaches worked well. For further exploration
of the data, you can plot the predictor values for the two genes with the largest differences between the two
groups in a scatter plot to see how they appear to follow a bivariate distribution as assumed by the LDA
and QDA approaches, coloring the points by the outcome, using the following code:
9.6
9.2
SPI1
6. Now we are going to increase the complexity of the challenge slightly. Repeat the LDA analysis from
Q5 but using all tissue types. Use the following code to create your dataset:
data("tissue_gene_expression")
y <- tissue_gene_expression$y
x <- tissue_gene_expression$x
x <- x[, sample(ncol(x), 10)]
134
fit_lda <- train(x, y, method = "lda", preProcess = c("center"))
fit_lda$results["Accuracy"]
## Accuracy
## 1 0.8194837
Section 5 - Classification with More than Two Classes and the Caret
Package
In the Classification with More than Two Classes and the Caret Package section, you will learn
how to overcome the curse of dimensionality using methods that adapt to higher dimensions and how to use
the caret package to implement many different machine learning algorithms.
After completing this section, you will be able to:
This section has three parts: classification with more than two classes, caret package, and a set of
exercises on the Titanic.
Trees Motivation
There is a link to the relevant section of the textbook: The curse of dimensionality
Key points
• LDA and QDA are not meant to be used with many predictors 𝑝 because the number of param-
eters needed to be estimated becomes too large.
• Curse of dimensionality: For kernel methods such as kNN or local regression, when they have
multiple predictors used, the span/neighborhood/window made to include a given percentage of the
data become large. With larger neighborhoods, our methods lose flexibility. The dimension here refers
to the fact that when we have 𝑝 predictors, the distance between two observations is computed in
𝑝-dimensional space.
There is a link to the relevant sections of the textbook: CART motivation and Regression trees
Key points
• A tree is basically a flow chart of yes or no questions. The general idea of the methods we are
describing is to define an algorithm that uses data to create these trees with predictions at the ends,
referred to as nodes.
• When the outcome is continuous, we call the decision tree method a regression tree.
• Regression and decision trees operate by predicting an outcome variable 𝑌 by partitioning the
predictors.
135
• The general idea here is to build a decision tree and, at end of each node, obtain a predictor 𝑦.̂
Mathematically, we are partitioning the predictor space into 𝐽 non-overlapping regions, 𝑅1 , 𝑅2 ,
…, 𝑅𝐽 and then for any predictor 𝑥 that falls within region 𝑅𝑗 , estimate 𝑓(𝑥) with the average of the
training observations 𝑦𝑖 for which the associated predictor 𝑥𝑖 in also in 𝑅𝑗 .
• To pick 𝑗 and its value 𝑠, we find the pair that minimizes the residual sum of squares (RSS):
∑𝑖∶𝑥 ̂ 1 )2 + ∑𝑖∶𝑥
(𝑦𝑖 − 𝑦𝑅 ̂ 2 )2
(𝑦𝑖 − 𝑦𝑅
𝑖 𝑅1 (𝑗,𝑠) 𝑖 𝑅2 (𝑗,𝑠)
• To fit the regression tree model, we can use the rpart() function in the rpart package.
• Two common parameters used for partition decision are the complexity parameter (cp) and the min-
imum number of observations required in a partition before partitioning it further (minsplit
in the rpart package).
• If we already have a tree and want to apply a higher cp value, we can use the prune() function. We
call this pruning a tree because we are snipping off partitions that do not meet a cp criterion.
Code
# Load data
data("olive")
olive %>% as_tibble()
## # A tibble: 572 x 10
## region area palmitic palmitoleic stearic oleic linoleic linolenic arachidic
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 South~ Nort~ 10.8 0.75 2.26 78.2 6.72 0.36 0.6
## 2 South~ Nort~ 10.9 0.73 2.24 77.1 7.81 0.31 0.61
## 3 South~ Nort~ 9.11 0.54 2.46 81.1 5.49 0.31 0.63
## 4 South~ Nort~ 9.66 0.570 2.4 79.5 6.19 0.5 0.78
## 5 South~ Nort~ 10.5 0.67 2.59 77.7 6.72 0.5 0.8
## 6 South~ Nort~ 9.11 0.49 2.68 79.2 6.78 0.51 0.7
## 7 South~ Nort~ 9.22 0.66 2.64 79.9 6.18 0.49 0.56
## 8 South~ Nort~ 11 0.61 2.35 77.3 7.34 0.39 0.64
## 9 South~ Nort~ 10.8 0.6 2.39 77.4 7.09 0.46 0.83
## 10 South~ Nort~ 10.4 0.55 2.13 79.4 6.33 0.26 0.52
## # ... with 562 more rows, and 1 more variable: eicosenoic <dbl>
table(olive$region)
##
## Northern Italy Sardinia Southern Italy
## 151 98 323
136
0.97
0.96
Accuracy (Bootstrap)
0.95
0.94
4 8 12
#Neighbors
137
arachidic eicosenoic linoleic
0.6 15.0
1.00
12.5
0.75 0.4
0.50 10.0
0.2 7.5
0.25
0.00 5.0
0.0
0.6 80 15
Northern Italy
0.4 75 12
70 Sardinia
0.2 9
65 Southern Italy
0.0 6
palmitoleic stearic
3.5
2 3.0
2.5
1
2.0
1.5
region
138
15.0
12.5
region
linoleic
7.5
5.0
139
0.10
0.05
margin
0.00
−0.05
if(!require(rpart)) install.packages("rpart")
library(rpart)
fit <- rpart(margin ~ ., data = polls_2008)
day< −39.5
|
day>=−86.5
0.06921
day< −49.5 day>=−117.5
day>=−62 day< −109.5
day< −71.5 0.03463 0.04914
0.02286 0.03972
−0.00366
0.01212 0.04226
140
polls_2008 %>%
mutate(y_hat = predict(fit)) %>%
ggplot() +
geom_point(aes(day, margin)) +
geom_step(aes(day, y_hat), col="red")
0.10
0.05
margin
0.00
−0.05
# change parameters
fit <- rpart(margin ~ ., data = polls_2008, control = rpart.control(cp = 0, minsplit = 2))
polls_2008 %>%
mutate(y_hat = predict(fit)) %>%
ggplot() +
geom_point(aes(day, margin)) +
geom_step(aes(day, y_hat), col="red")
141
0.10
0.05
margin
0.00
−0.05
142
0.0232
RMSE (Bootstrap)
0.0231
0.0230
day< −39.5
|
day>=−86.5
0.06921
0.01759 0.04247
polls_2008 %>%
mutate(y_hat = predict(train_rpart)) %>%
ggplot() +
geom_point(aes(day, margin)) +
geom_step(aes(day, y_hat), col="red")
143
0.10
0.05
margin
0.00
−0.05
There is a link to the relevant section of the textbook: Classification (decision) trees
Key points
• Classification trees, or decision trees, are used in prediction problems where the outcome is cate-
gorical.
• Decision trees form predictions by calculating which class is the most common among the training
set observations within the partition, rather than taking the average in each partition.
• Two of the more popular metrics to choose the partitions are the Gini index and entropy.
𝐾
Gini(𝑗) = ∑𝑘=1 𝑝𝑗,𝑘
̂ (1 − 𝑝𝑗,𝑘
̂ )
𝐾
entropy(𝑗) = − ∑𝑘=1 𝑝𝑗,𝑘
̂ 𝑙𝑜𝑔(𝑝𝑗,𝑘
̂ ), with 0 × log(0)defined as 0
• Pros: Classification trees are highly interpretable and easy to visualize.They can model human decision
processes and don’t require use of dummy predictors for categorical variables.
• Cons: The approach via recursive partitioning can easily over-train and is therefore a bit harder to
train than. Furthermore, in terms of accuracy, it is rarely the best performing method since it is not
very flexible and is highly unstable to changes in training data.
144
Code
0.82
0.81
Accuracy (Bootstrap)
0.80
0.79
0.78
0.77
Complexity Parameter
# compute accuracy
confusionMatrix(predict(train_rpart, mnist_27$test), mnist_27$test$y)$overall["Accuracy"]
## Accuracy
## 0.82
Random Forests
• Random forests are a very popular machine learning approach that addresses the shortcomings of
decision trees. The goal is to improve prediction performance and reduce instability by averaging
multiple decision trees (a forest of trees constructed with randomness).
• The general idea of random forests is to generate many predictors, each using regression or classification
trees, and then forming a final prediction based on the average prediction of all these trees.
To assure that the individual trees are not the same, we use the bootstrap to induce randomness.
145
• A disadvantage of random forests is that we lose interpretability.
• An approach that helps with interpretability is to examine variable importance. To define variable
importance we count how often a predictor is used in the individual trees. The caret package
includes the function varImp that extracts variable importance from any model in which the calculation
is implemented.
Code
if(!require(randomForest)) install.packages("randomForest")
## randomForest 4.6-14
##
## Attaching package: 'randomForest'
if(!require(Rborist)) install.packages("Rborist")
## Rborist 0.2-3
library(randomForest)
fit <- randomForest(margin~., data = polls_2008)
plot(fit)
146
fit
0.00055
Error
0.00045
trees
polls_2008 %>%
mutate(y_hat = predict(fit, newdata = polls_2008)) %>%
ggplot() +
geom_point(aes(day, margin)) +
geom_line(aes(day, y_hat), col="red")
147
0.10
0.05
margin
0.00
−0.05
## Accuracy
## 0.785
## Accuracy
## 0.8
n <- 1000
sigma <- 0.25
# set.seed(1) # if using R 3.5 or ealier
set.seed(1, sample.kind = "Rounding") # if using R 3.6 or later
148
## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
x <- rnorm(n, 0, 1)
y <- 0.75 * x + rnorm(n, 0, sigma)
dat <- data.frame(x = x, y = y)
Which code correctly uses rpart() to fit a regression tree and saves the result to fit?
2. Which of the following plots has the same tree shape obtained in Q1?
x< 0.04356
|
□ A.
149
□ B.
□ C.
150
⊠ D.
3. Below is most of the code to make a scatter plot of y versus x along with the predicted values based
on the fit.
dat %>%
mutate(y_hat = predict(fit)) %>%
ggplot() +
geom_point(aes(x, y)) +
#BLANK
Which line of code should be used to replace #BLANK in the code above?
151
dat %>%
mutate(y_hat = predict(fit)) %>%
ggplot() +
geom_point(aes(x, y)) +
geom_step(aes(x, y_hat), col=2)
1
y
−1
−2
−2 0 2 4
x
4. Now run Random Forests instead of a regression tree using randomForest() from the randomForest
package, and remake the scatterplot with the prediction line. Part of the code is provided for you
below.
library(randomForest)
fit <- #BLANK
dat %>%
mutate(y_hat = predict(fit)) %>%
ggplot() +
geom_point(aes(x, y)) +
geom_step(aes(x, y_hat), col = "red")
152
What code should replace #BLANK in the provided code?
library(randomForest)
fit <- randomForest(y ~ x, data = dat)
dat %>%
mutate(y_hat = predict(fit)) %>%
ggplot() +
geom_point(aes(x, y)) +
geom_step(aes(x, y_hat), col = "red")
1
y
−1
−2
−2 0 2 4
x
5. Use the plot() function to see if the Random Forest from Q4 has converged or if we need more trees.
Which of these graphs is most similar to the one produced by plotting the random forest? Note that there
may be slight differences due to the seed not being set.
plot(fit)
153
fit
0.13
0.12
Error
0.11
0.10
trees
□ A.
154
□ B.
155
⊠ C.
156
□ D.
157
6. It seems that the default values for the Random Forest result in an estimate that is too flexible
(unsmooth). Re-run the Random Forest but this time with a node size of 50 and a maximum of 25
nodes. Remake the plot.
library(randomForest)
fit <- #BLANK
dat %>%
mutate(y_hat = predict(fit)) %>%
ggplot() +
geom_point(aes(x, y)) +
geom_step(aes(x, y_hat), col = "red")
158
library(randomForest)
fit <- randomForest(y ~ x, data = dat, nodesize = 50, maxnodes = 25)
dat %>%
mutate(y_hat = predict(fit)) %>%
ggplot() +
geom_point(aes(x, y)) +
geom_step(aes(x, y_hat), col = "red")
1
y
−1
−2
−2 0 2 4
x
Caret Package
There is a link to the relevant section of the textbook: The caret package
Caret package links
http://topepo.github.io/caret/available-models.html
http://topepo.github.io/caret/train-models-by-tag.html
Key points
159
• The caret package helps provides a uniform interface and standardized syntax for the many different
machine learning packages in R. Note that caret does not automatically install the packages needed.
Code
data("mnist_27")
confusionMatrix(y_hat_glm, mnist_27$test$y)$overall[["Accuracy"]]
## [1] 0.75
confusionMatrix(y_hat_knn, mnist_27$test$y)$overall[["Accuracy"]]
## [1] 0.84
• The train() function automatically uses cross-validation to decide among a few default values of a
tuning parameter.
• The getModelInfo() and modelLookup() functions can be used to learn more about a model and the
parameters that can be optimized.
• We can use the tunegrid() parameter in the train() function to select a grid of values to be compared.
• The trControl parameter and trainControl() function can be used to change the way cross-
validation is performed.
• Note that not all parameters in machine learning algorithms are tuned. We use the train()
function to only optimize parameters that are tunable.
Code
getModelInfo("knn")
## $kknn
## $kknn$label
## [1] "k-Nearest Neighbors"
##
## $kknn$library
## [1] "kknn"
##
## $kknn$loop
## NULL
##
160
## $kknn$type
## [1] "Regression" "Classification"
##
## $kknn$parameters
## parameter class label
## 1 kmax numeric Max. #Neighbors
## 2 distance numeric Distance
## 3 kernel character Kernel
##
## $kknn$grid
## function(x, y, len = NULL, search = "grid") {
## if(search == "grid") {
## out <- data.frame(kmax = (5:((2 * len)+4))[(5:((2 * len)+4))%%2 > 0],
## distance = 2,
## kernel = "optimal")
## } else {
## by_val <- if(is.factor(y)) length(levels(y)) else 1
## kerns <- c("rectangular", "triangular", "epanechnikov", "biweight", "triweight"
## "cos", "inv", "gaussian")
## out <- data.frame(kmax = sample(seq(1, floor(nrow(x)/3), by = by_val), size = l
## distance = runif(len, min = 0, max = 3),
## kernel = sample(kerns, size = len, replace = TRUE))
## }
## out
## }
##
## $kknn$fit
## function(x, y, wts, param, lev, last, classProbs, ...) {
## dat <- if(is.data.frame(x)) x else as.data.frame(x, stringsAsFactors = TRUE)
## dat$.outcome <- y
## kknn::train.kknn(.outcome ~ ., data = dat,
## kmax = param$kmax,
## distance = param$distance,
## kernel = as.character(param$kernel), ...)
## }
##
## $kknn$predict
## function(modelFit, newdata, submodels = NULL) {
## if(!is.data.frame(newdata)) newdata <- as.data.frame(newdata, stringsAsFactors =
## predict(modelFit, newdata)
## }
##
## $kknn$levels
## function(x) x$obsLevels
##
## $kknn$tags
## [1] "Prototype Models"
##
## $kknn$prob
## function(modelFit, newdata, submodels = NULL) {
## if(!is.data.frame(newdata)) newdata <- as.data.frame(newdata, stringsAsFactors =
## predict(modelFit, newdata, type = "prob")
## }
##
161
## $kknn$sort
## function(x) x[order(-x[,1]),]
##
##
## $knn
## $knn$label
## [1] "k-Nearest Neighbors"
##
## $knn$library
## NULL
##
## $knn$loop
## NULL
##
## $knn$type
## [1] "Classification" "Regression"
##
## $knn$parameters
## parameter class label
## 1 k numeric #Neighbors
##
## $knn$grid
## function(x, y, len = NULL, search = "grid"){
## if(search == "grid") {
## out <- data.frame(k = (5:((2 * len)+4))[(5:((2 * len)+4))%%2 > 0])
## } else {
## by_val <- if(is.factor(y)) length(levels(y)) else 1
## out <- data.frame(k = sample(seq(1, floor(nrow(x)/3), by = by_val), size = len,
## }
## }
##
## $knn$fit
## function(x, y, wts, param, lev, last, classProbs, ...) {
## if(is.factor(y))
## {
## knn3(as.matrix(x), y, k = param$k, ...)
## } else {
## knnreg(as.matrix(x), y, k = param$k, ...)
## }
## }
##
## $knn$predict
## function(modelFit, newdata, submodels = NULL) {
## if(modelFit$problemType == "Classification")
## {
## out <- predict(modelFit, newdata, type = "class")
## } else {
## out <- predict(modelFit, newdata)
## }
## out
## }
##
## $knn$predictors
## function(x, ...) colnames(x$learn$X)
162
##
## $knn$tags
## [1] "Prototype Models"
##
## $knn$prob
## function(modelFit, newdata, submodels = NULL)
## predict(modelFit, newdata, type = "prob")
##
## $knn$levels
## function(x) levels(x$learn$y)
##
## $knn$sort
## function(x) x[order(-x[,1]),]
modelLookup("knn")
0.8150
0.8125
Accuracy (Bootstrap)
0.8100
0.8075
0.8050
5 6 7 8 9
#Neighbors
163
0.840
0.835
Accuracy (Bootstrap)
0.830
0.825
0.820
20 40 60
#Neighbors
train_knn$bestTune
## k
## 18 43
train_knn$finalModel
## Accuracy
## 0.855
164
0.845
Accuracy (Cross−Validation)
0.840
0.835
0.830
20 40 60
#Neighbors
train_knn$results %>%
ggplot(aes(x = k, y = Accuracy)) +
geom_line() +
geom_point() +
geom_errorbar(aes(x = k,
ymin = Accuracy - AccuracySD,
ymax = Accuracy + AccuracySD))
165
0.86
0.85
0.84
Accuracy
0.83
0.82
0.81
20 40 60
k
166
0.6
0.4
x_2
0.2
0.0
if(!require(gam)) install.packages("gam")
##
## Attaching package: 'foreach'
modelLookup("gamLoess")
167
grid <- expand.grid(span = seq(0.15, 0.65, len = 10), degree = 1)
1. Load the rpart package and then use the caret::train() function with method = "rpart" to fit a
classification tree to the tissue_gene_expression dataset. Try out cp values of seq(0, 0.1, 0.01).
Plot the accuracies to report the results of the best model. Set the seed to 1991.
data("tissue_gene_expression")
ggplot(fit)
168
0.89
0.88
Accuracy (Bootstrap)
0.87
0.86
0.85
2. Note that there are only 6 placentas in the dataset. By default, rpart requires 20 observations before
splitting a node. That means that it is difficult to have a node in which placentas are the majority.
Rerun the analysis you did in Q1 with caret::train(), but this time with method = "rpart" and
allow it to split any node by using the argument control = rpart.control(minsplit = 0). Look at
the confusion matrix again to determine whether the accuracy increases. Again, set the seed to 1991.
169
0.90
Accuracy (Bootstrap)
0.88
0.86
confusionMatrix(fit_rpart)
3. Plot the tree from the best fitting model of the analysis you ran in Q2.
plot(fit_rpart$finalModel)
text(fit_rpart$finalModel)
170
GPA33>=8.794
|
CLIP3>=10.45
colon
CES2< 8.922
liver
hippocampus
cerebellum HRH1>=6.204
B3GNT4>=9.131 kidney
endometrium
kidney placenta
□ A. B3GNT4
□ B. CAPN3
□ C. CES2
□ D. CFHR4
□ E. CLIP3
⊠ F. GPA33
□ G. HRH1
4. We can see that with just seven genes, we are able to predict the tissue type. Now let’s see if we can
predict the tissue type with even fewer genes using a Random Forest. Use the train() function and
the rf method to train a Random Forest model and save it to an object called fit. Try out values of
mtry ranging from seq(50, 200, 25) (you can also explore other values on your own). What mtry
value maximizes accuracy? To permit small nodesize to grow as we did with the classification trees,
use the following argument: nodesize = 1.
Note: This exercise will take some time to run. If you want to test out your code first, try using smaller
values with ntree. Set the seed to 1991 again.
What value of mtry maximizes accuracy? 100
library(randomForest)
fit <- with(tissue_gene_expression,
train(x, y, method = "rf",
nodesize = 1,
tuneGrid = data.frame(mtry = seq(50, 200, 25))))
ggplot(fit)
171
0.997
Accuracy (Bootstrap)
0.996
0.995
0.994
5. Use the function varImp() on the output of train() and save it to an object called imp:
## rf variable importance
##
## only 20 most important variables shown (out of 500)
##
## Overall
## GPA33 100.00
## BIN1 64.65
## GPM6B 62.35
## KIF2C 62.15
## CLIP3 52.09
## COLGALT2 46.48
## CFHR4 35.03
## SHANK2 34.90
## TFR2 33.61
## GALNT11 30.70
172
## CEP55 30.49
## TCN2 27.96
## CAPN3 27.52
## CYP4F11 25.74
## GTF2IRD1 24.89
## KCTD2 24.34
## FCN3 22.68
## SUSD6 22.24
## DOCK4 22.02
## RARRES2 21.53
6. The rpart() model we ran above in Q2 produced a tree that used just seven predictors. Extracting
the predictor names is not straightforward, but can be done. If the output of the call to train was
fit_rpart, we can extract the names like this: 1/1 point (graded)
Calculate the variable importance in the Random Forest call from Q4 for these seven predictors and examine
where they rank.
What is the importance of the CFHR4 gene in the Random Forest call? 35.0
What is the rank of the CFHR4 gene in the Random Forest call? 7
data_frame(term = rownames(imp$importance),
importance = imp$importance$Overall) %>%
mutate(rank = rank(-importance)) %>% arrange(desc(importance)) %>%
filter(term %in% tree_terms)
## # A tibble: 7 x 3
## term importance rank
## <chr> <dbl> <dbl>
## 1 GPA33 100 1
## 2 CLIP3 52.1 5
## 3 CFHR4 35.0 7
## 4 CAPN3 27.5 13
## 5 CES2 20.0 22
## 6 HRH1 2.35 97
## 7 B3GNT4 0.136 343
These exercises cover everything you have learned in this course so far. You will use the background
information to provided to train a number of different types of models on this dataset.
Background
The Titanic was a British ocean liner that struck an iceberg and sunk on its maiden voyage in 1912 from
the United Kingdom to New York. More than 1,500 of the estimated 2,224 passengers and crew died in
the accident, making this one of the largest maritime disasters ever outside of war. The ship carried a wide
173
range of passengers of all ages and both genders, from luxury travelers in first-class to immigrants in the
lower classes. However, not all passengers were equally likely to survive the accident. You will use real data
about a selection of 891 passengers to predict which passengers survived.
if(!require(titanic)) install.packages("titanic")
# 3 significant digits
options(digits = 3)
Split titanic_clean into test and training sets - after running the setup code, it should have 891 rows and
9 variables.
Set the seed to 42, then use the caret package to create a 20% data partition based on the Survived column.
Assign the 20% partition to test_set and the remaining 80% partition to train_set.
How many observations are in the training set?
nrow(train_set)
## [1] 712
nrow(test_set)
## [1] 179
174
mean(train_set$Survived == 1)
## [1] 0.383
The simplest prediction method is randomly guessing the outcome without using additional predictors. These
methods will help us determine whether our machine learning algorithm performs better than chance. How
accurate are two methods of guessing Titanic passenger survival?
Set the seed to 3. For each individual in the test set, randomly guess whether that person survived or not by
sampling from the vector c(0,1) (Note: use the default argument setting of prob from the sample function).
What is the accuracy of this guessing method?
#set.seed(3)
set.seed(3, sample.kind = "Rounding")
## [1] 0.475
train_set %>%
group_by(Sex) %>%
summarize(Survived = mean(Survived == 1)) %>%
filter(Sex == "female") %>%
pull(Survived)
## [1] 0.731
train_set %>%
group_by(Sex) %>%
summarize(Survived = mean(Survived == 1)) %>%
filter(Sex == "male") %>%
pull(Survived)
175
## `summarise()` ungrouping output (override with `.groups` argument)
## [1] 0.197
## [1] 0.821
train_set %>%
group_by(Pclass) %>%
summarize(Survived = mean(Survived == 1))
## # A tibble: 3 x 2
## Pclass Survived
## <int> <dbl>
## 1 1 0.619
## 2 2 0.5
## 3 3 0.242
⊠ A. 1
□ B. 2
□ C. 3
## [1] 0.704
176
train_set %>%
group_by(Sex, Pclass) %>%
summarize(Survived = mean(Survived == 1)) %>%
filter(Survived > 0.5)
## # A tibble: 2 x 3
## # Groups: Sex [1]
## Sex Pclass Survived
## <chr> <int> <dbl>
## 1 female 1 0.957
## 2 female 2 0.919
## [1] 0.821
177
## No Information Rate : 0.615
## P-Value [Acc > NIR] : 1.72e-09
##
## Kappa : 0.619
##
## Mcnemar's Test P-Value : 0.596
##
## Sensitivity : 0.873
## Specificity : 0.739
## Pos Pred Value : 0.842
## Neg Pred Value : 0.785
## Prevalence : 0.615
## Detection Rate : 0.536
## Detection Prevalence : 0.637
## Balanced Accuracy : 0.806
##
## 'Positive' Class : 0
##
178
## Reference
## Prediction 0 1
## 0 109 31
## 1 1 38
##
## Accuracy : 0.821
## 95% CI : (0.757, 0.874)
## No Information Rate : 0.615
## P-Value [Acc > NIR] : 1.72e-09
##
## Kappa : 0.589
##
## Mcnemar's Test P-Value : 2.95e-07
##
## Sensitivity : 0.991
## Specificity : 0.551
## Pos Pred Value : 0.779
## Neg Pred Value : 0.974
## Prevalence : 0.615
## Detection Rate : 0.609
## Detection Prevalence : 0.782
## Balanced Accuracy : 0.771
##
## 'Positive' Class : 0
##
⊠ A. 0
□ B. 1
□ A. sex only
□ B. class only
⊠ C. sex and class combined
⊠ A. sex only
□ B. class only
□ C. sex and class combined
⊠ A. sex only
□ B. class only
□ C. sex and class combined
179
6. F1 scores
Use the F_meas() function to calculate 𝐹1 scores for the sex model, class model, and combined sex and class
model. You will need to convert predictions to factors to use this function.
Which model has the highest 𝐹1 score?
## [1] 0.857
## [1] 0.78
## [1] 0.872
□ A. sex only
□ B. class only
⊠ C. sex and class combined
Set the seed to 1. Train a model using linear discriminant analysis (LDA) with the caret lda method using
fare as the only predictor.
What is the accuracy on the test set for the LDA model?
## [1] 0.693
Set the seed to 1. Train a model using quadratic discriminant analysis (QDA) with the caret qda method
using fare as the only predictor.
What is the accuracy on the test set for the QDA model?
180
#set.seed(1) # if using R 3.5 or earlier
set.seed(1, sample.kind = "Rounding") # if using R 3.6 or later
## [1] 0.693
Note: when training models for Titanic Exercises Part 2, please use the S3 method for class formula rather
than the default S3 method of caret train() (see ?caret::train for details).
Set the seed to 1. Train a logistic regression model with the caret glm method using age as the only
predictor.
What is the accuracy of your model (using age as the only predictor) on the test set ?
## [1] 0.615
Set the seed to 1. Train a logistic regression model with the caret glm method using four predictors: sex,
class, fare, and age.
What is the accuracy of your model (using these four predictors) on the test set?
train_glm <- train(Survived ~ Sex + Pclass + Fare + Age, method = "glm", data = train_set)
glm_preds <- predict(train_glm, test_set)
mean(glm_preds == test_set$Survived)
## [1] 0.849
181
Set the seed to 1. Train a logistic regression model with the caret glm method using all predictors. Ignore
warnings about rank-deficient fit.
What is the accuracy of your model (using all predictors) on the test set?
182
## prediction from a rank-deficient fit may be misleading
mean(glm_all_preds == test_set$Survived)
## [1] 0.849
183
#set.seed(6)
set.seed(6, sample.kind = "Rounding") # if using R 3.6 or later
## k
## 5 11
ggplot(train_knn)
0.678
0.675
Accuracy (Bootstrap)
0.672
0.669
10 20 30 40 50
#Neighbors
□ A. 7
184
⊠ B. 11
□ C. 17
□ D. 21
## [1] 0.709
10. Cross-validation
Set the seed to 8 and train a new kNN model. Instead of the default training control, use 10-fold cross-
validation where each partition consists of 10% of the total. Try tuning with k = seq(3, 51, 2).
What is the optimal value of k using cross-validation?
#set.seed(8)
set.seed(8, sample.kind = "Rounding") # simulate R 3.5
## k
## 2 5
What is the accuracy on the test set using the cross-validated kNN model?
## [1] 0.648
185
#set.seed(10)
set.seed(10, sample.kind = "Rounding") # simulate R 3.5
## cp
## 9 0.016
What is the accuracy of the decision tree model on the test set?
## [1] 0.838
## n= 712
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 712 273 0 (0.6166 0.3834)
## 2) Sexmale>=0.5 463 91 0 (0.8035 0.1965)
## 4) Age>=3.5 449 80 0 (0.8218 0.1782) *
## 5) Age< 3.5 14 3 1 (0.2143 0.7857) *
## 3) Sexmale< 0.5 249 67 1 (0.2691 0.7309)
## 6) Pclass>=2.5 118 59 0 (0.5000 0.5000)
## 12) Fare>=23.4 24 3 0 (0.8750 0.1250) *
## 13) Fare< 23.4 94 38 1 (0.4043 0.5957) *
## 7) Pclass< 2.5 131 8 1 (0.0611 0.9389) *
186
Sexmale>=0.5
|
Age>=3.5 Pclass>=2.5
Fare>=23.35
0 1 1
0 1
Which variables are used in the decision tree?
Select ALL that apply.
□ A. Survived
⊠ B. Sex
⊠ C. Pclass
⊠ D. Age
⊠ E. Fare
□ F. Parch
□ G. Embarked
Set the seed to 14. Use the caret train() function with the rf method to train a random forest. Test
values of mtry = seq(1:7). Set ntree to 100.
What mtry value maximizes accuracy?
#set.seed(14)
set.seed(14, sample.kind = "Rounding") # simulate R 3.5
187
train_rf <- train(Survived ~ .,
data = train_set,
method = "rf",
ntree = 100,
tuneGrid = data.frame(mtry = seq(1:7)))
train_rf$bestTune
## mtry
## 2 2
What is the accuracy of the random forest model on the test set?
## [1] 0.844
Use varImp() on the random forest model object to determine the importance of various predictors to the
random forest model.
What is the most important variable?
## rf variable importance
##
## Overall
## Sexmale 100.000
## Fare 65.091
## Age 45.533
## Pclass 32.529
## FamilySize 18.275
## SibSp 7.881
## Parch 7.150
## EmbarkedS 2.839
## EmbarkedQ 0.122
## EmbarkedC 0.000
This section has three parts: case study: MNIST, recommendation systems, and regularization.
188
Case Study: MNIST
There is a link to the relevant section of the textbook: Machine learning in practice
Key points
• We will apply what we have learned in the course on the Modified National Institute of Standards and
Technology database (MNIST) digits, a popular dataset used in machine learning competitions.
Code
names(mnist)
dim(mnist$train$images)
class(mnist$train$labels)
## [1] "integer"
table(mnist$train$labels)
##
## 0 1 2 3 4 5 6 7 8 9
## 5923 6742 5958 6131 5842 5421 5918 6265 5851 5949
# sample 10k rows from training set, 1k rows from test set
set.seed(123)
index <- sample(nrow(mnist$train$images), 10000)
x <- mnist$train$images[index,]
y <- factor(mnist$train$labels[index])
189
1. standardizing or transforming predictors and
2. removing predictors that are not useful, are highly correlated with others, have very few non-unique
values, or have close to zero variation.
Code
120
90
60
30
0 30 60 90
sds
190
1.0
0.8
0.6
0.4
0.2
0.0
## [1] 252
There is a link to the relevant section of the textbook: k-nearest neighbor and random forest
Key points
• The caret package requires that we add column names to the feature matrices.
• In general, it is a good idea to test out a small subset of the data first to get an idea of how long
your code will take to run.
Code
191
0.947
0.947
Accuracy (Cross−Validation)
0.946
0.946
0.945
0.945
2 4 6
#Neighbors
n <- 1000
b <- 2
index <- sample(nrow(x), n)
control <- trainControl(method = "cv", number = b, p = .9)
train_knn <- train(x[index ,col_index], y[index],
method = "knn",
tuneGrid = data.frame(k = c(3,5,7)),
trControl = control)
fit_knn <- knn3(x[ ,col_index], y, k = 3)
## Accuracy
## 0.955
cm$byClass[,1:2]
## Sensitivity Specificity
## Class: 0 1.000 0.998
## Class: 1 1.000 0.992
## Class: 2 0.953 0.999
192
## Class: 3 0.917 0.993
## Class: 4 0.936 0.996
## Class: 5 0.971 0.991
## Class: 6 0.990 0.998
## Class: 7 0.945 0.994
## Class: 8 0.846 0.998
## Class: 9 0.971 0.991
0.932
Accuracy (Cross−Validation)
0.928
0.926
10 20 30 40 50
#Randomly Selected Predictors
train_rf$bestTune
## predFixed minNode
## 1 10 1
193
minNode = train_rf$bestTune$minNode,
predFixed = train_rf$bestTune$predFixed)
## Accuracy
## 0.959
rafalib::mypar(3,4)
for(i in 1:12){
image(matrix(x_test[i,], 28, 28)[, 28:1],
main = paste("Our prediction:", y_hat_rf[i]),
xaxt="n", yaxt="n")
}
Variable Importance
There is a link to the relevant sections of the textbook: Variable importance and Visual assessments
Key points
• The Rborist package does not currently support variable importance calculations, but the random-
Forest package does.
• An important part of data science is visualizing results to determine why we are failing.
194
Code
x <- mnist$train$images[index,]
y <- factor(mnist$train$labels[index])
rf <- randomForest(x, y, ntree = 50)
imp <- importance(rf)
imp
## MeanDecreaseGini
## 1 0.0000
## 2 0.0000
## 3 0.0000
## 4 0.0000
## 5 0.0000
## 6 0.0000
## 7 0.0000
## 8 0.0000
## 9 0.0000
## 10 0.0000
## 11 0.0000
## 12 0.0000
## 13 0.0000
## 14 0.0000
## 15 0.0000
## 16 0.0000
## 17 0.0000
## 18 0.0000
## 19 0.0000
## 20 0.0000
## 21 0.0000
## 22 0.0000
## 23 0.0000
## 24 0.0000
## 25 0.0000
## 26 0.0000
## 27 0.0000
## 28 0.0000
## 29 0.0000
## 30 0.0000
## 31 0.0000
## 32 0.0000
## 33 0.0000
## 34 0.0000
## 35 0.0000
## 36 0.0000
## 37 0.0000
## 38 0.0000
## 39 0.0000
## 40 0.0000
## 41 0.0000
## 42 0.0000
## 43 0.0000
## 44 0.0000
## 45 0.0000
195
## 46 0.0000
## 47 0.0000
## 48 0.0000
## 49 0.0000
## 50 0.0000
## 51 0.0000
## 52 0.0000
## 53 0.0000
## 54 0.0000
## 55 0.0000
## 56 0.0000
## 57 0.0000
## 58 0.0000
## 59 0.0000
## 60 0.0000
## 61 0.0000
## 62 0.0000
## 63 0.0000
## 64 0.0000
## 65 0.0000
## 66 0.0000
## 67 0.0000
## 68 0.0000
## 69 0.0000
## 70 0.0200
## 71 0.0386
## 72 0.3364
## 73 0.4292
## 74 0.1083
## 75 0.1228
## 76 0.0000
## 77 0.0000
## 78 0.0359
## 79 0.0000
## 80 0.0000
## 81 0.0000
## 82 0.0000
## 83 0.0000
## 84 0.0000
## 85 0.0000
## 86 0.0000
## 87 0.0000
## 88 0.0000
## 89 0.0000
## 90 0.0000
## 91 0.0000
## 92 0.0000
## 93 0.0267
## 94 0.0702
## 95 0.0267
## 96 0.1533
## 97 0.5302
## 98 0.1691
## 99 0.1951
196
## 100 4.3825
## 101 3.7575
## 102 4.0716
## 103 1.4450
## 104 0.5788
## 105 0.0756
## 106 0.0300
## 107 0.0916
## 108 0.0000
## 109 0.0000
## 110 0.0000
## 111 0.0000
## 112 0.0000
## 113 0.0000
## 114 0.0000
## 115 0.0000
## 116 0.0000
## 117 0.0000
## 118 0.0000
## 119 0.0368
## 120 0.0958
## 121 0.0368
## 122 0.4054
## 123 0.1888
## 124 1.6623
## 125 1.0255
## 126 0.9706
## 127 0.9350
## 128 1.8896
## 129 2.3448
## 130 0.9726
## 131 0.7841
## 132 0.3058
## 133 0.2913
## 134 0.0611
## 135 0.4770
## 136 0.0000
## 137 0.0000
## 138 0.0000
## 139 0.0000
## 140 0.0000
## 141 0.0000
## 142 0.0000
## 143 0.0000
## 144 0.0000
## 145 0.0450
## 146 0.4217
## 147 0.1030
## 148 0.4381
## 149 0.2826
## 150 0.6646
## 151 1.4041
## 152 2.1603
## 153 3.1023
197
## 154 1.7377
## 155 2.9828
## 156 4.4697
## 157 4.6632
## 158 1.9789
## 159 1.1770
## 160 1.2593
## 161 1.1914
## 162 0.4314
## 163 0.9320
## 164 0.5088
## 165 0.0583
## 166 0.0000
## 167 0.0000
## 168 0.0000
## 169 0.0000
## 170 0.0000
## 171 0.0000
## 172 0.0337
## 173 0.0000
## 174 0.0467
## 175 0.0971
## 176 0.2638
## 177 0.8443
## 178 1.3889
## 179 2.3951
## 180 1.8932
## 181 3.7141
## 182 3.1491
## 183 2.5722
## 184 3.5550
## 185 3.7543
## 186 4.1136
## 187 1.2190
## 188 2.7119
## 189 1.3368
## 190 0.7848
## 191 0.5944
## 192 0.6998
## 193 0.0367
## 194 0.0000
## 195 0.0560
## 196 0.0000
## 197 0.0000
## 198 0.0000
## 199 0.0000
## 200 0.0000
## 201 0.0653
## 202 0.1618
## 203 0.2514
## 204 0.1467
## 205 0.7132
## 206 1.0696
## 207 1.8813
198
## 208 1.5488
## 209 1.6265
## 210 2.3821
## 211 4.1416
## 212 6.0898
## 213 2.8040
## 214 1.9544
## 215 2.9735
## 216 1.1595
## 217 1.2301
## 218 0.7179
## 219 0.8997
## 220 1.4020
## 221 0.8376
## 222 0.0376
## 223 0.0000
## 224 0.0000
## 225 0.0000
## 226 0.0000
## 227 0.0000
## 228 0.0000
## 229 0.1500
## 230 0.1951
## 231 0.6163
## 232 1.3442
## 233 0.8332
## 234 1.1122
## 235 3.0582
## 236 4.9129
## 237 3.2573
## 238 2.7814
## 239 2.9401
## 240 5.4603
## 241 3.9843
## 242 3.9568
## 243 1.1594
## 244 1.9290
## 245 1.5714
## 246 1.1573
## 247 0.9894
## 248 0.7398
## 249 0.2346
## 250 0.5157
## 251 0.0000
## 252 0.0000
## 253 0.0000
## 254 0.0000
## 255 0.0000
## 256 0.0000
## 257 0.0000
## 258 0.0722
## 259 0.6696
## 260 0.3971
## 261 1.1764
199
## 262 2.2870
## 263 2.6467
## 264 3.0094
## 265 5.8341
## 266 2.1984
## 267 3.1962
## 268 3.5770
## 269 2.7636
## 270 5.0814
## 271 4.8756
## 272 2.4102
## 273 2.2899
## 274 1.2372
## 275 0.3960
## 276 0.7806
## 277 0.2840
## 278 0.0000
## 279 0.0000
## 280 0.0000
## 281 0.0000
## 282 0.0000
## 283 0.0000
## 284 0.0000
## 285 0.1978
## 286 0.0691
## 287 0.8360
## 288 0.8459
## 289 0.9408
## 290 2.0882
## 291 4.3131
## 292 3.5580
## 293 3.2671
## 294 1.9374
## 295 1.9242
## 296 2.6329
## 297 3.0550
## 298 2.8851
## 299 3.3400
## 300 2.2500
## 301 2.8778
## 302 1.3096
## 303 0.5058
## 304 0.1055
## 305 0.1202
## 306 0.0000
## 307 0.0000
## 308 0.0000
## 309 0.0000
## 310 0.0000
## 311 0.0000
## 312 0.0000
## 313 0.0267
## 314 0.1652
## 315 1.0535
200
## 316 0.9770
## 317 1.1757
## 318 3.9662
## 319 7.4847
## 320 5.0866
## 321 3.2152
## 322 2.9141
## 323 3.5169
## 324 4.8595
## 325 3.6001
## 326 3.6972
## 327 2.4491
## 328 3.2116
## 329 1.3368
## 330 2.0959
## 331 0.6248
## 332 0.1734
## 333 0.1204
## 334 0.0000
## 335 0.0000
## 336 0.0000
## 337 0.0000
## 338 0.0000
## 339 0.0000
## 340 0.0669
## 341 0.0589
## 342 0.0710
## 343 0.7515
## 344 1.5224
## 345 2.9044
## 346 3.4698
## 347 2.9629
## 348 6.6917
## 349 2.8665
## 350 2.5272
## 351 5.2107
## 352 5.2579
## 353 2.5862
## 354 4.0516
## 355 3.9797
## 356 1.2102
## 357 1.9677
## 358 2.8926
## 359 2.4807
## 360 0.2659
## 361 0.0710
## 362 0.0000
## 363 0.0000
## 364 0.0000
## 365 0.0000
## 366 0.0000
## 367 0.0000
## 368 0.0000
## 369 0.0267
201
## 370 0.1961
## 371 0.6116
## 372 0.9917
## 373 2.6019
## 374 4.5573
## 375 5.0599
## 376 6.0905
## 377 5.3284
## 378 5.1077
## 379 9.6768
## 380 3.0461
## 381 4.7315
## 382 4.3859
## 383 4.5496
## 384 1.2225
## 385 2.1867
## 386 1.7976
## 387 1.3636
## 388 0.2294
## 389 0.0000
## 390 0.0000
## 391 0.0000
## 392 0.0000
## 393 0.0000
## 394 0.0000
## 395 0.0000
## 396 0.0000
## 397 0.2786
## 398 0.3010
## 399 1.2454
## 400 3.1789
## 401 4.4449
## 402 5.5182
## 403 4.3270
## 404 4.0243
## 405 4.0694
## 406 5.5033
## 407 6.6132
## 408 3.8076
## 409 5.1868
## 410 5.2291
## 411 4.3761
## 412 1.2487
## 413 1.6620
## 414 1.7047
## 415 3.3018
## 416 0.3135
## 417 0.0667
## 418 0.0000
## 419 0.0000
## 420 0.0000
## 421 0.0000
## 422 0.0000
## 423 0.0000
202
## 424 0.0200
## 425 0.1010
## 426 0.3706
## 427 0.8750
## 428 5.2063
## 429 3.6503
## 430 5.5588
## 431 6.5687
## 432 6.3710
## 433 3.7244
## 434 6.4584
## 435 3.8925
## 436 3.1450
## 437 4.6127
## 438 5.8932
## 439 3.6514
## 440 1.8678
## 441 0.7452
## 442 2.3169
## 443 1.7684
## 444 0.3237
## 445 0.0000
## 446 0.0000
## 447 0.0000
## 448 0.0000
## 449 0.0000
## 450 0.0000
## 451 0.0000
## 452 0.0384
## 453 0.0814
## 454 0.5199
## 455 0.5373
## 456 5.9110
## 457 2.8719
## 458 4.4087
## 459 2.8772
## 460 2.8043
## 461 4.5564
## 462 9.2761
## 463 3.5203
## 464 3.9495
## 465 3.0245
## 466 3.5809
## 467 2.6407
## 468 2.9175
## 469 1.9749
## 470 2.2785
## 471 0.5547
## 472 0.2392
## 473 0.1860
## 474 0.0200
## 475 0.0000
## 476 0.0000
## 477 0.0000
203
## 478 0.0000
## 479 0.0000
## 480 0.0383
## 481 0.0387
## 482 0.4292
## 483 1.6728
## 484 2.5022
## 485 0.4138
## 486 2.9169
## 487 3.0419
## 488 4.1365
## 489 7.1352
## 490 4.9019
## 491 2.8327
## 492 2.5211
## 493 1.7125
## 494 2.7378
## 495 2.8248
## 496 2.0614
## 497 2.3113
## 498 0.9727
## 499 1.6279
## 500 0.5343
## 501 0.3333
## 502 0.0000
## 503 0.0000
## 504 0.0000
## 505 0.0000
## 506 0.0000
## 507 0.0000
## 508 0.0676
## 509 0.2275
## 510 0.2708
## 511 2.4200
## 512 2.5823
## 513 3.0054
## 514 3.4622
## 515 4.5320
## 516 6.1263
## 517 2.3824
## 518 3.3455
## 519 1.9886
## 520 2.9348
## 521 1.1133
## 522 1.4845
## 523 3.0486
## 524 1.7594
## 525 2.0075
## 526 1.0956
## 527 0.7642
## 528 0.5527
## 529 0.0702
## 530 0.0000
## 531 0.0000
204
## 532 0.0000
## 533 0.0000
## 534 0.0000
## 535 0.0000
## 536 0.0000
## 537 0.1836
## 538 0.8058
## 539 3.7220
## 540 5.5971
## 541 1.8936
## 542 2.1503
## 543 5.3189
## 544 3.1706
## 545 2.5217
## 546 2.2154
## 547 1.6559
## 548 2.3495
## 549 0.9677
## 550 2.5048
## 551 2.7026
## 552 1.4848
## 553 1.0656
## 554 0.5196
## 555 0.4745
## 556 0.5605
## 557 0.1946
## 558 0.0000
## 559 0.0000
## 560 0.0000
## 561 0.0000
## 562 0.0000
## 563 0.0000
## 564 0.0000
## 565 0.0360
## 566 0.7484
## 567 2.0237
## 568 4.3082
## 569 3.1404
## 570 4.0156
## 571 3.2594
## 572 3.2163
## 573 3.2371
## 574 2.6207
## 575 1.3211
## 576 1.4396
## 577 1.4215
## 578 2.6131
## 579 2.1551
## 580 1.6976
## 581 0.4295
## 582 0.7656
## 583 0.1415
## 584 0.1012
## 585 0.0653
205
## 586 0.1405
## 587 0.0000
## 588 0.0000
## 589 0.0000
## 590 0.0000
## 591 0.0000
## 592 0.0000
## 593 0.3101
## 594 0.8712
## 595 1.2101
## 596 1.5286
## 597 3.0302
## 598 3.8308
## 599 3.8574
## 600 1.4988
## 601 1.4851
## 602 2.2346
## 603 1.6009
## 604 1.5888
## 605 1.7945
## 606 1.9097
## 607 1.8448
## 608 0.7688
## 609 1.4031
## 610 0.4461
## 611 0.1067
## 612 0.2739
## 613 0.0000
## 614 0.0000
## 615 0.0000
## 616 0.0000
## 617 0.0000
## 618 0.0000
## 619 0.0000
## 620 0.0390
## 621 0.1751
## 622 0.1036
## 623 1.4516
## 624 2.0503
## 625 1.8557
## 626 4.5113
## 627 2.0373
## 628 1.6867
## 629 2.8683
## 630 2.0734
## 631 1.8517
## 632 2.4817
## 633 1.4786
## 634 1.3862
## 635 1.1019
## 636 1.0241
## 637 0.4047
## 638 0.3250
## 639 0.0655
206
## 640 0.0000
## 641 0.0400
## 642 0.0000
## 643 0.0000
## 644 0.0000
## 645 0.0000
## 646 0.0000
## 647 0.0000
## 648 0.0000
## 649 0.0000
## 650 0.0360
## 651 0.5241
## 652 0.7703
## 653 1.3069
## 654 2.9215
## 655 1.3210
## 656 4.7766
## 657 3.5148
## 658 3.5579
## 659 2.7827
## 660 2.0031
## 661 1.1806
## 662 0.6780
## 663 0.4173
## 664 0.5286
## 665 0.0000
## 666 0.0840
## 667 0.1122
## 668 0.1322
## 669 0.0644
## 670 0.0000
## 671 0.0000
## 672 0.0000
## 673 0.0000
## 674 0.0000
## 675 0.0000
## 676 0.0000
## 677 0.0923
## 678 0.1728
## 679 0.2596
## 680 0.2985
## 681 0.2241
## 682 0.5979
## 683 1.1140
## 684 1.2162
## 685 1.9263
## 686 0.9836
## 687 1.6218
## 688 0.6831
## 689 0.4048
## 690 0.4089
## 691 0.4024
## 692 0.0845
## 693 0.1489
207
## 694 0.0533
## 695 0.0000
## 696 0.0394
## 697 0.0000
## 698 0.0000
## 699 0.0000
## 700 0.0000
## 701 0.0000
## 702 0.0000
## 703 0.0000
## 704 0.0000
## 705 0.0000
## 706 0.0378
## 707 0.0745
## 708 0.0460
## 709 0.0400
## 710 0.8688
## 711 0.5995
## 712 1.3124
## 713 0.3276
## 714 2.1420
## 715 0.5888
## 716 0.1989
## 717 0.6024
## 718 0.1311
## 719 0.1512
## 720 0.0356
## 721 0.0000
## 722 0.0000
## 723 0.1434
## 724 0.0000
## 725 0.0000
## 726 0.0000
## 727 0.0000
## 728 0.0000
## 729 0.0000
## 730 0.0000
## 731 0.0000
## 732 0.0000
## 733 0.0000
## 734 0.0000
## 735 0.0000
## 736 0.0367
## 737 0.0000
## 738 0.2851
## 739 0.5083
## 740 0.2420
## 741 0.0676
## 742 0.0320
## 743 0.0709
## 744 0.2129
## 745 0.0382
## 746 0.0350
## 747 0.0326
208
## 748 0.0000
## 749 0.0000
## 750 0.0393
## 751 0.0000
## 752 0.0000
## 753 0.0000
## 754 0.0000
## 755 0.0000
## 756 0.0000
## 757 0.0000
## 758 0.0000
## 759 0.0000
## 760 0.0000
## 761 0.0000
## 762 0.0000
## 763 0.0000
## 764 0.0000
## 765 0.0000
## 766 0.0000
## 767 0.0000
## 768 0.0000
## 769 0.0000
## 770 0.0000
## 771 0.0371
## 772 0.0000
## 773 0.0000
## 774 0.0000
## 775 0.0000
## 776 0.0000
## 777 0.0000
## 778 0.0000
## 779 0.0000
## 780 0.0000
## 781 0.0000
## 782 0.0000
## 783 0.0000
## 784 0.0000
209
1.0
0.8
0.6
0.4
0.2
0.0
210
Pr(3)=1 but is a 8 Pr(0)=1 but is a 2 Pr(1)=1 but is a 6 Pr(1)=1 but is a 7
211
Pr(2)=0.69 but is a 7 Pr(7)=0.68 but is a 2 Pr(9)=0.65 but is a 4 Pr(0)=0.62 but is a 9
Ensembles
• Ensembles combine multiple machine learning algorithms into one model to improve predictions.
Code
212
## 6 0 0 1 0 1 0 101 0 0 0
## 7 0 0 1 2 0 0 0 102 0 0
## 8 0 0 0 1 0 0 0 0 81 0
## 9 0 0 0 0 5 0 0 2 1 102
##
## Overall Statistics
##
## Accuracy : 0.959
## 95% CI : (0.945, 0.97)
## No Information Rate : 0.121
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.954
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 1.000 1.000 0.962 0.929 0.936 0.971
## Specificity 0.998 0.993 0.996 0.993 0.998 0.992
## Pos Pred Value 0.981 0.953 0.962 0.929 0.981 0.907
## Neg Pred Value 1.000 1.000 0.996 0.993 0.992 0.998
## Prevalence 0.102 0.121 0.106 0.084 0.109 0.070
## Detection Rate 0.102 0.121 0.102 0.078 0.102 0.068
## Detection Prevalence 0.104 0.127 0.106 0.084 0.104 0.075
## Balanced Accuracy 0.999 0.997 0.979 0.961 0.967 0.982
## Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity 0.990 0.927 0.890 0.971
## Specificity 0.998 0.997 0.999 0.991
## Pos Pred Value 0.981 0.971 0.988 0.927
## Neg Pred Value 0.999 0.991 0.989 0.997
## Prevalence 0.102 0.110 0.091 0.105
## Detection Rate 0.101 0.102 0.081 0.102
## Detection Prevalence 0.103 0.105 0.082 0.110
## Balanced Accuracy 0.994 0.962 0.945 0.981
1. Use the training set to build a model with several of the models available from the caret package. We
will test out 10 of the most common machine learning models in this exercise:
models <- c("glm", "lda", "naive_bayes", "svmLinear", "knn", "gamLoess", "multinom", "qda", "rf", "adabo
Apply all of these models using train() with all the default parameters. You may need to install some
packages. Keep in mind that you will probably get some warnings. Also, it will probably take a while to
train all of the models - be patient!
Run the following code to train the various models:
213
## Warning in set.seed(1, sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
data("mnist_27")
## [1] "glm"
## [1] "lda"
## [1] "naive_bayes"
## [1] "svmLinear"
## [1] "knn"
## [1] "gamLoess"
214
## Warning in gam.lo(data[["lo(x_2, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## extrapolation not allowed with blending
215
## Warning in gam.lo(data[["lo(x_2, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## upperlimit 0.54068
216
## Warning in gam.lo(data[["lo(x_1, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## eval 0.46667
217
## Warning in gam.lo(data[["lo(x_2, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## extrapolation not allowed with blending
218
## Warning in gam.lo(data[["lo(x_2, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## lowerlimit 0.10703
219
## Warning in gam.lo(data[["lo(x_1, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## eval 0.40426
220
## Warning in gam.lo(data[["lo(x_2, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## extrapolation not allowed with blending
221
## Warning in gam.lo(data[["lo(x_2, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## lowerlimit 0.10877
222
## Warning in gam.lo(data[["lo(x_2, span = 0.5, degree = 1)"]], z, w, span = 0.5, :
## eval 0.57895
## [1] "multinom"
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 384.794809
## final value 384.794775
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 421.251454
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 384.848555
## final value 384.848522
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 358.466023
## final value 358.466014
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 400.257332
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 358.528966
## final value 358.528958
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 345.361326
## final value 345.361319
## converged
223
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 389.162400
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 345.427631
## final value 345.427624
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 370.819967
## iter 10 value 370.819967
## iter 10 value 370.819967
## final value 370.819967
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 411.520894
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 370.881269
## iter 10 value 370.881269
## iter 10 value 370.881269
## final value 370.881269
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 338.339240
## final value 337.642174
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 389.552735
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 337.725860
## final value 337.725851
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 362.651997
## iter 10 value 362.651996
## iter 10 value 362.651996
## final value 362.651996
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 404.947235
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
224
## iter 10 value 362.716896
## iter 10 value 362.716895
## iter 10 value 362.716894
## final value 362.716894
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 353.360649
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 396.615883
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 353.427369
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 331.505876
## final value 331.505837
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 382.233327
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 331.587049
## final value 331.587010
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 364.158073
## iter 10 value 364.158073
## iter 10 value 364.158073
## final value 364.158073
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 400.438283
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 364.210111
## iter 10 value 364.210111
## iter 10 value 364.210111
## final value 364.210111
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 343.760429
## final value 343.760410
## converged
225
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 387.083157
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 343.826126
## final value 343.826108
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 377.277862
## iter 10 value 377.277862
## iter 10 value 377.277861
## final value 377.277861
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 413.479657
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 377.330740
## iter 10 value 377.330739
## iter 10 value 377.330738
## final value 377.330738
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 363.527477
## final value 363.527449
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 405.904614
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 363.591426
## final value 363.591399
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 346.706756
## iter 10 value 346.706754
## iter 10 value 346.706754
## final value 346.706754
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 393.064300
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
226
## iter 10 value 346.778579
## iter 10 value 346.778577
## iter 10 value 346.778577
## final value 346.778577
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 350.308158
## final value 350.308124
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 394.686750
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 350.376208
## final value 350.376174
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 365.423988
## final value 365.423967
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 407.046095
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 365.486830
## final value 365.486809
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 375.942875
## final value 375.942868
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 412.738783
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 375.996860
## final value 375.996853
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 369.004020
## final value 369.003531
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
227
## final value 407.374841
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 369.060934
## final value 369.060455
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 360.551961
## iter 10 value 360.551959
## iter 10 value 360.551959
## final value 360.551959
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 400.866217
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 360.611945
## iter 10 value 360.611943
## iter 10 value 360.611943
## final value 360.611943
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 370.467778
## final value 370.414135
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 406.680836
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 370.519928
## final value 370.466715
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 355.236387
## final value 355.236347
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 401.370189
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 355.308279
## final value 355.308240
## converged
## # weights: 4 (3 variable)
228
## initial value 554.517744
## iter 10 value 364.714111
## final value 364.714051
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 407.312950
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 364.779508
## final value 364.779448
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 347.812292
## final value 347.812150
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 389.764148
## iter 10 value 389.764145
## iter 10 value 389.764145
## final value 389.764145
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 347.875247
## final value 347.875105
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 319.870357
## final value 319.870338
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 372.994080
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 319.955663
## final value 319.955644
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 312.576095
## final value 312.576064
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 367.284329
## iter 10 value 367.284329
## iter 10 value 367.284329
229
## final value 367.284329
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 312.666550
## final value 312.666520
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 363.313712
## iter 10 value 363.313712
## iter 10 value 363.313712
## final value 363.313712
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## final value 403.175943
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 363.373575
## iter 10 value 363.373575
## iter 10 value 363.373575
## final value 363.373575
## converged
## # weights: 4 (3 variable)
## initial value 554.517744
## iter 10 value 358.900453
## iter 10 value 358.900452
## iter 10 value 358.900452
## final value 358.900452
## converged
## [1] "qda"
## [1] "rf"
## note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .
##
## [1] "adaboost"
⊠ A. Yes
□ B. No
2. Now that you have all the trained models in a list, use sapply() or map() to create a matrix of
predictions for the test set. You should end up with a matrix with length(mnist_27$test$y) rows
and length(models) columns.
230
pred <- sapply(fits, function(object)
predict(object, newdata = mnist_27$test))
dim(pred)
## [1] 200 10
mean(acc)
## [1] 0.789
4. Next, build an ensemble prediction by majority vote and compute the accuracy of the ensemble. Vote
7 if more than 50% of the models are predicting a 7, and 2 otherwise.
## [1] 0.815
5. In Q3, we computed the accuracy of each method on the test set and noticed that the individual
accuracies varied.
## [1] 3
models[ind]
231
□ A. glm
□ B. lda
□ C. naive_bayes
□ D. svmLinear
⊠ E. knn
⊠ F. gamLoess
□ G. multinom
⊠ H. qda
□ I. rf
□ J. adaboost
6. It is tempting to remove the methods that do not perform well and re-do the ensemble. The problem
with this approach is that we are using the test data to make a decision. However, we could use the
minimum accuracy estimates obtained from cross validation with the training data for each model
from fit$results$Accuracy. Obtain these estimates and save them in an object. Report the mean
of these training set accuracy estimates.
## [1] 0.809
7. Now let’s only consider the methods with an estimated accuracy of greater than or equal to 0.8 when
constructing the ensemble. Vote 7 if 50% or more of the models are predicting a 7, and 2 otherwise.
## [1] 0.825
Recommendation Systems
• https://bits.blogs.nytimes.com/2009/09/21/netflix-awards-1-million-prize-and-starts-a-new-contest/
• http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
• https://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
Key points
• Recommendation systems are more complicated machine learning challenges because each outcome
has a different set of predictors. For example, different users rate a different number of movies and
rate different movies.
232
• To compare different models or to see how well we’re doing compared to a baseline, we will use root
mean squared error (RMSE) as our loss function. We can interpret RMSE similar to standard
deviation.
• If 𝑁 is the number of user-movie combinations, 𝑦𝑢,𝑖 is the rating for movie 𝑖 by user 𝑢, and 𝑦𝑢,𝑖
̂ is our
prediction, then RMSE is defined as follows:
√ 𝑁1 ∑𝑢,𝑖 (𝑦𝑢,𝑖
̂ − 𝑦𝑢,𝑖 )2
Code
data("movielens")
head(movielens)
movielens %>%
summarize(n_users = n_distinct(userId),
n_movies = n_distinct(movieId))
## n_users n_movies
## 1 671 9066
## Selecting by n
233
Forrest Pulp Shawshank Silence of the Star Wars: Episode IV -
userId Gump Fiction Redemption, The Lambs, The A New Hope
13 5.0 3.5 4.5 NA NA
15 1.0 5.0 2.0 5.0 5.0
16 NA NA 4.0 NA NA
17 2.5 5.0 5.0 4.5 3.5
18 NA NA NA NA 3.0
19 5.0 5.0 4.0 3.0 4.0
20 2.0 0.5 4.5 0.5 1.5
20 40 60 80 100
Movies
movielens %>%
dplyr::count(movieId) %>%
ggplot(aes(n)) +
geom_histogram(bins = 30, color = "black") +
scale_x_log10() +
ggtitle("Movies")
234
Movies
3000
2000
count
1000
1 10 100
n
movielens %>%
dplyr::count(userId) %>%
ggplot(aes(n)) +
geom_histogram(bins = 30, color = "black") +
scale_x_log10() +
ggtitle("Users")
235
Users
40
30
count
20
10
library(caret)
set.seed(755)
test_index <- createDataPartition(y = movielens$rating, times = 1,
p = 0.2, list = FALSE)
train_set <- movielens[-test_index,]
test_set <- movielens[test_index,]
There is a link to the relevant sections of the textbook: A first model,Modeling movie effects and User effects
Key points
• We start with a model that assumes the same rating for all movies and all users, with all the
differences explained by random variation: If 𝜇 represents the true rating for all movies and users and
𝜖 represents independent errors sampled from the same distribution centered at zero, then:
𝑌𝑢,𝑖 = 𝜇 + 𝜖𝑢,𝑖
236
• In this case, the least squares estimate of 𝜇 — the estimate that minimizes the root mean squared
error — is the average rating of all movies across all users.
• We can improve our model by adding a term, 𝑏𝑖 , that represents the average rating for movie 𝑖:
𝑌𝑢,𝑖 = 𝜇 + 𝑏𝑖 + 𝜖𝑢,𝑖
𝑏𝑖 is the average of 𝑌𝑢,𝑖 minus the overall mean for each movie 𝑖.
We can further improve our model by adding 𝑏𝑢 , the user-specific effect:
𝑌𝑢,𝑖 = 𝜇 + 𝑏𝑖 + 𝑏𝑢 + 𝜖𝑢,𝑖
• Note that because there are thousands of 𝑏’s, the lm() function will be very slow or cause R to crash,
so we don’t recommend using linear regression to calculate these effects.
Code
## [1] 3.54
## [1] 1.05
## [1] 1.49
movie_avgs %>% qplot(b_i, geom ="histogram", bins = 10, data = ., color = I("black"))
237
2000
1500
1000
500
−3 −2 −1 0 1 2
b_i
method RMSE
Just the average 1.048
Movie Effect Model 0.986
train_set %>%
group_by(userId) %>%
summarize(b_u = mean(rating)) %>%
filter(n()>=100) %>%
ggplot(aes(b_u)) +
geom_histogram(bins = 30, color = "black")
238
80
60
count
40
20
2 3 4 5
b_u
method RMSE
Just the average 1.048
Movie Effect Model 0.986
Movie + User Effects Model 0.885
239
Comprehension Check - Recommendation Systems
The following exercises all work with the movielens data, which can be loaded using the following code:
data("movielens")
1. Compute the number of ratings for each movie and then plot it against the year the movie came out
using a boxplot for each year. Use the square root transformation on the y-axis (number of ratings)
when creating your plot.
300
200
100
n
1902
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
NA
year
2. We see that, on average, movies that came out after 1993 get more ratings. We also see that with
newer movies, starting in 1993, the number of ratings decreases with year: the more recent a movie is,
the less time users have had to rate it.
240
Among movies that came out in 1993 or later, select the top 25 movies with the highest average number of
ratings per year (n/year), and caculate the average rating of each of them. To calculate number of ratings
per year, use 2018 as the end year.
What is the average rating for the movie The Shawshank Redemption?
What is the average number of ratings per year for the movie Forrest Gump?
movielens %>%
filter(year >= 1993) %>%
group_by(movieId) %>%
summarize(n = n(), years = 2018 - first(year),
title = title[1],
rating = mean(rating)) %>%
mutate(rate = n/years) %>%
top_n(25, rate) %>%
arrange(desc(rate))
## # A tibble: 25 x 6
## movieId n years title rating rate
## <int> <int> <dbl> <chr> <dbl> <dbl>
## 1 356 341 24 Forrest Gump 4.05 14.2
## 2 79132 111 8 Inception 4.05 13.9
## 3 2571 259 19 Matrix, The 4.18 13.6
## 4 296 324 24 Pulp Fiction 4.26 13.5
## 5 318 311 24 Shawshank Redemption, The 4.49 13.0
## 6 58559 121 10 Dark Knight, The 4.24 12.1
## 7 4993 200 17 Lord of the Rings: The Fellowship of the Ri~ 4.18 11.8
## 8 5952 188 16 Lord of the Rings: The Two Towers, The 4.06 11.8
## 9 7153 176 15 Lord of the Rings: The Return of the King, ~ 4.13 11.7
## 10 2858 220 19 American Beauty 4.24 11.6
## # ... with 15 more rows
3. From the table constructed in Q2, we can see that the most frequently rated movies tend to have above
average ratings. This is not surprising: more people watch popular movies. To confirm this, stratify
the post-1993 movies by ratings per year and compute their average ratings. To calculate number of
ratings per year, use 2018 as the end year. Make a plot of average rating versus ratings per year and
show an estimate of the trend.
movielens %>%
filter(year >= 1993) %>%
group_by(movieId) %>%
summarize(n = n(), years = 2018 - first(year),
title = title[1],
rating = mean(rating)) %>%
mutate(rate = n/years) %>%
ggplot(aes(rate, rating)) +
geom_point() +
geom_smooth()
241
## `summarise()` ungrouping output (override with `.groups` argument)
3
rating
0 5 10
rate
□ A. There is no relationship between how often a movie is rated and its average rating.
□ B. Movies with very few and very many ratings have the highest average ratings.
⊠ C. The more often a movie is rated, the higher its average rating.
□ D. The more often a movie is rated, the lower its average rating.
4. Suppose you are doing a predictive analysis in which you need to fill in the missing ratings with some
value.
Given your observations in the exercise in Q3, which of the following strategies would be most appropriate?
□ A. Fill in the missing values with the average rating across all movies.
□ B. Fill in the missing values with 0.
□ C. Fill in the missing values with a lower value than the average rating across all movies.
⊠ D. Fill in the value with a higher value than the average rating across all movies.
□ E. None of the above.
5. The movielens dataset also includes a time stamp. This variable represents the time and data in
which the rating was provided. The units are seconds since January 1, 1970. Create a new column
date with the date.
242
movielens <- mutate(movielens, date = as_datetime(timestamp))
6. Compute the average rating for each week and plot this average against date. Hint: use the
round_date() function before you group_by().
4
rating
243
□ A. There is very strong evidence of a time effect on average rating.
⊠ B. There is some evidence of a time effect on average rating.
□ C. There is no evidence of a time effect on average rating (straight horizontal line).
If we define 𝑑𝑢,𝑖 as the day for user’s 𝑢 rating of movie 𝑖, which of the following models is most appropriate?
8. The movielens data also has a genres column. This column includes every genre that applies to the
movie. Some movies fall under several genres. Define a category as whatever combination appears
in this column. Keep only categories with more than 1,000 ratings. Then compute the average and
standard error for each category. Plot these as error bar plots.
244
4.0
3.8
avg
3.6
3.4
Comedy
Comedy|Romance
Action|Adventure|Thriller
Comedy|Crime
Drama|Thriller
Action|Crime|Thriller
Action|Adventure|Sci−Fi|Thriller
Action|Adventure|Sci−Fi
Action|Sci−Fi|Thriller
Comedy|Drama|Romance
Comedy|Drama
Drama|Romance
Drama
Crime|Drama|Thriller
Action|Drama|War
Documentary
Crime|Drama
Drama|War
genres
9. The plot you generated in Q8 shows strong evidence of a genre effect. Consider this plot as you answer
the following question.
If we define 𝑔𝑢,𝑖 as the genre for user 𝑢’s rating of movie 𝑖, which of the following models is most appropriate?
Regularization
• To improve our results, we will use regularization. Regularization constrains the total variability of
the effect sizes by penalizing large estimates that come from small sample sizes.
• To estimate the 𝑏’s, we will now minimize this equation, which contains a penalty term:
1
𝑁 ∑𝑢,𝑖 (𝑦𝑢,𝑖 − 𝜇 − 𝑏𝑖 )2 + 𝜆 ∑𝑖 𝑏𝑖2
The first term is the mean squared error and the second is a penalty term that gets larger when many 𝑏’s
are large.
The values of 𝑏 that minimize this equation are given by:
245
𝑛
𝑏̂𝑖 (𝜆) = 1
𝜆+𝑛𝑖 ∑𝑢=1
𝑖
(𝑌𝑢,𝑖 − 𝜇),
̂
where 𝑛𝑖 is a number of ratings 𝑏 for movie 𝑖.
• The larger 𝜆 is, the more we shrink. 𝜆 is a tuning parameter, so we can use cross-validation to
choose it. We should be using full cross-validation on just the training set, without using the test set
until the final assessment.
• We can also use regularization to estimate the user effect. We will now minimize this equation:
1
𝑁 ∑𝑢,𝑖 (𝑦𝑢,𝑖 − 𝜇 − 𝑏𝑖 − 𝑏𝑢 )2 + 𝜆(∑𝑖 𝑏𝑖2 + ∑𝑢 𝑏𝑢2 )
Code
data("movielens")
set.seed(755)
test_index <- createDataPartition(y = movielens$rating, times = 1,
p = 0.2, list = FALSE)
train_set <- movielens[-test_index,]
test_set <- movielens[test_index,]
test_set <- test_set %>%
semi_join(train_set, by = "movieId") %>%
semi_join(train_set, by = "userId")
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((true_ratings - predicted_ratings)^2))
}
mu_hat <- mean(train_set$rating)
naive_rmse <- RMSE(test_set$rating, mu_hat)
rmse_results <- data_frame(method = "Just the average", RMSE = naive_rmse)
mu <- mean(train_set$rating)
movie_avgs <- train_set %>%
group_by(movieId) %>%
summarize(b_i = mean(rating - mu))
246
mutate(pred = mu + b_i + b_u) %>%
.$pred
model_2_rmse <- RMSE(predicted_ratings, test_set$rating)
rmse_results <- bind_rows(rmse_results,
data_frame(method="Movie + User Effects Model",
RMSE = model_2_rmse ))
test_set %>%
left_join(movie_avgs, by='movieId') %>%
mutate(residual = rating - (mu + b_i)) %>%
arrange(desc(abs(residual))) %>%
dplyr::select(title, residual) %>% slice(1:10) %>% knitr::kable()
title residual
Day of the Beast, The (Día de la Bestia, El) 4.50
Horror Express -4.00
No Holds Barred 4.00
Dear Zachary: A Letter to a Son About His Father -4.00
Faust -4.00
Hear My Song -4.00
Confessions of a Shopaholic -4.00
Twilight Saga: Breaking Dawn - Part 1, The -4.00
Taxi Driver -3.81
Taxi Driver -3.81
title b_i
Lamerica 1.46
Love & Human Remains 1.46
Enfer, L’ 1.46
Picture Bride (Bijo photo) 1.46
Red Firecracker, Green Firecracker (Pao Da Shuang Deng) 1.46
Faces 1.46
Maya Lin: A Strong Clear Vision 1.46
Heavy 1.46
Gate of Heavenly Peace, The 1.46
Death in the Garden (Mort en ce jardin, La) 1.46
247
slice(1:10) %>%
knitr::kable()
title b_i
Santa with Muscles -3.04
BAP*S -3.04
3 Ninjas: High Noon On Mega Mountain -3.04
Barney’s Great Adventure -3.04
Merry War, A -3.04
Day of the Beast, The (Día de la Bestia, El) -3.04
Children of the Corn III -3.04
Whiteboyz -3.04
Catfish in Black Bean Sauce -3.04
Watcher, The -3.04
## Joining, by = "movieId"
title b_i n
Lamerica 1.46 1
Love & Human Remains 1.46 3
Enfer, L’ 1.46 1
Picture Bride (Bijo photo) 1.46 1
Red Firecracker, Green Firecracker (Pao Da Shuang Deng) 1.46 3
Faces 1.46 1
Maya Lin: A Strong Clear Vision 1.46 2
Heavy 1.46 1
Gate of Heavenly Peace, The 1.46 1
Death in the Garden (Mort en ce jardin, La) 1.46 1
## Joining, by = "movieId"
248
title b_i n
Santa with Muscles -3.04 1
BAP*S -3.04 1
3 Ninjas: High Noon On Mega Mountain -3.04 1
Barney’s Great Adventure -3.04 1
Merry War, A -3.04 1
Day of the Beast, The (Día de la Bestia, El) -3.04 1
Children of the Corn III -3.04 1
Whiteboyz -3.04 1
Catfish in Black Bean Sauce -3.04 1
Watcher, The -3.04 1
lambda <- 3
mu <- mean(train_set$rating)
movie_reg_avgs <- train_set %>%
group_by(movieId) %>%
summarize(b_i = sum(rating - mu)/(n()+lambda), n_i = n())
data_frame(original = movie_avgs$b_i,
regularlized = movie_reg_avgs$b_i,
n = movie_reg_avgs$n_i) %>%
ggplot(aes(original, regularlized, size=sqrt(n))) +
geom_point(shape=1, alpha=0.5)
0
sqrt(n)
regularlized
4
8
12
−1 16
−2
−3 −2 −1 0 1
original
249
train_set %>%
dplyr::count(movieId) %>%
left_join(movie_reg_avgs) %>%
left_join(movie_titles, by="movieId") %>%
arrange(desc(b_i)) %>%
dplyr::select(title, b_i, n) %>%
slice(1:10) %>%
knitr::kable()
## Joining, by = "movieId"
title b_i n
All About Eve 0.927 26
Shawshank Redemption, The 0.921 240
Godfather, The 0.897 153
Godfather: Part II, The 0.871 100
Maltese Falcon, The 0.860 47
Best Years of Our Lives, The 0.859 11
On the Waterfront 0.847 23
Face in the Crowd, A 0.833 4
African Queen, The 0.832 36
All Quiet on the Western Front 0.824 11
train_set %>%
dplyr::count(movieId) %>%
left_join(movie_reg_avgs) %>%
left_join(movie_titles, by="movieId") %>%
arrange(b_i) %>%
dplyr::select(title, b_i, n) %>%
slice(1:10) %>%
knitr::kable()
## Joining, by = "movieId"
title b_i n
Battlefield Earth -2.06 14
Joe’s Apartment -1.78 7
Speed 2: Cruise Control -1.69 20
Super Mario Bros. -1.60 13
Police Academy 6: City Under Siege -1.57 10
After Earth -1.52 4
Disaster Movie -1.52 3
Little Nicky -1.51 17
Cats & Dogs -1.47 6
Blade: Trinity -1.46 11
250
mutate(pred = mu + b_i) %>%
.$pred
method RMSE
Just the average 1.048
Movie Effect Model 0.986
Movie + User Effects Model 0.885
Regularized Movie Effect Model 0.965
251
0.985
0.980
rmses
0.975
0.970
0.965
lambdas[which.min(rmses)]
## [1] 3
252
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
253
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
qplot(lambdas, rmses)
0.90
rmses
0.89
0.88
0.0 2.5 5.0 7.5 10.0
lambdas
254
lambda <- lambdas[which.min(rmses)]
lambda
## [1] 3.75
method RMSE
Just the average 1.048
Movie Effect Model 0.986
Movie + User Effects Model 0.885
Regularized Movie Effect Model 0.965
Regularized Movie + User Effect Model 0.881
The exercises in Q1-Q8 work with a simulated dataset for 1000 schools. This pre-exercise setup walks you
through the code needed to simulate the dataset.
If you have not done so already since the Titanic Exercises, please restart R or reset the number of digits
that are printed with options(digits=7).
An education expert is advocating for smaller schools. The expert bases this recommendation on the fact
that among the best performing schools, many are small schools. Let’s simulate a dataset for 1000 schools.
First, let’s simulate the number of students in each school, using the following code:
Now let’s assign a true quality for each school that is completely independent from size. This is the parameter
we want to estimate in our analysis. The true quality can be assigned using the following code:
255
## [1] 67 94
Now let’s have the students in the school take a test. There is random variability in test taking, so we will
simulate the test scores as normally distributed with the average determined by the school quality with a
standard deviation of 30 percentage points. This code will simulate the test scores:
1. What are the top schools based on the average score? Show just the ID, size, and the average score.
256
Report the ID of the top school and average score of the 10th school.
What is the ID of the top school?
What is the average score of the 10th school (after sorting from highest to lowest average score)?
schools %>% top_n(10, score) %>% arrange(desc(score)) %>% dplyr::select(id, size, score)
## id size score
## 1 PS 567 121 95.8
## 2 PS 191 1036 93.5
## 3 PS 330 162 91.0
## 4 PS 701 83 90.5
## 5 PS 591 213 89.7
## 6 PS 205 172 89.3
## 7 PS 574 199 89.2
## 8 PS 963 208 89.0
## 9 PS 430 61 88.7
## 10 PS 756 245 88.0
2. Compare the median school size to the median school size of the top 10 schools based on the score.
median(schools$size)
## [1] 261
## [1] 186
3. According to this analysis, it appears that small schools produce better test scores than large schools.
Four out of the top 10 schools have 100 or fewer students. But how can this be? We constructed the
simulation so that quality and size were independent. Repeat the exercise for the worst 10 schools.
What is the median school size of the bottom 10 schools based on the score?
median(schools$size)
## [1] 261
## [1] 219
4. From this analysis, we see that the worst schools are also small. Plot the average score versus school
size to see what’s going on. Highlight the top 10 schools based on the true quality.
257
schools %>% ggplot(aes(size, score)) +
geom_point(alpha = 0.5) +
geom_point(data = filter(schools, rank<=10), col = 2)
90
score
80
70
□ A. There is no difference in the standard error of the score based on school size; there must be an error
in how we generated our data.
⊠ B. The standard error of the score has larger variability when the school is smaller, which is why both
the best and the worst schools are more likely to be small.
□ C. The standard error of the score has smaller variability when the school is smaller, which is why
both the best and the worst schools are more likely to be small.
□ D. The standard error of the score has larger variability when the school is very small or very large,
which is why both the best and the worst schools are more likely to be small.
□ E. The standard error of the score has smaller variability when the school is very small or very large,
which is why both the best and the worst schools are more likely to be small.
5. Let’s use regularization to pick the best schools. Remember regularization shrinks deviations from
the average towards 0. To apply regularization here, we first need to define the overall average for all
schools, using the following code:
Then, we need to define, for each school, how it deviates from that average.
Write code that estimates the score above the average for each school but dividing by 𝑛 + 𝛼 instead of 𝑛,
with 𝑛 the school size and 𝛼 a regularization parameter. Try 𝛼 = 25.
258
What is the ID of the top school with regularization?
What is the regularized score of the 10th school?
alpha <- 25
score_reg <- sapply(scores, function(x) overall + sum(x-overall)/(length(x)+alpha))
schools %>% mutate(score_reg = score_reg) %>%
top_n(10, score_reg) %>% arrange(desc(score_reg))
6. Notice that this improves things a bit. The number of small schools that are not highly ranked is now
lower. Is there a better 𝛼? Using values of 𝛼 from 10 to 250, find the 𝛼 that minimizes the RMSE.
1 1000
RMSE = √ 1000 ∑𝑖=1 (quality − estimate)2
What value of 𝛼 gives the minimum RMSE?
259
1.9
1.8
rmse
1.7
1.6
alphas
alphas[which.min(rmse)]
## [1] 135
7. Rank the schools based on the average obtained with the best 𝛼 from Q6. Note that no small school
is incorrectly included.
8. A common mistake made when using regularization is shrinking values towards 0 that are not centered
around 0. For example, if we don’t subtract the overall average before shrinking, we actually obtain a
260
very similar result. Confirm this by re-running the code from the exercise in Q6 but without removing
the overall mean.
20
10
alphas
alphas[which.min(rmse)]
## [1] 10
Matrix Factorization
• Our earlier models fail to account for an important source of variation related to the fact that groups
of movies and groups of users have similar rating patterns. We can observe these patterns by studying
the residuals and converting our data into a matrix where each user gets a row and each
movie gets a column:
261
• We can factorize the matrix of residuals 𝑟 into a vector 𝑝 and vector 𝑞, 𝑟𝑢,𝑖 ≈ 𝑝𝑢 𝑞𝑖 , allowing us to
explain more of the variance using a model like this:
𝑌𝑢,𝑖 = 𝜇 + 𝑏𝑖 + 𝑏𝑢 + 𝑝𝑢 𝑞𝑖 + 𝜖𝑖,𝑗
• Because our example is more complicated, we can use two factors to explain the structure and
two sets of coefficients to describe users:
• To estimate factors using our data instead of constructing them ourselves, we can use principal
component analysis (PCA) or singular value decomposition (SVD).
Code
rownames(y)<- y[,1]
y <- y[,-1]
colnames(y) <- with(movie_titles, title[match(colnames(y), movieId)])
262
1
Godfather: Part II, The
−1
−2
−3
−3 −2 −1 0 1
Godfather, The
263
2
0
Goodfellas
−1
−2
−3
−3 −2 −1 0 1
Godfather, The
264
1
Sleepless in Seattle
−1
−2
−2 −1 0 1
You've Got Mail
set.seed(1)
options(digits = 2)
Q <- matrix(c(1 , 1, 1, -1, -1), ncol=1)
rownames(Q) <- c(m_1, m_2, m_3, m_4, m_5)
P <- matrix(rep(c(2,0,-2), c(3,5,4)), ncol=1)
rownames(P) <- 1:nrow(P)
X <- jitter(P%*%t(Q))
X %>% knitr::kable(align = "c")
265
Godfather, The Godfather: Part II, The Goodfellas You’ve Got Mail Sleepless in Seattle
1.81 2.15 1.81 -1.76 -1.81
1.90 1.91 1.91 -2.31 -1.85
2.06 2.22 1.61 -1.82 -2.02
0.33 0.00 -0.09 -0.07 0.29
-0.24 0.17 0.30 0.26 -0.05
0.32 0.39 -0.13 0.12 -0.20
0.36 -0.10 -0.01 0.23 -0.34
0.13 0.22 0.08 0.04 -0.32
-1.90 -1.65 -2.01 2.02 1.85
-2.35 -2.23 -2.25 2.23 2.01
-2.24 -1.88 -1.74 1.62 2.13
-2.26 -2.30 -1.87 1.98 1.93
cor(X)
Godfather, The Godfather: Part II, The Goodfellas You’ve Got Mail Sleepless in Seattle
1 1 1 -1 -1
## [,1]
## 1 2
## 2 2
## 3 2
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 -2
## 10 -2
## 11 -2
## 12 -2
266
set.seed(1)
options(digits = 2)
m_6 <- "Scent of a Woman"
Q <- cbind(c(1 , 1, 1, -1, -1, -1),
c(1 , 1, -1, -1, -1, 1))
rownames(Q) <- c(m_1, m_2, m_3, m_4, m_5, m_6)
P <- cbind(rep(c(2,0,-2), c(3,5,4)),
c(-1,1,1,0,0,1,1,1,0,-1,-1,-1))/2
rownames(P) <- 1:nrow(X)
cor(X)
267
Godfather, Godfather: Part II, You’ve Got Sleepless in Scent of a
The The Goodfellas Mail Seattle Woman
1 1 1 -1 -1 -1
1 1 -1 -1 -1 1
## [,1] [,2]
## 1 1 -0.5
## 2 1 0.5
## 3 1 0.5
## 4 0 0.0
## 5 0 0.0
## 6 0 0.5
## 7 0 0.5
## 8 0 0.5
## 9 -1 0.0
## 10 -1 -0.5
## 11 -1 -0.5
## 12 -1 -0.5
• You can think of singular value decomposition (SVD) as an algorithm that finds the vectors 𝑝
and 𝑞 that permit us to write the matrix of residuals 𝑟 with 𝑚 rows and 𝑛 columns in the following
way:
268
• SVD also computes the variabilities so that we can know how much of the matrix’s total variability
is explained as we add new terms.
• The vectors 𝑞 are called the principal components and the vectors 𝑝 are the user effects.
By using principal components analysis (PCA), matrix factorization can capture structure in the data
determined by user opinions about movies.
Code
y[is.na(y)] <- 0
y <- sweep(y, 1, rowMeans(y))
pca <- prcomp(y)
dim(pca$rotation)
dim(pca$x)
plot(pca$sdev)
1.5
pca$sdev
1.0
0.5
0.0
Index
269
1.0
0.8
var_explained
0.6
0.4
0.2
0.0
Index
pcs <- data.frame(pca$rotation, name = colnames(y))
pcs %>% ggplot(aes(PC1, PC2)) + geom_point() +
geom_text_repel(aes(PC1, PC2, label=name),
data = filter(pcs,
PC1 < -0.1 | PC1 > 0.1 | PC2 < -0.075 | PC2 > 0.1))
Matrix, The Lord of the Rings: The Return of the King, The
Spider−Man 2
0.1
Dark Knight, The Independence Day (a.k.a. ID4)
Fight Club
Silence of the Lambs, The Spider−Man Shrek
SevenPotter
Harry (a.k.a.and
Se7en)
the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone)
Armageddon
Taxi Driver Godfather, The
0.0 Twister
Fargo Being John Malkovich Batman Forever
2001: A Space Odyssey Titanic
Pulp Fiction
Truman Show, The Slumdog Millionaire
Clockwork Orange, A
Little Miss Sunshine
−0.1
−0.1 0.0 0.1
PC1
270
pcs %>% dplyr::select(name, PC1) %>% arrange(PC1) %>% slice(1:10)
## name PC1
## Pulp Fiction Pulp Fiction -0.16
## Seven (a.k.a. Se7en) Seven (a.k.a. Se7en) -0.14
## Fargo Fargo -0.14
## Taxi Driver Taxi Driver -0.13
## 2001: A Space Odyssey 2001: A Space Odyssey -0.13
## Silence of the Lambs, The Silence of the Lambs, The -0.13
## Clockwork Orange, A Clockwork Orange, A -0.12
## Being John Malkovich Being John Malkovich -0.11
## Fight Club Fight Club -0.10
## Godfather, The Godfather, The -0.10
##
## Independence Day (a.k.a. ID4)
## Shrek
## Twister
## Titanic
## Armageddon
## Spider-Man
## Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) Harry Potter
## Batman Forever
## Forrest Gump
## Enemy of the State
## PC1
## Independence Day (a.k.a. ID4) 0.161
## Shrek 0.128
## Twister 0.119
## Titanic 0.118
## Armageddon 0.111
## Spider-Man 0.107
## Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) 0.102
## Batman Forever 0.101
## Forrest Gump 0.100
## Enemy of the State 0.092
## name
## Little Miss Sunshine Little Miss Sunshine
## Truman Show, The Truman Show, The
## Slumdog Millionaire Slumdog Millionaire
## Mars Attacks! Mars Attacks!
## American Beauty American Beauty
## Amelie (Fabuleux destin d'Amélie Poulain, Le) Amelie (Fabuleux destin d'Amélie Poulain, Le)
## City of God (Cidade de Deus) City of God (Cidade de Deus)
## Monty Python's Life of Brian Monty Python's Life of Brian
## Shawshank Redemption, The Shawshank Redemption, The
271
## Beautiful Mind, A Beautiful Mind, A
## PC2
## Little Miss Sunshine -0.081
## Truman Show, The -0.079
## Slumdog Millionaire -0.076
## Mars Attacks! -0.073
## American Beauty -0.069
## Amelie (Fabuleux destin d'Amélie Poulain, Le) -0.068
## City of God (Cidade de Deus) -0.068
## Monty Python's Life of Brian -0.068
## Shawshank Redemption, The -0.066
## Beautiful Mind, A -0.064
## name
## Lord of the Rings: The Two Towers, The Lord of the Rings: The Two Towers, The
## Lord of the Rings: The Fellowship of the Ring, The Lord of the Rings: The Fellowship of the Ring, The
## Lord of the Rings: The Return of the King, The Lord of the Rings: The Return of the King, The
## Matrix, The Matrix, The
## Star Wars: Episode IV - A New Hope Star Wars: Episode IV - A New Hope
## Star Wars: Episode VI - Return of the Jedi Star Wars: Episode VI - Return of the Jedi
## Star Wars: Episode V - The Empire Strikes Back Star Wars: Episode V - The Empire Strikes Back
## Spider-Man 2 Spider-Man 2
## Dark Knight, The Dark Knight, The
## X2: X-Men United X2: X-Men United
## PC2
## Lord of the Rings: The Two Towers, The 0.336
## Lord of the Rings: The Fellowship of the Ring, The 0.332
## Lord of the Rings: The Return of the King, The 0.237
## Matrix, The 0.231
## Star Wars: Episode IV - A New Hope 0.217
## Star Wars: Episode VI - Return of the Jedi 0.192
## Star Wars: Episode V - The Empire Strikes Back 0.168
## Spider-Man 2 0.114
## Dark Knight, The 0.103
## X2: X-Men United 0.094
In this exercise set, we will be covering a topic useful for understanding matrix factorization: the singular
value decomposition (SVD). SVD is a mathematical result that is widely used in machine learning, both in
practice and to understand the mathematical properties of some algorithms. This is a rather advanced topic
and to complete this exercise set you will have to be familiar with linear algebra concepts such as matrix
multiplication, orthogonal matrices, and diagonal matrices.
The SVD tells us that we can decompose an 𝑁 × 𝑝 matrix 𝑌 with 𝑝 < 𝑁 as
𝑌 = 𝑈 𝐷𝑉 ⊤
with 𝑈 and 𝑉 orthogonal of dimensions 𝑁 × 𝑝 and 𝑝 × 𝑝 respectively and 𝐷 a 𝑝 × 𝑝 diagonal matrix with
the values of the diagonal decreasing:
𝑑1,1 ≥ 𝑑2,2 ≥ … 𝑑𝑝,𝑝
272
In this exercise, we will see one of the ways that this decomposition can be useful. To do this, we will
construct a dataset that represents grade scores for 100 students in 24 different subjects. The overall average
has been removed so this data represents the percentage point each student received above or below the
average test score. So a 0 represents an average grade (C), a 25 is a high grade (A+), and a -25 represents
a low grade (F). You can simulate the data like this:
set.seed(1987)
#if using R 3.6 or later, use `set.seed(1987, sample.kind="Rounding")` instead
n <- 100
k <- 8
Sigma <- 64 * matrix(c(1, .75, .5, .75, 1, .5, .5, .5, 1), 3, 3)
m <- MASS::mvrnorm(n, rep(0, 3), Sigma)
m <- m[order(rowMeans(m), decreasing = TRUE),]
y <- m %x% matrix(rep(1, k), nrow = 1) + matrix(rnorm(matrix(n*k*3)), n, k*3)
colnames(y) <- c(paste(rep("Math",k), 1:k, sep="_"),
paste(rep("Science",k), 1:k, sep="_"),
paste(rep("Arts",k), 1:k, sep="_"))
Our goal is to describe the student performances as succinctly as possible. For example, we want to know
if these test results are all just a random independent numbers. Are all students just about as good? Does
being good in one subject imply you will be good in another? How does the SVD help with all this? We
will go step by step to show that with just three relatively small pairs of vectors we can explain much of the
variability in this 100 × 24 dataset.
1. You can visualize the 24 test scores for the 100 students by plotting an image:
my_image(y)
273
Math_1
Math_2
Math_3
Math_4
Math_5
Math_6
Math_7
Math_8
Science_1
Science_2
Science_3
Science_4
Science_5
Science_6
Science_7
Science_8
Arts_1
Arts_2
Arts_3
Arts_4
Arts_5
Arts_6
Arts_7
Arts_8
How would you describe the data based on this figure?
2. You can examine the correlation between the test scores directly like this:
274
Math_2
Math_4
Math_6
Math_8
Science_2
Science_4
Science_6
Science_8
Arts_2
Arts_4
Arts_6
Arts_8
Math_1
Math_2
Math_3
Math_4
Math_5
Math_6
Math_7
Math_8
Science_1
Science_2
Science_3
Science_4
Science_5
Science_6
Science_7
Science_8
Arts_1
Arts_2
Arts_3
Arts_4
Arts_5
Arts_6
Arts_7
Arts_8
Which of the following best describes what you see?
3. Remember that orthogonality means that 𝑈 ⊤ 𝑈 and 𝑉 ⊤ 𝑉 are equal to the identity matrix. This implies
that we can also rewrite the decomposition as
𝑌 𝑉 = 𝑈 𝐷 or 𝑈 ⊤ 𝑌 = 𝐷𝑉 ⊤
We can think of 𝑌 𝑉 and 𝑈 ⊤ 𝑉 as two transformations of 𝑌 that preserve the total variability of 𝑌 since 𝑈
and 𝑉 are orthogonal.
Use the function svd() to compute the SVD of y. This function will return 𝑈 , 𝑉 , and the diagonal entries
of 𝐷.
s <- svd(y)
names(s)
## [1] 5.3e-14
275
Compute the sum of squares of the columns of 𝑌 and store them in ss_y. Then compute the sum of squares
of columns of the transformed 𝑌 𝑉 and store them in ss_yv. Confirm that sum(ss_y) is equal to sum(ss_yv).
What is the value of sum(ss_y) (and also the value of sum(ss_yv))?
## [1] 175435
sum(ss_yv)
## [1] 175435
4. We see that the total sum of squares is preserved. This is because 𝑉 is orthogonal. Now to start
understanding how 𝑌 𝑉 is useful, plot ss_y against the column number and then do the same for
ss_yv.
plot(ss_y)
7600
7400
ss_y
7200
7000
5 10 15 20
Index
plot(ss_yv)
276
120000
80000
ss_yv
40000
0
5 10 15 20
Index
□ A. ss_y and ss_yv are decreasing and close to 0 for the 4th column and beyond.
⊠ B. ss_yv is decreasing and close to 0 for the 4th column and beyond.
□ C. ss_y is decreasing and close to 0 for the 4th column and beyond.
□ D. There is no discernible pattern to either ss_y or ss_yv.
5. Now notice that we didn’t have to compute ss_yv because we already have the answer. How? Re-
member that 𝑌 𝑉 = 𝑈 𝐷 and because 𝑈 is orthogonal, we know that the sum of squares of the columns
of 𝑈 𝐷 are the diagonal entries of 𝐷 squared. Confirm this by plotting the square root of ss_yv versus
the diagonal entries of 𝐷.
277
300
200
y
100
0
0 100 200 300
x
Which of these plots is correct?
⊠ A.
278
□ B.
279
□ C.
280
□ D.
281
6. So from the above we know that the sum of squares of the columns of 𝑌 (the total sum of squares)
adds up to the sum of s$d^2 and that the transformation 𝑌 𝑉 gives us columns with sums of squares
equal to s$d^2. Now compute the percent of the total variability that is explained by just the first
three columns of 𝑌 𝑉 .
What proportion of the total variability is explained by the first three columns of 𝑌 𝑉 ?
sum(s$d[1:3]^2) / sum(s$d^2)
## [1] 0.99
7. Before we continue, let’s show a useful computational trick to avoid creating the matrix diag(s$d).
To motivate this, we note that if we write 𝑈 out in its columns [𝑈1 , 𝑈2 , … , 𝑈𝑝 ] then 𝑈 𝐷 is equal to
282
⊠ B. identical(s$u %*% diag(s$d), sweep(s$u, 2, s$d, FUN = "*"))
□ C. identical(s$u %*% t(diag(s$d)), sweep(s$u, 2, s$d, FUN = "*"))
□ D. identical(s$u %*% diag(s$d), sweep(s$u, 2, s, FUN = "*"))
8. We know that 𝑈1 𝑑1,1 , the first column of 𝑈 𝐷, has the most variability of all the columns of 𝑈 𝐷.
Earlier we looked at an image of 𝑌 using my_image(y), in which we saw that the student to student
variability is quite large and that students that are good in one subject tend to be good in all. This
implies that the average (across all subjects) for each student should explain a lot of the variability.
Compute the average score for each student, plot it against 𝑈1 𝑑1,1 , and describe what you find.
plot(s$u[,1]*s$d[1], rowMeans(y))
10
rowMeans(y)
0
−10
−20
−50 0 50 100
s$u[, 1] * s$d[1]
□ A. There is no relationship between the average score for each student and 𝑈1 𝑑1,1 .
□ B. There is an exponential relationship between the average score for each student and 𝑈1 𝑑1,1 .
⊠ C. There is a linear relationship between the average score for each student and 𝑈1 𝑑1,1 .
𝑈 𝐷𝑉 ⊤ = (−𝑈 )𝐷(−𝑉 )⊤
With this in mind we see that the first column of 𝑈 𝐷 is almost identical to the average score for each student
except for the sign.
This implies that multiplying � by the first column of � must be performing a similar operation to taking the
average. Make an image plot of � and describe the first column relative to others and how this relates to
taking an average.
How does the first column relate to the others, and how does this relate to taking an average?
283
my_image(s$v)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
□ A. The first column is very variable, which implies that the first column of YV is the sum of the rows
of Y multiplied by some non-constant function, and is thus not proportional to an average.
□ B. The first column is very variable, which implies that the first column of YV is the sum of the rows
of Y multiplied by some non-constant function, and is thus proportional to an average.
⊠ C. The first column is very close to being a constant, which implies that the first column of YV is the
sum of the rows of Y multiplied by some constant, and is thus proportional to an average.
□ D. The first three columns are all very close to being a constant, which implies that these columns are
the sum of the rows of Y multiplied by some constant, and are thus proportional to an average.
284
0.2
0.1
s$u[, 1]
0.0
−0.2 −0.1
0 20 40 60 80 100
Index
0.0
−0.2 −0.1
5 10 15 20
Index
285
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
my_image(y)
Math_1
Math_2
Math_3
Math_4
Math_5
Math_6
Math_7
Math_8
Science_1
Science_2
Science_3
Science_4
Science_5
Science_6
Science_7
Science_8
Arts_1
Arts_2
Arts_3
Arts_4
Arts_5
Arts_6
Arts_7
Arts_8
11. We see that with just a vector of length 100, a scalar, and a vector of length 24, we can actually come
close to reconstructing the a 100 × 24 matrix. This is our first matrix factorization:
𝑌 ≈ 𝑑1,1 𝑈1 𝑉1⊤
In the exercise in Q6, we saw how to calculate the percent of total variability explained. However, our
approximation only explains the observation that good students tend to be good in all subjects. Another
aspect of the original data that our approximation does not explain was the higher similarity we observed
within subjects. We can see this by computing the difference between our approximation and original data
and then computing the correlations. You can see this by running this code:
286
resid <- y - with(s,(u[, 1, drop=FALSE]*d[1]) %*% t(v[, 1, drop=FALSE]))
my_image(cor(resid), zlim = c(-1,1))
axis(side = 2, 1:ncol(y), rev(colnames(y)), las = 2)
Math_2
Math_4
Math_6
Math_8
Science_2
Science_4
Science_6
Science_8
Arts_2
Arts_4
Arts_6
Arts_8
Math_1
Math_2
Math_3
Math_4
Math_5
Math_6
Math_7
Math_8
Science_1
Science_2
Science_3
Science_4
Science_5
Science_6
Science_7
Science_8
Arts_1
Arts_2
Arts_3
Arts_4
Arts_5
Arts_6
Arts_7
Arts_8
Now that we have removed the overall student effect, the correlation plot reveals that we have not yet
explained the within subject correlation nor the fact that math and science are closer to each other than to
the arts. So let’s explore the second column of the SVD.
Repeat the previous exercise (Q10) but for the second column: Plot 𝑈2 , then plot 𝑉2⊤ using the same range
for the y-axis limits, then make an image of 𝑈2 𝑑2,2 𝑉2⊤ and compare it to the image of resid.
287
0.4
0.2
s$u[, 2]
0.0
−0.4 −0.2
0 20 40 60 80 100
Index
0.0
−0.4 −0.2
5 10 15 20
Index
288
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
my_image(resid)
Math_1
Math_2
Math_3
Math_4
Math_5
Math_6
Math_7
Math_8
Science_1
Science_2
Science_3
Science_4
Science_5
Science_6
Science_7
Science_8
Arts_1
Arts_2
Arts_3
Arts_4
Arts_5
Arts_6
Arts_7
Arts_8
12. The second column clearly relates to a student’s difference in ability in math/science versus the arts.
We can see this most clearly from the plot of s$v[,2]. Adding the matrix we obtain with these two
columns will help with our approximation:
289
resid <- y - with(s,sweep(u[, 1:2], 2, d[1:2], FUN="*") %*% t(v[, 1:2]))
my_image(cor(resid), zlim = c(-1,1))
axis(side = 2, 1:ncol(y), rev(colnames(y)), las = 2)
Math_2
Math_4
Math_6
Math_8
Science_2
Science_4
Science_6
Science_8
Arts_2
Arts_4
Arts_6
Arts_8
Math_1
Math_2
Math_3
Math_4
Math_5
Math_6
Math_7
Math_8
Science_1
Science_2
Science_3
Science_4
Science_5
Science_6
Science_7
Science_8
Arts_1
Arts_2
Arts_3
Arts_4
Arts_5
Arts_6
Arts_7
Arts_8
and see that the structure that is left is driven by the differences between math and science. Confirm this
by first plotting 𝑈3 , then plotting 𝑉3⊤ using the same range for the y-axis limits, then making an image of
𝑈3 𝑑3,3 𝑉3⊤ and comparing it to the image of resid.
0.0
−0.4 −0.2
0 20 40 60 80 100
Index
290
plot(s$v[,3], ylim = c(-0.5, 0.5))
0.4
0.2
s$v[, 3]
0.0
−0.4 −0.2
5 10 15 20
Index
my_image(resid)
291
Math_1
Math_2
Math_3
Math_4
Math_5
Math_6
Math_7
Math_8
Science_1
Science_2
Science_3
Science_4
Science_5
Science_6
Science_7
Science_8
Arts_1
Arts_2
Arts_3
Arts_4
Arts_5
Arts_6
Arts_7
Arts_8
13. The third column clearly relates to a student’s difference in ability in math and science. We can see
this most clearly from the plot of s$v[,3]. Adding the matrix we obtain with these two columns will
help with our approximation:
292
Math_2
Math_4
Math_6
Math_8
Science_2
Science_4
Science_6
Science_8
Arts_2
Arts_4
Arts_6
Arts_8
Math_1
Math_2
Math_3
Math_4
Math_5
Math_6
Math_7
Math_8
Science_1
Science_2
Science_3
Science_4
Science_5
Science_6
Science_7
Science_8
Arts_1
Arts_2
Arts_3
Arts_4
Arts_5
Arts_6
Arts_7
Arts_8
We no longer see structure in the residuals: they seem to be independent of each other. This implies that
we can describe the data with the following model:
𝑌 = 𝑑1,1 𝑈1 𝑉1⊤ + 𝑑2,2 𝑈2 𝑉2⊤ + 𝑑3,3 𝑈3 𝑉3⊤ + 𝜀
with 𝜀 a matrix of independent identically distributed errors. This model is useful because we summarize of
100 × 24 observations with 3 × (100 + 24 + 1) = 375 numbers.
Furthermore, the three components of the model have useful interpretations:
1 - the overall ability of a student
2 - the difference in ability between the math/sciences and arts
3 - the remaining differences between the three subjects.
The sizes 𝑑1,1 , 𝑑2,2 and 𝑑3,3 tell us the variability explained by each component. Finally, note that the
components 𝑑𝑗,𝑗 𝑈𝑗 𝑉𝑗⊤ are equivalent to the jth principal component.
Finish the exercise by plotting an image of 𝑌 , an image of 𝑑1,1 𝑈1 𝑉1⊤ + 𝑑2,2 𝑈2 𝑉2⊤ + 𝑑3,3 𝑈3 𝑉3⊤ and an image
of the residuals, all with the same zlim.
293
1 Math_1
2 Math_2
3 Math_3
4 Math_4
5 Math_5
6 Math_6
7 Math_7
8 Math_8
9 Science_1
10 Science_2
294
16 Science_8
17 Arts_1
18 Arts_2
19 Arts_3
20 Arts_4
21 Arts_5
22 Arts_6
23 Arts_7
24 Arts_8
Math_1
Math_2
Math_3
Math_4
Math_5
Math_6
Math_7
Math_8
Science_1
Science_2
Science_3
Science_4
Science_5
Science_6
Science_7
Science_8
Arts_1
Arts_2
Arts_3
Arts_4
Arts_5
Arts_6
Arts_7
Arts_8
Comprehension Check - Dimension Reduction
data("tissue_gene_expression")
dim(tissue_gene_expression$x)
We want to get an idea of which observations are close to each other, but, as you can see from the dimensions,
the predictors are 500-dimensional, making plotting difficult. Plot the first two principal components with
color representing tissue type.
Which tissue is in a cluster by itself?
pc <- prcomp(tissue_gene_expression$x)
data.frame(pc_1 = pc$x[,1], pc_2 = pc$x[,2],
tissue = tissue_gene_expression$y) %>%
ggplot(aes(pc_1, pc_2, color = tissue)) +
geom_point()
295
5
tissue
cerebellum
0
colon
endometrium
pc_2
hippocampus
−5 kidney
liver
placenta
−10
−15
−15 −10 −5 0 5 10 15
pc_1
□ A. cerebellum
□ B. colon
□ C. endometrium
□ D. hippocampus
□ E. kidney
⊠ F. liver
□ G. placenta
2. The predictors for each observation are measured using the same device and experimental procedure.
This introduces biases that can affect all the predictors from one observation. For each observation,
compute the average across all predictors, and then plot this against the first PC with color representing
tissue. Report the correlation.
296
15
10
tissue
5 cerebellum
colon
endometrium
pc_1
0
hippocampus
kidney
liver
−5
placenta
−10
−15
7.4 7.5 7.5 7.6
avgs
cor(avgs, pc$x[,1])
## [1] 0.6
3. We see an association with the first PC and the observation averages. Redo the PCA but only after
removing the center. Part of the code is provided for you.
#BLANK
pc <- prcomp(x)
data.frame(pc_1 = pc$x[,1], pc_2 = pc$x[,2],
tissue = tissue_gene_expression$y) %>%
ggplot(aes(pc_1, pc_2, color = tissue)) +
geom_point()
Which line of code should be used to replace #BLANK in the code block above?
297
5
tissue
cerebellum
0
colon
endometrium
pc_2
hippocampus
−5 kidney
liver
placenta
−10
−15
−15 −10 −5 0 5 10 15
pc_1
4. For the first 10 PCs, make a boxplot showing the values for each tissue.
For the 7th PC, which two tissues have the greatest median difference?
for(i in 1:10){
boxplot(pc$x[,i] ~ tissue_gene_expression$y, main = paste("PC", i))
}
298
PC 1
15
10
5
pc$x[, i]
0
−5
−10
tissue_gene_expression$y
PC 2
5
0
pc$x[, i]
−5
−10
tissue_gene_expression$y
299
PC 3
10
5
pc$x[, i]
0
−5
−10
tissue_gene_expression$y
PC 4
5
pc$x[, i]
0
−5
−10
tissue_gene_expression$y
300
PC 5
5
0
pc$x[, i]
−5
−10
tissue_gene_expression$y
PC 6
5
pc$x[, i]
0
−5
tissue_gene_expression$y
301
PC 7
10
pc$x[, i]
5
0
tissue_gene_expression$y
PC 8
6
4
2
pc$x[, i]
0
−2
−4
tissue_gene_expression$y
302
PC 9
4
2
0
pc$x[, i]
−2
−4
−6
−8
tissue_gene_expression$y
PC 10
4
2
pc$x[, i]
0
−2
−4
tissue_gene_expression$y
Select the TWO tissues that have the greatest difference between their medians.
□ A. cerebellum
⊠ B. colon
□ C. endometrium
□ D. hippocampus
303
□ E. kidney
□ F. liver
⊠ G. placenta
5. Plot the percent variance explained by PC number. Hint: use the summary function.
How many PCs are required to reach a cumulative percent variance explained greater than 50%? 3
plot(summary(pc)$importance[3,])
0.4 0.5 0.6 0.7 0.8 0.9 1.0
summary(pc)$importance[3, ]
0 50 100 150
Index
These exercises will work with the tissue_gene_expression dataset, which is part of the dslabs package.
1. Load the tissue_gene_expression dataset. Remove the row means and compute the distance between
each observation. Store the result in d.
□ A. d <- dist(tissue_gene_expression$x)
□ B. d <- dist(rowMeans(tissue_gene_expression$x))
□ C. d <- dist(rowMeans(tissue_gene_expression$y))
⊠ D. d <- dist(tissue_gene_expression$x - rowMeans(tissue_gene_expression$x))
2. Make a hierarchical clustering plot and add the tissue types as labels.
304
Height
□
⊠
□
□
□
□
□
#BLANK
plot(h)
0 10 25
F. liver
B. colon
E. kidney
h <- hclust(d)
liver_7
G. placenta
liver_4
liver_6
liver_5
use of colors.
liver_1
liver_2
A. cerebellum
liver_3
liver_10
liver_8
liver_9
liver_11
liver_12
C. endometrium
D. hippocampus
liver_25
liver_24
liver_23
liver_26
liver_16
liver_22
liver_18
liver_21
library(RColorBrewer)
liver_15
liver_19
liver_17
liver_20
liver_13
liver_14
cerebellum_29
cerebellum_30
kidney_38
kidney_39
hippocampus_10
hippocampus_7
hippocampus_9
hippocampus_3
hippocampus_5
cerebellum_32
cerebellum_33
cerebellum_31
cerebellum_34
cerebellum_35
hippocampus_31
hippocampus_15
hippocampus_13
hippocampus_18
You will observe multiple branches.
hippocampus_12
hippocampus_14
hippocampus_4
hippocampus_26
hippocampus_19
hippocampus_8
hippocampus_17
hippocampus_22
hippocampus_21
hippocampus_25
hippocampus_1
hippocampus_6
hippocampus_11
cerebellum_25
305
cerebellum_21
cerebellum_2
cerebellum_23
cerebellum_13
d
cerebellum_1
cerebellum_5
cerebellum_8
placenta_4
placenta_5
placenta_6
placenta_3
placenta_1
placenta_2
kidney_16
kidney_6
kidney_12
kidney_10
kidney_9
kidney_15
kidney_29
kidney_21
kidney_23
kidney_27
kidney_22
kidney_26
kidney_24
kidney_28
kidney_18
kidney_19
endometrium_2
endometrium_11
endometrium_12
endometrium_9
endometrium_10
endometrium_14
endometrium_13
endometrium_15
endometrium_7
endometrium_4
endometrium_6
endometrium_5
endometrium_1
endometrium_3
endometrium_8colon_22
colon_32
argument to assign colors. Also, use col = RColorBrewer::brewer.pal(11, "RdBu") for a better
tor are centered, and add a color bar to show the different tissue types. Hint: use the ColSideColors
3. Select the 50 most variable genes. Make sure the observations show up in the columns, that the predic-
Which line of code should replace #BLANK in the code above?
if(!require(RColorBrewer)) install.packages("RColorBrewer")
library(RColorBrewer)
sds <- matrixStats::colSds(tissue_gene_expression$x)
ind <- order(sds, decreasing = TRUE)[1:50]
colors <- brewer.pal(7, "Dark2")[as.numeric(tissue_gene_expression$y)]
heatmap(t(tissue_gene_expression$x[,ind]), col = brewer.pal(11, "RdBu"), scale = "row", ColSideColors =
CD14
DDT
OPN3
CES2
COL1A2
BIN1
MOAP1
SPP1
USP32P2
SV2B
PTPRN2
COLGALT2
CLDN10
GALNT11
GPM6B
TFR2
GALNT2
TLR3
NDP
GPA33
LRP4
COL15A1
MARC2
PBLD
HAMP
liver_11
liver_12
liver_5
liver_26
liver_15
liver_21
cerebellum_37
cerebellum_2
cerebellum_38
cerebellum_4
cerebellum_5
cerebellum_22
cerebellum_18
cerebellum_34
hippocampus_15
hippocampus_12
hippocampus_23
hippocampus_11
hippocampus_25
kidney_38
kidney_20
kidney_8
kidney_11
kidney_15
kidney_18
kidney_23
endometrium_15
endometrium_10
endometrium_8
kidney_32
placenta_2
colon_4
colon_8
colon_10
colon_24
colon_15
colon_18
colon_30
306
Section 7 - Final Assessment
The brca dataset from the dslabs package contains information about breast cancer diagnosis biopsy samples
for tumors that were determined to be either benign (not cancer) and malignant (cancer). The brca object
is a list consisting of:
For these exercises, load the data by setting your options and loading the libraries and data as shown in the
code here:
options(digits = 3)
data(brca)
The exercises in this assessment are available to Verified Learners only and are split into four parts, all of
which use the data described here.
dim(brca$x)[1]
## [1] 569
dim(brca$x)[2]
## [1] 30
mean(brca$y == "M")
## [1] 0.373
which.max(colMeans(brca$x))
## area_worst
## 24
307
which.min(colSds(brca$x))
## [1] 20
Use sweep() two times to scale each column: subtract the column means of brca$x, then divide by the
column standard deviations of brca$x.
After scaling, what is the standard deviation of the first column?
sd(x_scaled[,1])
## [1] 1
median(x_scaled[,1])
## [1] -0.215
3. Distance
Calculate the distance between all samples using the scaled matrix.
What is the average distance between the first sample, which is benign, and other benign samples?
## [1] 4.41
What is the average distance between the first sample and malignant samples?
## [1] 7.12
4. Heatmap of features
Make a heatmap of the relationship between features using the scaled matrix.
Which of these heatmaps is correct? To remove column and row labels like the images below, use labRow =
NA and labCol = NA.
308
d_features <- dist(t(x_scaled))
heatmap(as.matrix(d_features), labRow = NA, labCol = NA)
⊠ A.
309
□ B.
310
□ C.
311
□ D.
312
□ E.
313
5. Hierarchical clustering
Perform hierarchical clustering on the 30 features. Cut the tree into 5 groups.
All but one of the answer options are in the same group.
Which is in a different group?
h <- hclust(d_features)
groups <- cutree(h, k = 5)
split(names(groups), groups)
## $`1`
## [1] "radius_mean" "perimeter_mean" "area_mean"
314
## [4] "concavity_mean" "concave_pts_mean" "radius_se"
## [7] "perimeter_se" "area_se" "radius_worst"
## [10] "perimeter_worst" "area_worst" "concave_pts_worst"
##
## $`2`
## [1] "texture_mean" "texture_worst"
##
## $`3`
## [1] "smoothness_mean" "compactness_mean" "symmetry_mean"
## [4] "fractal_dim_mean" "smoothness_worst" "compactness_worst"
## [7] "concavity_worst" "symmetry_worst" "fractal_dim_worst"
##
## $`4`
## [1] "texture_se" "smoothness_se" "symmetry_se"
##
## $`5`
## [1] "compactness_se" "concavity_se" "concave_pts_se" "fractal_dim_se"
□ A. smoothness_mean
□ B. smoothness_worst
□ C. compactness_mean
□ D. compactness_worst
⊠ E. concavity_mean
□ F. concavity_worst
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 3.644 2.386 1.6787 1.407 1.284 1.0988 0.8217 0.6904
## Proportion of Variance 0.443 0.190 0.0939 0.066 0.055 0.0403 0.0225 0.0159
## Cumulative Proportion 0.443 0.632 0.7264 0.792 0.847 0.8876 0.9101 0.9260
## PC9 PC10 PC11 PC12 PC13 PC14 PC15
## Standard deviation 0.6457 0.5922 0.5421 0.51104 0.49128 0.39624 0.30681
## Proportion of Variance 0.0139 0.0117 0.0098 0.00871 0.00805 0.00523 0.00314
## Cumulative Proportion 0.9399 0.9516 0.9614 0.97007 0.97812 0.98335 0.98649
## PC16 PC17 PC18 PC19 PC20 PC21 PC22
## Standard deviation 0.28260 0.24372 0.22939 0.22244 0.17652 0.173 0.16565
## Proportion of Variance 0.00266 0.00198 0.00175 0.00165 0.00104 0.001 0.00091
## Cumulative Proportion 0.98915 0.99113 0.99288 0.99453 0.99557 0.997 0.99749
## PC23 PC24 PC25 PC26 PC27 PC28 PC29
## Standard deviation 0.15602 0.1344 0.12442 0.09043 0.08307 0.03987 0.02736
315
## Proportion of Variance 0.00081 0.0006 0.00052 0.00027 0.00023 0.00005 0.00002
## Cumulative Proportion 0.99830 0.9989 0.99942 0.99969 0.99992 0.99997 1.00000
## PC30
## Standard deviation 0.0115
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
Plot the first two principal components with color representing tumor type (benign/malignant).
0
type
PC2
B
M
−5
−10
−5 0 5 10 15
PC1
Which of the following is true?
□ A. Malignant tumors tend to have smaller values of PC1 than benign tumors.
⊠ B. Malignant tumors tend to have larger values of PC1 than benign tumors.
□ C. Malignant tumors tend to have smaller values of PC2 than benign tumors.
□ D. Malignant tumors tend to have larger values of PC2 than benign tumors.
□ E. There is no relationship between the first two principal components and tumor type.
8. PCA: PC boxplot
316
data.frame(type = brca$y, pca$x[,1:10]) %>%
gather(key = "PC", value = "value", -type) %>%
ggplot(aes(PC, value, fill = type)) +
geom_boxplot()
10
type
value
B
0 M
−10
PC1 PC10 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
PC
Which PCs are significantly different enough by tumor type that there is no overlap in the interquartile
ranges (IQRs) for benign and malignant samples?
Select ALL that apply.
⊠ A. PC1
□ B. PC2
□ C. PC3
□ D. PC4
□ E. PC5
□ F. PC6
□ G. PC7
□ H. PC8
□ I. PC9
□ J. PC10
Set the seed to 1, then create a data partition splitting brca$y and the scaled version of the brca$x matrix
into a 20% test set and 80% train using the following code:
317
# set.seed(1) if using R 3.5 or earlier
set.seed(1, sample.kind = "Rounding") # if using R 3.6 or later
You will be using these training and test sets throughout the exercises in Parts 3 and 4. Save your models
as you go, because at the end, you’ll be asked to make an ensemble prediction and to compare the accuracy
of the various models!
Check that the training and test sets have similar proportions of benign and malignant tumors.
What proportion of the training set is benign?
mean(train_y == "B")
## [1] 0.628
mean(test_y == "B")
## [1] 0.626
Set the seed to 3. Perform k-means clustering on the training set with 2 centers and assign the output to k.
Then use the predict_kmeans() function to make predictions on the test set.
What is the overall accuracy?
318
# set.seed(3) if using R 3.5 or earlier
set.seed(3, sample.kind = "Rounding") # if using R 3.6 or later
## [1] 0.922
## [1] 0.986
## [1] 0.814
Fit a logistic regression model on the training set with caret::train() using all predictors. Ignore warnings
about the algorithm not converging. Make predictions on the test set.
What is the accuracy of the logistic regression model on the test set?
319
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
320
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## [1] 0.957
Train an LDA model and a QDA model on the training set. Make predictions on the test set using each
model.
What is the accuracy of the LDA model on the test set?
321
train_lda <- train(train_x, train_y, method = "lda")
lda_preds <- predict(train_lda, test_x)
mean(lda_preds == test_y)
## [1] 0.991
## [1] 0.957
Set the seed to 5, then fit a loess model on the training set with the caret package. You will need to install
the gam package if you have not yet done so. Use the default tuning grid. This may take several minutes;
ignore warnings. Generate predictions on the test set.
What is the accuracy of the loess model on the test set?
# set.seed(5)
set.seed(5, sample.kind = "Rounding") # simulate R 3.5
322
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
323
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
324
## Warning in gam.lo(data[["lo(symmetry_se, span = 0.5, degree = 1)"]], z, : eval
## 7.0657
325
## Warning in gam.lo(data[["lo(texture_se, span = 0.5, degree = 1)"]], z, w, :
## extrapolation not allowed with blending
326
## Warning in gam.lo(data[["lo(compactness_worst, span = 0.5, degree = 1)"]], :
## upperlimit 3.916
327
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
328
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
329
## Warning in gam.lo(data[["lo(symmetry_mean, span = 0.5, degree = 1)"]], z, :
## upperlimit 4.0229
330
## Warning in gam.lo(data[["lo(fractal_dim_mean, span = 0.5, degree = 1)"]], : eval
## 4.6672
331
## Warning in gam.lo(data[["lo(concave_pts_se, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
332
## Warning in gam.lo(data[["lo(texture_worst, span = 0.5, degree = 1)"]], z, :
## lowerlimit -1.8961
333
## Warning in gam.lo(data[["lo(fractal_dim_se, span = 0.5, degree = 1)"]], : eval
## 9.8429
334
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
335
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
336
## Warning in gam.lo(data[["lo(texture_mean, span = 0.5, degree = 1)"]], z, :
## upperlimit 3.404
337
## lo.wam convergence not obtained in 30 iterations
338
## lo.wam convergence not obtained in 30 iterations
339
## Warning in gam.lo(data[["lo(fractal_dim_mean, span = 0.5, degree = 1)"]], :
## upperlimit 4.6997
340
## Warning in gam.lo(data[["lo(concave_pts_se, span = 0.5, degree = 1)"]], : eval
## 4.7168
341
## Warning in gam.lo(data[["lo(texture_se, span = 0.5, degree = 1)"]], z, w, :
## extrapolation not allowed with blending
342
## Warning in gam.lo(data[["lo(fractal_dim_worst, span = 0.5, degree = 1)"]], :
## upperlimit 3.2995
343
## Warning in gam.lo(data[["lo(perimeter_se, span = 0.5, degree = 1)"]], z, : eval
## 9.4537
344
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
345
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
346
## Warning in gam.lo(data[["lo(symmetry_worst, span = 0.5, degree = 1)"]], :
## upperlimit 4.6782
347
## Warning in gam.lo(data[["lo(perimeter_mean, span = 0.5, degree = 1)"]], : eval
## 3.9726
348
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
349
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
350
## Warning in gam.lo(data[["lo(texture_mean, span = 0.5, degree = 1)"]], z, :
## upperlimit 3.3456
351
## Warning in gam.lo(data[["lo(concavity_worst, span = 0.5, degree = 1)"]], : eval
## 4.6965
352
## Warning in gam.lo(data[["lo(area_worst, span = 0.5, degree = 1)"]], z, w, :
## extrapolation not allowed with blending
353
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
354
## Warning in gam.lo(data[["lo(radius_worst, span = 0.5, degree = 1)"]], z, :
## upperlimit 3.5125
355
## Warning in gam.lo(data[["lo(perimeter_worst, span = 0.5, degree = 1)"]], : eval
## 3.6318
356
## Warning in gam.lo(data[["lo(fractal_dim_se, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
357
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
358
## Warning in gam.lo(data[["lo(symmetry_se, span = 0.5, degree = 1)"]], z, :
## extrapolation not allowed with blending
359
## Warning in gam.lo(data[["lo(texture_worst, span = 0.5, degree = 1)"]], z, :
## upperlimit 3.5238
360
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
361
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
362
## Warning in gam.lo(data[["lo(symmetry_mean, span = 0.5, degree = 1)"]], z, : eval
## -2.1763
363
## Warning in gam.lo(data[["lo(symmetry_worst, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
364
## Warning in gam.lo(data[["lo(texture_se, span = 0.5, degree = 1)"]], z, w, :
## lowerlimit -1.5136
365
## Warning in gam.lo(data[["lo(concave_pts_mean, span = 0.5, degree = 1)"]], : eval
## 3.9245
366
## Warning in gam.lo(data[["lo(compactness_se, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
367
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
368
## Warning in gam.lo(data[["lo(symmetry_mean, span = 0.5, degree = 1)"]], z, :
## upperlimit 3.187
369
## Warning in gam.lo(data[["lo(symmetry_se, span = 0.5, degree = 1)"]], z, : eval
## 4.9499
370
## Warning in gam.lo(data[["lo(perimeter_worst, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
371
## Warning in gam.lo(data[["lo(compactness_se, span = 0.5, degree = 1)"]], :
## upperlimit 4.5476
372
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
373
## Warning in gam.lo(data[["lo(radius_worst, span = 0.5, degree = 1)"]], z, : eval
## 3.4864
374
## Warning in gam.lo(data[["lo(concave_pts_worst, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
375
## Warning in gam.lo(data[["lo(concavity_worst, span = 0.5, degree = 1)"]], :
## upperlimit 4.0184
376
## Warning in gam.lo(data[["lo(concave_pts_mean, span = 0.5, degree = 1)"]], : eval
## 3.9245
377
## Warning in gam.lo(data[["lo(smoothness_se, span = 0.5, degree = 1)"]], z, :
## extrapolation not allowed with blending
378
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
379
## Warning in gam.lo(data[["lo(texture_mean, span = 0.5, degree = 1)"]], z, :
## extrapolation not allowed with blending
380
## Warning in gam.lo(data[["lo(concave_pts_se, span = 0.5, degree = 1)"]], :
## upperlimit 3.7678
381
## Warning in gam.lo(data[["lo(texture_se, span = 0.5, degree = 1)"]], z, w, : eval
## -1.4842
382
## Warning in gam.lo(data[["lo(compactness_worst, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
383
## Warning in gam.lo(data[["lo(area_se, span = 0.5, degree = 1)"]], z, w, span =
## 0.5, : upperlimit 10.724
384
## Warning in gam.lo(data[["lo(fractal_dim_se, span = 0.5, degree = 1)"]], : eval
## -1.096
385
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
386
## Warning in gam.lo(data[["lo(texture_mean, span = 0.5, degree = 1)"]], z, :
## extrapolation not allowed with blending
387
## Warning in gam.lo(data[["lo(concave_pts_se, span = 0.5, degree = 1)"]], :
## upperlimit 3.7678
388
## Warning in gam.lo(data[["lo(fractal_dim_se, span = 0.5, degree = 1)"]], : eval
## -1.096
389
## lo.wam convergence not obtained in 30 iterations
390
## upperlimit 3.0137
391
## Warning in gam.lo(data[["lo(radius_worst, span = 0.5, degree = 1)"]], z, : eval
## -1.7254
392
## Warning in gam.lo(data[["lo(symmetry_worst, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
393
## Warning in gam.lo(data[["lo(perimeter_mean, span = 0.5, degree = 1)"]], :
## lowerlimit -1.8404
394
## Warning in gam.lo(data[["lo(area_mean, span = 0.5, degree = 1)"]], z, w, : eval
## 5.2402
395
## Warning in gam.lo(data[["lo(radius_se, span = 0.5, degree = 1)"]], z, w, :
## extrapolation not allowed with blending
396
## Warning in gam.lo(data[["lo(area_worst, span = 0.5, degree = 1)"]], z, w, :
## lowerlimit -1.1802
397
## Warning in gam.lo(data[["lo(fractal_dim_se, span = 0.5, degree = 1)"]], : eval
## -1.096
398
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
399
## Warning in gam.lo(data[["lo(smoothness_worst, span = 0.5, degree = 1)"]], : eval
## -2.0453
400
## Warning in gam.lo(data[["lo(texture_mean, span = 0.5, degree = 1)"]], z, :
## extrapolation not allowed with blending
401
## Warning in gam.lo(data[["lo(perimeter_mean, span = 0.5, degree = 1)"]], :
## lowerlimit -1.8417
402
## Warning in gam.lo(data[["lo(compactness_se, span = 0.5, degree = 1)"]], : eval
## -1.2155
403
## Warning in gam.lo(data[["lo(smoothness_se, span = 0.5, degree = 1)"]], z, :
## extrapolation not allowed with blending
404
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
405
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
406
## Warning in gam.lo(data[["lo(symmetry_worst, span = 0.5, degree = 1)"]], : eval
## -2.1591
407
## Warning in gam.lo(data[["lo(symmetry_se, span = 0.5, degree = 1)"]], z, :
## extrapolation not allowed with blending
408
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
409
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
410
## Warning in gam.lo(data[["lo(radius_mean, span = 0.5, degree = 1)"]], z, : eval
## 3.9678
411
## Warning in gam.lo(data[["lo(texture_se, span = 0.5, degree = 1)"]], z, w, :
## extrapolation not allowed with blending
412
## Warning in gam.lo(data[["lo(perimeter_mean, span = 0.5, degree = 1)"]], :
## lowerlimit -1.8414
413
## Warning in gam.lo(data[["lo(smoothness_se, span = 0.5, degree = 1)"]], z, : eval
## -1.3835
414
## Warning in gam.lo(data[["lo(smoothness_se, span = 0.5, degree = 1)"]], z, :
## extrapolation not allowed with blending
415
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
416
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
417
## Warning in gam.lo(data[["lo(texture_mean, span = 0.5, degree = 1)"]], z, :
## extrapolation not allowed with blending
418
## Warning in gam.lo(data[["lo(fractal_dim_mean, span = 0.5, degree = 1)"]], :
## upperlimit 4.6997
419
## Warning in gam.lo(data[["lo(smoothness_se, span = 0.5, degree = 1)"]], z, : eval
## 5.4251
420
## lo.wam convergence not obtained in 30 iterations
421
## lo.wam convergence not obtained in 30 iterations
422
## Warning in gam.lo(data[["lo(texture_mean, span = 0.5, degree = 1)"]], z, : eval
## -2.0715
423
## Warning in gam.lo(data[["lo(texture_worst, span = 0.5, degree = 1)"]], z, :
## extrapolation not allowed with blending
424
## Warning in gam.lo(data[["lo(compactness_mean, span = 0.5, degree = 1)"]], :
## upperlimit 3.9476
425
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
426
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
427
## Warning in gam.lo(data[["lo(radius_mean, span = 0.5, degree = 1)"]], z, :
## extrapolation not allowed with blending
428
## Warning in gam.lo(data[["lo(radius_worst, span = 0.5, degree = 1)"]], z, :
## upperlimit 3.3804
429
## Warning in gam.lo(data[["lo(fractal_dim_mean, span = 0.5, degree = 1)"]], : eval
## -1.8183
430
## Warning in gam.lo(data[["lo(perimeter_worst, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
431
## Warning in gam.lo(data[["lo(perimeter_mean, span = 0.5, degree = 1)"]], :
## upperlimit 3.2984
432
## Warning in gam.lo(data[["lo(area_mean, span = 0.5, degree = 1)"]], z, w, : eval
## 4.5327
433
## Warning in gam.lo(data[["lo(radius_se, span = 0.5, degree = 1)"]], z, w, :
## extrapolation not allowed with blending
434
## Warning in gam.lo(data[["lo(area_worst, span = 0.5, degree = 1)"]], z, w, :
## upperlimit 4.5093
435
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
436
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
437
## Warning in gam.lo(data[["lo(symmetry_mean, span = 0.5, degree = 1)"]], z, :
## lowerlimit -1.744
438
## Warning in gam.lo(data[["lo(texture_se, span = 0.5, degree = 1)"]], z, w, : eval
## 4.4052
439
## Warning in gam.lo(data[["lo(perimeter_worst, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
440
## Warning in gam.lo(data[["lo(concave_pts_mean, span = 0.5, degree = 1)"]], :
## upperlimit 3.5179
441
## Warning in gam.lo(data[["lo(smoothness_se, span = 0.5, degree = 1)"]], z, : eval
## 5.4251
442
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
443
## Warning in gam.lo(data[["lo(radius_worst, span = 0.5, degree = 1)"]], z, :
## extrapolation not allowed with blending
444
## Warning in gam.lo(data[["lo(perimeter_worst, span = 0.5, degree = 1)"]], :
## lowerlimit -1.5965
445
## Warning in gam.lo(data[["lo(texture_worst, span = 0.5, degree = 1)"]], z, : eval
## 3.4953
446
## Warning in gam.lo(data[["lo(compactness_se, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
447
## Warning in gam.lo(data[["lo(area_worst, span = 0.5, degree = 1)"]], z, w, :
## upperlimit 4.1599
448
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
449
## Warning in gam.lo(data[["lo(smoothness_worst, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
450
## Warning in gam.lo(data[["lo(fractal_dim_mean, span = 0.5, degree = 1)"]], :
## lowerlimit -1.8108
451
## Warning in gam.lo(data[["lo(radius_se, span = 0.5, degree = 1)"]], z, w, : eval
## 8.8991
452
## Warning in gam.lo(data[["lo(compactness_worst, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
453
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
454
## Warning in gam.lo(data[["lo(smoothness_worst, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
455
## Warning in gam.lo(data[["lo(concave_pts_se, span = 0.5, degree = 1)"]], :
## upperlimit 3.6928
456
## Warning in gam.lo(data[["lo(concavity_se, span = 0.5, degree = 1)"]], z, : eval
## 4.0286
457
## Warning in gam.lo(data[["lo(concavity_worst, span = 0.5, degree = 1)"]], :
## extrapolation not allowed with blending
458
## Warning in gam.lo(data[["lo(compactness_se, span = 0.5, degree = 1)"]], :
## lowerlimit -1.2442
459
## lo.wam convergence not obtained in 30 iterations
460
## lo.wam convergence not obtained in 30 iterations
461
## Warning in gam.lo(data[["lo(symmetry_worst, span = 0.5, degree = 1)"]], : eval
## 6.0407
462
## Warning in gam.lo(data[["lo(texture_worst, span = 0.5, degree = 1)"]], z, :
## extrapolation not allowed with blending
463
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, :
## lo.wam convergence not obtained in 30 iterations
464
## extrapolation not allowed with blending
465
## Warning in gam.lo(data[["lo(texture_se, span = 0.5, degree = 1)"]], z, w, :
## upperlimit 4.435
466
## Warning in gam.lo(data[["lo(compactness_se, span = 0.5, degree = 1)"]], : eval
## -1.297
mean(loess_preds == test_y)
## [1] 0.983
Set the seed to 7, then train a k-nearest neighbors model on the training set using the caret package. Try
odd values of 𝑘 from 3 to 21. Use the final model to generate predictions on the test set.
What is the final value of 𝑘 used in the model?
# set.seed(7)
set.seed(7, sample.kind = "Rounding") # simulate R 3.5
467
## k
## 10 21
## [1] 0.948
# set.seed(9)
set.seed(9, sample.kind = "Rounding") # simulate R 3.5
## mtry
## 1 3
What is the accuracy of the random forest model on the test set?
## [1] 0.974
What is the most important variable in the random forest model? Be sure to enter the variable name exactly
as it appears in the dataset.
varImp(train_rf)
## rf variable importance
##
## only 20 most important variables shown (out of 30)
468
##
## Importance
## area_worst 100.0
## radius_worst 87.7
## concave_pts_worst 85.7
## perimeter_worst 85.5
## concave_pts_mean 72.1
## area_se 67.3
## concavity_worst 63.5
## area_mean 61.4
## texture_worst 59.9
## perimeter_mean 55.2
## concavity_mean 55.2
## texture_mean 55.0
## radius_se 49.8
## smoothness_worst 49.1
## radius_mean 49.0
## perimeter_se 45.0
## compactness_worst 39.3
## symmetry_worst 35.3
## smoothness_mean 30.6
## fractal_dim_worst 27.8
□ A. mean values
□ B. standard errors
⊠ C. worst values
ensemble <- cbind(glm = glm_preds == "B", lda = lda_preds == "B", qda = qda_preds == "B", loess = loess_
## [1] 0.983
469
models <- c("K means", "Logistic regression", "LDA", "QDA", "Loess", "K nearest neighbors", "Random fore
accuracy <- c(mean(kmeans_preds == test_y),
mean(glm_preds == test_y),
mean(lda_preds == test_y),
mean(qda_preds == test_y),
mean(loess_preds == test_y),
mean(knn_preds == test_y),
mean(rf_preds == test_y),
mean(ensemble_preds == test_y))
data.frame(Model = models, Accuracy = accuracy)
## Model Accuracy
## 1 K means 0.922
## 2 Logistic regression 0.957
## 3 LDA 0.991
## 4 QDA 0.957
## 5 Loess 0.983
## 6 K nearest neighbors 0.948
## 7 Random forest 0.974
## 8 Ensemble 0.983
□ A. Logistic regression
⊠ B. LDA
□ C. Loess
□ D. Random forest
□ E. Ensemble
470