Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
14 views

Logistic Regression Notes

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Logistic Regression Notes

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 79

# To check which directory you are working in:

getwd()

# To import the data set

# you need to change the “file” location to where you’ve stored the data
set

g <- read.csv(file = "Q:/MPHMOOC/R/cancer data for MOOC 1.csv",

header = TRUE, sep = ',')

# To have a look at the first few rows of our data set:

head(g)

# To inspect the `age` variable:

g$age

# To display a summary of the ages of our patients:

summary(g$age)

# To display a summary of the genders of our patients:

table(g$gender)

# To display a summary of the BMI of our patients:

summary(g$bmi)

# To display a summary of the smoking status of our patients:

table(g$smoking)
# To display a summary of the exercise status of our patients:

table(g$exercise)

# To display a summary of the daily fruit consumption of our patients:

table(g$fruit)

# To display a summary of the daily vegetable consumption of our patients


:

table(g$veg)

# To display a summary of the cancer status of our patients:

table(g$cancer)

# To create a new variable `fruitveg`, which sums the daily consumption o


f fruit and veg of each patient:

g$fruitveg <- g$fruit + g$veg

# To display a summary of the combined fruit and veg consumption of our p


atients:

table(g$fruitveg)

# To display a histogram of the ages of our patients:

hist(g$age)

# To create a new binary variable `five_a_day`, whether the patient eats


at least 5 fruit or veg a day:

g$five_a_day <- ifelse(g$fruitveg >= 5, 1, 0)


# To summarise the `five_a_day` variable:

table(g$five_a_day)

# To display a histogram of the daily fruit and veg consumption of our pa


tients, including a title and proper axes:

hist(g$fruitveg, xlab = "Portions of fruit and vegetables",

main = "Daily consumption of fruit and vegetables combined", axes =


F)

axis(side = 1, at = seq(0, 11, 1))

axis(side = 2, at = seq(0, 16, 2))

# To create a new binary variable `healthy_BMI`, whether the patient has


a healthy BMI or not:

g$healthy_BMI <- ifelse(g$bmi > 18.5 & g$bmi < 25, 1, 0)

# To summarise `healthy_BMI`:

table(g$healthy_BMI)

# To run a chi-squared test to look for an association between eating fiv


e or more fruit and veg a day and cancer:

chisq.test(x = g$five_a_day, y = g$cancer)

# To run a (two-tailed) t-test to see whether the mean BMI of those with
cancer is different from the mean BMI of those without cancer:

t.test(g$bmi ~ g$cancer)
# To run a (two-tailed) t-test to see whether the mean BMI of those with
cancer is different from the mean BMI of those without cancer, where the
variances are equal:

t.test(g$bmi ~ g$cancer, var.equal = T)

# To run a t-test to see whether the mean BMI of all patients is differen
t from 25:

t.test(g$bmi, mu = 25)

# To run a chi-squared test to see whether there is an association betwee


n eating five or more fruit a day and having cancer:

chisq.test(x = g$five_a_day, y = g$cancer)

# To create a new binary variable, whether overweight or not according to


their BMI:

g$overweight <- ifelse(g$bmi >= 25, 1, 0)

# To summarise the `overweight` variable:

table(g$overweight)

# To run a chi-squared test to see whether there is an association betwee


n being overweight and cancer:

chisq.test(x = g$overweight, y = g$cancer)

Logistic regression in r
Have you ever tried washing up dirty dishes with a hammer, or tried to carve a roast chicken with a
toothbrush? Of course you haven't, it's the wrong tool for the job. Similarly don't be tempted to use
linear regression when your outcome variable, the thing you want to predict, only has two values. It's
the wrong tool for the job and it will lead to disaster. Instead you want to use logistic regression. Logistic
regression is what's used for so called binary outcomes which have only two values. So in medicine the
classic example is death as you're either dead or live. Another is whether you have a given disease at a
particular point in time, so you either have infection or you don't. You've either had a heart attack or
you haven't. Note that logistic regression can also be adapted for a categorical outcome variable with
more than two possible values but that is beyond our scope in this course. We'll stick strictly to the
binary case. In this course, we're going to use the diagnosis of diabetes as our worked example. So
either someone has been diagnosed with diabetes or they haven't. Diabetes mellitus, to give it its full
name, it's of two principal types, though there are others. It's type 1, which as it happens I have, and
type 2 which some of my friends have. Around 90% is type 2 and often has lifestyle as a major
contributor cause. It's a huge and growing problem for populations and healthcare services worldwide,
not just in rich countries. It's estimated at over 400 million people have it globally and with many cases
undiagnosed. So that's the medical example that we'll be using in this course. There are also various
statistical learning points that we'll cover. For example, assessing how well the regression model fits the
data is done differently from how it's done with linear regression. We'll also look at the huge issue of
deciding which variables to include in our model and which to leave out. In the more technically minded
literature, logistic regression is one method that is applied to so-called classification problems. This is
because we classify patients according to the binary variable and then look at the rest of the data set to
see which patient characteristics are associated more with being a one than with being a zero. For
instance, how do a patient's age and gender affect their chance of having diabetes? Now, you may have
come across what's known as machine learning methods many of which are also used in such
classification problems. The underlying maths behind these methods is much more complicated than the
maths behind logistic regression but sometimes it's worth the extra effort because those methods work
better than regression in practice. It very much depends on the field and the data. Logistic regression
has two big advantages - it's easy to run in standard software and you can examine and describe the
relationships between the variables much more easily. Now I find it a bit confusing that some
researchers include logistic regression in the set of machine learning methods - it's done probably
because you need a machine, that is a computer, to do it. I think it's because the alternative to getting
your computer to tell you which patients are the highest risk of the outcome is using your clinical
judgment and experience if you are healthcare professional. Actually, there's a growing and very
interesting literature on if and in what circumstances is a machine better at predicting an outcome for a
given patient than a doctor. In some cases actually doctors have been using algorithms for years to help
them decide on treatment. Screening programs use a really simple one to decide which patients to
invite, such as age and gender, to screen for breast cancer. In the future, we're going to see many more
instances of doctors and algorithms working together and logistic regression is likely to remain a key
part of our algorithm toolkit. Logistic regression has been around for decades and for good reason it
remains the tool of choice when investigating the relation between a set of variables and a binary
outcome such as death or the presence of disease such as diabetes. [Music]

Why does linear regression not work with


binary outcomes?
Why does linear regression not work with binary
outcomes?
Binary outcomes only have two values. The example we are using throughout this course is
diabetes, where individuals either have diabetes or they don’t. For our regression model, we could
code this outcome so that individuals with diabetes = 1 and those without diabetes = 0. If we just ran
a linear regression model with this binary outcome and one continuous predictor variable, then the
model will plot a straight line through these points just as we have seen with simple linear regression
in the course on Linear Regression for Public Health.

The graph on the left shows the relation between the continuous predictor variable (“cpred” on the X
axis) and some continuous outcome variable (“cont_outcome”) on the Y axis. It shows that the
predicted values from the linear regression model (red line) are reasonable for the continuous
dependent variable, even if the model does not explain the relationship very well because lots of the
points are far from the red line. However, the graph on the right clearly shows that the linear model
does not fit the data well at all when the outcome is binary (“outcome” on the Y axis). The predicted
values often correspond to impossible values of the outcome, i.e. values other than 0 or 1. It just
doesn’t make sense.

For a binary outcome, we are generally most interested in modelling the proportion of individuals
with the outcome of interest, i.e. the proportion of individuals with diabetes in our example. This is
equivalent to the probability of an individual having diabetes. Although probabilities are continuous
variables, they can only take values from 0 to 1. However, as we have seen in the graph above
(right), linear models will predict values below 0 and above 1. Luckily, we can transform our variable
of interest into one that can be modelled in a regression equation using something called a link
function.
As its name suggests, a link function describes the relationship that links our variable of
interest to the variable that we use in our regression equation. It's a mathematical trick. The link
function that’s most often used for logistic regression is called the logit. Instead of directly modelling
the probability, we model the logit of the probability. The logit of a probability p is equivalent to the
natural logarithm (log) of the odds (equation below).

As the equation above shows, to get back to x from its natural log (y), you raise e to the power of y.
This ‘anti-log’ transformation is known as exponentiating.

The reason we model the log(odds) rather than just the odds as the outcome variable is because it
can take any value from minus infinity (when p = 0) to positive infinity (when p = 1). Odds, on the
other hand, can only take positive values. Using the log(odds) as the outcome variable means that
we can run a regression model in a similar way to normal linear regression with a continuous
variable, and still ensure that the predicted values for probabilities are between 0 and 1 (graph
below).
Odds and odds ratio
Logistic regression is all about odds rather than the probability. But why? What's the difference? Many
laymen mix them up, but as statistical thinkers, you and I can't afford to. Odds describes the expected
number of successes per failure. In gambling, odds are used because it's convenient to make it clear how
much the bookmaker will pay compared with what the gambler has bet. For instance, if a horse has odds
of seven to one, sometimes called seven to one against or just sevens, it means that it has one chance
winning, but seven of losing. Over the long-term, for every one race it wins, it will run seven races and
lose. It's therefore much more likely to lose than it is to win. However, if it does win, then a bookmaker
would give you seven times what you bet. What's the probability that a horse will win? Well, with seven
chances of losing and one of winning, then our 7 plus 1 equals 8 chances in total. So, the probability of it
winning is therefore one in eight or 12.5 percent, whereas the probability of it losing is seven in eight, or
87.5 percent. But what if the horse is so-called even money to win? So, this means that it's just as likely
to lose as is to win, which is often described as being 50-50. So, 50 chances of winning and 50 chances of
losing. The probability that it will win is therefore 50 divided by 50 plus 50, which is 50 percent, like
tossing a coin. Logistic regression involves modelling the odds rather than a probability. In fact, as we've
seen, it models the natural logarithm of the odds, or log odds for short. The log odds can take negative
values unlike either the odds or the probability, and it can take values above one unlike probability. Both
of those things are necessary in order to make the underlying maths work, and more commonly though,
we don't just want to know the log odds of something happening, like having diabetes, we want to know
how the log odds varies by some patient factors along their age. So, are older people more or less likely
to have diabetes than younger people. In logistic regression, this is assessed by comparing the log odds
of having diabetes in older people with the log odds of having diabetes in younger people. Dividing the
former by the latter gives the log odds ratio. Happily, we can take the antilogarithm of the odds, log
odds ratio, a procedure called exponentiating, to get the odds ratio which is much easier to interpret.
This is just one odds divided by another odds. For example, if we divide the odds for older people by the
odds for younger people, and the resulting odds ratio is greater than one, it means that older people are
more likely to have diabetes than younger people are. That's true. We've seen that probability and odds
are not the same thing, and the maths of logistic regression works on the log odd scale. Happily, the
easy trick of exponentiating the output, formula logistic regression model gives us odds ratios, which can
be interpreted. This course will give you lots of practice at doing just that.

Odds Ratios and Examples from the


Literature
Odds ratios and examples from the literature

As I explained briefly in the video, to get the underlying maths to work when our outcome is binary
and not continuous, the standard approach with logistic regression is to convert the binary outcome
into log odds. Let’s look at that in a bit more detail.

While probabilities are the most intuitive way to model binary variables, if we model the probability as
the outcome variable, we will end up with impossible predictions from the model (i.e. predicted
values below 0 or above 1). The reason we model the log(odds) rather than the odds as the
outcome variable is because odds can only take positive values, and therefore this model would also
give impossible predictions. The log(odds) can take any positive or negative values, and therefore all
predictions from the model are possible.

Logistic regression is a very common tool for public health studies, as many healthcare-
related outcomes are binary measures. Therefore, you are very likely to come across odds ratios
in the public health literature. For example, an excerpt of a table from a study looking at risk factors
for type 2 diabetes is below. You can see that the odds of having diabetes are greater in those who
are overweight or obese than in those who are not overweight, and in those with hypertension
compared with those without hypertension.
Kumari, M., Head, J. and Marmot, M., 2004. Prospective study of social and other risk factors for
incidence of type 2 diabetes in the Whitehall II study. Archives of internal medicine, 164(17),
pp.1873-1880.

Available from: https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/217380

As with all types of statistical analyses, you need to be wary of spurious results. If a public
health study presents huge odds ratios, you need to interpret these results with caution. As you can
see from the table below, as the probability increases, the odds increase in a non-linear fashion.
This means that a relatively small increase in probability can lead to a large increase in odds.

p 2p Odds 2*odds

0.01 0.02 0.01 0.02

0.02 0.04 0.02 0.04

0.03 0.06 0.03 0.06

0.04 0.08 0.04 0.08

0.05 0.10 0.05 0.10

0.10 0.20 0.11 0.22

0.20 0.40 0.25 0.50

0.30 0.60 0.43 0.86

0.40 0.80 0.67 1.33

0.50 1.00 1.00 2.00

0.60 1.20 1.50 3.00

0.70 1.40 2.33 4.67

Notice also another reason for dealing with odds rather than probabilities: when you double a
probability you can get values above 1, which are impossible, but odds above 1 are completely
sensible.
PREPARING DATA FOR LOGISTIC REGRESSION
Now, you've seen what odds and odds ratios are in theory, it's time to generate
some in practice. But before you dive straight in and fit the model on some data,
there are various preparations you have to make. You don't make a cake by
throwing all the ingredients into the oven without cleaning your equipment, or
checking that the ingredients haven't infact gone off, because they've been in the
cupboard for five years. Or at least you shouldn't make a cake like that. So, it is with
logistic regression modelling. For this course, as with the others in the series, we're
using software R and interface RStudio. Instructions for downloading and installing
both of these totally free products, are given elsewhere. So, having opened R, you
will need to import the dataset that we're using in this course. Then, you'll need to
get a sense of how big the data set is. What types of variables it contains. What sets
of values each variable has and whether there's anything odd about them. For
categorical variables, how many different values do they have? How common is
each one? As uncommon categories can cause problems. For continuous variables,
what distribution do they follow? How might you summarise that distribution? If it's
normal, you would report the mean and standard deviation. But if it's skewed or
weird in some way, it's more usual to report the median and interquartile range.
That's standard advice for preparing to run any kind of regression or indeed any
kind of data analysis. However, your doing logistic regression, because you have a
binary outcome variable. So, the next thing to do, is some cross tabulations. You'll
need to see how each variable cross-tabulates with the outcome variable. For
example, in your case, your outcome is having diabetes. So, is it more common in
males or in females? Does it vary by whether the patient is obese? But how should
you investigate the relation between age and diabetes? You may be tempted just to
do a cross tabulation as with gender, or the other categorical variables. But this will
give you a huge table, unless the clinic only serves patients within narrow age
range, for instance, in pediatrics. There are two better ways. The first, is to plot age
in single years, against the log odds of having diabetes to see if the relation is
linear, or curved, or something weirder. Before you do this though, is worth just
plotting the distribution of age with a histogram that gives the number of patients of
each age rather than the proportion. This tells you how many ages have few
patients with them. If some ages only have a handful of people with them, then you
get a lot of jagged lines or noise due to small numbers. This makes it harder for you
to judge the relation. That would mean you have to do it the second way. Here, you
combine the uncommon ages and make age groups. You can then use those groups
in the plots or if you don't have very many groups in a cross tabulation as you did
with gender. It may take some trial and error to get the groups right. But a word of
caution here, if you make the groups too wide, the extreme is to have just two -
young and old, then you lose a lot of information. So before leaping into logistic
regression, you need to get to know each variable in your data set, and how it
relates to your outcome variable in a set of patients that you have. Because every
dataset is different which is one reason why different research studies, get different
results for the same question. There's no substitute for getting to know your data.
To import it into R, I then typed and ran:

g <- read.csv(file = "C:/Users/rab97/Documents/Alex work/diabetes data fo


r R (csv).csv",
header=TRUE, sep=',')

The “header=TRUE” option tells R that the first row in the file contains the column names. If you
downloaded the data set to another location, you’ll just have to amend the above “file=” option. Now
run this:

dim(g)
So you have 403 rows, in this case patients, and 23 columns. To find out what R thinks the columns
are called, you can use the function colnames():[1] 403 23

[1] "id" "chol" "stab.glu" "hdl" "ratio" "glyhb"


[7] "location" "age" "gender" "height" "weight" "frame"
[13] "bp.1s" "bp.1d" "bp.2s" "bp.2d" "waist" "hip"
[19] "time.ppn" "insurance" "fh" "smoking" "dm"
An alternative that you might see online is:
dimnames(g)[[2]]
There are two parts to the dimnames object: names of the rows, which is generally not useful, and
names of the columns, which definitely is. You just want the second of those, hence the “[[2]]” bit.
This gives the same output.

With this data set, this is some documentation online at


http://biostat.mc.vanderbilt.edu/wiki/Main/DataSets under “Diabetes data”.
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/Cdiabetes.html gives the list of fields and
what the categories mean. It doesn’t tell the complete story, though, especially around the units of
some of the fields. It’s also missing a few fields. That’s not surprising, as I’ve derived the outcome
variable “dm” based on the usual HbA1c threshold of 7.5 and have simulated a few extra:

insurance: 0=none, 1=government, 2=private


fh = family history of diabetes (yes/no, where 1=yes, 0=no)
smoking: 1=current, 2=never and 3=ex
To do any analysis, my preference is to make one variable per column rather than refer directly to
the column within the data set every time. Not all of the columns are very interesting, though, so
don’t bother with “id” for example. We’re not going to use it. When doing this, you need to tell R
which variables are categorical, as it will assume they’re all continuous by default:
chol <- g[“chol”] # cholesterol is continuous, so it’s easy
gender <- as.factor(g[,”gender”]) # but gender isn’t.
dm <- as.factor(g[,”dm”]) # neither is dm

Note the use of the hash (#) to add comments – I use a lot of comments in all my programs, and it’s
good practice. It’s much quicker and safer to write out in words what your code is trying to do than
try to work it out weeks or months later when you’ve forgotten all about what you were doing at that
time.

To see how many males and females we have, you can use the “table” command. It’s worth also
getting the total:

t <- table(gender) # store the tabulation for further manipulation


addmargins(t) # this will sum up the gender totals to give an overall tot
al and print the results
Annoyingly, it doesn’t give the percentages of any of the categories, a really basic analysis task.
With the tabulation stored as an R object, though, we can do this quite simply using the “prop.table”
command:

round(prop.table(t),digits=3) # get proportions rounded to 3dp

gender
female male
0.581 0.419
Or, even easier on the brain:

> round(100*prop.table(t),digits=1) # get %s rounded to 1dp


gender
female male
58.1 41.9
So 58.1% of our population are female. One decimal place for percentages is enough, and often
we’ll round to the nearest integer, particularly for large percentages.

Irritatingly, however, “table” excludes missing values by default. To see these – and we ALWAYS
want to see these – use an “exclude=NULL” option when making the variable:

> dm2 <- factor(dm, exclude=NULL) # make new factor from the old one
> table(dm2) # display the counts including the missings (NAs)
dm2
no yes <NA>
330 60 13
For continuous variables, though, use “summary”:

summary(chol)
> summary(chol)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
78.0 179.0 204.0 207.8 230.0 443.0 1
The “summary” command does give us missing values (“NAs”) by default, which is perfect. We can
see that the median cholesterol is similar to the mean, so the distribution is likely roughly
symmetrical. There’s a large range.

Let’s continue with the other variables. For instance, for weight and height:

height <- g[,'height']


weight <- g[,'weight']
summary(height)
summary(weight)
> summary(height)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
52.00 63.00 66.00 66.02 69.00 76.00 5

How to calculate body mass index (BMI) from height and weight

As this is a US data set, height is in inches and weight is in pounds. Neither height nor weight are
particularly useful by themselves, however, so it’s common to combine them into the body mass
index (BMI), which is weight divided by the square of height; both measures need to be in SI units,
i.e. kilograms and metres, so we need to convert:

height.si <- height*0.0254


weight.si <- weight*0.453592
bmi <- weight.si/height.si^2

Now we can summarise BMI:

> summary(bmi)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
15.20 24.13 27.80 28.79 32.24 55.79 6
How to make a categorical variable from a continuous one

For display purposes and also because thresholds are used in clinical decision-making, it’s useful to
categorise continuous variables even though this loses information. For instance, it’s very common
to categorise BMI. One reason is so that public health agencies can track the numbers of people and
proportion of the population who are obese, for example. Tracking the mean or median BMI for the
population won’t tell you this. There are several ways to categorise BMI in R. Here’s one way. Let’s
say we want groups for underweight [<18.5], normal [18.5-25], overweight [>25] and obese [>30]; a
morbidly obese category also exists but let’s ignore that here.

bmi_categorised <- ifelse(bmi < 18.5, "underweight",


ifelse(bmi >= 18.5 & bmi <= 25, "normal",
ifelse(bmi > 25 & bmi <= 30, "overweight
",
ifelse(bmi > 30, "obese", NA))))

# check that the bmi_categorised variable has worked


table(bmi_categorised, exclude = NULL)

## bmi_categorised
## normal obese overweight underweight <NA>
## 113 152 123 9 6
his makes it easy to see whether obese people are overrepresented in those with diabetes via a
cross-tabulation:

# frequencies of diabetes by BMI category


dm_by_bmi_category <- table(bmi_categorised, dm2, exclude = NULL)

# check
dm_by_bmi_category

## dm2
## bmi_categorised no yes <NA>
## normal 100 9 4
## obese 118 29 5
## overweight 99 20 4
## underweight 9 0 0
## <NA> 4 2 0

# with the row percentages


round(100 * prop.table(dm_by_bmi_category, margin = 1), digits = 1)

## dm2
## bmi_categorised no yes <NA>
## normal 88.5 8.0 3.5
## obese 77.6 19.1 3.3
## overweight 80.5 16.3 3.3
## underweight 100.0 0.0 0.0
## <NA> 66.7 33.3 0.0
##### Here is the R code to do the cross-tabulations and the resulting ou
tput

# creating "age" variable


age <- g[,"age"]

# creating a categorical variable "age_grouped"


age_grouped <- ifelse(age < 45, "under 45",
ifelse(age >= 45 & age < 65, "45 - 64",
ifelse(age >= 65 & age < 75, "65 - 74",
ifelse(age >= 75, "75 or over", NA)))
)

# displaying new variable in a table


table(age_grouped, exclude = NULL)

## age_grouped
## 45 - 64 65 - 74 75 or over under 45
## 139 41 23 200

# cross tabulating with gender


age_group_by_gender <- table(age_grouped, gender, exclude = NULL)

# display the cross tabulation


age_group_by_gender

## gender
## age_grouped female male
## 45 - 64 75 64
## 65 - 74 21 20
## 75 or over 12 11
## under 45 126 74

Logistic regression in r
You've heard about logistic regression in theory so how did you do it in practice?
I'm going to talk you through the things to check and the basics of how to do it
using our package of choice, R. The first thing sounds obvious, but like many
obvious things, it needs to be said. You first need to check whether your outcome
is in fact binary. Just tabulate it, for instance, by using the table command. If it has
more than two values, then you've got a problem and a decision to make. Why
does it have to be more than two? Maybe it's just a handful of patients with
values that have been entered wrongly. If so, you can safely just exclude the
affected patients. If there are a lot of values, then you can consider combining
two of the values but only if it makes sense to do so. If it doesn't make sense to
combine groups or if you don't want to lose information, which always happens
when you combine categories, then consider something like ordinal regression,
which is beyond the scope of this course. In the example in this course, however,
diabetes is a yes-no variable. Based on a threshold HBA1C value, it will be binary
unless HBA1C is missing. In what's called simple logistic regression, which I'm
going to explain now, you have just one predictor. An example of when this is
useful is if you want to look at time trends and test whether there is a significant
linear trend in your outcome. I'll come back to that keyword, linear, in a minute.
For instance, has the rate of diabetes recently been getting bigger or smaller over
time? To run logistic regression in R, you need to use the GLM command. As a
minimum, you need to tell R what your outcome variable is, what your predictor
or predictors are, what distribution you want to assume for the outcome variable
and which link function your want. With GLM, you can run other kinds of
regression too, so this is why you have to tell it that the distribution is the
binomial achieved by the family equals binomial option. The link function says
how you want to transform the outcome variable, in order to make the maths
work. So you get an equation who's right hand side is just the sum of one or more
predictors. The link function that's generally used in logistic regression is the logit.
This means you take the probability of the outcome happening and turn it into
the log odds, which you came across earlier in the course. There are other choices
of link function that are more appropriate if your outcome variable really
represents a continuous one or counts that you've just forced to be either 0 or 1,
but I won't go into them in this course. So I will now discuss the predictors. Your
predictors can be categorical or continuous. If a categorical, you do not need an
equal number of observations in each category, but categories with very small
numbers can cause problems, as we'll see later. If continuous, they do not need to
be normally distributed. That's something a lot of students get wrong. There is
something you need to assume though. For a continuous variable or one that you
are essentially treating as continuous, for example a year, you are assuming that
the relation between the variable and the outcome is linear. For example,
suppose you want to know how diabetes risk varies with age, and you have age in
whole years rather than its categories. If you plot the rate of diabetes by age, you
are assuming, and R is assuming, that the diabetes rate changes by the same
amount for every one unit increase in age, whether it goes up with age, goes
down with age or is flat. No curve is allowed. The relation is linear. It's important
to test that your data fits this assumption by plotting the data first and then fitting
a model. Don't just hope for the best. Raw assumptions matter in statistics as well
as in life. If the relation on your plot looks rather more curved than straight, then
maybe a line isn't a good approximation. So in that case, you will need to try some
other shapes, for instance, by adding a squared term to the model. If your single
predictor is age, then this would mean including not just a term for age but also a
term for age squared and testing whether that square term is statistically
significant via its p-value. There are fancier things that you can do, but those are
the basics. So those are the key elements to fitting a simple logistic regression
model, for instance, with a binary outcome variable such as diabetes and a single
predictor such as age in R, using the GLM commands. Why don't you have a go?

Practice in R: Simple Logistic Regression

Simple logistic regression: how to run a model with only one predictor

The simplest model we can fit is one with no predictors whatsoever in it. This is called the empty or
null model and is rarely useful. It has the assumption that everyone has the same odds of having the
outcome (diabetes in our example). The R command to do this would be:

glm(dm ~ 1, family=binomial (link=logit))


The “1” is just R’s way of saying that there’s only an intercept term in the model. To get the output,
though, you need the “summary” command as well. To do this, you need to make an R object out of
the model. I’ve called this object “m”. We can then summarise it:

m <- glm(dm ~ 1, family=binomial (link=logit))


summary(m)
By the way, it’s worth checking how R has interpreted the binary outcome “dm”. If you run this...

table(m$y)
...you see that there are 60 1s and 330 0s, which is good because there were 60 yesses and 330
noes in the “dm” vector. It’s important to know that R is modelling the log odds of dm=1 and not the
log odds of dm=0!

Much more useful than the null model above is to see how the chance of having diabetes depends
on one or more predictors. The next simplest model is one with one predictor. Let’s first look at
gender, which in our data set is binary: male or female.

We’ve already told R that gender is a factor (in its language), that is to say, a categorical variable (in
our language). That was the point of the “as.factor” commands earlier. To include it in the model,
type:
m <- glm(dm ~ gender, family=binomial (link=logit))
summary(m)
This means we are saying that the log odds of having diabetes differs by gender alone. To include a
continuous variable in the model instead, such as age (note that we are treating age as continuous,
just rounded down to the nearest birthday), then the code is similar:

m <- glm(dm ~ age, family=binomial (link=logit))


summary(m)
R will know to fit age as a continuous variable because you haven’t told it that it’s
categorical. More on creating categorical variables from continuous ones later.

It’s straightforward to include age as a single term in the model, but remember what I said in the
video about assuming a linear relation with the outcome? More precisely, this assumes that the
relation between age and the log odds of having diabetes is linear (more on this in detail in the next
section). Is that reasonable? The easiest way is just to plot one against the other.

# create a cross tabulation of age and diabetes status


dm_by_age <- table(age, dm)

# output the frequencies of diabetes status by age


freq_table <- prop.table(dm_by_age, margin = 1)

# calculate the odds of having diabetes


odds <- freq_table[, "yes"]/freq_table[, "no"]

# calculate the log odds


logodds <- log(odds)

# plot the ages found in the sample against the log odds of having diabet
es
plot(rownames(freq_table), logodds)
Feedback - Output and Interpretation from
Simple Logistic Regression

How did you get on with that?

First, I’ll go through what the output means when you had just age in the model and then will do the
same for gender. Actually, before even that, I’ll show you what you would have got if you had run the
empty model. If you had typed and run this…

m <- glm(dm ~ 1, family=binomial (link=logit))


summary(m)
…you would have got this…

> m <- glm(dm ~ 1, family=binomial(link=logit))


> summary(m)

Call:
glm(formula = dm ~ 1, family = binomial(link = logit))

Deviance Residuals:
Min 1Q Median 3Q Max
-0.578 -0.578 -0.578 -0.578 1.935

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.7047 0.1403 -12.15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 334.87 on 389 degrees of freedom


Residual deviance: 334.87 on 389 degrees of freedom
(13 observations deleted due to missingness)
AIC: 336.87

Number of Fisher Scoring iterations: 3

If you’ve never run any kind of regression model in R, this will likely be a blur to you. For now at
least, much of it can be ignored. I couldn’t care less how many Fisher scoring iterations it took. The
bits about deviance residuals and deviance are important, though, because they tell you how well
your model fits the data, but I want to leave the concept of data fit for later in this course.

The main thing of interest is the coefficients, but first there’s another bit of info that R has sneaked in
that is worth noting. It’s in brackets but it’s important. 13 observations were deleted due to
missingness. For 13 of our patients, we don’t know whether they had diabetes, so they’ve been
excluded. Luckily, that’s not a large number and it’s not a large proportion of our sample, so we can
just note that down and move on to look at the coefficients.

Here, there’s just one coefficient: the intercept. R prints out all the coefficients on the scale on
which the algorithm did its magic, i.e. the log scale in the case of logistic regression as we are
modelling the log odds of having diabetes. With this rather unexciting null model, we are saying that
the log odds of having diabetes is -1.7047 and that it’s the same for every patient. What does that
mean? To interpret this, we first need to exponentiate it to get the odds of having diabetes. To do
this, type:

exp(-1.7047)
…and you’ll get 0.182 to three decimal places. If you prefer to work in probabilities rather than odds,
you can use the relation between odds and probability that we established earlier to convert. So, just
divide the odds by 1 plus the odds, to give 0.182/1.182 = 0.15, or 15%. How do these compare with
the raw data? If you type…

table(dm)
> table(dm)
dm
no yes
330 60
Using these numbers, the odds of having diabetes is 60/330 = 0.182 and the probability is
60/(330+60) = 0.15, both exactly the same as from the model, which is entirely as we had expected
(and hoped!).

Now let’s add age. When you typed and ran this…

m <- glm(dm ~ age, family=binomial (link=logit))


summary(m)
…you should have got this…

> m <- glm(dm ~ age, family=binomial (link=logit))


> summary(m)

Call:
glm(formula = dm ~ age, family = binomial(link = logit))

Deviance Residuals:
Min 1Q Median 3Q Max
-1.3612 -0.5963 -0.4199 -0.3056 2.4848

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.404530 0.542828 -8.114 4.90e-16 ***
age 0.052465 0.009388 5.589 2.29e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 334.87 on 389 degrees of freedom


Residual deviance: 299.41 on 388 degrees of freedom
(13 observations deleted due to missingness)
AIC: 303.41

Number of Fisher Scoring iterations: 5

We still have 13 observations that were deleted due to missingness. This shouldn’t be a surprise,
because our descriptive statistics earlier in this course told us that 13 patients had no HbA1c
readings – and hence no diabetes information – and no patients had their age missing in the data
set. As you add more variables to the model, you’ll need to keep an eye on this.

Let’s go on and look at the coefficients. This time there are two: the intercept and one for age. Now,
with a predictor (age in this case) in the model, the intercept is no longer the overall crude log odds
but is instead the log odds of diabetes when age is zero. This follows from the equation for the
model:

Log odds of having diabetes= intercept + (coefficient for age) * age in years
= -4.4045 + 0.0525 * age in years

If age in years is zero, then we only have the intercept left. At birth, the model is saying that the log
odds of having diabetes is -4.4045, which is 0.012 when exponentiated to give us the odds. If you
prefer to think in probabilities, then we can convert this as before to give us 0.012/1.012 = 0.012 (to
three decimal places) or 1.2%, which is pretty much the same as the odds. As you saw earlier, when
odds are small, they’re really similar to probabilities.

So how do we interpret the coefficient for age? It’s the increase in log odds of having diabetes for
a one-year increase in age. A linear relation is assumed between age and the log odds. It’s
assumed, therefore, that the log odds if you’re 25 is 0.0525 higher than if you’re 24 and that the log
odds if you’re 75 is 0.0525 higher than if you’re 74. One of the nice things about working on the log
scale is that the difference between two log odds and the ratio of two log odds are the same
mathematically:

 the log odds if you’re 25 minus the log odds if you’re 24 is 0.0525
 the log odds if you’re 25 divided by the log odds if you’re 24 is 0.0525

When we exponentiate 0.0525 we get 1.05 (to two decimal places). This is an odds ratio, which
is generally what’s reported when running logistic regression models (and is generally reported to
two decimal places). It’s the ratio of the odds of having diabetes if you’re 25 divided by the odds of
having diabetes if you’re 24. Or that if you’re 75 divided by that if you’re 74 etc. It’s the amount by
which your odds increases when you get a year older. So getting older is bad news, at least in terms
of diabetes, which is what we expected.

But wait – we haven’t yet checked to see whether this is statistically significant or merely compatible
with a chance result. The p value for age is given in the “Pr(>|z|)” column and is really tiny. R also
uses an asterisk system to point out the size of the p values. Age has three asterisks, meaning close
to zero, so it’s not compatible with a chance result. Age is a statistically significant predictor.

For gender, you would type this:

m <- glm(dm ~ gender, family=binomial (link=logit))


summary(m)
m <- glm(dm ~ gender, family=binomial (link=logit))
> summary(m)

Call:
glm(formula = dm ~ gender, family = binomial(link = logit))

Deviance Residuals:
Min 1Q Median 3Q Max
-0.5915 -0.5915 -0.5683 -0.5683 1.9509
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.74150 0.18592 -9.367 <2e-16 ***
gendermale 0.08694 0.28352 0.307 0.759
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 334.87 on 389 degrees of freedom


Residual deviance: 334.78 on 388 degrees of freedom
(13 observations deleted due to missingness)
AIC: 338.78

Number of Fisher Scoring iterations: 4

In this data set, gender comes labelled as male and female, so the coefficient for gender here is
printed as “gendermale”. If your data set has gender coded as 1 and 2, for example, then you’ll need
to refer to the documentation for that data set to see what 1 and 2 mean. Here, the log odds for
having diabetes for males is 0.0869 higher than that for females. This is also the log odds ratio for
males compared with females. Again, if we exponentiate 0.0869, we get the odds ratio for males
compared with females, which is 1.09 (to two decimal places). That implies higher odds for males,
but we need to inspect the p value, which is 0.759. That’s pretty high and well above the
conventional threshold of 0.05, so chance is a likely explanation for the result and we can conclude
that we don’t have any good evidence of a gender difference in diabetes odds in this sample.

While in this data set, gender is nicely labelled, it’s a good idea in general to check how R has
entered gender into the model. Do this by typing:

contrasts(gender)
…and you’ll get:

> contrasts(gender)
male
female 0
male 1
This confirms that the coefficient given in the output refers to males because males have a 1 next to
them in the above output and females have a zero. The log odds for females are incorporated into
the intercept.

Suppose you didn’t want to compare males relative to females but instead the reverse? How can
you get R to give you the odds ratio so that it’s the odds for females divided by the odds for
males?
R will by default organise values (called levels) of categorical variables alphabetically. You can
check the order like so:

levels(gender)
## [1] "female" "male"
So by default, “female” is the first level. R will automatically set this as the reference group in
statistical analyses, such that the odds ratios of other groups will be displayed relative to this one.
Remember the table in section 1.07? The odds of having diabetes with hypertension is compared
with the reference group (not having hypertension). This makes sense, as we would hope that not
having hypertension would be the “default” state. The default gender is, however, more arbitrary, so
it may make sense to redefine the reference group:

We can use the function relevel() to do this:

gender <- relevel(gender, ref = "male")


levels(gender)
## [1] "male" "female"
m <- glm(dm ~ gender, family=binomial (link=logit))
summary(m)
##
## Call:
## glm(formula = dm ~ gender, family = binomial(link = logit))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5915 -0.5915 -0.5683 -0.5683 1.9509
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.65456 0.21404 -7.730 1.08e-14 ***
## genderfemale -0.08694 0.28352 -0.307 0.759
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 334.87 on 389 degrees of freedom
## Residual deviance: 334.78 on 388 degrees of freedom
Now “genderfemale” is the coefficient, which represents the log odds of diabetes of females
compared with males. Notice that the estimate is the same, except negative. This makes sense,
because of the rule:

log(A/B) = −log(B/A)
The log odds ratio of having diabetes if you’re male compared with female (A/B) is -1 * the log odds
of having diabetes if you’re female compared with male (B/A).

Lastly, it’s useful to know that the glm command produces an R object with various useful
elements, identified using the dollar symbol $, that can be manipulated and exported. That’s
the reason why we store the glm output as an object, m. For instance, to see the model’s
coefficients, type:

m$coefficients
exp(m$coefficients)

Describing your Data and Preparing to Run


Multiple Logistic Regression

I’d now like to extend the simple logistic regression case, in which we just have one predictor, to the
multiple regression case, in which with have multiple predictors – that’s why it’s called “multiple
regression”.

You will unfortunately come across many medical papers that use the phrases “univariate analysis”
or “univariate regression” for simple regression and “multivariate analysis” or “multivariate
regression” for multiple regression (whether it be linear, logistic or other types). This is wrong.
Multivariate regression does actually exist but refers to the case of having multiple outcome
variables that you are modelling at the same time in the same model, and that’s rarely the case.
Most public health analysts or even statisticians will never run a multivariate regression in their entire
career. The last time I did one was during my Master’s degree. There are other kinds of multivariate
analysis such as principal component analysis, which are quite different, so it’s best to be clear
about terminology.

To prepare to run a multiple logistic regression model, you need to get to know each of the predictor
variables that you are planning to put into the model. The important question of how to decide which
variables to enter will come later in this specialisation.

You should now be able to summarise categorical and continuous variables by themselves using
“table” and “summary” respectively. For the continuous variables, it’s also useful to see the whole
distribution. For age, which is actually discrete when rounded to the number of birthdays as is usual
but can be considered continuous, you can plot a histogram. The simplest way to do this is with the
“hist” function. Try typing:

hist(age)
You’ll get this if you copy the plot to the clipboard as a bitmap (found under the Export menu in the
plot window in RStudio):

This is the default. It counts records – as we have one record per patient, it’s effectively counting
patients – and uses default “bins” (groups) for age, so it’s not very elegant, but it does the job in
terms of showing you the distribution. If you don’t like the default bins, i.e. the way it has grouped
age, then you can change it using the “breaks” argument in the “hist” command:
So what’s a probability density? When the bins are of equal width, then the height of each column in
the plot reflects the frequency, i.e. it counts the number of patients. When the bins aren’t of equal
width, then the area of the column is proportional to the frequency. I think I prefer the frequency one,
but it’s a question of personal choice. I think the plots with bins of either 5 or even 10 years are good
enough to show the distribution, though of course the one with bins of 5 give more information.
Histograms are affected by the choice of bins, so some people prefer to use fancier plots instead to
describe the distribution, such as kernel density plots (also known simply as density plots):

d <- density(age)
plot(d,main = "") # gives warnings but the “main” argument suppresses the
ugly default title
Rather than a set of blocky columns, you see a curve. This curve smooths out the noise
using a method called kernel smoothing (hence the name kernel density plot), which uses a
weighted average of neighbouring data - i.e. of frequencies for ages just above and just below each
age. The “bandwidth” mentioned in the above plot reflects the amount of data (i.e. ages above and
below each age) used during the averaging. The details aren’t important for using the method. It’s
simple enough to do in R that you could argue that you would be better off using it for continuous
variables than histograms. For age in our example, I don’t think it matters at all, as long as you
remember that the density plot involves smoothing – a kind of modelling – and so gives values on
the graph, e.g. ages under 19 and over 92, that don’t actually have values in the real data set,
whereas the histogram displays only what’s actually in the data.

Share and Reflect: Describing Variables


and R Analyses
Questions for you to discuss among yourselves:
 What was shape of the distributions of each of the five variables age, gender, cholesterol, BMI
and HDL?
 Were the continuous ones normally distributed? Does that matter?
 What was the relation between each of the five variables and the log odds of diabetes? How did
you do these plots? What kind of grouping did you do in order to plot the relations?

 Feedback

 Before looking at the relations between each variable and the outcome, you first need to
describe each potential predictor. This is done by plotting distributions and running frequency
tabulations. In this feedback, I'll first show you the distribution plots and then the relations
between each predictor and the outcome. Lastly, I'll go through the correlations between the
predictors.
 Shape of the distributions of the five variables
 Here’s what I got with cholesterol. There was a missing value, which I had to exclude before
calling the “density” function, or it gives an error. The easiest is gender – it has only two
values so you could call it bimodal, but it isn’t really worth commenting on. We saw before
that age was like the normal but with a skew to the right.
 summary(chol)

 > summary(chol)

 Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

 78.0 179.0 204.0 207.8 230.0 443.0 1

 chol.no.na <- chol[is.na(chol)==0]

 d <- density(chol.no.na)

 plot(d,main = "")
This looks reasonably normal. Here’s the same plot for HDL:

Another that’s a bit skewed to the right. Here’s the one for BMI:
Assessing crude relations between predictors and the outcome

The first one I'll go through below is gender. This one is straightforward as you can just do a cross-
tabulation. I'll first give you the plots, mostly to show how useless they are for variables like this.
Then I'll do the plot for age. As you're reading through these, think about which are the most
appropriate for presenting in a report or paper.

# define the gender variable


gender <- as.factor(g[,"gender"])

# cross tabulation
dm_by_gender <- table(gender, dm) # not including NA values because there
aren't that many

# proportion of diabetes status by gender


dm_by_gender_prop <- prop.table(dm_by_gender, margin = 1)

# calculate the odds of having diabetes by gender


odds_gender <- dm_by_gender_prop[, "yes"]/dm_by_gender_prop[, "no"]

# calculate the log odds


logodds_gender <- log(odds_gender)

# plot the log odds of having diabetes by gender


dotchart(logodds_gender)
Here is the dot chart for gender. It’s not very useful.
This next chart draws lines instead of dots and it’s also not very useful.

plot(as.factor(names(logodds_gender)), logodds_gender)
Now, plot the relation between age and the outcome. The first one plots age by the individual year
and the second one puts it into four groups.

# define the age variable (continuous)


age <- age <- g[,"age"]

# create a cross tabulation of age and diabetes status


dm_by_age <- table(age, dm) # not including NA values because there aren'
t that many

# output the frequencies of diabetes status by age


dm_by_age_prop <- prop.table(dm_by_age, margin = 1)

# calculate the odds of having diabetes


odds_age <- dm_by_age_prop[, "yes"]/dm_by_age_prop[, "no"]

# calculate the log odds


logodds_age <- log(odds_age)

# plot the ages found in the sample against the log odds of having diabet
es
plot(rownames(dm_by_age_prop), logodds_age)
# age grouping converting continuous variable to a categorical (ordinal)
one
age_grouped <- ifelse(age < 45, "under 45",
ifelse(age >= 45 & age < 65, "45 - 64",
ifelse(age >= 65 & age < 75, "65 - 74",
ifelse(age >= 75, "75 or over", NA)))
)

age_grouped <- factor(age_grouped, levels = c("under 45", "45 - 64", "65


- 74", "75 or over"))

# create a cross tabulation of age and diabetes status


dm_by_age_grouped <- table(age_grouped, dm)

# output the frequencies of diabetes status by age


age_grouped_prop <- prop.table(dm_by_age_grouped, margin = 1)

# calculate the odds of having diabetes


odds_age_grouped <- age_grouped_prop[, "yes"]/age_grouped_prop[, "no"]

# calculate the log odds


logodds_age_grouped <- log(odds_age_grouped)

# plot the age groups found in the sample against the log odds of having
diabetes
dotchart(logodds_age_grouped)
Now let’s plot the relation between cholesterol and the outcome. The first one plots cholesterol by
individual value and the second one puts it into three groups.

# define chol as a continuous variable


chol <- g[,"chol"]

# create a cross tabulation of cholesterol and diabetes status


dm_by_chol <- table(chol, dm) # not including NA values because there are
n't that many

# output the frequencies of diabetes status by cholesterol


dm_by_chol_prop <- prop.table(dm_by_chol, margin = 1)

# calculate the odds of having diabetes


odds_chol <- dm_by_chol_prop[, "yes"]/dm_by_chol_prop[, "no"]
# calculate the log odds
logodds_chol <- log(odds_chol)

# plot the cholesterol found in the sample against the log odds of having
diabetes
plot(rownames(dm_by_chol_prop), logodds_chol, xlim=c(150, 300))

# categorising chol into an ordinal variable

# https://www.medicalnewstoday.com/articles/315900.php
chol_categorised <- ifelse(chol < 200, "healthy",
ifelse(chol < 240, "borderline high",
ifelse(chol >= 240, "high", NA)))

# make sure that it is treated as a factor/categorical variable and order


ing the levels within the factor for the table
chol_categorised <- factor(chol_categorised, levels = c("healthy", "borde
rline high", "high"))

# create a cross tabulation of cholesterol and diabetes status


dm_by_chol_categorised <- table(chol_categorised, dm) # not including NA
values because there aren't that many

# output the frequencies of diabetes status by cholesterol


dm_by_chol_categorised_prop <- prop.table(dm_by_chol_categorised, margin
= 1)

# calculate the odds of having diabetes


odds_chol_categorised <- dm_by_chol_categorised_prop[, "yes"]/
dm_by_chol_categorised_prop[, "no"]

# calculate the log odds


logodds_chol_categorised <- log(odds_chol_categorised)

# plot the cholesterol categories found in the sample against the log odd
s of having diabetes
dotchart(logodds_chol_categorised)
You can do the same thing for HDL as we have just done for cholesterol.

Here is the code to show the relation between BMI and diabetes. This puts BMI into four categories.

#bmi
height <- g[,"height"]
weight <- g[,"weight"]
height.si <- height*0.0254
weight.si <- weight*0.453592
bmi <- weight.si/height.si^2

# categorising BMI

bmi_categorised <- ifelse(bmi < 18.5, "underweight",


ifelse(bmi >= 18.5 & bmi <= 25, "normal",
ifelse(bmi > 25 & bmi <= 30, "overweight
",
ifelse(bmi > 30, "obese", NA))))
# make sure that it is treated as a factor/categorical variable and order
ing the levels within the factor for the table
bmi_categorised <- factor(bmi_categorised, levels = c("underweight", "nor
mal", "overweight","obese"))

# create a cross tabulation of BMI and diabetes status


dm_by_bmi_categorised <- table(bmi_categorised, dm) # not including NA va
lues because there aren't that many

# output the frequencies of diabetes status by BMI


dm_by_bmi_categorised_prop <- prop.table(dm_by_bmi_categorised, margin =
1)

# calculate the odds of having diabetes


odds_bmi_categorised <- dm_by_bmi_categorised_prop[, "yes"]/
dm_by_bmi_categorised_prop[, "no"]

# calculate the log odds


logodds_bmi_categorised <- log(odds_bmi_categorised)

# plot the BMI categories found in the sample against the log odds of hav
ing diabetes
dotchart(logodds_bmi_categorised)
Note that in the above graph, there is no dot for underweight.

As you can see, R can give you a lot of plots, only some of which are helpful. I hope you can spot
which are the ones worth doing. Software will only do what you tell it to do - you need to help it do
the right thing.

Correlation between predictors

While you can and have already cross-tabulated age and gender, albeit using very broad age bands
for speed and simplicity, it’s worth thinking whether something similar would be useful for other pairs
of variables. Why might it be useful?

To answer that, think about what might happen if we try to put two highly correlated variables into
the regression model. If they are highly correlated, then they are essentially providing the same
information, even if they don’t mean the same thing.

An example could be systolic BP (the upper number when giving a BP reading) and diastolic BP
(the lower number). If someone has a higher than average systolic pressure then they’ll often have a
higher than average diastolic too, etc. Both of them change in us as we age but also during the day
but they remain correlated, e.g. r=0.74 in a study of 24-hour ambulatory monitoring by Gavish et al (J
Hypertens 2008).

Likewise, we might expect cholesterol and HDL to go together in some way. Assessing correlation
was covered in detail in “Linear regression for public health”. To calculate the Pearson correlation
coefficient between two continuous, (roughly) normally distributed variables in R, we can type:

cor.test(x=chol,y=hdl,method=”pearson”)
Pearson's product-moment correlation
data: chol and hdl

t = 3.7983, df = 400, p-value = 0.0001683

alternative hypothesis: true correlation is not equal to 0

95 percent confidence interval:

0.09042379 0.27929136

sample estimates:

cor

0.1865809

This code excludes patients with missing data and tells us that cholesterol and HDL are indeed
correlated (p=0.00017) but only weakly (r=0.19 to two decimal places). You can happily try both of
those in the model. For the two blood pressure values, however, the Pearson correlation coefficient
is 0.60, which is a bit too high for comfort. You’d probably be best to try only one of the two at a time
rather than trying to include both.

All of this leads nicely on to the core part of this course: running multiple logistic regression
models.

Feedback: Multiple Regression Model

To model the effects of age, gender and BMI


To run a model with age, gender and BMI in which age and BMI are assumed to have a roughly
linear relation with the log odds of having diabetes, type this:

m <- glm(dm ~ age + gender + bmi, family=binomial (link=logit))


summary(m)
This will give the following output:

Call:
glm(formula = dm ~ age + gender + bmi, family = binomial(link = logit))

Deviance Residuals:
Min 1Q Median 3Q Max
-1.6843 -0.5763 -0.3885 -0.2575 2.6991

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)-6.892669 1.027388 -6.709 1.96e-11 ***
Age0.055454 0.009884 5.611 2.02e-08 ***
Gendermale0.244852 0.322817 0.758 0.44816
Bmi0.073879 0.023310 3.169 0.00153 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 326.02 on 383 degrees of freedom


Residual deviance: 281.62 on 380 degrees of freedom
(19 observations deleted due to missingness)
AIC: 289.62

Number of Fisher Scoring iterations: 5


First look at how many patients were excluded due to missing values. This is noted here and it
was nineteen.

Normally, the next thing you’d look at is the model fit information, but I’ll cover that later as we need
more than just the default output to assess that properly. At first glance, the fit is OK. There’s no
point in interpreting the coefficients unless the model fit is reasonable.

Next look at the coefficients and their standard errors, to see if there’s been some problem
getting the algorithm to work or whether you've got things like categories with very few patients in
them. This is an issue I'll return to in the course on survival analysis. The standard errors are all
pretty small, so that’s fine.
Now to the interesting part. The log odds ratio for age is 0.055 and its p value is tiny, so you can
conclude that age is significantly – and positively – associated with the risk of getting diabetes. If you
exponentiate its coefficient we get 1.057, or 1.06 to two decimal places. This means that a one-year
increase in age is associated with six percent higher odds of being diagnosed with diabetes. You ran
a simple logistic regression model earlier and got a log odds ratio of 0.052, so pretty similar to this
one. The important thing about your new estimate for the effect of age is that this one is adjusted for
the effects of gender and BMI. This means you don’t have to worry about whether the apparent
effect of age in your earlier simple model is in fact due to gender or indeed to BMI. However, it might
still be due at least in part to things that you haven’t yet put into the model.

You’ll remember if you took the previous two courses in this series that you also need to recognise
that there’s some uncertainty about this estimate of 6% higher odds because it’s based on a
sample of patients. If you got hold of data from another sample of patients, you might not get 6%
again. To express this uncertainty, you need to calculate the 95% confidence interval using the
standard error. The easiest way to do this in R is with the “confint” command, which gives this
output:

> exp(confint(m))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.0001226733 0.006986056
age 1.0373532463 1.078493131
gendermale 0.6762611041 2.409679250
bmi 1.0287121260 1.127738696
There are a lot of unnecessary decimal places, but it does the job. For age, the odds ratio is 1.06
with 95% CI 1.04 to 1.08, all to two decimal places as is usual reporting practice. If you like
technical details, the confidence limits 1.04 and 1.08 for age (and those for the other variables and
the intercept) are what’s called profile-likelihood limits. These are thought to be superior with small
sample sizes, but the difference between them and alternatives such as Wald limits that R provides
using the “confint.default” command is generally modest. You can try that out here for yourself.

The next variable is gender. You actually ran a model with just gender in it earlier, just as you did for
age, and that time it wasn’t statistically significant. Here too the p value is large at 0.448, well
above the standard 0.05 threshold. Its 95% CI is wide, from 0.68 to 2.41, so this data set doesn’t
tell you a whole lot about the relation between gender and diabetes risk. All you can say is that
there's no strong evidence for a gender difference in risk.

Lastly, there’s BMI. The odds ratio for a unit increase in BMI is exp(0.073879) = 1.08, with 95%
CI 1.03 to 1.13, p=0.00153 (or 0.002 to three decimal places, as is usual reporting practice). That’s
a pretty low p value, so you can conclude that people with higher BMIs are more at risk of diabetes.
That’s entirely what we'd expect given that 90% of diabetes cases are type 2, which is associated
with lifestyle, which affects one’s BMI.
Now - it's time for you to run a similar, but new model and take an assessment based on this.

Feedback on the Assessment

Output from the model with age, cholesterol and insurance

This is the code I used to run the model and get the 95% CIs:

Insurance <- as.factor(g[,”insurance”])


m <- glm(dm ~ age + chol + insurance, family=binomial (link=logit))
summary(m)
exp(m$coefficients)
exp(confint(m))

> m <- glm(dm ~ age + chol + insurance, family=binomial (link=logit))


> summary(m)

Call:
glm(formula = dm ~ age + chol + insurance, family = binomial(link = logit
))

Deviance Residuals:
Min 1Q Median 3Q Max
-1.5714 -0.5945 -0.3992 -0.2619 2.4399

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.794252 0.874555 -6.625 3.46e-11 ***
age 0.049753 0.009770 5.092 3.54e-07 ***
chol 0.008402 0.003153 2.665 0.0077 **
insurance1 -0.271955 0.359445 -0.757 0.4493
insurance2 -0.589803 0.377434 -1.563 0.1181
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 334.54 on 388 degrees of freedom


Residual deviance: 289.28 on 384 degrees of freedom
(14 observations deleted due to missingness)
AIC: 299.28
Number of Fisher Scoring iterations: 5

> exp(m$coefficients) # exponentiate the coefs


(Intercept) age chol insurance1 insurance2
0.003045009 1.051011544 1.008437550 0.761888740 0.554436387
> exp(confint(m))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.0005079001 0.0159153
age 1.0315787639 1.0720130
chol 1.0022390155 1.0148096
insurance1 0.3749368651 1.5449046
insurance2 0.2611364934 1.1558103
So the odds ratio for:

 age is 1.05
 cholesterol is 1.01
 insurance 1 (government) is 0.76
 insurance 2 (private) is 0.55

The four p values you should have obtained are (each to three d.p.):

 age is 0.001 (in fact it’s even lower than this for age, so it’s enough to write “p<0.001”)
 cholesterol is 0.008
 insurance 1 is 0.449
 insurance 2 is 0.118

Age and cholesterol are statistically significant predictors. While patients with either
government or private insurance had lower odds than those with no insurance, the differences were
not statistically significant.

Model Fit in Logistic Regression

Model fit in logistic regression

If a regression model – or any other kind of model not seen on the catwalk – is to be useful, it needs
to fit the data well. What does that mean?
There are essentially two very different ways of approaching this question: predictive power
and goodness of fit. Ideally, you want your model to do well on both. Some of what I'll cover is also
relevant to other types of regression, and the course on linear regression in this series introduced
the important concept of the residual.

With the first, the aim is to get a statistic that measures how well you can predict the dependent
variable (the outcome, so getting diagnosed with diabetes in our case) based on the independent
variables (the predictors, such as age and BMI). This will tell you about the “predictive power” or
“explanatory power” of the model. Generally they range between 0 (the model explains none of the
data and the variables don’t predict the outcome at all) and 1 (the variables predict the outcome
completely). These include measures such as the R-square and the area under the ROC curve,
which we’ll look at shortly.

The other approach to evaluating model fit is to compute what’s called a goodness-of-fit
statistic. These kind of measures include the deviance and the popular Hosmer-Lemeshow statistic.
There are formal tests for these that yield a p value, so if you’re happy to use the usual cut-off of
p=0.05, you can use them to decide whether your model fits the data acceptably. Again, more on
those shortly. Goodness of fit tells you nothing about predictive power – and vice versa. You can get
good prediction with poor fit or a model with good fit but poor prediction.

Let’s look at some of these in brief, beginning with what they are and then how to calculate them in
R.

R-squared measures

The previous course on linear regression models covered how to test how well your linear
regression model fits the data. The most common way to do so is with the R-squared value, which
measures the proportion of the variance in the outcome variable (Y) that can be explained by your
predictor variables (X1, X2… etc). An R-squared value close to 1 indicates strong predictive power,
while one close to 0 indicates poor predictive power. As you now know, logistic regression is used
when the outcome variable is binary. We can't do correlation tests if your Y can only take 2 values –
so what can we do?

It turns out that there are many ways to approximate an R-squared for logistic regression. One of the
best ways is with the McFadden (pseudo) R-squared. This measure depends on the “likelihood” of
your model, which is a cryptic way of describing how compatible your model parameters are with the
observed data. You don’t need to know the details of the calculation of the likelihood or indeed of
McFadden’s R-squared measure. The McFadden R-squared can be interpreted in a similar way to
the “classic” R-squared from a linear regression: high values are best. In practice, though, R-
squared values – whether the McFadden version or any other – tend to be pretty low and certainly
lower than people who are used to linear models expect. This does not mean the model is bad – it's
more a reflection of the limitation of the R-squared measure than of the model.
Discrimination: c statistic or area under the ROC curve

In prognostic modelling – estimating the risk of an outcome (e.g. disease) based on a person’s
characteristics (e.g. age, gender, etc) – we want to be able to assess a model’s discrimination.
Discrimination is a “measure of how well the model can separate those who do and do not have the
[outcome] of interest” [Nancy Cook in Circulation 2007]. In our case, we’re interested in
distinguishing between people with and without a disease (diabetes). When looking at a sample of
patients, a model with good discrimination will declare those with a disease to have had a higher risk
than all of those without the disease. Therefore, in the modelling world, discrimination is a good
thing.

You can see why you would want to test this for your logistic models. In the example you’ve done,
you've built a model to test the potential relationship between several traits (age, gender, cholesterol
level, etc) with a disease outcome: diabetes. In your sample of patients, would your model
predict a higher risk score for those who we have observed to have diabetes than those who
don’t?

One of the most popular ways to do this is called the “area under the receiver operating
characteristic (ROC) curve, or “c-statistic” for short. The ROC is a plot of sensitivity (probability of a
positive result among the cases) against 1 - specificity (probability of a negative result among the
non-cases). A “case” here is someone with the disease or outcome of interest. This can be reworded
as the “true positive rate” vs. the “false positive rate”. These terms are covered in more detail in the
epidemiology course as part of the online MPH. This is what the plot looks like for 3 models on
dummy data:
The area under a curve, which is calculated by a technique called integration, is the c-
statistic. A c-statistic of 0.5 indicates that the model is only as good at predicting the outcome as
random chance (i.e. no discrimination). A curve at or close to the black line (y=x) in the diagram
would be an example of this. As the curve pulls away and above from the black line, the area under
it increases, so therefore the discrimination increases. In the diagram, the model represented by the
red ROC curve has the best discrimination. A c-statistic of 1 would be perfect, but of course this
never happens in real life and in fact, as Cook’s article shows, the theoretical maximum for a given
model is often lower than this. A c-statistic below 0.5 would predict the outcome worse than random
chance, which would mean a very poor model indeed.

With large datasets, these plots will be smooth, but for smaller ones they are often somewhat jagged
like this:
The c-statistic appears a lot in the machine learning literature too when the aim is prediction, as it is
here.

Deviance

This word has certain connotations in non-statistical spheres, but in regression it concerns how well
– or rather how badly – the model fits the data. It’s a measure of the “goodness of fit”. In a linear
model, where the outcome can take on any value, the predicted value can match the actual outcome
exactly or differ from it by a measurable amount. This leads straightforwardly to the concept of
deviance – a measure of how the prediction differs from the observed outcome. In logistic
regression, however, the observed outcome can only take on two values, zero and one, whereas the
predicted value, a log odds, can take on any value and can be mapped to a probability, which can
take on any value between zero and one. Therefore, we can’t just take the deviance measure
from the linear regression case. Some adjustment is necessary.
One very common approach can be taken when the data can be aggregated or grouped into unique
“profiles”: groups of cases that have exactly the same values on the predictors, e.g. patients with the
same age, gender and insurance. After fitting the model, we can get an observed number of events
and an expected number of events for each profile. The two well-known statistics for comparing
the observed number with the expected number are the deviance and Pearson’s chi-square.
Both produce statistics that can be compared against tables of a chi-squared distribution in order to
see how unusual the value of the statistic is for that model, which yields a p value. High p values
(e.g. above the usual threshold of 0.05) mean that the model’s deviance statistic is nothing unusual,
which is a good thing as it means that the model fits the data well. The deviance compares our
model with a few variables in against one that fits the data perfectly – the “saturated model” – to see
whether we’re missing anything important, such as interactions between variables or non-linearities.

This profile approach is fine when we have only categorical predictors. It will likely be fine with age
(in years) if there are a number of patients with each different value for age. If the data are spread so
that there’s only one case per profile, then the deviance and the Pearson’s chi-square statistic won’t
fit the chi-squared distribution very well at all, so the test breaks down. What can we do? This leads
us to the next measure: the Hosmer-Lemeshow statistic, proposed in 1980 and still very much in use
today.

Calibration: Hosmer-Lemeshow statistic and test

Here, patients are grouped together according to their predicted values from the model, which are
ordered from lowest to highest and then separated into typically ten groups of roughly equal size.
For each group, we calculate the observed number of events – here that’s the number of patients
with diabetes – and the expected number of events, which is just the sum of the predicted
probabilities for all the patients in the group. Pearson’s chi-square test is then applied to compare
observed counts with expected counts. A large p value (e.g. above the conventional 0.05)
indicates that the model’s predicted values are a good match for the real (observed) values,
i.e. the model is a good fit.

The authors’ own work has revealed some limitations with the test. With small data sets, the
test has limited ability (limited power) to detect important differences between the observed and
expected counts i.e. to detect poor fit that’s poor enough to worry about. At the other end, with large
data sets, you can get a low p value when the difference between the observed and expected isn’t
that important.

It has some other issues too. One is that although the standard number of groups to use is ten, the p
value can be altered just by choosing fewer or more than ten groups, and there’s no good way of
deciding on the number. Another is that some people, including me, have found that adding
interaction terms between variables, even non-significant ones, can alter the statistic.

Despite all these problems, the test is much used and reported in literature. These days, I only use
the plot of the observed against the expected rather than the p value. Some other tests are reviewed
at https://support.sas.com/resources/papers/proceedings14/1485-2014.pdf though the article uses
SAS as the exemplar software package.

How to get these statistics in R

The deviance is given by default – see the next item on the course – but the R-squared, c statistic
and Hosmer-Lemeshow statistics and test have to be requested in R.

McFadden’s r-squared:

Here’s the formula in case you wanted it:

# design your logistic regression


full_model <- glm(dm ~ age + chol + insurance, family=binomial (link=logi
t))

# check your model


summary(full_model)
##
## Call:
## glm(formula = dm ~ age + chol + insurance, family = binomial(link = lo
git))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5714 -0.5945 -0.3992 -0.2619 2.4399
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.794252 0.874555 -6.625 3.46e-11 ***
## age 0.049753 0.009770 5.092 3.54e-07 ***
## chol 0.008402 0.003153 2.665 0.0077 **
## insurance1 -0.271955 0.359445 -0.757 0.4493
## insurance2 -0.589803 0.377434 -1.563 0.1181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 334.54 on 388 degrees of freedom
## Residual deviance: 289.28 on 384 degrees of freedom
## (14 observations deleted due to missingness)
## AIC: 299.28
##
## Number of Fisher Scoring iterations: 5
# run a null model
null_model <- glm(dm ~ 1, family=binomial (link=logit))

# check
summary(null_model)
##
## Call:
## glm(formula = dm ~ 1, family = binomial(link = logit))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.578 -0.578 -0.578 -0.578 1.935
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7047 0.1403 -12.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 334.87 on 389 degrees of freedom
## Residual deviance: 334.87 on 389 degrees of freedom
## (13 observations deleted due to missingness)
## AIC: 336.87
##
## Number of Fisher Scoring iterations: 3
# calculate McFadden's R-square
R2 <- 1-logLik(full_model)/logLik(null_model)

# print it
R2
## 'log Lik.' 0.1361385 (df=5)

This R-squared of about 14% is typical of logistic regression models and is actually not too bad (but
not great).

c-statistic:

The easiest way to generate the c-statistic in R is to download the package “DescTools” and use the
function Cstat().
# install a package
install.packages("DescTools")
## Installing package into ''
## (as 'lib' is unspecified)
## package 'DescTools' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
##
# load package
require(DescTools)
## Loading required package: DescTools
## Warning: package 'DescTools' was built under R version 3.5.1
# design your logistic regression
full_model <- glm(dm ~ age + chol + insurance, family=binomial (link=logi
t))

# check your model


summary(full_model)
##
## Call:
## glm(formula = dm ~ age + chol + insurance, family = binomial(link = lo
git))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5714 -0.5945 -0.3992 -0.2619 2.4399
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.794252 0.874555 -6.625 3.46e-11 ***
## age 0.049753 0.009770 5.092 3.54e-07 ***
## chol 0.008402 0.003153 2.665 0.0077 **
## insurance1 -0.271955 0.359445 -0.757 0.4493
## insurance2 -0.589803 0.377434 -1.563 0.1181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 334.54 on 388 degrees of freedom
## Residual deviance: 289.28 on 384 degrees of freedom
## (14 observations deleted due to missingness)
## AIC: 299.28
##
## Number of Fisher Scoring iterations: 5
# generate the c-statistic
Cstat(full_model)
## [1] 0.764387

Hosmer-Lemeshow statistic and test:

# H-L test

# install package "ResourceSelection"


install.packages("ResourceSelection")
## Installing package into ''
## (as 'lib' is unspecified)
## package 'ResourceSelection' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
##
# load package
require(ResourceSelection)
## Loading required package: ResourceSelection
## Warning: package 'ResourceSelection' was built under R version 3.5.1
## ResourceSelection 0.3-2 2017-02-28
# design your logistic regression
full_model <- glm(dm ~ age + chol + insurance, family = binomial(link = l
ogit))

full_model$y

full_model$y is the outcome variable we specified (dm); fitted(full_model) generates fitted values
from the model.

# run Hosmer-Lemeshow test


HL <- hoslem.test(x = full_model$y, y = fitted(full_model), g = 10)
HL
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: full_model$y, fitted(full_model)
## X-squared = 11.25, df = 8, p-value = 0.1879
# plot the observed vs expected number of cases for each of the 10 groups
plot(HL$observed[,"y1"], HL$expected[,"yhat1"])
# plot the observed vs expected number of noncases for each of the 10 gro
ups
plot(HL$observed[,"y0"], HL$expected[,"yhat0"])
# plot observed vs. expected prevalence for each of the 10 groups
plot(x = HL$observed[,"y1"]/(HL$observed[,"y1"]+HL$observed[,"y0"]),
y = HL$expected[,"yhat1"]/(HL$expected[,"yhat1"]
+HL$expected[,"yhat0"]))
As you can see, there are different ways of plotting the information from a Hosmer-Lemeshow test.
Another way is to plot the ten ratios of observed:predicted cases, where a well-calibrated model
would show ten points very near 1.

##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: full_model$y, fitted(full_model)
## X-squared = 11.25, df = 8, p-value = 0.1879
# verify result with another package?

# install package("generalhoslem")
install.packages("generalhoslem")
## Installing package into ''
## (as 'lib' is unspecified)
## package 'generalhoslem' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
##
# load package
require(generalhoslem)
## Loading required package: generalhoslem
## Warning: package 'generalhoslem' was built under R version 3.5.1
## Loading required package: reshape
## Warning: package 'reshape' was built under R version 3.5.1
## Loading required package: MASS
# run Hosmer-Lemeshow test
logitgof(obs = full_model$y, exp = fitted(full_model), g = 10)
##
## Hosmer and Lemeshow test (binary model)
##
## data: full_model$y, fitted(full_model)
## X-squared = 11.25, df = 8, p-value = 0.1879

How to Interpret Model Fit and Performance


Information in R

Let’s consider the last model we ran, which had age, cholesterol and insurance. How well does this
model fit the data? We’ll start with what R gives us by default and decide what’s useful and what’s
missing. R’s default output for the “glm” command includes the following:

 The call to the algorithm itself, i.e. what the model is


 Deviance residuals
 Coefficients (we’ve covered these already)
 Dispersion parameter
 Null deviance
 Residual deviance
 AIC
 Number of Fisher scoring iterations

The last of these isn’t terribly interesting except to note that it means that the model has converged,
i.e. the algorithm has worked and found the best solution. In the others, there’s quite a bit of mention
here about “deviance”, which was described in the earlier reading. The null deviance tells us about
model fit with just the intercept term in. What’s more important is the deviance when we’ve added
our predictors and how much the deviance falls when we do so compared with the null value. For
our model:
 The null model had a deviance of 334.54 on 388 degrees of freedom
 Our model had a residual deviance of 289.28 on 384 degrees of freedom

That’s a difference of 334.54-289.28 = 45.26 at a “cost” of 388-384 = 4 degrees of freedom. If you


recall the concept of degrees of freedom from the previous course, then you’ll see that the four d.f.
represent four added parameters, which came from one for age, one for cholesterol and two for
insurance. Our model has improved by 45.26 for our “investment” of 4 d.f., but is that a good return
on our investment?

To answer this, we need to define “deviance” in this context. This can get very technical, as there
are different ways to compute this. First, the bigger the deviance, the worse the model fits the data,
so you want to be able to test this. Second, we want our model to be an improvement on the null
model – if you have at least one variable with a low p value, then you’ll have an improvement.

Null Deviance and Residual Deviance

To understand residual deviance, we must first think about 3 models: the null model, the
proposed model and the saturated model. The null model, as discussed earlier, is one where we
only include the intercept: this model therefore only has one parameter. The proposed model is the
model with the variables we included in our logistic regression. The number of parameters in the
proposed model is the number of variables plus one (the intercept): this is because each of the
variables we have included only needs one parameter, but remember that a categorical variable with
three categories, for example, will need two parameters. The saturated model is a model which fits
the data perfectly, because it has as many parameters as there are data points.

The null deviance is a measure of how well the null model explains the data compared with
the saturated model. Having just one parameter, the null model does not usually explain the data
very well, and this is indicated by a large null deviance. The point of doing a regression model is that
we reckon we can do better at explaining the data with a few variables (age, sex, etc). This brings us
on to the residual deviance: how well the proposed model explains the data compared with the
saturated model.

The difference between the null deviance and the residual deviance gives us an idea of how
well our model has performed (at the cost of degrees of freedom). It’s like a return on
investment: how much benefit (explanation of the variation in the outcome) do we get for our
investment (the variables we’ve added or, more accurately, the degrees of freedom taken up by
those variables). If the model is “good”, then the difference between the null deviance and the
residual deviance will be large. There are formal ways to see whether the difference is large enough.

Let’s demonstrate this in R:

# design your logistic regression


full_model <- glm(dm ~ age + chol + insurance, family = binomial(link = l
ogit))

# analyse table of deviance


anova(full_model, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: dm
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 388 334.54
## age 1 35.413 387 299.12 2.667e-09 ***
## chol 1 7.363 386 291.76 0.006658 **
## insurance 2 2.478 384 289.28 0.289613
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The fourth column shows the deviances of the models compared with the saturated model. The first
(334.54) is the null deviance and each subsequent number is the deviance of the model with each
new variable. The final value (289.28) is the deviance of the proposed model (with all three of our
variables). This is the residual deviance. As expected, adding each new variable to our model
explains the data better, thus reducing the deviance.

To test whether each added parameter increases the deviance by a significant amount, we
asked R to compare it with a chi-square value for the number of degrees of freedom lost. If
the p-value is low, it indicates that the corresponding added variable causes a significant change in
deviance, and thus is a better fitting model. It’s not at all essential that you understand why we use
the chi-square distribution for this comparison – just that you know how to interpret the resulting p-
value.

In our case, adding the variables age and cholesterol significantly reduce the deviance and improve
the model fit, as indicated by their low p-values, but including the insurance variable does not
improve the model fit enough to justify the loss in degrees of freedom, as indicated by its high p-
value of 0.2896.

AIC

Lastly, I’ll mention the AIC. This is short for Akaike Information Criterion and measures the
quality of a model in terms of the amount of information lost by that model. It therefore
recognises that all models lose information compared with “reality” but some models lose less than
others. It’s of no use by itself but is used for comparing two or more models. Small AIC values are
best.

Further Reading on Model Fit

Further reading on model fit measures

As noted above, the Hosmer-Lemeshow test has some limitations with both small and large sample
sizes, which the authors themselves acknowledge. With a small sample, the test lacks power to spot
important differences between the observed and expected counts. With a large sample, it can spot
such differences but it also picks up small and unimportant ones. I often analyse national data with
hundreds of thousands of patients in a model and therefore regularly get low p values from this test.
Because of these issues, I’ve even seen someone on a forum call the test “obsolete”, but you can
still see it everywhere in the medical literature.

For those interested and happy to go through a bit of algebra (but not too much), there are some
useful discussions on the many different R-squared measures in existence at the following link:

https://www.researchgate.net/publication/
228507556_One_more_time_about_R2_measures_of_fit_in_logistic_regression Although the article
covers a different software package, SAS, it’s still a clear explanation of the subject.

This review article by Mittlbock and Schemper (Mittlböck M, Schemper M. Explained variation for
logistic regression. Statistics in Medicine 1996; 15(19): 1987-97)

https://onlinelibrary.wiley.com/doi/abs/10.1002/%28SICI%291097-
0258%2819961015%2915%3A19%3C1987%3A%3AAID-SIM318%3E3.0.CO%3B2-9 is over 20
years old but even then it covered a dozen R-squared measures for logistic regression. Since then
there have been further developments in this area, for instance:

Tjur T. Coefficients of determination in logistic regression models—A new proposal: The coefficient
of discrimination. The American Statistician 2009; 63: 366-372.

The definition of this one is very simple, so it’s easy to calculate in R. For each of the two categories
of the dependent variable, so having diabetes yes / no, calculate the mean of the predicted
probabilities of having diabetes. Then take the difference between those two means. Tjur called this
the coefficient of discrimination, and you can interpret it like any other R-squared as the proportion of
variation in the outcome that’s explained by the model. High values are best.
Too much choice can
be a bad thing. You have a dataset, you need to fit a regression
model to predict something, but you have possible
predictor variables coming out of your ears. How are you going to
decide which predictors to leave in and which
to leave out of it? That's a really
important question. Let's look at
some guiding principles to steer your course. So a good start is to read
existing relevant literature. Studies in high profile
peer review journals are more likely to have
been done well than those in little-known journals at unscrupulous
publishing houses that will accept anything
if you pay them. You can also ask experts
if you know any. These sources will give you
a few suggestions of what to include but they probably won't
do the whole job for you. So before going any further, let's be clear about what
your model is trying to do. You want to predict
a patient outcome with enough accuracy to be
useful and realistic, but you don't want
a model that's so complicated that you can't
interpret its coefficients, which for logistic
regression are odds ratios, once you've done the explanation. You also need your
model
to be robust, that means that it should
also work well when you apply it to another dataset
with different patients. So let's first
consider the pros and the cons of the model
with only two predictors, and then for one with a 100. I always like to exaggerate
to illustrate a point, it's fun. First, let's say your model
has only one predictor which you've selected based on your reading of
the relevant literature, to make it really easy, let's say it's gender defined as either
male or female. This model will have
a grand total of two parameters. One for one gender say female, and one for the
intercept
which will include the odds for the other
gender, so males. This model has
some obvious advantages, it's quick to run even
on a slow computer, and simple to interpret and
explain to other people. The parameters of the model, that's the intercept
and the odds ratio for the effects of being female
compared with being male, will have nice narrow
confidence intervals because they're each based
on a lot of patients. If your dataset contains 1000 patients and
the gender split is 50-50, you've 500 patients to estimate the odds for each
gender,
that's a lot. This model will be robust but the outcome is hardly likely to be only
due to
the patient's gender, the models predicted
power would be poor. To get better prediction, you'll need to use
more predictors, so let's consider a model
with a 100 of them. Say you've just thrown
them all in together - the discrimination of the model as measured by the C-
statistic, may well be high. Let's say it's now 0.85, whereas with just gender in it, it
was just 0.53, so a huge improvement. But this model would have
taken much longer to run, and you've got a lot of interpreting and
explaining to do. Some predictors will have
low p-values, but many won't. Some predictors will have large standard errors
and
wide confidence intervals, meaning that the
estimated odds ratios for these predictors have a lot of uncertainty about
their real values, that is they are unstable. This model is not robust, its output
probably
can't be trusted. If you fitted the same model to a different set of patients, you'd
probably get
some very different odds ratios, this is called
overfitting which I'll explain in more detail
separately. So what should you do? You need to prune the model
and clear out the junk. To do this, there are some exotically named
technical tricks that can be used in regression, but are also considered
machine learning methods - these are beyond
our scope in this course. If prior knowledge
isn't enough to help, there are some other
commonly used approaches, commonly used but really smelly, so smelly that I
can barely bring myself
to describe them, but I must because they
are so widespread. The first is forward selection. Here, you start off
with no predictors in the model and then you
try them one at a time, starting with one with
the lowest p-value, then you keep adding variables until none of the remaining
ones
are significant. Often defined by p less than 0.1, so you don't miss important
ones, this is horrible. You might think you're keeping only those where p
is less than 0.1, but actually all this testing of this and testing of that, means
you've no idea what
the real p-values are, and the confidence intervals
don't make sense in the situation either,
it's not robust. A variant on this is
step-wise selection, which allows you to
drop variables if they become nonsignificant
when you add a new one, this is also horrible
for the same reasons. Thirdly, there's
backwards elimination. Here, you put all the
possible predictors in at once and you drop
the nonsignificant ones, beginning with
the least significant one, that with the highest p-value. This is the least
bad of the three, I use it sometimes,
though with caution. Choosing which predictors
to keep in your model is a vital task and an art, but can be fraught with danger.
Using prior knowledge is good, and backwards elimination is
useful when used carefully, but forward selection and step-wise selection
are too smelly, even to be considered.
Overfitting. Nothing to do with clothes. This is a major hazard of model building
that can affect all types of aggression and particularly, machine learning methods.
It happens when you try to squeeze too many variables, actually, too many
parameters, which I'll explain in a minute, into your model that it can't cope and it
explodes. Let's use an analogy. You have an empty train. Many people are waiting
on the platform to get on to take them to the airport. If a normal size adult or
child with a small bag gets on first, they'll still be loads of room on the train. But if
a Sumo wrestler gets on, they'll still be lots of room, though less than with the
childs. As more people bored the train, the amounts of free space falls until
people really have to push their way on until no one has enough room to move
and breathe, and that's my daily commute in London. It's also over-fitting. It's the
same with modelling. In the analogy, the train is your model and the people are
your predictors. Too many predictors in the model and you're in trouble. If you
just picked one predictor, then you'll have no problem if you're predictor is the
equivalent of an average person or child with a small bag. They can be a problem
however if that predictor is equivalent of a sumo wrestler with enough luggage
for a year stay at the North Pole. So, which types available are like children and
which are like sumo wrestlers with luggage? Continuous variables are like
children. Why? Because they only need one parameter in the model to describe
their relation with the outcome. Remember, that this assumes that this relation is
linear. To get a curved relation, you need more parameters. Also like children are
categorical variables with only two categories for instance, male or female. They
also need only one parameter to describe their relation which would be the odds
for females relative to the odds for males, for instance. Remember that the
reference category, which is males in this example, forms parts of the model
intercept. Variables that take up the most room in a model and a light Sumo
wrestler are categorical variables with lots of categories. For instance, suppose
you've got age with 20 categories. Each of which is a five-year age band. This
would need 19 parameters plus the intercept. The patients in your dataset have
to be spread amongst 20 categories. So, you might not get very many patients in
each one. That can cause the software algorithm problems. You may need to
combine some of the categories that's the equivalent of stuffing a bag inside a
hard suitcase. So, how can you spot overfitting? Well, in the most extreme case,
the software would just give up and tell you that the model has not converged.
This means that the underlying algorithm that's trying to estimate all your odds
ratios is unable to find the best solution. R will give warning messages and tell you
that the algorithm did not converge. But if it did converge, there might still be
problems. So, to check, I always inspect the standard errors for the odds ratios
and the size of the odds ratios themselves. Large values of either make me
uneasy, but especially for the standard errors, as they are also an indication of
how many patients and outcomes we use to estimate the associated odds ratio.
This is more often a problem for logistic regression than for linear regression, but
it can happen in any type of regression. It can also happen when two or more of
your predictors are highly correlated with each other. But how large is large?
Well, it's no agreed cut off for standard errors, but anything over 10, I'd say is
definitely too big, and I personally rarely accept standard errors over one. So,
what can you do about overfitting? Happily, there are some simple remedies that
often work. So, if one or more of your categories in the categorical variable has
large standard errors, then try combining them with another category that makes
sense. Also, check that you're reference category isn't tiny. For instance, if you
have ages 20 categories and only four people are in the end of fives, then don't
have the under fives as the reference. If those things don't work, then you'll need
to drop the whole variable. You might need to drop several variables with big
standard errors. Overfitting is a major pitfall of predictive modelling and happens
when you try to squeeze too many predictors or too many categories into your
model. Happily, simple tricks often get around it, but it's vital to try your model
out on a separate set of patients whenever possible to check that your model is
robust.

Summary of Different Ways to Run Multiple


Regression

Ways to choose your model

As I explained in the videos, the problem is that having too few predictors leads to poor prediction,
but having too many can cause overfitting, non-convergence, difficulty in interpretation and
explaining results to other people. So how do you choose the predictors?
It’s always a good idea to begin by reviewing the relevant literature and expert knowledge, but that
will only get you so far. It may tell you that you should include age, gender and perhaps a few other
things, but it’s very likely that you’ll have other variables in your data set that are worth trying. One
option is to enter and keep them all in your model, whatever the p values. This is a good idea if you
don’t have too many and / or you can use a priori knowledge for them all. (but beware when this
prior knowledge comes from studies using poor methods)

A variant of this is backwards elimination, where you then drop the non-significant variables.
This works OK in some circumstances, but you need to check for correlation between variables. The
best way to do that is by inspecting the odds ratios for the predictors that you are keeping – first with
all variables in the model and second when you drop some. If the odds ratios change noticeably
when you drop some, then you’ll need to add back at least one of the dropped ones.

Forward selection involves starting with an empty model and trying variables one at a time. Stepwise
selection involves a mixture of forwards and backwards. Both of these should be avoided.
Similarly, “all-possible-regressions", which literally tests all possible subsets of the set of potential
independent variables, is to be avoided. It might sound like it, but it’s not in fact guaranteed to give
you the best model for your data.

Machine learning approaches are increasingly popular, but are complex and out of our scope.

Training and testing data sets

It’s good practice to split your data set if possible into a training and a testing data set and apply the
training-set model to the test-set data. If you get very different answers, then rethink the complexity
of training-set model and repeat. This is standard practice with machine learning because of the risk
of overfitting and "overtraining", in which the algorithm fits the data too closely so it's unable to
distinguish between signal and noise in the data set and so performs badly on a new data set.
However, it's also strongly recommended with statistical models where possible. Another related
technique is called k-fold cross-validation, which is often used when you have limited data.

With small data sets, splitting into a training and a testing set is not advisable. The data sets we are
using in these statistics for public health courses are considered small, which is why I haven’t
suggested splitting them. If you're keen to go beyond this introductory course, then I suggest
learning about cross-validation. If you're taking our Global Master's degree in Public Health, then it
will be covered in the Advanced Statistics specialisation, which will also cover things like LASSO and
elastic nets that are relevant to model selection.

Feedback: Backwards Elimination


Your task was to fit a model with those six predictors and apply backwards elimination to remove
any that are not statistically significant.

Here is the code to fit the model and the output that it gives.

##### Make the variables and run the models #####

dm <- as.factor(g[,"dm"])
insurance <- as.factor(g[,"insurance"])# let's say 0=none, 1=gov, 2=priva
te
fh <- as.factor(g[,"fh"]) # 1=FH, 0=no FH
smoking <- as.factor(g[,"smoking"]) # 1,2,3
chol <- g[,'chol']
hdl <- g[,'hdl']
ratio <- g[,'ratio']
location <- as.factor(g[,'location'])
age <- g[,'age']
gender <- as.factor(g[,'gender'])
frame <- as.factor(g[,'frame'])
systolic <- g[,'bp.1s']
diastolic <- g[,'bp.1d']

model <- glm(dm ~ age + bmi + chol + hdl + systolic + diastolic, family =
binomial(link = logit))

summary(model)
##
## Call:
## glm(formula = dm ~ age + bmi + chol + hdl + systolic + diastolic,
## family = binomial(link = logit))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4658 -0.5453 -0.3625 -0.1989 2.8155
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.4479066 1.5287185 -4.872 1.10e-06 ***
## age 0.0505742 0.0121218 4.172 3.02e-05 ***
## bmi 0.0496011 0.0242735 2.043 0.04101 *
## chol 0.0106330 0.0035028 3.036 0.00240 **
## hdl -0.0290599 0.0103633 -2.804 0.00505 **
## systolic 0.0053365 0.0089493 0.596 0.55097
## diastolic -0.0002951 0.0158481 -0.019 0.98515
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 324.38 on 378 degrees of freedom
## Residual deviance: 263.80 on 372 degrees of freedom
## (24 observations deleted due to missingness)
## AIC: 277.8
##
## Number of Fisher Scoring iterations: 5

anova(model, test = "Chisq")

## Analysis of Deviance Table


##
## Model: binomial, link: logit
##
## Response: dm
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 378 324.38
## age 1 33.737 377 290.64 6.309e-09 ***
## bmi 1 9.295 376 281.34 0.002298 **
## chol 1 7.949 375 273.40 0.004812 **
## hdl 1 9.043 374 264.35 0.002638 **
## systolic 1 0.555 373 263.80 0.456200
## diastolic 1 0.000 372 263.80 0.985146
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It’s clear that neither of the BP variables is significantly associated with the odds of being diagnosed
with diabetes in this data set, but the other four variables were. If you drop the BP variables, you
get this:

Call:
glm(formula = dm ~ age + bmi + chol + hdl, family = binomial(link = logit
))

Deviance Residuals:
Min 1Q Median 3Q Max
-1.4243 -0.5554 -0.3585 -0.1969 2.8492

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.112566 1.279586 -5.558 2.72e-08 ***
age 0.054510 0.010577 5.153 2.56e-07 ***
bmi 0.052218 0.024013 2.175 0.02966 *
chol 0.011010 0.003452 3.190 0.00142 **
hdl -0.028668 0.010310 -2.781 0.00543 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 325.70 on 382 degrees of freedom


Residual deviance: 265.11 on 378 degrees of freedom
(20 observations deleted due to missingness)
AIC: 275.11

Number of Fisher Scoring iterations: 5

Have any of the coefficients for the four remaining variables changed? Not much, which is good. But
why is blood pressure not significant here despite what the literature says? One way to find out is to
see if it correlates with other variables. Here's the code to do that and the output.

# strange that systolic and diastolic are not significant...

cor.test(systolic, hdl) # not significant

##
## Pearson's product-moment correlation
##
## data: systolic and hdl
## t = 0.39368, df = 395, p-value = 0.694
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07877132 0.11799603
## sample estimates:
## cor
## 0.01980412

cor.test(systolic, bmi) # significant

##
## Pearson's product-moment correlation
##
## data: systolic and bmi
## t = 2.2863, df = 391, p-value = 0.02277
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.01611718 0.21137656
## sample estimates:
## cor
## 0.1148561

cor.test(systolic, chol) # very significant

##
## Pearson's product-moment correlation
##
## data: systolic and chol
## t = 4.1276, df = 395, p-value = 4.474e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1070653 0.2958454
## sample estimates:
## cor
## 0.2033444
cor.test(systolic, age) # extremely significant
##
## Pearson's product-moment correlation
##
## data: systolic and age
## t = 9.8342, df = 396, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3604404 0.5187477
## sample estimates:
## cor
## 0.4430412
So systolic BP correlates weakly (but statistically significantly) with cholesterol and moderately (and
also statistically significantly) with age. Both of these results are entirely expected from what we
know about physiology.

As an exercise, you can try leaving age out of the model: is systolic BP significant now?

Further Reading on Model Selection


Methods

Optional further reading on model selection methods


There’s more detail on how stepwise selection is done and why it’s a bad idea at
https://people.duke.edu/~rnau/regstep.htm You can’t trust the coefficients and you can’t trust the p
values. That’s true for all types of regression, including Cox that we’re covering in the next course in
this specialisation.

Rather than use p values for each variable to decide whether to include it, a (better) alternative is to
be guided by a quantity that I mentioned earlier called the AIC, Akaike’s Information Criterion, or a
similar one called the Bayes Information Criterion (BIC). Without going into the maths behind the
“information” (which has a technical meaning here), you just need to know that the AIC aims to
describe how well the model fits the data while penalising models with lots of coefficients and that
lower values of the AIC are desirable. It's useful for comparing models, but if you only run one model
it's of no value.

Some people use the R-squared statistic, which you read about earlier in this course, to pick the
“best” model. The R-squared estimates the predictive power of the model and tells you nothing
about how well the model fits the data (goodness of fit). It’s not ideal but it does give you useful
information. With logistic regression, the maths behind the R-squared is trickier than for linear
regression, leading to many different versions of it being proposed. I'd advise against using only the
R-squared to pick your best model.

An alternative to choosing a single model is to use model averaging. This has a different philosophy
from the model selection methods by considering that there are several “good” candidate models
and that we don’t actually have to choose between them – we can just take an average. This idea
has been developed and applied to all kinds of regression. This has been quite a large area of
methodological research for some time, and model averaging can be done in R with functions written
by users. It’s now often used to average regression coefficients across multiple models with the
ultimate goal of capturing a variable's overall effect. This use of model averaging implicitly assumes
the same parameter exists across models so that taken an average is a sensible thing to do. At first
glance, this assumption seems reasonably, but regression coefficients associated with particular
variables might not have the same interpretations across all of the models in which they appear, and
that makes interpreting the averaged value tricky. Despite the issues – after all, there are issues with
every method in existence – model averaging is widely used, but the size of the literature and the
technical details of Bayesian model averaging in particular mean that I’ll go no further. A readable
summary of the area is given by https://warwick.ac.uk/fac/sci/statistics/crism/research/17-06/17-
06w.pdf The focus is on Bayesian model averaging – in Bayesian statistics, one combines the data
with one’s prior beliefs to produce the model output, whereas in classical or “frequentist” statistics,
one is driven only by the data. The article also covers the frequentist approach. There are formulae,
which you can skip over if you like, and the author is an economist, but it’s still a very readable
account of the subject.

R Code for the Whole Module


# import data
g <- read.csv(file = paste0(getwd(), "/diabetes.csv"), header=TRUE, sep
=",")

# define your variables


###############

chol <- g[,"chol"]


gender <- as.factor(g[,"gender"])
height <- g[,"height"]
weight <- g[,"weight"]
age <- g[,"age"]
dm <- as.factor(g[,"dm"])
insurance <- as.factor(g[,"insurance"])# let"s say 0=none, 1=gov, 2=priva
te
fh <- as.factor(g[,"fh"]) # 1=FH, 0=no FH
smoking <- as.factor(g[,"smoking"]) # 1,2,3
hdl <- g[,"hdl"]
ratio <- g[,"ratio"]
location <- as.factor(g[,"location"])
frame <- as.factor(g[,"frame"])
systolic <- g[,"bp.1s"]
diastolic <- g[,"bp.1d"]

# calculate BMI from height and weight:


###############

# 1. convert height and weight to metric units


height.si <- height*0.0254
weight.si <- weight*0.453592

# 2. BMI = weight over height squared


bmi <- weight.si/height.si^2

###############

# create a table for the gender variable


table_gender <- table(gender)

# display % in each gender


round(100 * prop.table(table_gender), digits = 1)
## gender
## female male
## 58.1 41.9
# categorise BMI by category
bmi_categorised <- ifelse(bmi < 18.5, "underweight",
ifelse(bmi >= 18.5 & bmi <= 25, "normal",
ifelse(bmi > 25 & bmi <= 30, "overweight
",
ifelse(bmi > 30, "obese", NA))))

# cross tabulate diabetes status and BMI category


dm_by_bmi_category <- table(bmi_categorised, dm, exclude = NULL)
dm_by_bmi_category
## dm
## bmi_categorised no yes <NA>
## normal 100 9 4
## obese 118 29 5
## overweight 99 20 4
## underweight 9 0 0
## <NA> 4 2 0
# produce the table as % in each BMI category with or without diabetes
round(100 * prop.table(dm_by_bmi_category, margin = 1), digits = 1)
## dm
## bmi_categorised no yes <NA>
## normal 88.5 8.0 3.5
## obese 77.6 19.1 3.3
## overweight 80.5 16.3 3.3
## underweight 100.0 0.0 0.0
## <NA> 66.7 33.3 0.0
# categorise age by group
age_grouped <- ifelse(age < 45, "under 45",
ifelse(age >= 45 & age < 65, "45 - 64",
ifelse(age >= 65 & age < 75, "65 - 74",
ifelse(age >= 75, "75 or over", NA)))
)

# cross tabulate age by gender


age_group_by_gender <- table(age_grouped, gender, exclude = NULL)

# print % in each age group by gender


round(100 * prop.table(age_group_by_gender, margin = 2), digits = 1)
## gender
## age_grouped female male
## 45 - 64 32.1 37.9
## 65 - 74 9.0 11.8
## 75 or over 5.1 6.5
## under 45 53.8 43.8
# create a null logistic model for diabetes
m <- glm(dm ~ 1, family = binomial(link = logit))
summary(m)
##
## Call:
## glm(formula = dm ~ 1, family = binomial(link = logit))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.578 -0.578 -0.578 -0.578 1.935
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7047 0.1403 -12.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 334.87 on 389 degrees of freedom
## Residual deviance: 334.87 on 389 degrees of freedom
## (13 observations deleted due to missingness)
## AIC: 336.87
##
## Number of Fisher Scoring iterations: 3
# perform logistic regression with gender as predictor variable
m <- glm(dm ~ gender, family = binomial(link = logit))
summary(m)
##
## Call:
## glm(formula = dm ~ gender, family = binomial(link = logit))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5915 -0.5915 -0.5683 -0.5683 1.9509
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.74150 0.18592 -9.367 <2e-16 ***
## gendermale 0.08694 0.28352 0.307 0.759
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 334.87 on 389 degrees of freedom
## Residual deviance: 334.78 on 388 degrees of freedom
## (13 observations deleted due to missingness)
## AIC: 338.78
##
## Number of Fisher Scoring iterations: 4
# perform logistic regression with gender as predictor variable, with mal
e as reference group
# generate odds of having diabetes if female compared to male
###############

# 1. check order of the levels in the gender variable


levels(gender)
## [1] "female" "male"
# 2. make "male" the reference group
gender <- relevel(gender, ref = "male")

# 3. run logistic regression


m <- glm(dm ~ gender, family = binomial(link = logit))
summary(m)
##
## Call:
## glm(formula = dm ~ gender, family = binomial(link = logit))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5915 -0.5915 -0.5683 -0.5683 1.9509
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.65456 0.21404 -7.730 1.08e-14 ***
## genderfemale -0.08694 0.28352 -0.307 0.759
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 334.87 on 389 degrees of freedom
## Residual deviance: 334.78 on 388 degrees of freedom
## (13 observations deleted due to missingness)
## AIC: 338.78
##
## Number of Fisher Scoring iterations: 4
# 4. exponentiate the log odds of having diabetes when female to obtain t
he odds
exp(m$coefficients["genderfemale"])
## genderfemale
## 0.9167328
###############

# run logistic model with age as predictor variable


m <- glm(dm ~ age, family = binomial(link = logit))
summary(m)
##
## Call:
## glm(formula = dm ~ age, family = binomial(link = logit))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3612 -0.5963 -0.4199 -0.3056 2.4848
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.404530 0.542828 -8.114 4.90e-16 ***
## age 0.052465 0.009388 5.589 2.29e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 334.87 on 389 degrees of freedom
## Residual deviance: 299.41 on 388 degrees of freedom
## (13 observations deleted due to missingness)
## AIC: 303.41
##
## Number of Fisher Scoring iterations: 5
# to check whether the relationship between age and log odds of having di
abetes is linear (an assumption of the logistic regression)
###############

# 1. create a cross tabulation of age and diabetes status


dm_by_age <- table(age, dm)

# 2. output the frequencies of diabetes status by age


freq_table <- prop.table(dm_by_age, margin = 1)

# 3. calculate the odds of having diabetes


odds <- freq_table[, "yes"]/freq_table[, "no"]

# 4. calculate the log odds


logodds <- log(odds)
# 5. plot the ages found in the sample against the log odds of having dia
betes
plot(rownames(freq_table), logodds)

You might also like