Geographical Data Science and Spatial Data Analysis An Introduction in R (Spatial Analytics and GIS) 1st Edition
Geographical Data Science and Spatial Data Analysis An Introduction in R (Spatial Analytics and GIS) 1st Edition
Los Angeles
London
New Delhi
Singapore
Washington DC
Melbourne
SAGE Publications Ltd
1 Oliver’s Yard
55 City Road
London EC1Y 1SP
SAGE Publications Inc.
2455 Teller Road
Thousand Oaks, California 91320
SAGE Publications India Pvt Ltd
B 1/I 1 Mohan Cooperative Industrial Area
Mathura Road
New Delhi 110 044
SAGE Publications Asia-Pacific Pte Ltd
3 Church Street
#10-04 Samsung Hub
Singapore 049483
© Lex Comber and Chris Brunsdon 2021
First published 2021
Apart from any fair dealing for the purposes of research or private study,
or criticism or review, as permitted under the Copyright, Designs and
Patents Act, 1988, this publication may be reproduced, stored or
transmitted in any form, or by any means, only with the prior permission
in writing of the publishers, or in the case of reprographic reproduction, in
accordance with the terms of licences issued by the Copyright Licensing
Agency. Enquiries concerning reproduction outside those terms should
be sent to the publishers.
Library of Congress Control Number: 2020938055
British Library Cataloguing in Publication data
A catalogue record for this book is available from the British Library
ISBN 978-1-5264-4935-1
ISBN 978-1-5264-4936-8 (pbk)
Editor: Jai Seaman
Assistant editor: Charlotte Bush
Assistant editor, digital: Sunita Patel
Production editor: Katherine Haw
Copyeditor: Richard Leigh
Proofreader: Neville Hankins
Indexer: Martin Hargreaves
Marketing manager: Susheel Gokarakonda
Cover design: Francis Kenney
Typeset by: C&M Digitals (P) Ltd, Chennai, India
Printed in the UK
At SAGE we take sustainability seriously. Most of our products are
printed in the UK using responsibly sourced papers and boards. When
we print overseas we ensure sustainable papers are used as measured
by the PREPS grading system. We undertake an annual audit to monitor
our sustainability.
Lex: To my children, Carmen, Fergus and Madeleine: you are all adults
now. May you continue to express your ever growing independence in
body, as well as thought.
Chris: To all of my family – living and no longer living.
CONTENTS
About the authors
Preface
Online resources
1 Introduction to Geographical Data Science and Spatial Data
Analytics
1.1 Overview
1.2 About this book
1.2.1 Why Geographical Data Science and Spatial Data
Analytics?
1.2.2 Why R?
1.2.3 Chapter contents
1.2.4 Learning and arcs
1.3 Getting started in R
1.3.1 Installing R and RStudio
1.3.2 The RStudio interface
1.3.3 Working in R
1.3.4 Principles
1.4 Assignment, operations and object types in R
1.4.1 Your first R script
1.4.2 Basic data types in R
1.4.3 Basic data selection operations
1.4.4 Logical operations in R
1.4.5 Functions in R
1.4.6 Packages
1.5 Summary
References
2 Data and Spatial Data in R
2.1 Overview
2.2 Data and spatial data
2.2.1 Long vs. wide data
2.2.2 Changes to data formats
2.2.3 Data formats: tibble vs. data.frame
2.2.4 Spatial data formats: sf vs. sp
2.3 The tidyverse and tidy data
2.4 dplyr for manipulating data (without pipes)
2.4.1 Introduction to dplyr
2.4.2 Single-table manipulations: dplyr verbs
2.4.3 Joining data tables in dplyr
2.5 Mapping and visualising spatial properties with tmap
2.6 Summary
References
3 A Framework for Processing Data: the Piping Syntax and dplyr
3.1 Overview
3.2 Introduction to pipelines of tidy data
3.3 The dplyr pipelining filters
3.3.1 Using select for column subsets
3.3.2 Using mutate to derive new variables and transform
existing ones
3.3.3 group_by and summarise: changing the unit of
observation
3.3.4 group_by with other data frame operations
3.3.5 Order-dependent window functions
3.4 The tidy data chaining process
3.4.1 Obtaining data
3.4.2 Making the data tidy
3.5 Pipelines, dplyr and spatial data
3.5.1 dplyr and sf format spatial objects
3.5.2 A practical example of spatial data analysis
3.5.3 A further map-based example
3.5.4 Other spatial manipulations
3.6 Summary
References
4 Creating Databases and Queries in R
4.1 Overview
4.2 Introduction to databases
4.2.1 Why use a database?
4.2.2 Databases in R
4.2.3 Prescribing data
4.3 Creating relational databases in R
4.3.1 Creating a local in-memory database
4.3.2 Creating a local on-file database
4.3.3 Summary
4.4 Database queries
4.4.1 Extracting from a database
4.4.2 Joining (linking) database tables
4.4.3 Mutating, grouping and summarising
4.4.4 Final observations
4.5 Worked example: bringing it all together
4.6 Summary
References
5 EDA and Finding Structure in Data
5.1 Overview
5.2 Exploratory data analysis
5.3 EDA with ggplot2
5.3.1 ggplot basics
5.3.2 Groups with ggplot
5.4 EDA of single continuous variables
5.5 EDA of multiple continuous variables
5.6 EDA of categorical variables
5.6.1 EDA of single categorical variables
5.6.2 EDA of multiple categorical variables
5.7 Temporal trends: summarising data over time
5.8 Spatial EDA
5.9 Summary
References
6 Modelling and Exploration of Data
6.1 Overview
6.2 Questions, questions
6.2.1 Is this a fake coin?
6.2.2 What is the probability of getting a head in a coin flip?
6.2.3 How many heads next time I flip the coin?
6.3 More conceptually demanding questions
6.3.1 House price problem
6.3.2 The underlying method
6.3.3 Practical computation in R
6.4 More technically demanding questions
6.4.1 An example: fitting generalised linear models
6.4.2 Practical considerations
6.4.3 A random subset for regressions
6.4.4 Speeding up the GLM estimation
6.5 Questioning the answering process and questioning the
questioning process
6.6 Summary
References
7 Applications of Machine Learning to Spatial Data
7.1 Overview
7.2 Data
7.3 Prediction versus inference
7.4 The mechanics of machine learning
7.4.1 Data rescaling and normalisation
7.4.2 Training data
7.4.3 Measures of fit
7.4.4 Model tuning
7.4.5 Validation
7.4.6 Summary of key points
7.5 Machine learning in caret
7.5.1 Data
7.5.2 Model overviews
7.5.3 Prediction
7.5.4 Inference
7.5.5 Summary of key points
7.6 Classification
7.6.1 Supervised classification
7.6.2 Unsupervised classification
7.6.3 Other considerations
7.6.4 Pulling it all together
7.6.5 Summary
References
8 Alternative Spatial Summaries and Visualisations
8.1 Overview
8.2 The invisibility problem
8.3 Cartograms
8.4 Hexagonal binning and tile maps
8.5 Spatial binning data: a small worked example
8.6 Binning large spatial datasets: the geography of misery
8.6.1 Background context
8.6.2 Extracting from and wrangling with large datasets
8.6.3 Mapping
8.6.4 Considerations
8.7 Summary
References
9 Epilogue on the Principles of Spatial Data Analytics
9.1 What we have done
9.1.1 Use the tidyverse
9.1.2 Link analytical software to databases
9.1.3 Look through a spatial lens
9.1.4 Consider visual aspects
9.1.5 Consider inferential aspects
9.2 What we have failed to do
9.2.1 Look at spatio-temporal processes
9.2.2 Look at textual data
9.2.3 Look at raster data
9.2.4 Be uncritical
9.3 A series of consummations devoutly to be wished
9.3.1 A more integrated spatial database to work with R
9.3.2 Cloud-based R computing
9.3.3 Greater critical evaluation of data science projects
References
Index
ABOUT THE AUTHORS
Alexis Comber
(Lex) is Professor of Spatial Data Analytics at Leeds Institute for Data
Analytics (LIDA), University of Leeds. He worked previously at the
University of Leicester where he held a chair in Geographical Information
Sciences. His first degree was in Plant and Crop Science at the
University of Nottingham and he completed a PhD in Computer Science
at the Macaulay Institute, Aberdeen (now the James Hutton Institute),
and the University of Aberdeen. This developed expert systems for land
cover monitoring from satellite imagery and brought him into the world of
spatial data, spatial analysis and mapping. Lex’s research interests span
many different application areas including environment, land cover/land
use, demographics, public health, agriculture, bio-energy and
accessibility, all of which require multi-disciplinary approaches. His
research draws from methods in geocomputation, mathematics, statistics
and computer science, and he has extended techniques in operations
research/location allocation (what to put where), graph theory (cluster
detection in networks), heuristic searches (how to move intelligently
through highly dimensional big data), remote sensing (novel approaches
for classification), handling divergent data semantics (uncertainty
handling, ontologies, text mining) and spatial statistics (quantifying spatial
and temporal process heterogeneity). He has co-authored (with Chris
Brunsdon) the first ‘how to’ book for spatial analysis and mapping in R,
the open source statistical software, now in its second edition
(https://uk.sagepub.com/en-gb/eur/an-introduction-to-r-for-spatial-
analysis-and-mapping/book258267). Outside of academic work and in no
particular order, Lex enjoys his vegetable garden, walking the dog and
playing pinball (he is the proud owner of a 1981 Bally Eight Ball Deluxe).
Chris Brunsdon
is Professor of Geocomputation and Director of the National Centre for
Geocomputation at the National University of Ireland, Maynooth, having
worked previously in the Universities of Newcastle, Glamorgan, Leicester
and Liverpool, variously in departments focusing on both geography and
computing. He has interests that span both of these disciplines, including
spatial statistics, geographical information science, and exploratory
spatial data analysis, and in particular the application of these ideas to
crime pattern analysis, the modelling of house prices, medical and health
geography and the analysis of land use data. He was one of the
originators of the technique of geographically weighted regression
(GWR). He has extensive experience of programming in R, going back to
the late 1990s, and has developed a number of R packages which are
currently available on CRAN, the Comprehensive R Archive Network. He
is an advocate of free and open source software, and in particular the
use of reproducible research methods, and has contributed to a large
number of workshops on the use of R and of GWR in a number of
countries, including the UK, Ireland, Japan, Canada, the USA, the Czech
Republic and Australia. When not involved in academic work he enjoys
running, collecting clocks and watches, and cooking – the last of these
probably cancelling out the benefits of the first.
PREFACE
Data and data science are emerging (or have emerged) as a dominant
activity in many disciplines that now recognise the need for empirical
evidence to support decision-making (although at the time of writing in
the UK at the end of April 2020, this is not obvious). All data are spatial
data – they are collected somewhere – and location cannot be treated as
just another variable in most statistical models. And because of the ever
growing volumes of (spatial) data, from increasingly diverse sources,
describing all kinds of phenomena and processes, being able to develop
approaches and models grounded in spatial data analytics is increasingly
important. This books pulls together and links lessons from general data
science to those from quantitative geography, which have been
developed and applied over many years.
In fact, the practices and methods of data science, if framed as being a
more recent term for statistical analysis, and spatial data science, viewed
as being grounded in geographical information systems and science, are
far from new. A review of the developments in these fields would suggest
that the ideas of data analytics have arisen as a gradual evolution. One
interesting facet of this domain is the importance of spatial
considerations, particularly in marketing, where handling locational data
has been a long-standing core activity. The result is that geographical
information scientists and quantitative geographers are now leading
many data science activities – consider the background of key players at
the Alan Turing Institute, for example. Leadership is needed from this
group in order to ensure that lessons learned and experiences gained are
shared and disseminated. A typical example of this is the modifiable areal
unit problem, which in brief posits that statistical distributions,
relationships and trends exhibit very different properties when the same
data are aggregated or combined over different areal units, and at
different spatial scales. It describes the process of distortion in
calculations and differences in outcomes caused by changes zoning and
scales. This applies to all analyses of spatial data – and has universal
consequences, but is typically unacknowledged by research in non-
geographical domains using spatial data.
This possibility of distortion also underpins another motivation for this
book at this time: one of reproducibility. The background to R, the open
source statistical package, is well documented and a number of
resources have been published that cover recent developments in the
context of spatial data and spatial analysis in R (including our other
offering in this arena: Brunsdon and Comber, 2018). This has promoted
the notion of the need for open coding environments within which
analysis takes place, thereby allowing (spatial) data science cultures to
flourish. And in turn this has resulted in a de facto way of working that
embraces open thinking, open working, sharing, open collaboration, and,
ultimately, reproducibility and transparency in research and analysis. This
has been massively supported by the RStudio integrated development
environment for working in R, particularly the inclusion of RMarkdown
which allows users to embed code, analysis and data within a single
document, as well as the author’s interpretation of the results. This is
truly the ‘holy grail’ of scientific publishing!
A further driver for writing this book is to promote notions of critical data
science. Through the various examples and illustration in the book, we
have sought to show how different answers/results (and therefore
understandings and predictions) can be generated by very small and
subtle changes to models, either through the selection of the machine
learning algorithm, the scale of the data used or the choice of the input
variables. Thus we reject plug and play data science, we reject the idea
of theory-free analyses, we reject data mining, all of which abrogate
inferential responsibility through philosophies grounded in letting the data
speak. Many of the new forms of data that are increasingly available to
the analyst are not objective (this is especially the case for what has
sometimes been called ‘big data’). They are often collected without any
experimental design, have many inherent biases and omissions, and
without careful consideration can result in erroneous inference and poor
decision-making. Thus being critical means considering the
technological, social and economic origins of data, including their
creation and deployment, as well as the properties of the data relative to
the intended analysis, or the consequences of any analysis. Criticality
involves thinking about the common good, social contexts, using data
responsibly, and even considering how your work could be used in the
wrong way or the results misinterpreted. There is no excuse for number
crunchers who fail to be critical in their data analysis.
In summary, we believe that the practice of data analytics (actually
spatial data analytics) should be done in an open and reproducible way, it
should include a critical approach to the broader issues surrounding the
data, their analysis and consideration of how they will be used, and it
should be done wearing geography goggles to highlight the impacts of
scale and zonation on the results of analyses of spatial data. This may
involve some detective work to understand the impacts of data and
analysis choices on the findings – this too is a part of data science. We
believe that the contents of this book, and the various coded examples,
provide the reader with an implicit grounding in these issues.
REFERENCE
Brunsdon, C. and Comber, L. (2018) An Introduction to R for Spatial
Analysis and Mapping (2nd edn). London: Sage.
ONLINE RESOURCES
Here the result is 11. The [1] that precedes it formally indicates first
requested element will follow. In this case there is just one element. The
> indicates that R is ready for another command and all code outputs that
are reproduced in this book are prefixed by ‘##’ to help the reader
differentiate these from code inputs.
So for the first bit of R code type the code below into your new R script:
y = c(4.3,7.1,6.3,5.2,3.2,2.1)
Notice the use of the c in the above code. This is used to combine or
concatenate individual objects of the same type (numeric, character or
logical – see below).
To run the code (in the console pane), a quick way is to highlight the
code (with the mouse or using the keyboard controls) and press Ctrl-R or
Cmd-Enter on a Mac. There is also a Run button at the top right of your
script panel in RStudio: place your mouse cursor on the line and click on
Run to run the code.
What you have done with the code snippet is assign the values to y, an R
object. Assignment is the basic process of passing values to R objects,
which are held in R’s memory – if you look at the Environment pane in
RStudio you will see that it now has an object called y. Assignment will
generally be done using an = (equals sign), but you should note that it
can also be done with a <- (less than, dash).
y <- c(4.3,7.1,6.3,5.2,3.2,2.1)
For the time being you can assume that assignment with = and <- are
the same (although they are not, as will be illustrated with piping syntax
in later chapters).
Now you can undertake operations on the objects held in the R working
environment. You should write the following code into your R script and
run each line individually or as a block by highlighting multiple lines and
then running the code as before. Recall that highlighted code can be run
by clicking on the Run icon at the top left of the script pane, or by
pressing Ctrl-Enter (PC) or Cmd-Enter (Mac).
y*2
max(y)
There are two kinds of things here. The first is a mathematical operation
(y*2). Operations do something to an R object directly, using
mathematical notation like * for multiply. The second is the application of
the function max to find the maximum value in an R object.
Recall that your understanding of what is being done by code snippets
will grow if you explore the code, play around with it and examine the
help for individual functions. All functions have a help file that can be
accessed using help(<function_name>). Enter the code below into
your script and run it to examine the help for max:
help(max)
You should type the functions below into your R script and examine the
help for these:
sum(y)
mean(y)
A key thing to note here is that functions are always followed by round
brackets or parentheses ( ). These are different from square brackets
(or brackets) [ ] and curly brackets (or braces) { }, as will be illustrated
later. And functions (nearly!) always return something to the console and
have the form:
result <- function_name(<input>)
You should remember to write your code and add comments as these will
help you understand your code and what you did when you come back to
it at a later date.
As hinted at above with the first code snippet adding 2 to 9, R can be
used a bit like a calculator. It evaluates and prints out the result of any
expression that is entered at the command line in the console. Recall that
anything after a # prefix on a line is not evaluated. Type the code
snippets below into your R script and run them:
2/9
sqrt(1000) # square root
2*3*4*5
pi # pi
2*pi*6378 # earth circumference
sin(c(1, 3, 6)) #sine of angles in radians
Key Points
Code can (should) be run from a script.
You should include comments in your scripts.
Mathematical operations can be applied to R objects.
Functions have round brackets or parentheses in the form
function_name(<input>).
Functions take inputs, do something to them and return the result.
Each function has a help page that can be accessed using
help(function_name) or ?function_name.
1.4.2 Basic data types in R
The first bit of coding above was to get you used to the R environment.
You should have a script with a few code snippets. Now we will step back
and examine in a bit more detail some of the structures and operations in
R.
The preceding sections created two R objects, x and y – you should see
them in the Environment pane in RStudio or by entering ls( ) at the
console (this lists the objects in the R environment). There are a number
of fundamental data types in R that provide the building blocks for data
analysis. The sections below explore some of these data types and
illustrate further operations on them.
1.4.2.1 Vectors
A vector is a group of values of the same type. The individual values are
combined using c ( ); you have created one already. Examples of
vectors are:
c(2,3,5,2,7,1)
3:10 # the sequence numbers 3, 4, …, 10
c(TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,FALSE)
c("London","Leeds","New York","Montevideo", NA)
x <- c(2,3,5,2,7,1)
x
## [1] 2 3 5 2 7 1
y <- c(10,15,12)
y
## [1] 10 15 12
z <- c(x, y)
z
## [1] 2 3 5 2 7 1 10 15 12
Vectors can be subsetted – that is, elements from them can be extracted.
There are two common ways to extract subsets of vectors. Note in both
cases the use of the square brackets [ ].
1. Specify the positions of the elements that are to be extracted, for
example:
x[c(2,4)] # Extract elements 2 and 4
## [1] 3 2
x[x > 4]
## [1] 5 7
Further details on logical operations for extracting subsets are given later
in this chapter.
1.4.2.2 Matrices and data frames
Matrices and data frames are like data tables with a row and column
structure. The fundamental difference between a matrix and a
data.frame is that matrices can only contain a single data type
(numeric, logical, character, etc.) whereas a data frame can have
different types of data in each column. All elements of any column must
have the same type (e.g. all numeric).
Matrices are easy to define, but notice how the sequence 1 to 10 below is
ordered differently with the byrow parameter:
matrix(1:10, ncol = 2)
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
matrix(1:10, ncol = 2, byrow = T)
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
## [3,] 5 6
## [4,] 7 8
## [5,] 9 10
matrix(letters[1:10], ncol = 2)
## [,1] [,2]
## [1,] "a" "f"
## [2,] "b" "g"
## [3,] "c" "h"
## [4,] "d" "i"
## [5,] "e" "j"
This is a data.frame:
class(iris)
## [1] "data.frame"
The code below uses the head() function to print out the first six rows
and the dim() function to tell us the dimensions of iris, in this case
150 rows and 5 columns:
head(iris)
## Sepal.Length Sepal.Width Petal.Length
Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
dim(iris)
## [1] 150 5
The str function can be used to indicate the formats of the attributes
(columns, fields) in iris:
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4
4.6 5 4.4 4.9 …
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9
3.4 3.4 2.9 3.1 …
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7
1.4 1.5 1.4 1.5 …
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0 .2 0.4
0.3 0.2 0.2 0.1 …
## $ Species : Factor w/ 3 levels
"setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 …
Here we can see that four of the attributes are numeric and the other is
a factor – a kind of ordered categorical variable.
The summary function is also very useful and shows different summaries
of the individual attributes in iris:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length
Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000
Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350
Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758
Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900
Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Notice also how the code above used plot(iris[,1:4]). This tells R
to only plot fields (columns) 1 to 4 in the iris data table (the numeric
fields). Section 1.4.3 describes how to subset and extract elements from
R objects in greater detail. The data types for individual fields can also be
investigated using the sapply function:
iris$Species
attributes(iris$Species)
## $levels
## [1] "setosa" "versicolor" "virginica"
##
## $class
## [1] "factor"
The main advantages of factors are that values not listed in the levels
cannot be assigned to the variable (try entering the code below), and that
they allow easy grouping by other R functions as demonstrated in later
chapters.
iris$Species[10] = "bananas"
Notice how 1:4 extracted all the columns between 1 and 4. Effectively
this code says: plot columns 1, 2, 3 and 4 of the iris dataset, using plot
character (pch) 1 and a character expansion factor (cex) of 1.5. This is
very useful and hints at how elements of vectors, flat data tables such as
matrices and data frames can be extracted and/or ordered by using
square brackets.
Starting with vectors, the code below extracts different elements from z
that you defined earlier:
# z
z
# 1st element
z[1]
# 5th element
z[5]
# elements 1 to 6
z[1:6]
# elements 1, 3, 6 and 2 … in that order
z[c(1,3,6,2)]
by name – the code below returns the first 10 rows and 2 named
columns:
iris[1:10, c("Sepal.Length", "Petal.Length")]
logically – the code below returns the first 10 rows and 2 logically
selected columns:
iris[1:10, c(TRUE, FALSE, TRUE, FALSE, FALSE,
FALSE)]
Thus there are multiple ways in which the ith rows or jth columns in a
data table can be accessed. Also note that compound logical statements
can be used to create an index as in the code below:
n <- iris$Sepal.Length > 6 & iris$Petal.Length
> 5
iris[n,]
What these do in each case is print out (extract) the elements from x and
z that are greater than 4 and 10, respectively. We can examine what is
being done by the code in the square brackets and see that they return a
series of logical statements. Considering x, we can see that it has six
elements and two of these satisfy the logical condition of ‘being greater
than 4’. The code x > 4 also returns a vector with six elements but it is a
logical one containing TRUE or FALSE values:
x
## [1] 2 3 5 2 7 1
x > 4
## [1] FALSE FALSE TRUE FALSE TRUE FALSE
Additionally, we can see that these are elements 3 and 5 of x. The code
below combines x and the logical statement into a data frame to show
these two vectors together:
data.frame(x = x, logic = x>4)
## x logic
## 1 2 FALSE
## 2 3 FALSE
## 3 5 TRUE
## 4 2 FALSE
## 5 7 TRUE
## 6 1 FALSE
x(c(3,5))
And this is reinforced by using the function which with the logical
statement. This returns the vector (element) positions for which the logic
statement is TRUE:
which(x > 4)
## [1] 3 5
And logical statements can be inverted using the not or negation syntax
(!) placed in front of the statement:
x > 4
## [1] FALSE FALSE TRUE FALSE TRUE FALSE
!(x > 4)
## [1] TRUE TRUE FALSE TRUE FALSE TRUE
So, returning to the original code snippets, what these do is use the
logical statement applied to the whole vector, to return just the elements
from that vector that match the condition. This could be done in long
hand by creating an intermediate variable:
my.index = x > 4
x[my.index]
x[my.index]
Key Points
Logical statements return a vector of TRUE and FALSE elements.
These can be converted to binary [0, 1] format.
Logical statements can be used to extract elements from one-
dimensional and two-dimensional data either directly or by assigning
the results of the statement to an R object that is later used to index
the data.
Compound logical statements can be constructed to specify a series
of conditions that have to be met.
1.4.5 Functions in R
There are a number of useful functions in R as listed below. You should
explore the help for these.
length(z)
mean(z)
median(z)
range(z)
unique(z)
sort(z)
order(z)
sum(z)
cumsum(z)
cumprod(z)
rev(z)
sapply(iris, is.factor)
## Sepal.Length Sepal.Width Petal.Length
Petal.Width Species
## FALSE FALSE FALSE FALSE
TRUE
Note the use of ! to negate index – you could examine its effect!
index
!index
Two simple examples of R functions are described below. The first gives
an approximate conversion from miles to kilometres:
miles.to.km <- function(miles)miles*8/5
# Distance from Maynooth to Nottingham is ~260
miles
miles.to.km(260)
The function will do the conversion for several distances all at once. To
convert a vector of the three distances 100, 200 and 300 miles to
distances in kilometres, specify:
miles.to.km(c(100,200,300))
Here is a function that makes it possible to plot the figures for any pair of
variables in the data.
plot_iris <- function(x, y){
x <- iris[,x]
y <- iris[,y]
plot(x, y)
}
Note that the function body is enclosed in curly brackets or braces ({ }).
plot_iris(1,2)
plot_iris("Sepal.Length", 3)
plot_iris("Sepal.Length", "Petal.Width")
install.packages("tidyverse", dependencies =
TRUE)
Whichever way you do this, via the menu system or via the console, you
may have to respond to the request from R/RStudio to set a mirror – a
site from which to download the package – pick the nearest one! Again a
mirror only needs to be set once as RStudio will remember your choice.
You should install the tidyverse package now using one of the
methods above, making sure that this is done with the package
dependencies.
Dependencies are needed in most package installations. This is because
most packages build on the functions contained in other packages. If you
have installed tidyverse you will see that it loads a number of other
packages. This is because the tidyverse package uses functions as
building blocks for its own functions, and provides a wrapper for these
(see https://www.tidyverse.org). These dependencies are loaded if you
check the Install dependencies tickbox in the system menu
approach or by including the dependencies = TRUE argument in the
code above. In both cases these tell R to install any other packages that
are required by the package being installed.
Once installed, the packages do not need to installed again for
subsequent use. They can simply be called using the library function
as below, and this is usually done at the start of the R script:
library(tidyverse)
Now examine the help for the readShapePoly function using the help
function:
help(readShapePoly)
Here you can see that the help pages for deprecated functions contain a
warning and suggest other functions that should be used instead. The
code on the book’s website (https://study.sagepub.com/comber) will
always contain up-to-date code snippets for each chapter to overcome
any problems caused by function deprecation.
Such changes are only a minor inconvenience and are part of the nature
of a dynamic development environment provided by R in which to do
research: such changes are inevitable as packages finesse, improve and
standardise. Further descriptions of packages, their installation and their
data structures are given in later chapters.
Key Points
R/RStudio comes with a large number of default tools.
These can be expanded by installing user-contributed packages of
tools and data.
Packages only have to installed once (via the menu or the console),
and then can be called subsequently using the
library(package_name) syntax.
Packages should be installed with their dependencies (these are
other packages that the target package builds on).
1.5 SUMMARY
The aim of Section 1.4 was to introduce you to R if you have not used it
before, to familiarise you with the R/RStudio environment and to provide
a basic grounding in R coding. You should have a script with all your R
code and comments (comments are really important) describing how to
assign values to R objects and some basic operations on those objects’
data. You should understand some of the different basic data strictures
and how to handle and manipulate them, principally one-dimensional
vectors and two-dimensional matrices and data frames. Logical
operations were introduced. These allow data elements to be extracted
or subsetted. Functions were introduced, and you should have installed
the tidyverse package. Some additional resources were suggested for
those needing to spend a bit more time becoming more familiar with R. A
number of excellent online get started in R guides were listed and the
book An Introduction to R for Spatial Analysis and Mapping (Brunsdon
and Comber, 2018) was recommended if a deeper introduction to data
formats in R was needed.
REFERENCES
Brunsdon, C. and Comber, A. (2020a) Big issues for big data. Preprint
arXiv:2007.11281.
Brunsdon, C. and Comber, A. (2020b) Open in a practice: Supporting
reproducibility and critical spatial data science. Journal of Geographical
Systems, https://doi.org/10.1007/s10109-020-00334-2.
Brunsdon, C. and Comber, L. (2018) An Introduction to R for Spatial
Analysis and Mapping (2nd edn). London: Sage.
Comber, A., Brunsdon, C., Charlton, M. and Harris, R. (2016) A moan, a
discursion into the visualisation of very large spatial data and some
rubrics for identifying big questions. In International Conference on
GIScience Short Paper Proceedings. Vol. 1, Issue 1.
Kitchin, R. (2013) Big data and human geography: Opportunities,
challenges and risks. Dialogues in Human Geography, 3(3), 262–267.
Kitchin, R. and Lauriault, T. (2014) Towards critical data studies: Charting
and unpacking data assemblages and their work. The Programmable
City Working Paper 2, Preprint. https://ssrn.com/abstract=2474112.
Kitchin, R. and McArdle, G. (2016) What makes big data, big data?
Exploring the ontological characteristics of 26 datasets. Big Data &
Society, 3(1).
Laney, D. (2001) 3D data management: Controlling data volume, velocity
and variety. META Group Research Note, 6 February.
Marr, B. (2014) Big data: The 5 Vs everyone must know. LinkedIn Pulse,
6 March.
McNulty, E. (2014) Understanding big data: The seven V’s. Dataconomy,
22 May.
Myers, J. L., Well, A. D. and Lorch, R. F. Jr. (2013) Research Design and
Statistical Analysis. New York: Routledge.
O’Neil, C. and Schutt, R. (2014) Doing Data Science: Straight Talk from
the Front Line. Sebastopol, CA: O’Reilly.
Openshaw, S. (1984a) Ecological fallacies and the analysis of areal
census data. Environment and Planning A, 16(1), 17–31.
Openshaw, S. (1984b) The Modifiable Areal Unit Problem, Catmog 38.
Norwich: Geo Abstracts.
Pebesma, E., Bivand, R., Racine, E., Sumner, M., Cook, I., Keitt, T. et al.
(2019). Simple features for R. https://cran.r-
project.org/web/packages/sf/vignettes/sf1.html.
Robinson, W. (1950) Ecological correlations and the behavior of
individuals. American Sociological Review, 15, 351–357.
Tobler, W. R. (1970) A computer movie simulating urban growth in the
Detroit region. Economic Geography, 46 (sup1), 234–240.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François,
R., Grolemund, G. et al. (2019) Welcome to the Tidyverse. Journal of
Open Source Software, 4(43), 1686.
2 DATA AND SPATIAL DATA IN R
2.1 OVERVIEW
This chapter introduces data and spatial data. It covers recent
developments in R that have resulted in new formats for data (tibble
replacing data.frame) and for spatial data (the sf format replacing the
sp format). The chapter describes a number of issues related to data
generally. including the structures used to store data and tidy data
manipulations with the dplyr package, and introduces the tmap
package for mapping and visualising spatial data properties, before they
are given a more comprehensive treatment in later chapters (Chapters 3
and 5, respectively).
The chapter introduces some data-table-friendly functions and some of
the critical issues associated with spatial data science, such as how to
move between different spatial reference systems so that data can be
spatially linked. Why spatial data? Nearly all data are spatial: they are
collected somewhere, at some place. In fact this argument can probably
be extended to the spatio-temporal domain: all data are spatio-temporal –
they are collected somewhere and at some time (Comber and Wulder,
2019).
This chapter covers the following topics:
Data and spatial data
The tidyverse package and tidy data
The sf package and spatial data
The dplyr package for manipulating data.
It is expected that the reader has a basic understanding of data formats
in R, and has at least worked their way through the introductory materials
in Chapter 1. They should be familiar with assigning values to different
types of R object (e.g. character, logical and numeric) and
different classes of R object (e.g. vector, matrix, list,
data.frame and factor). If you have worked through our other
offering (Brunsdon and Comber, 2018) then some of this chapter will be
revision.
The following packages will be required for this chapter:
library(tidyverse)
library(sf)
library(tmap)
library(sp)
library(datasets)
If you have not already installed these packages, this can be done using
the install.packages() function as described in Chapter 1. For
example:
install.packages("tidyverse", dep = T)
Note that if you encounter problems installing packages then you should
try to install them in R (outside of RStudio) and then call the now installed
libraries in RStudio.
Remember that you should write your code into a script, and set the
working directory for your R/RStudio session. To open a new R script
select File > New File > R Script and save it to your working directory
(File > Save As) with an appropriate name (e.g. chap2.R). Having
saved your script, you should set the working directory. This can be done
via Session > Set Working Directory > Choose Directory. Or, if you
already have a saved script in a directory, select Session > Set Working
Directory > To Source File Location. This should be done each time
you start working on a new chapter.
2.2 DATA AND SPATIAL DATA
2.2.1 Long vs. wide data
This section starts by examining data formats, comparing non-spatial or
aspatial data with spatial data. However, it is instructive to consider what
is meant by data in this context. R has many different ways of storing and
holding data, from individual data elements to lists of different data types
and classes.
The commonest form of data is a flat data table, similar in form to a
spreadsheet. In this wide format, each of the rows (or records) relates to
some kind of observation or real-world feature (e.g. a person, a
transaction, a date, a shop) and the columns (or fields) represent some
attribute or property associated with that observation. Each cell in wide
data tables contains a value of the property of a single record.
Much spatial data has a similar structure. Under the object or vector form
of spatial data, records still refer to single objects, but these relate to
some real-world geographical feature (e.g. a place, a route, a region) and
the fields describe variables, measurements or attributes associated with
that feature (e.g. population, length, area).
Data can also be long or flat. In this format, the observation or feature is
retained but the multiple variable fields are collapsed typically into three
columns: one containing observation IDs or references, another
describing the variable name, domain or type, and the third containing
the value for that observation in that domain. Long data will have different
levels of longness depending on the number of variables.
Depending on the analysis you are undertaking, you may require wide
format data or long format data. Data are usually in wide format, but for
some data science activities long format data are more commonly
required than wide format data, particularly the ggplot2 package for
data visualisations. Thus it is important to be able to move between wide
and long data formats.
The tidyr package provides functions for doing this. The
pivot_longer() function transforms from wide to long format, and the
pivot_wider() function transforms from long to wide format.
The code below generates wide and long format data tables from a
subset of the first five records and of the first three fields of the mtcars
dataset, one of the datasets automatically included with R. The results
are shown in Tables 2.1 and 2.2.
# load the data
data(mtcars)
# create an identifier variable, ID, from the
data frame row names
mtcars$ID = rownames(mtcars)
# sequentially number the data frame rows
rownames(mtcars) = 1:nrow(mtcars)
# extract a subset of the data
mtcars_subset = mtcars[1:5,c(12, 1:3)]
# pivot to long format
mtcars_long = pivot_longer(mtcars_subset, -ID)
# pivot to wide format
mtcars_wide = pivot_wider(mtcars_long)
mtcars_long
mtcars_wide
Table 2.1
Table 2.2
Key Points
Data tables are composed of rows or records, and columns or fields.
Flat data tables have a row for each observation (a place, a person,
etc.) and a column for each variable (field).
Long data format contains a record for each unique observation–field
combination.
2.2.2 Changes to data formats
The standard formats for tabular data and vector-based spatial data have
been the data.frame and sp formats. However, R is a dynamic coding
and research environment: things do not stay the same. New tools,
packages and formats are constantly being created and updated to
improve, extend and increase the functionality and consistency of
operations in R. Occasionally a completely new paradigm is introduced,
and this is the case with the recent launch of the tibble and sf data
formats for data and spatial data, respectively.
The tibble format was developed to replace the data.frame. It is a
reimagining of the data frame that overcomes the limitations of the
data.frame format. The emergence of the tibble as the default
format for data tables has been part of a coherent and holistic vision for
data science in R supported by the tidyverse package
(https://www.tidyverse.org). This has been transformatory: tidyverse
provides a collection of integrated R packages supporting data science
that have redefined data manipulation tools, vocabularies and functions
as well as data formats (Wickham, 2014). The tibble package is
loaded with dplyr as part of tidyverse. Without the tidyverse
package, this book could not have been written in the way that it has
been. The tidyverse and tibble format have also underpinned a
parallel development in spatial data. At the time of writing the sp spatial
data format is in the process of being replaced by sf or the simple
feature data format for vector spatial data. These developments are
described in the next sections.
2.2.3 Data formats: tibble vs. data.frame
The tibble class in R is composed of a series of vectors of equal
length, which together form two-dimensional data structures. Each vector
records values for a particular variable, theme or attribute and has a
name (or header) and is ordered such that the nth element in the vector
is a value for the nth record (row) representing the nth feature. These
characteristics also apply to the data.frame class, which at the time of
writing is probably the most common data format in R.
The tibble is a reworking of the data.frame that retains its
advantages (e.g. multiple data types) and eliminates less effective
aspects (some of which are illustrated below). However, the tibble
reflects a tidy (Wickham, 2014) data philosophy:
It allows multiple types of variable or attribute to be stored in the
same table (unlike, for example, the matrix format which can only
hold one data type, such as integer, logical or character).
It seeks to be lazy and does not do any work trying, for example, to
link partially matched variable names (unlike the data.frame
format – see the example below).
It is surly and complains if calls to the data are not exactly specified,
identifying problems earlier in the data analysis cycle and thereby
forcing cleaner coding.
By contrast, the data.frame format is not tidy: it is not lazy or surly.
This is illustrated in the code snippets below. These use the mtcars
dataset loaded above (a data.frame) to create a new data.frame
and then creates a tibble from this to highlight the differences between
the two formats.
First, create a data.frame called df from mtcars:
# create 4 variables
type = as.character(mtcars$ID)
weight = mtcars$wt
horse_power = mtcars$hp
q_mile <- mtcars$qsec
# create a new data.frame
df <- data.frame(type, weight, horse_power,
q_mile)
str(df)
## 'data.frame': 32 obs. of 4 variables:
## $ type : Factor w/ 32 levels "AMC
Javelin",..: 18 19 5 13
14 31 7 21 20 22 …
## $ weight : num 2.62 2.88 2.32 3.21 3.44 …
## $ horse_power : num 110 110 93 110 175 105
245 62 95 123 …
## $ q_mile : num 16.5 17 18.6 19.4 17 …
unique(df$type)
Here we can see that the character values in type have not been
converted to a factor. However, probably the biggest criticism of
data.frame is the partial matching behaviour. Enter the following code:
head(df$ty)
head(tb$ty)
Some final considerations are that the print method for tibble returns
the first 10 records by default, whereas for a data.frame all records are
returned. Additionally, the tibble class includes a description of the
class of each field (column) when it is printed. Examine the differences
between these data table formats:
tb
df
You should examine the tibble vignettes and explore their creation,
coercion, subsetting and so on:
vignette("tibble")
The CSV file can be read in with the stringsAsFactors parameter set
to FALSE to avoid the factor issue with the data.frame format
described above:
df2 = read.csv("df.csv", stringsAsFactors = F)
str(df2)
The write.table() function can be used to write TXT files with
different field separations:
# write a tab-delimited text file
write.table(df, file = "df.txt", sep = "\t",
row.names = F, qmethod = "double")
df = read.table(file = "df.txt" , header = T,
sep= "\t", stringsAsFactors = F)
head(df2)
str(df2)
Data tables in tibble format can be treated in a similar way but using
the read_csv() and write_csv() functions:
tb2 = read_csv("df.csv")
write_csv(tb, "tb.csv")
# write a tab-delimited text file
write_delim(tb, path = "tb.txt", delim = "\t ",
quote_escape = "double")
The advantages and hints for effective reading of tidy data are described
more fully in Section 3.4 of Chapter 3.
R binary files can also be written out and loaded into an R session.
These have the advantage of being very efficient at storing data and
quicker to load than, for example, .csv or .txt files. The code below
saves the data R object – check your working directory when you have
run this to see the differences in file size on your computer:
save(list = c("df"), file = "df.RData")
What this does is load in the R objects from the .RData file to the R
session, with the same names. To test this run the code below, if you
have run the code snippet above. This deletes two R objects and then
loads them back in:
ls()
rm(list = c ("df", "tb"))
ls()
load(file = "data.RData")
ls()
There are also tools for loading data in other formats such as the
foreign package. An internet search can quickly provide solutions for
loading data in different formats.
In summary, the first observable difference between tibbles and standard
data frames is that when a tibble is printed out a truncated version is
seen, with no need to use head to stop every row from being printed.
Also, they state what kind of variable each column is when they are
printed. Thus, in tb the weight, horse_power and q_mile variables
are all indicated as numeric double precision numbers (<dbl>) but type
is a character (<chr>) variable. Information about the unprinted rows and
columns is also given. Other differences are in the way tibbles are read in
from files – for example, they will always reproduce the column names in
.csv files as they are given – so that a column with the header (name) of
‘Life Exp’ would be stored as Life Exp in a tibble rather than
Life.Exp in a standard data frame. Also, character variables are stored
as such, rather than converted to factors. If a factor variable is genuinely
required, an explicit conversion should be used. To read a .csv file into
a tibble, read_csv is used in place of read.csv. Perhaps the most
important difference is that the tibble is surly and complains if operations
on it are not correctly specified, whereas the data frame is too
accommodating – it tries to help – for example in partial matching, which
is dangerous in data science.
Key Points
The tibble format is a reworking of the data.frame format that
retains its advantages and eliminates less effective aspects.
Table 2.3
It seeks to be lazy and will not link to partially matched variable
names (unlike the data.frame format).
It is surly and complains if calls to the data are not correct.
A number of different ways of reading and writing data from and to
external files were illustrated.
Package vignettes were introduced. These provide short overviews
of particular aspects of package functionality and operations.
2.2.4 Spatial data formats: sf vs. sp
Data describing spatial features in R are similar in structure to tabular
data, but they also include information about the spatial properties of
each observation: the coordinates of the point, line or area. These allow
the geography of the observation to be interrogated spatially. The
structural similarity of spatial data tables to ordinary data tables allows
them to be manipulated in much the same way.
For many years the tools for handling spatial data in R were built around
the spatial data structures defined in the sp package. The sp format for
spatial data in R is a powerful structure that supports a great deal of
spatial data manipulation. In this data model, spatial objects can be
thought of as being divided into two parts – a data frame (essentially the
same as those considered until now) and the geometric information for
one of several kinds of spatial objects (e.g.
SpatialPointsDataFrame, SpatialPolygonsDataFrame), where
each row in the data frame is associated with an individual component of
the geometric information. The sp class of objects is broadly analogous
to shapefile formats (lines, points, areas) and raster or grid formats. The
sp format defines spatial objects both with a data.frame (holding
attributes) and without (purely spatial objects) as shown in Table 2.3.
However, recently a new class of spatial object has been defined called
sf (which stands for simple features). The sf format seeks to encode
spatial data, in a way that conforms to formal standards defined in the
ISO 19125-1:2004 standard.
This mode of storage is more ‘in tune’ with the tidy framework. Essentially
spatial objects in sf format appear as a data frame with an extra column
named geometry that contains the geometrical information for the
spatial part of the object. A column containing a geometric feature is
called an sfc (simple feature column), and this can be used in various
overlay and other geometric operations, but of itself is not a tibble or data
frame. A data frame containing a geometry column of sfc type is called
an sf object. It is relatively easy to create both kinds of object via the sf
package. Since a great deal of geographical data exists either through
existing ‘classical’ R data objects (from sp) or as shapefiles, a common
approach is to convert existing sp objects from a Spatial*DataFrame
(using st_as_sf) or to read from a spatial data file such as shapefile or
geopackage, using st_read() – both of which are functions provided in
the sf package. Thus, sf emphasises the spatial geometry of objects,
their hierarchical ordering and the way that objects are stored in
databases. The team developing sf (many of whom also developed the
sp package and format) aim to provide a new, consistent and standard
format for spatial data in R, and the sf format implements a tidy data
philosophy in the same way as tibble.
Data are tidy (Wickham, 2014; Wickham and Grolemund, 2016) when:
Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table.
Further tidy aspects of the sf package are:
All functions/methods start with st_ (enter this in the console and
press tab to search), use _ and are in lower case
All functions have the input data as the first argument and are thus
piping syntax friendly (see Chapter 3)
By default stringAsFactors = FALSE (see above describing
tibble and data.frame formats)
dplyr verbs can be directly applied to sf objects, meaning that they
can be manipulated in the same way as ordinary data tables such as
tibble format (Chapters 3, 4 and 7).
The idea behind sf is that a feature (an object in a spatial data layer or a
spatial database representing part of the real world) is often composed of
other objects, with a set of objects forming a single feature. A forest stand
can be a feature, a forest can be a feature, a city can be a feature. Each
of these has other objects (features) within it. Similarly, a satellite image
pixel can be a feature, but so can a complete image as well. Features in
sf have a geometry describing where they are located, and they have
attributes which describe their properties. An overview of the evolution of
spatial data formats in R can be found at
https://edzer.github.io/UseR2017/.
The sf package puts features in tables derived from data.frame or
tibble formats. These tables have simple feature class (sfc)
geometries in a column, where each element is the geometry of a single
feature of class sfg. Feature geometries are represented in R by:
a numeric vector for a single point (POINT)
a numeric matrix (each row a point) for a set of points (MULTIPOINT
or LINESTRING)
a list of matrices for a set of set of points (MULTIINESTRING,
POLYGON)
a list of lists of matrices (MULTIPOLYGON)
a list of anything mentioned above (GEOMETRYCOLLECTION) – all
other classes also fall in one of these categories.
To explore this, the code below loads a spatial dataset of polygons
representing the counties in North Carolina, USA that is included in the
sf package (see https://cran.r-
project.org/web/packages/spdep/vignettes/sids.pdf):
# read the data
nc <- st_read(system.file("shape/nc.shp",
package= "sf"), quiet = T)
The nc object can be mapped using the qtm() function in the tmap
package, a bespoke mapping package (Tennekes, 2018), as in Figure
2.1. Note that tmap can work with both sp and sf data formats.
Figure 2.1
However, at the time of writing (and for some years yet) sp formats are
still required by many packages for spatial analysis which have not yet
been updated to work with sf either indirectly or directly. Many packages
still have dependencies on sp. An example is the GWmodel package (Lu
et al., 2014) for geographically weighted regression (Brunsdon et al.,
1996) which at the time of writing still has dependencies on sp,
maptools, spData and other packages. If you install and load GWmodel
you will see these packages being loaded. However, ultimately sf
formats will completely replace sp and packages that use sp will
eventually be updated to use sf, but that may be a few years away.
It is easy to convert between sp and sf formats:
# sf to sp
nc_sp <-as(nc, "Spatial")
# sp to sf
nc_sf = st_as_sf(nc_sp)
You can see that nc has 100 rows and 15 columns and that it is of sf
class as well as data.frame. For sf-related formats, R prints out just
the first 10 records and all columns of the spatial data table, as with
tibble. Try entering:
nc
You can generate summaries of the nc data table and the attributes it
contains:
summary(nc)
The geometry and attributes of sf objects can be plotted using the plot
method defined in the sf and sp packages:
# sf format
plot(nc)
plot(st_geometry(nc), col = "red" )
# add specific counties to the plot
plot(st_geometry(nc[c(5,6,7,8),]), col =
"blue", add = T)
# sp format
plot(nc_sp, col = "indianred")
# add specific counties to the plot
plot(nc_sp[c(5,6,7,8),], col = "dodgerblue",
add = T)
Spatial data attributes can be mapped using the qtm() function. The
code below calculates two rates (actually non-white live births) and then
generates Figure 2.2:
# calculate rates
nc$NWBIR74_rate = nc$NWBIR74/nc$BIR74
nc$NWBIR79_rate = nc$NWBIR79/nc$BIR79
qtm(nc, fill = c("NWBIR74_rate",
"NWBIR79_rate"))
The code above starts to hint at the similarities between data tables and
spatial data. Recall that the difference between spatial data and ordinary
wide data tables is that each observation in a spatial dataset is
associated with a location – a point, line or area. The result is that the
data tables behind spatial data formats can be interrogated, selected,
and so on in much the same way as ordinary data tables, using the same
kinds of operations, but the results have a spatial dimension. This is the
case for spatial data in both sp and sf formats.
Select records and data fields (to return a data table, not spatial
data):
names(nc)
# select first 50 elements from a single field
using $
nc$AREA[1:50]
# randomly select 50 elements from a single
field using $
nc$AREA[sample(50, nrow(nc))]
# select first 10 records from multiple fields
st_drop_geometry(nc[1:10 ,c("AREA", "BIR79",
"SID79")])
st_drop_geometry(nc[1:10,c(1,12,13)])
library(sf)
vignette(package = "sf")
You may have noticed that the North Carolina dataset was read using the
st_read function. Spatial data in sf format can also be written using the
st_write() function. For example, to write nc (or any other simple
features object) to a spatial data format, at least two arguments are
needed: the object and a filename.
The code below writes the nc object to a shapefile format. Note that this
will not work if the nc.shp file exists in the working directory, so the
delete_layer = T parameter needs to be specified:
st_write(nc, "nc.shp", delete_layer = T)
The filename is taken as the data source name. The default for the layer
name is the basename (filename without path) of the data source name.
For this, st_write guesses the driver (for the format of the external
spatial data file), which in this case is an ESRI shapefile. The above
command is, for instance, equivalent to:
st_write(nc, dsn = "nc.shp", layer = "nc.shp",
driver = "ESRI Shapefile", delete_layer = T)
Typical users will use a filename that includes a path, or first set R’s
working directory with setwd() and use a filename without a path.
Spatial data in sf format can also be written out to a range of different
spatial data formats:
# as GeoPackage
st_write(nc, "nc.gpkg", delete_layer = T)
It is also possible to read spatial data into R using the st_read function:
new_nc = st_read("nc.shp")
new_nc1
new_nc = st_read("nc.gpkg")
new_nc2
Finally, the data table can be extracted from sf and sp objects (i.e.
without spatial attributes). For sf objects the geometry can be set to
NULL or the st_drop_geometry() function can be used. For sp
objects, the data table can be accessed using the data.frame function or
the @data syntax. In all cases a data.frame is created and can be
examined:
# 1. sf format
# create a copy of the nc data
nc2 <- nc
# 1a. remove the geometry by setting to NULL
st_geometry(nc2) <- NULL
class(nc2)
head(nc2)
# 1b. remove geometry using st_drop_geometry
nc2 = st_drop_geometry(nc)
class(nc2)
head(nc2)
# 2. sp format
# 2a. using data.frame
class(data.frame(nc_sp))
head(data.frame(nc_sp))
# 2b. using @data
class(nc_sp@data)
head(nc_sp@data)
Key Points
The simple features format for spatial data is defined in the sf
package.
The sf format is gradually replacing the sp format, but some
interchange is necessary as some packages still only take sp format
as inputs.
Spatial data tables (in both sf or sp formats) can be interrogated in
much the same way as ordinary data tables.
A number of different ways of reading and writing spatial data from
and to external files were illustrated.
Some basic mapping of spatial data and spatial data attributes was
introduced.
2.3 THE TIDYVERSE AND TIDY DATA
A collection of R libraries often referred to as the tidyverse was created
by Hadley Wickham and associates working at RStudio to provide a more
streamlined approach to manipulating and analysing data in R. Two key
characteristics of the tidyverse are tidy data and pipelining.
Recall that the tidy data idea essentially requires (1) a single variable in
its own column, (2) each observation in it its own row, and (3) each value
in its own cell. An example of a dataset not complying with this can be
derived from the AirPassengers dataset supplied in the package
datasets:
AirPassengers
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct
Nov Dec
## 1949 112 118 132 129 121 135 148 148 136 119
104 118
## 1950 115 126 141 135 125 149 170 170 158 133
114 140
## 1951 145 150 178 163 172 178 199 199 184 162
146 166
## 1952 171 180 193 181 183 218 230 242 209 191
172 194
## 1953 196 196 236 235 229 243 264 272 237 211
180 201
## 1954 204 188 235 227 234 264 302 293 259 229
203 229
## 1955 242 233 267 269 270 315 364 347 312 274
237 278
## 1956 284 277 317 313 318 374 413 405 355 306
271 306
## 1957 315 301 356 348 355 422 465 467 404 347
305 336
## 1958 340 318 362 348 363 435 491 505 404 359
310 337
## 1959 360 342 406 396 420 472 548 559 463 407
362 405
## 1960 417 391 419 461 472 535 622 606 508 461
390 432
We will consider the details of the above code in Chapter 3, but for now
note that the data frame AirPassengers_tidy is tidy. Each
observation is a row: a single row contains the information of a unique
passenger count, together with the month and the year it refers to.
Similarly, the three variables Count, Month and Year each have a
unique column.
Note also that in this format, the order of the rows does not affect the
content of the data, in contrast to the original AirPassengers data
which implicitly implied that counts were on a month-by-month basis
starting from January 1949.
A final point here is that the year was stored in the rownames attribute of
the data frame (again, the tidy principle suggests this should be a
standard variable), here called Year. Again, all of the variables have a
column – but it should be an ordinary column to avoid confusion, rather
than be ‘hidden’ in a different kind of attribute called rowname.
Of course, it cannot be guaranteed that all the data that a data scientist
receives will be tidy. However, a key aspect of the tidyverse in terms
of data manipulation and ‘wrangling’ is that it provides a suite of functions
that take tidy data as input, and return transformed but still tidy data as
output. To complement this, there are also tools to take non-tidy data and
restructure them in a tidy way. Some of these come with base R (e.g.
AirPassengers_df$Year <- rownames(AirPassengers_df)
ensures the row names are stored as a standard variable), and others
are supplied as part of the tidyverse (e.g. gather()). A standard
workflow here is as follows:
1. Make data tidy if they are not so already.
2. Manipulate the data through a series of ‘tidy-to-tidy’ transformations.
Key Points
Tidy data are defined as data in which a single variable has its own
column, each observation is in its own row. and each value has its
own cell.
Untidy data can be converted into tidy format using the gather()
function.
2.4 DPLYR FOR MANIPULATING DATA
(WITHOUT PIPES)
2.4.1 Introduction to dplyr
The dplyr package is part of tidyverse and provides a suite of tools
for manipulating and summarising data tables and spatial data. It
contains a number of functions for data transformations that take a tidy
data frame and return a transformed tidy data frame. A very basic
example of this is filter(): this simply selects a subset of a data
frame on the basis of its rows. Given the requirement that each row
corresponds to a unique observation, this effectively subsets
observations.
We can use the air passenger data, for example, in order to find the
subset of the data for the passenger counts in March:
head(filter(AirPassengers_tidy, Month ==
"Mar"))
## Year Month Count
## 1 1949 Mar 132
## 2 1950 Mar 141
## 3 1951 Mar 178
## 4 1952 Mar 193
## 5 1953 Mar 236
## 6 1954 Mar 235
A few things to note here:
1. The first argument of filter is the tidy data frame.
2. The second argument is a logical expression specifying the subset.
3. Column names in the data frame can be used in the logical
expression more or less as though they were ordinary variables.
4. The output, as stated, is a tidy data frame as well – in the same
format as the input, but with fewer observations (only those whose
month is March).
The logical expression can be any valid logical expression in R – thus >,
<, <=, >=, == and != may all be used, as can & (and), | (or) and ! (not).
The & operator is not always necessary, as it is possible to apply multiple
conditions in the same filter operation by adding further logical
arguments:
filter(AirPassengers_tidy,Month=="Mar",Year >
1955)
## Year Month Count
## 1 1956 Mar 317
## 2 1957 Mar 356
## 3 1958 Mar 362
## 4 1959 Mar 406
## 5 1960 Mar 419
head(arrange(AirPassengers_tidy,desc(Year)))
Because the life expectancy variable name Life Exp contains a space,
we need to enclose it in backward quotes when referring to it. Again this
transformation has the tidy-to-tidy pattern so that an additional filter
command could be applied to the output. Having arranged the state data
as above, we may want to filter out the places where average life
expectancy falls below 70 years:
filter(arrange(state_tidy,Region,Division,`Life
Exp`), `Life Exp`<70)
Key Points
The dplyr package was introduced.
Some basic dplyr operations were illustrated, including filter to
select rows and arrange to order rows.
The syntax for dplyr functions is standardised with the first
argument always the input data. This is important for piping (see
Chapter 3).
2.4.2 Single-table manipulations: dplyr verbs
The filter() and arrange() functions are just two of the dplyr
verbs (the package developers call them ‘verbs’ because they do
something!). These can be used to manipulate data singly or in a nested
sequence of operations with intermediate outputs. Chapter 3 shows that
these dplyr tools can be used in a piping syntax that chains sequences
of data operations together in a single data manipulation workflow. The
important dyplyr verbs are summarised in Table 2.4. The code snippets
below illustrate how some of the single-table dplyr verbs can be applied
and combined in a non-piped way.
Table 2.4dplyr
The dplyr verbs can be used to undertake a number of different
operations on the dataset to subset and fold it different ways. We have
already used filter() to select records that fulfil some kind of logical
criterion and arrange() to reorder them. The code below uses
mutate() to create a population density attribute (pop_dens) and
assigns the result to a temporary file (tmp) which can be examined:
Such manipulations can be combined with other dplyr verbs. The code
below calculates population density, creating a variable called
pop_dens. It then selects this and the Name variables and assigns the
result to tmp:
tmp = select(
mutate(state_tidy, pop_dens =
Population/Area),
Name, pop_dens)
head(tmp)
tmp = arrange(
summarise(
group_by(
mutate(state_tidy, Inc_Illit =
Income/Illiteracy),
Division
),
mean_II = mean(Inc_Illit)),
mean_II)
You can inspect the results in tmp. But notice how complex this is getting
– keeping track of the nested statement and the parentheses!
It would be preferable not to have to create intermediary variables like
tmp for longer chains of analysis. This is especially the case if we want to
do something like pass the results to ggplot so that they can be
visualised, such as in an ordered lollipop plot as in Figure 2.3:
ggplot(data = tmp, aes(x=mean_II,
y=fct_reorder(Division,
mean_II))) +
geom_point(stat='identity', fill="black",
size=3) +
geom_segment(aes(y = Division, x = 0, yend
= Division,
xend = mean_II), color =
"black" ) +
labs(y = "Ordered Division",
x = "Mean Income to Illiteracy ratio")
+
theme(axis.text=element_text(size=7))
The above code may appear confusing, but each line before the line with
ggplot() can be run to see what the code is doing with the data. The
aim here is to stress the importance of understanding the operation of
these verbs as they underpin piping and data analytics with databases
described in Chapter 4. So, for example, try examining the following
sequence of code snippets:
state_tidy %>%
mutate(Inc_Illit = Income/Illiteracy)
Then…
state_tidy %>%
mutate(Inc_Illit = Income/Illiteracy) %>%
group_by(Division)
And finally…
state_tidy %>%
mutate(Inc_Illit = Income/Illiteracy) %>%
group_by(Division) %>%
summarise(mean_II = mean(Inc_Illit))
Key Points
The dplyr package has a number of verbs for data table
manipulations.
These can be used to select variables or fields and to filter records
or observations using logical statements.
They can be used to create new variables using the mutate()
function.
Group summaries can be easily generated using a combination of
the group_by() and summarise() functions.
Sequences of nested dplyr verbs can be applied in a code block.
Many dplyr operations can be undertaken on sf format spatial
data.
2.4.3 Joining data tables in dplyr
Tables can be joined through an attribute that they have in common. To
illustrate this, the code below loads some census data for Liverpool held
in an R binary file (ch2.Rdata) and saves it to your current working
directory – you may want to clear your workspace to check this:
rm(list = ls())
getwd()
download.file("http://archive.researchdata.leeds.ac.uk/731/1
"./ch2.RData", mode = "wb")
load("ch2.RData")
ls()
## [1] "lsoa" "lsoa_data"
You will notice that two objects lsoa and lsoa_data are loaded into
your R session (enter ls()). The lsoa object is a spatial layer of Lower
Super Output Areas for Liverpool in sf format with 298 areas, and
lsoa_data is a tibble with 298 rows and 11 columns.
The design of the UK Census reporting areas aimed to provide a
consistent spatial unit for reporting population data at different scales.
Lower Super Output Areas (Martin, 1997, 1998, 2000, 2002) contain
around 1500 people. This means that their size is a function of population
density. UK Census data can be downloaded from the Office of National
Statistics (https://www.nomisweb.co.uk) and spatial layers of Census
areas from the UK Data Service (https://borders.ukdataservice.ac.uk).
You could examine the tibble and sf objects that have been loaded:
lsoa_data
lsoa
plot(st_geometry(lsoa), col = "thistle")
Both have a field called code and this can be used to link the two
datasets using one of the dplyr join functions:
It is also possible to specify join fields in situations where the names are
not the same. To illustrate this the code below changes the name of
code in lsoa_data, specifies the join and then renames the field back
to code:
names(lsoa_data)[1] = "ID"
lsoa_join = inner_join(lsoa, lsoa_data, by =
c("code" = "ID"))
names(lsoa_data)[1] = "code"
Table 2.5dplyr
It is instructive to see how they work with lsoa_data and lsoa, using
some mismatched data. The code below randomly samples 250 records
from lsoa_data and the inner join means that only records that match
are returned, and these are passed to dim to get the object dimensions:
set.seed(1984)
lsoa_mis = sample_n(lsoa_data, 250)
# nest the join inside dim
dim(inner_join(lsoa_mis, lsoa))
## [1] 250 12
The different join types can return different results. The code snippets
below illustrate this using dim():
dim(inner_join(lsoa, lsoa_mis))
dim(left_join(lsoa, lsoa_mis))
dim(right_join(lsoa, lsoa_mis))
dim(semi_join(lsoa, lsoa_mis))
dim(anti_join(lsoa, lsoa_mis))
You should compare the results when x and y are changed, especially
for left and right joins. The links between two tables can be defined in
different ways and cardinality enforced between them through the choice
of join method. In a perfect world all joins would result in a seamless,
single match between each common object in each table, with no
unmatched records and no ambiguous matches. Most cases are not like
this and the different join types essentially enforce rules for how
unmatched records are treated. This is how dplyr enforces cardinality,
or relations based on the uniqueness of the values in data column being
joined. In this context, the mutating joins combine variables from both x
and y inputs, filtering the results to include all of the records from the
input that is given prominence in the join. So only matching records in x
and y are returned with an inner join, all of the elements of y are returned
for a right join, etc.).
The help section for dplyr joins has some other worked examples that
you can copy and paste into your console. To see these, run:
?inner_join
You should also work through the vignette for dplyr joins:
vignette("two-table", package = "dplyr")
As with the dplyr single-table verbs, various joins are used extensively
throughout this book.
Putting it all together, single-table manipulations and joins can be
combined (in fact, they usually are combined). The code below mixes
some single-table verbs and joining operations to create an
unemployment rate, selects the areas within the top 25% and passes this
to qtm:
qtm(
filter(
mutate(
inner_join(lsoa, lsoa_data),
unemp_pc = (econ_active -
employed)/emp_pop),
unemp_pc > quantile(unemp_pc, 0.75)),
"tomato")
Key Points
Data tables can be joined through an attribute that they have in
common.
This can be done in different directions, preserving the inputs in
different ways (compare left_join with right_join and
inner_join).
Spatial data tables in sf format can be joined in the same way,
preserving the spatial properties.
2.5 MAPPING AND VISUALISING SPATIAL
PROPERTIES WITH TMAP
Having introduced spatial data and done some mapping in this chapter
using the qtm() function (qtm generates a quick tmap), much greater
control over the mapping of spatial data can be exercised using the full
tmap. This section briefly outlines the tmap package for creating maps
from spatial data, providing information on the structure and syntax of
calls to tmap. A more comprehensive and detailed treatment on tmap
can be found in Brunsdon and Comber (2018).
The tmap package supports the thematic visualisation of spatial data
(Tennekes, 2018). It has a grammatical style that handles each element
of the map separately in a series of layers (it is similar to ggplot in this
respect – see Chapter 5). In so doing it seeks to exercise control over
each element in the map. This is different from the basic R plot functions.
The basic syntax of tmap is:
tm_shape(data = <data>) +
tm_<function>()
Do not run the code above; it simply shows the syntax of tmap. However,
you have probably noticed that it uses a similar syntactical style to
ggplot to join together layers of commands, using the + sign. The
tm_shape() function initialises the map and then layer types, the
variables that get mapped, and so on are specified using different
flavours of tm_<function>. The main types are listed in Table 2.6.
Table 2.6
Let us start with a simple choropleth map. These maps show the
distribution of a continuous variable in different elements of the spatial
data (typically polygons/areas or points). The code below creates new
variables for percentage unemployed, percentage under 25 and
percentage over 65:
lsoa_join$UnempPC =
(lsoa_join$unemployed/lsoa_join$econ_active)*100
lsoa_join$u25PC =
(lsoa_join$u25/lsoa_join$age_pop)*100
lsoa_join$o65PC =
(lsoa_join$o65/lsoa_join$age_pop)*100
p1 = tm_shape(lsoa_join) +
tm_polygons("UnempPC", palette = "GnBu",
border.col = "salmon",
breaks = seq(0,35, 5), title = "%
Unemployed") +
tm_layout(legend.position = c("right",
"top"), legend.outside = T)
p1
And of course many other elements can be included either by running the
code snippet defining p1 above with additional lines or by simply adding
them as in the code below:
p1 + tm_scale_bar(position = c("left",
"bottom")) +
tm_compass(position = c(0.1, 0.1))
boundary = st_union(lsoa)
p1 + tm_shape(boundary) + tm_borders(col =
"black", lwd = 2)
The tmap package can be used to plot multiple attributes in the same
plot, in this case unemployment and over 65 percentages (Figure 2.5):
tm_shape(lsoa_join) +
tm_fill(c("UnempPC", "o65PC"), palette =
"YlGnBu",
breaks = seq(0,40, 5), title = c("% Unemp",
"% Over 65")) +
tm_layout(legend.position = c("left",
"bottom"))
Figure 2.5 Choropleth maps of UnempPC and o65PC
The code below applies further refinements to the choropleth map to
generate Figure 2.6. Notice the use of title and legend.hist, and
then subsequent para-meters passed to tm_layout to control the
legend:
tm_shape(lsoa_join) +
tm_polygons("u25PC", title = "% Under 25",
palette = "Reds",
style = "kmeans", legend.hist = T) +
tm_layout(title = "Under 25s in \n
Liverpool",
frame = F, legend.outside = T,
legend.hist.width = 1 ,
legend.format = list(digits = 1),
legend.outside.position = c("left",
"top"),
legend.text.size = 0.7,
legend.title.size = 1 ) +
tm_compass(position = c(0.1, "0.1")) +
tm_scale_bar(position = c("left", "bottom"))
+
tm_shape(boundary) + tm_borders(col =
"black", lwd = 2)
Figure 2.6 A refined choropleth map of u25PC
This is quite a lot of code to unpick. The first call to tm_shape()
determines which layer is to be mapped, then a specific mapping function
is applied to that layer (in this case tm_polygons()) and a variable is
passed to it. Other para-meters to specify are the palette, whether a
histogram is to be displayed and the type of breaks to be imposed (here
a k-means was applied). The help for tm_polygons () describes a
number of different parameters that can be passed to tmap elements.
Next, some adjustments are made to the defaults through the
tm_layout function, which allows you to override the defaults that tmap
automatically assigns (shading scheme, legend location, etc.). Finally,
the boundary layer is added.
There are literally thousands of options with tmap and many of them are
controlled in the tm_layout() function. You should inspect this:
?tm_layout
The code below creates and maps point data values in different ways.
There are two basic options: change the size or change the shading of
the points according to the attribute value. The code snippets below
illustrate these approaches.
Create the point layer:
lsoa_pts = st_centroid(lsoa_join)
tmap_mode("plot")
If it needs to be changed this can be done via Session > Set Working
Directory > Choose Directory. If you have done these then the IBDSDA
package can be loaded using the code below, or it can be done manually
from the package archive (.tar.gz) file using the RStudio Menu: Tools
> Install Packages > Install from (dropdown) > Package Archive File.
# download the package zip file
download.file("http://archive.researchdata.leeds.ac.uk/733/1
"./IBDSDA_0.1.2.tar.gz", mode =
"wb")
# install the package
install.packages("IBDSDA_0.1.2.tar.gz", type=
"source", repos= NULL)
# state_tidy
state_tidy <- as.data.frame(state.x77)
state_tidy$Division <-
as.character(state.division)
state_tidy$Region <- as.character(state.region)
state_tidy$Name <- rownames(state.x77)
# AirPassengers_tidy
AirPassengers_df <-
matrix(datasets::AirPassengers,12,12,
byrow = TRUE)
rownames(AirPassengers_df) <- 1949:1960
colnames(AirPassengers_df) <-
c("Jan","Feb","Mar","Apr","May",
"Jun","Jul","Aug","Sep","Oct
"Nov","Dec")
AirPassengers_df <-
data.frame(AirPassengers_df)
AirPassengers_df$Year <-
rownames(AirPassengers_df)
AirPassengers_tidy <- gather(AirPassengers_df,
key = Month, value
= Count, -Year)
as_tibble(AirPassengers_tidy)
## # A tibble: 144 x 3
## Year Month Count
## <chr> <chr> <dbl>
## 1 1949 Jan 112
## 2 1950 Jan 115
## 3 1951 Jan 145
## 4 1952 Jan 171
## 5 1953 Jan 196
## 6 1954 Jan 204
## 7 1955 Jan 242
## 8 1956 Jan 284
## 9 1957 Jan 315
## 10 1958 Jan 340
## # … with 134 more rows
Key Points
Methods for installing packages saved as local archives (.tar and
.zip files) were introduced.
This chapter introduces the use of piping syntax with the dplyr
package, providing a core framework for data and spatial data
analyses.
3.2 INTRODUCTION TO PIPELINES OF TIDY
DATA
In Chapter 2 the code snippet below was used to filter out records
(places) from the state_tidy data table where average life expectancy
falls below 70 years.
Other code snippets nested a sequence of dplyr verbs to summarise
mean income to illiteracy ratios for the nine divisions in the USA:
arrange(
summarise(
group_by(
mutate(state_tidy, Inc_Illit =
Income/Illiteracy),
Division),
mean_II = mean(Inc_Illit)),
mean_II)
Note that the expression above is quite hard to read, as there are several
layers of parentheses, and it is hard to see immediately which
parameters relate to the individual dplyr verbs for sorting, filtering,
summarising, grouping and arranging. The dplyr package presents an
alternative syntax that overcomes this, using the pipeline operator %>%.
Effectively this allows a variable (usually a tidy data frame) to be placed
in front of a function so that f(x) is replaced by x %>% f. So, rather
than the function f being applied to x, x is piped into the function. If there
are multiple arguments for f, we have f(x,y,z,…) replaced by x %>%
f(y,z,…) – thus typically the first argument is placed to the left of the
function.
The reason why this can be helpful is that it makes it possible to specify
chained operations of functions on tidy data as a left-to-right narrative.
The pipelined versions of the above operations are:
state_tidy %>% arrange(Region,Division,`Life
Exp`) %>%
filter(`Life Exp` < 70)
and:
state_tidy %>%
mutate(Inc_Illit = Income/Illiteracy) %>%
group_by(Division) %>%
summarise(mean_II = mean(Inc_Illit))
From the piped code snippets above, it is clear that the input data frame
is state_tidy in both cases, and in the first that an arrange operator
is applied, and then a filter. Since this is not stored anywhere else,
the result is printed out. Similarly, the second code snippet first applies a
mutate operation and then groups the result using group_by, before
calculating a group summary.
Inspecting piped operations
One key thing to note in the construction of pipelines of operations is that
the action of each operation in the pipeline can be inspected.
If the input is a data.frame then the function head() can be inserted
(and then removed) after each %>% in the pipeline. So, for example, the
above code could be inspected as follows:
state_tidy %>%
mutate(Inc_Illit = Income/Illiteracy) %>%
head()
If the input is a tibble then the code could be run to just before the pipe
operator as the print function for tibble formats prints out the first 10
lines:
# make a tibble
state_tidy %>% as_tibble() %>%
mutate(Inc_Illit = Income/Illiteracy)
However, for some this sits uneasily with the left-to-right flow using the
piping operators as the final operation (assignment) appears at the far
left-hand side. An alternative R assignment operator exists, however,
which maintains the left-to-right flow. The operator is ->:
The tibble format was introduced in Chapter 2. Recall that tibbles are
basically data frames with a few modifications to help them work better in
a tidyverse framework and that a standard data frame can be converted
to a tibble with the function as_tibble:
state_tbl <- as_tibble(state_tidy)
state_tbl
## # A tibble: 50 x 11
## Population Income Illiteracy `Life Exp` Murder `HS
Grad`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3615 3624 2.1 69.0 15.1 41.3
## 2 365 6315 1.5 69.3 11.3 66.7
## 3 2212 4530 1.8 70.6 7.8 58.1
## 4 2110 3378 1.9 70.7 10.1 39.9
## 5 21198 5114 1.1 71.7 10.3 62.6
## 6 2541 4884 0.7 72.1 6.8 63.9
## 7 3100 5348 1.1 72.5 3.1 56
## 8 579 4809 0.9 70.1 6.2 54.6
## 9 8277 4815 1.3 70.7 10.7 52.6
## 10 4931 4091 2 68.5 13.9 40.6
## # … with 40 more rows, and 5 more variables: Frost
<dbl>,
## # Area <dbl>, Division <chr>, Region <chr>, Name
<chr>
Key Points
The dplyr package supports a piping syntax.
This allows sequences of dplyr operations to be chained together
with the pipeline operator %>%.
In these left-to-right sequences, the output of one operation
automatically becomes the input of to the next.
The final operation can be assignment to an R object, and this can
be done using alternative R assignment operators:
– using -> at the end of the piped operations, which maintains
the left-to-right flow;
– using <- before the first piped operation, in a more ‘classic’ R
way.
The final operation can also pipe the results to other actions such as
ggplot() (see the end of Chapter 2).
3.3 THE DPLYR PIPELINING FILTERS
As well as filter and arrange, a number of other tidy-to-tidy
transformations exist. In this section some key ones will be outlined.
3.3.1 Using select for column subsets
While filter selects subsets of rows from a data frame, select
selects subsets of columns. Returning to state_tbl, a basic selection
operation works as follows:
state_tbl %>% select(Region, Division, Name,
`Life Exp`)
This selects the columns Region, Division, Name and Life Exp from
state_tbl. Note also that ordering of columns matters:
state_tbl %>% select(Name, Division, Region,
`Life Exp`)
You can also use the minus sign on ranges – although parentheses
should be used:
state_tbl %>% select (-(Population:Murder))
There are also operations allowing selections based on the text in the
column name. These are starts_with, ends_with, contains and
matches. For example:
These are useful when a group of column names have some common
pattern – for example, Age 0 to 5, Age 6 to 10 and so on could
be selected using starts_with("Age"). matches is more complex,
as it uses regular expressions to specify matching column names. For
example, suppose the data also had a column called Agenda – this
would also be selected using starts_with("Age") – but using
matches("Age [0-9]") would correctly select only the first set of
columns. There is also a function called num_range which selects
variables with a common stem and a numeric ending – for example,
num_range("Type_",1:4) selects Type_1, Type_2, Type_3 and
Type_4.
The helper function everything() selects all columns. This is
sometimes useful for changing the order in which columns appear. In a
tidy data frame column order does not matter in terms of information
stored – a consequence of the requirement that every variable has a
column – but in presentational terms reordering is useful, for example so
that some variables of current interest appear leftmost (i.e. first) when the
table is listed. The following ensures the Name and then the Life Exp
columns come first:
state_tbl %>% select(Name,`Life
Exp`,everything())
The reason why this works is that Name and then Life Exp are selected
in order followed by all of the columns in current order. This will include
the previous columns, but if a column name is duplicated, the column
only appears once, in the first position it is listed. So effectively
everything() here means the rest of the columns in order.
A quick note on select. Lots of packages have a function named
select, including for example Raster, dplyr and MASS. This is the
most common cause of a select operation not working! If you have
multiple packages containing a function named select, then you need
to tell R which one to use. The syntax for doing this uses a double colon
as follows:
state_tbl %>% dplyr::select(Name,`Life
Exp`,everything())
Often it is a good idea to avoid overwriting unless the results are easy to
undo – otherwise it becomes very difficult to roll back if you make an
error. Here, no long-term damage was done since although the mutate
operation was carried out and a tibble object was created with a
redefined Income column, the new tibble was not stored anywhere –
and, most importantly, it was not overwritten onto state_tbl.
A mutate expression can involve any of the columns in a tibble. Here, a
new column containing income per head of population is created:
state_tbl %>% mutate(`Per
Capita`=Income/Population) %>%
select(Name, Income, Population, `Per
Capita`)
The pipelining idea can be utilised to advantage here. To create the per
capita income column and then sort on this column, reorder the columns
to put the state name and per capita income columns to the left, storing
the final result into a new tibble called state_per_cap, enter:
This is of course a matter of opinion, but the authors feel that the dplyr
version is more easily understood.
Expressions can also contain variables external to the tibble. For
example. we added the Division column to the original data frame
state_tidy based on the external variable state.division using
state_tidy$Division <- as.character(state.division)
with standard R. A dplyr version of this would be:
state_tidy %>%
mutate(Division=as.character(state.division)) -
>
state_tidy
state_tidy
A related operation is transmute. This is similar to mutate, but while
mutate keeps the existing columns in the result alongside the derived
ones, transmute only returns the derived results. For example:
state_tbl %>%
transmute(LogIncome=log10(Income),
LogPop=log10(Population))
To keep some of the columns from the original, but without transforming
them, just state the names of these columns:
state_tbl %>%
transmute(Name,Region,LogIncome=log10(Income),
LogPop=log10(Population))
## # A tibble: 50 x 4
## Name Region LogIncome LogPop
## <chr> <chr> <dbl> <dbl>
## 1 Alabama South 3.56 3.56
## 2 Alaska West 3.80 2.56
## 3 Arizona West 3.66 3.34
## 4 Arkansas South 3.53 3.32
## 5 California West 3.71 4.33
## 6 Colorado West 3.69 3.41
## 7 Connecticut Northeast 3.73 3.49
## 8 Delaware South 3.68 2.76
## 9 Florida South 3.68 3.92
## 10 Georgia South 3.61 3.69
## # … with 40 more rows
state_tbl %>%
summarise(TotPop=sum(Population),TotInc=sum(Income))
## # A tibble: 1 x 2
## TotPop TotInc
## <dbl> <dbl>
## 1 212321 221790
In this case the output is a tibble, but because sum maps a column onto a
single number there is only one row, with elements corresponding to the
two summarised columns. Essentially, because the sums are for the
whole of the USA, the unit of observation is now the USA as a country.
However, we may wish to obtain the sums of these quantities by region.
This is where group_by is useful. Applying this function on its own has
little noticeable effect:
state_tbl %>% group_by(Region)
## # A tibble: 50 x 11
## # Groups: Region [4]
## Population Income Illiteracy `Life Exp`
Murder `HS Grad`
## <dbl> <dbl> <dbl> <dbl> <dbl>
<dbl>
## 1 3615 3624 2.1 69.0 15.1
41.3
## 2 365 6315 1.5 69.3 11.3
66.7
## 3 2212 4530 1.8 70.6 7.8
58.1
## 4 2110 3378 1.9 70.7 10.1
39.9
## 5 21198 5114 1.1 71.7 10.3
62.6
## 6 2541 4884 0.7 72.1 6.8
63.9
## 7 3100 5348 1.1 72.5 3.1 56
## 8 579 4809 0.9 70.1 6.2
54.6
## 9 8277 4815 1.3 70.7 10.7
52.6
## 10 4931 4091 2 68.5 13.9
40.6
## # … with 40 more rows, and 5 more variables:
Frost <dbl>,
## # Area <dbl>, Division <chr>, Region <chr>,
Name <chr>
The only visible effect is that when the tibble is printed it records that
there are groups by region. Internally the variable specified for grouping
is stored as an attribute of the tibble. However, if this information is then
passed on to summarise then a new effect occurs. Rather than applying
the summarising expressions to a whole column, the column is split into
groups on the basis of the grouping variable, and the function applied to
each group in turn.
Again, pipelining is a useful tool here – group_by is applied to a tibble,
and the grouped tibble is piped into a summarise operation:
state_tbl %>% group_by(Region) %>%
summarise(TotPop=sum(Population),TotInc=sum(Income))
## # A tibble: 4 x 3
## Region TotPop TotInc
## <chr> <dbl> <dbl>
## 1 North Central 57636 55333
## 2 Northeast 49456 41132
## 3 South 67330 64191
## 4 West 37899 61134
The output is still a tibble but now the rows are the regions. As well as the
summary variables TotPop and TotInc the grouping variable (Region)
is also included as a column. Similarly, we could look at the division as a
unit of analysis with a similar operation:
state_tbl %>% group_by(Division) %>%
summarise(TotPop=sum(Population),TotInc=sum(Income))
Using pipelines, a new tibble with divisional income per capita, sorted on
this quantity, can be created. Again, procedures like this benefit most
from dplyr. Here we store the result in a new tibble called
division_per_capita:
Another thing to see here is that there are seven rows of data, but
actually eight possible Region – Dense combinations. However, a row
is only created if a particular combination of the grouping variables
occurs in the input data. In this case, there is no dense state in the West
region, while all the other combinations do occur, hence the discrepancy
of one row.
3.3.4 group_by with other data frame operations
Although perhaps the most usual use of group_by is with summarise, it
is also possible to use it in conjunction with mutate (and transmute)
and filter. For mutate this is essentially because it is possible to use
column-to-single-value functions (such as mean, sum and max) as part of
an expression. Suppose we wish to standardise the life expectancy
values against a US national mean figure:
state_tbl %>% transmute(Name,sle = 100*`Life
Exp`/mean
(`Life Exp`))
## # A tibble: 50 x 2
## Name sle
## <chr> <dbl>
## 1 Alabama 97.4
## 2 Alaska 97.8
## 3 Arizona 99.5
## 4 Arkansas 99.7
## 5 California 101.
## 6 Colorado 102.
## 7 Connecticut 102.
## 8 Delaware 98.8
## 9 Florida 99.7
## 10 Georgia 96.7
## # … with 40 more rows
Now the values have changed – for example, although Alabama still has
a standardised life expectancy below 100, this is less marked when
considered at the regional level. Also, as with summarise the grouping
variable is implicitly carried through to the output tibble, so that even if the
transmute command did not specify this column, it still appears in the
output tibble.
As another example, arguably the mean value used in the calculation
above should be population weighted. The weighted.mean function
could be used to achieve this:
state_tbl %>% group_by(Region) %>%
transmute(Name,
region_sle = 100*`Life
Exp`/weighted.mean(`Life Exp`,Population))
Tied values result in fractional ranks, and a rank of 1 is the lowest value.
A simple trick to reverse rank orders (i.e. to give the maximum value a
rank of 1) is to change the sign of the ranking column:
state_tbl %>% transmute(Name,` Rank LE`=rank(-
`Life Exp`))
Window functions are also affected by grouping – in this case they are
applied to each group separately – so if the data were grouped by region
then the rank of Alabama would be in relation to the states in the South
region, and so on. Assuming no ties, we would expect four rank 1
observations in the entire tibble since it is grouped into the four regions.
Here is an example:
state_tbl %>% group_by(Region) %>%
transmute(Name,`Rank LE`=rank(-`Life Exp`))
Other useful window functions are scale (which returns z-scores) and
ntile. The latter assigns an integer value to each item in the column
based on ordering the values into n evenly sized buckets. Thus, if there
were 100 observations in column x, ntile(x,4) assigns 1 to the lowest
25 values, 2 to the next 25 and so on. Again, these will all be applied on
a groupwise basis if group_by has been used. Here ntile is used
regionally:
state_tbl %>% group_by(Region) %>%
transmute(Name,`Illiteracy
group`=ntile(Illiteracy,3)) %>%
arrange(desc(`Illiteracy group`))
The filter function also respects grouping in this way. So, for example,
to select all of the states whose illiteracy rate is above the mean for their
region, use:
state_tbl %>% group_by(Region) %>%
filter(Illiteracy > mean(Illiteracy)) %>%
select(Name,Illiteracy,everything())
## # A tibble: 23 x 11
## # Groups: Region [4]
## Name Illiteracy Population Income `Life
Exp` Murder
## <chr> <dbl> <dbl> <dbl> <dbl>
<dbl>
## 1 Alab… 2.1 3615 3624 69.0
15.1
## 2 Alas… 1.5 365 6315 69.3 11.3
## 3 Ariz… 1.8 2212 4530 70.6
7.8
## 4 Arka… 1.9 2110 3378 70.7
10.1
## 5 Cali… 1.1 21198 5114 71.7
10.3
## 6 Conn… 1.1 3100 5348 72.5
3.1
## 7 Geor… 2 4931 4091 68.5 13.9
## 8 Hawa… 1.9 868 4963 73.6 6.2
## 9 Illi… 0.9 11197 5107 70.1
10.3
## 10 Loui… 2.8 3806 3545 68.8
13.2
## # … with 13 more rows, and 5 more variables:
`HS
## # Grad` <dbl>, Frost <dbl>, Area <dbl>,
Division <chr>,
## # Region <chr>
Note that in the transmute operation, the Month is added as well as the
lagged changes so that it is carried through for the grouping operation.
Here it was necessary to perform the grouping after this step as it was
required to perform the lagging on the ungrouped data. Also,
na.rm=TRUE had to be added to the mean function, as the first 12 values
of Change were NA.
Key Points
The dplyr package has a number of operations for pipelining and
extracting data from data tables.
The output of piped operations is in the same format as the input
(tibble, data.frame, etc.).
The select function subsets columns by name, by named ranges
and pattern matching such as starts_with, and the minus sign in
front of named variable causes it to be omitted (e.g. select(-
Population)).
Many packages have a function named select, and the dplyr
version can be enforced using the syntax dplyr::select().
The mutate and transmute functions can be used to both create
new (column) variables and transform existing ones.
While mutate keeps the existing columns in the result alongside the
derived ones, transmute only returns the derived results.
The summarise function summarises numerical variables in a data
table.
The group_by function in conjunction with summarise allows
single or multiple group summaries to be calculated.
3.4 THE TIDY DATA CHAINING PROCESS
The previous sections outlined the main functions used to transform and
arrange data in the tidy format. There is a further set of operations to
allow some more analytical approaches, but for now these provide a
useful set of tools – and certainly provide a means for exploring and
selecting data prior to a more formal analysis. The advantage of this set
of methods is that they are designed to chain together so that the output
from one method can feed directly as input to the next. This is due to
some common characteristics – all of the pipelining operators take the
form
output <- method(input,method modifiers…)
where input and output are data frames or tibbles, and modifiers
provide further information for the method. Another characteristic is that
the modifiers can refer to the column names in a data frame directly –
that is, col1 can be used rather than df$col1. Both of these are
conventions rather than necessities, but make coding easier as users can
depend on the functions behaving in a certain way. Although the method
template above was in ‘classic R’ (or base R) format, in pipeline form it
would be
input %>% method(method modifiers…) -> output
However, this assumes that data are always supplied in tidy form, and
that the final output required is a tidy tibble or data frame. Often this is not
the case – data are provided in many forms, and although tidy data
frames are easy to handle, the final output may well be more
understandable and easily viewed in some other format. This might, for
example, be a rectangular table with columns for months and rows for
years (not tidy, but fits more neatly on a printed page or web browser
window), or the output may require graphical representation via a graph
or a map.
A more complete workflow than that proposed in Chapter 2 might be as
follows:
1. Read in untidy data.
2. Make it tidy.
3. Perform a chain of tidy-to-tidy transforms with mutate, filter, etc.
4. Take tidy data and transform to an output to communicate results –
typically a table or a graphic.
Good practice here is to separate the process into these distinct sets of
operations in the given order. The data selection, transformation and
possibly analysis should take place in step 3 (as a series of chained
operations), whereas step 1 should be regarded as an extended ‘reading-
in’ process. Step 4 should be regarded as transforming the final outcome
of data processing in order to better communicate.
The whole process can then be thought of as an extended pipeline with
an optional output as in Figure 3.1.
Each box or step in the pipeline may consist of multiple instructions, and
certainly the middle box may be expressed explicitly in dplyr pipelined
operations. A further characteristic is that the output of step 1, the input
and output of step 2 and the input of step 3 should all be in tidy data
format. Section 3.3 outlined a set of operators that may be used in step 2;
in this section we focus on some tools for use in steps 1 and 3.
Figure 3.1 The recommended pipeline form for tidy data manipulation
3.4.1 Obtaining data
On many occasions, obtaining and reading in data is relatively
straightforward – common file formats are .RData or .Rda, .csv, and
.xls or .xlsx (MS Excel).
The first of these, .RData or .Rda, can be dealt with in base R via the
load function. For the .csv case, base R provides the read.csv
function. There is also a tidyverse version called read_csv, which in
general is faster with large .csv files, automatically reads data into tibble
format, leaves column names unchanged, and does not automatically
convert character variables into factors by default. These were introduced
in Chapter 2 but, in a nutshell, read_csv works on the principles of
‘don’t mess with the data any more than necessary’ and ‘read the data
quickly’. Note that by default, read_csv guesses the data type of each
column, and reports this when it reads in the data. The spec() function
does the same thing and can be used to check the column data
specification. The code below reloads the Sacramento real estate data
introduced in Chapter 2:
real_estate_url="https://raw.githubusercontent.com/lexcomber
real_estate = read_csv(real_estate_url)
to create a new tibble with zip as a character column. One issue here is
that the spec characteristic of the tibble is now lost:
spec(real_estate2)
Once the data have been read in, a brief inspection using
View(real_estate) raises some further issues. In particular, there are
some properties with zero square feet floor area, zero bathrooms or zero
bedrooms. These seem anomalous and so might be regarded as missing
values. Similarly, there are a small number of properties with implausibly
low prices (in particular, $1551, $2000 and several at $4897 all sold on
the same date). This could be dealt with using filter or could be
regarded as data tidying prior to processing. In the latter case, it could be
handled by specifying that zero (or one of the rogue price values) is a
missing value on numerical variables. There is another argument to
read_csv called na which specifies which values are used for NA in the
.csv file. Here, we could use:
Another kind of file that is quite commonly encountered is the Excel file.
MS Excel has been referred to as a tool for converting .xls and .xlsx
files to .csv format. However, this approach is not reproducible as the
procedures to do this are generally not scripted. A common issue is that
spreadsheets are often formatted in more complex ways than standard
.csv files – several datasets on the same sheet, or a single .xlsx file
with several worksheets each with a different dataset, are common. Also
there is a tendency to merge data and metadata, so that notes on the
definition of variables and so on appear on the same sheet as the data
themselves.
The general consequence of this is that getting data out of Excel
formatted files generally involves selecting a range of cells from a sheet,
and writing these out as a .csv file. However, some reproducible
alternatives exist, via R packages that can read .xls and .xlsx files.
Here we introduce one of these – the package readxl. This offers a
function called read_excel, taking the form
read_excel(path,worksheet,range) in its basic form. path is
simply the path to the Excel spreadsheet.
As an example, the Police Service of Northern Ireland (PSNI) releases
crime data in spreadsheet form. This may be downloaded from the site
https://www.psni.police.uk/inside-psni/Statistics (in particular, the data for
November 2017 can be found in the file with URL
https://www.psni.police.uk//globalassets/inside-the-psni/our-
statistics/police-recorded-crime-statistics/2017/november/monthly-crime-
summary-tables-period-ending-nov-17.xls). read_excel does not read
directly from a URL, although it may be helpful to download the file and
view it before extracting data in any case. This can be done via the
download.file() operation. Note that the code below splits the URL
into stem1 and stem2 for the purposes of the book!
stem1 <-
"https://www.psni.police.uk//globalassets/inside-
the-psni/"
stem2 <- "our-statistics/police-recorded-crime-
statistics"
filepath <- "/2017/november/monthly-crime-
summary-tables-period-ending-nov-17.xls"
download.file(paste0(stem1,stem2,filepath),"psni.xls")
This downloads the file, making a local copy in your current folder called
psni.xls. Opening this and looking at the sheet called Bulletin
Table 2 shows a typical Excel table (Figure 3.2) . As suggested earlier,
this sheet contains footnotes and other metadata as well as the data
themselves. In addition, it contains merged columns, sub-headings
embedded in rows, and column names spread over multiple rows – a
fairly typical mixture of presentation and data storage found in Excel files.
Figure 3.2 Part of an Excel spreadsheet
Here, the data seem to be in the range A5:E31 – although things will still
be messy when the raw information is extracted. In any case, the first
step is to read this in using read_excel():
library(readxl)
crime_tab <- read_excel(path="psni.xls",
sheet="Bulletin Table
2",
range="A5:E31")
print(crime_tab, n = 26 )
Here, the key argument provides the name given to the variable derived
from the multiple column names when the column stacking occurs
(essentially it just indicates which of the columns the newly stacked data
were taken from). The value argument provides the name for the
column of stacked values – here Count, as it is a passenger count.
Finally, the last argument specifies the column range over which the
gathering takes place. Here we want to gather everything except Year.
Columns can be specified in the same way as select so -Year
specifies gathering over all of the columns except Year. Any non-
gathering columns will simply have their values stacked in
correspondence to their row in the staking range – so the row 1 value
(1949) will appear next to every gathered value from that row, and so on.
The data are now tidy, although before processing them, we may wish to
transform some of the data types. This was done in base R initially, but
now it may be done using mutate with dplyr functions to convert
between types including parse_character, parse_integer and
parse_factor. Suppose here we would like Year and Count to be
integers (at the moment Year is character and Count is double
precision), and Month to be a factor (so that it tabulates in month order,
not alphabetically). It might also be useful to convert the base R data
frame into a tibble. Using pipelines, the gathering and transformation may
be carried out as follows:
AirPassengers_df %>%
gather(key="Month",value="Count",-Year) %>%
as_tibble %>%
mutate(Year=parse_integer(Year),
Month=parse_factor(Month,
levels=c("Jan","Feb","Mar","Apr","May","Jun",
"Jul","Aug","Sep","Oct","Nov","Dec")),
Count=as.integer(Count))
In both cases the convention that step 2 (in Figure 3.1) outputs tidy data,
and step 3 expects it as input, means that the whole process can be
chained as a single pipeline.
Revisiting the Northern Ireland crime data, the situation is more complex,
as the data are messier, with a number of issues. The first two of these
concerned the column names. Here, transmute can be used as a tool
to rename the columns – simply by defining new columns (with better
names) to equate to the existing columns. First, check out the existing
names:
colnames(crime_tab)
## [1] "…1"
## [2] "12 months to November 2016"
## [3] "12 months to \nNovember 20171,2"
## [4] "Change between years"
## [5] "% change between years3"
The …1 name was provided since the column name for this column was
in a different row on a spreadsheet than the others. Using transmute,
we have:
crime_tab <- crime_tab %>%
transmute(`Crime Type`=`…1`,
`Year to Nov 2016`=`12 months to
November 2016`,
`Year to Nov 2017`=`12 months to \n
November 20171,2`,
`Change`=`Change between years`,
`Change (Pct)`=`% change between
years3`)
crime_tab
To complete this part of the tidying, the next step is to remove the rows
corresponding to the broad classes. Noting that these are the only ones
to contain NA for any cell, it is possible to pick any column, and filter on
that, not taking NA as a value. Adding this filter to the pipeline, we have:
crime_tab %>%
mutate(`Broad class` = ifelse(
cumany(`Crime Type`=="OTHER CRIMES AGAINST
SOCIETY"),
"Other" ,
"Victim based")) %>%
filter(! is.na(Change))
## # A tibble: 24 x 6
## `Crime Type` `Year to Nov 20… `Year to Nov
20… Change
## <chr> <dbl> <dbl> <dbl>
## 1 VIOLENCE AG… 34318
33562 -756
## 2 Homicide 20 21 1
## 3 Violence wi… 14487
13696 -791
## 4 Violence wi… 19811 19845 34
## 5 SEXUAL OFFE… 3160 3243 83
## 6 Rape 814 912 98
## 7 Other sexua… 2346 2331
-15
## 8 ROBBERY 688 554 -134
## 9 Robbery of … 541 444 -97
## 10 Robbery of … 147 110 -37
## # … with 14 more rows, and 2 more variables:
`Change
## # (Pct)` <dbl>, `Broad class` <chr>
Next, the subgroups need addressing. One thing that distinguishes these
in the CRIME TYPE column is that they are all in upper case. A test for
this tells us that the row corresponds to a subgroup, not an individual
crime type. The function toupper() converts a character string to all
upper case:
toupper("Violence with injury")
## [1] "VIOLENCE WITH INJURY"
in a mutate operation adds a logical column that has the value TRUE if it
corresponds to a subclass and FALSE otherwise. A neat trick here is to
apply the cumsum() window function. This treats logical values as either
0 (for FALSE) or 1 (for TRUE). Thus it increases by 1 every time a row is a
subclass and each row in the column has an integer from 1 to 9
indicating the subclass. The actual names (slightly rephrased and not in
upper case) are stored in a vector called subclass_name which is
indexed by the above calculation, providing a column of subclass names.
However, a further complication is that there are subclasses that have
only one component – themselves. An example is PUBLIC ORDER
OFFENCES. One way of dealing with these is to identify them using the
add_count() operator. This is a one-off function that, given a column
name, provides a new column n that counts the number of times each
value in the column appears in the column as a whole. Thus if we have
created a Subclass column, add_count(Subclass) (in pipeline
mode) adds a new column n that counts the number of times the
subclass value in the row appears in the column as a whole. If there are
crime types in a given subclass, then every row with that subclass value
will have an n value of k + 1 for the k members of the group, plus the row
allocated to the subclass. The exceptions to this are the one-component
subclasses, which only appear as a subclass row. In this case k = 1.
Finally, the rows referring to subclasses with more than one crime type
are filtered out using the toupper() comparison method used before,
and checking whether the n column is equal to 1. Having done this, n is
no longer a useful column, and so it is removed via select:
At this point it is perhaps worth reinforcing that getting the data into a
format we wish to work with requires quite a lot of data ‘munging’ –
reformatting and providing some information not supplied perfectly in the
first place. This perhaps is a characteristic that should be highlighted in
data science and that has been underplayed in traditional statistics:
dealing with imperfect data can account for a great deal of time in
practice, and although there are notable exceptions in the literature such
as Chatfield (1995), much discussion of new analytical techniques is
traditionally presented with data that magically arrive in an ideal form.
Key Points
The pipeline chaining process and workflows require data to be tidy,
but data as supplied may often not have tidy characteristics (see
also Chapter 2).
A number of issues related to different data formats were illustrated,
including an untidy Excel example.
A formal set of requirements for tidy data was outlined and applied:
each variable must have its own column, each observation must
have its own row, and each value must have its own cell.
The use of the gather() function to do this was illustrated. This
‘pulls in’ multi-column variables, stacking the values in each input
column and returning a single output column.
Excel data were used to illustrate the need for extensive
reformatting, reinforcing the key point that is often overlooked in data
science that dealing with imperfect data can take a lot of time and
effort.
3.5 PIPELINES, DPLYR AND SPATIAL DATA
The sections above outline how dplyr and the concept of tidy data
provide a framework in general for data manipulation. Chapter 2 provided
some cursory examples of how dplyr functions could be applied to
spatial data, and these are expanded here. Recall that the sp format for
spatial data does not fit directly into the tidy framework, but the sf
(simple features) framework does, as described in Chapter 2.
Here we will make use of the Newhaven burglary data supplied with the
IBDSDA package. This contains several sf objects, including:
blocks – US Census blocks area
tracts – US Census tracts area
crimes – report crimes: family disputes, breaches of the peace,
residential burglary (forced/unforced)
roads – roads in the area.
These relate to a number of crimes recorded in New Haven, CT, and to
census data for the same area.
Each of these objects is actually a named item in a list object, called
newhaven. First, these are read in:
library(IBDSDA)
data(newhaven)
Rather than having to refer to each of these items by their full name (e.g.
newhaven$blocks) the list items are first copied out to separate
individual objects:
tracts <- newhaven$tracts
roads <- newhaven$roads
callouts <- newhaven$callouts
blocks <- newhaven$blocks
You should examine these in the usual way. The tracts and blocks
objects are census areas at different scales with attributes from the US
Census. The roads layer contains different classes of road linear
features, and the callouts data contain information on the locations of
police callouts to different types of incidents.
These objects can be mapped using tmap, and the code below
generates Figure 3.3:
library(tmap)
tmap_mode('plot')
tm_shape(blocks) + tm_borders() +
tm_shape(roads) + tm_lines(col='darkred') +
tm_shape(callouts) +
tm_dots(col="Callout",size=0.5,alpha=0.3) +
tm_scale_bar(position=c('left','bottom')) +
tm_compass() +
tm_layout(legend.position = c('left',0.1 ))
The different dplyr functions for manipulating data tables can be used
to manipulate spatial data tables such as are held in sf objects. The
code below calculates a count of rented but occupied properties, groups
the data by tracts (blocks are nested within tracts in the US Census),
summarises vacant and rented but occupied properties over the groups
and creates a ratio, before passing the result to tmap:
blocks %>%
mutate(RentOcc = HSE_UNITS*P_RENTROCC/100 )
%>%
group_by(TRACTBNA) %>%
summarise(RentOcc=sum(RentOcc),TotVac=sum(VACANT))
%>%
mutate(Vac2Rent=TotVac/RentOcc) %>%
tm_shape() +
tm_fill(col = c("RentOcc", "TotVac",
"Vac2Rent"),
style = "kmeans",
title =c("Rent Occ.", "Vacant Props.",
"Ratio"),
palette = "YlGnBu", format =
"Europe_wide") +
tm_layout(legend.position = c("left",
"bottom"))
Note here that the original callouts object’s columns are merged using
the tract column (TRACTBNA). This provides a ‘block area number’ – a
way of identifying tracts and blocks (now not used officially). Here it
serves the purpose of identifying which tract each incident occurred in.
Suppose now we wish to focus on burglaries. This can be achieved by
filtering via the Callout column. The four classes of callout include two
relating to burglary, and their levels contain the word ‘burglary’. The
str_detect() function in the stringr package (which is automatically
loaded when tidyverse is loaded) detects the presence of a character
pattern in a string. If it is found, the function value is TRUE, otherwise
FALSE. Thus it may be used as a filtering function to pick out the callouts
relating to burglary. Next, fct_drop in the forcats package is used to
redefine the factor Callout so it only has the levels that actually appear
(i.e. only the levels relating to burglary). This is important to stop future
tabulation and counting functions considering other levels, which have
now been systematically excluded – at least for this particular analysis.
The following code applies the filtering process outlined above to create
an object called burgres:
callouts %>%
filter(str_detect(Callout,"Burglary")) %>%
mutate(Callout = fct_drop(Callout)) ->
burgres
To overcome this and get rid of this column, the sf object must first be
converted to an ordinary tibble via the as_tibble() function. Thus to
create a table with just the Callout and TRACTBNA columns, use:
burgres %>%
st_join(tracts) %>%
as_tibble %>%
select(-geometry) ->burgres_tab
burgres_tab
## # A tibble: 220 x 2
## Callout TRACTBNA
## <fct> <fct>
## 1 Forced Entry Burglary 1403
## 2 Forced Entry Burglary 1427
## 3 Forced Entry Burglary 1407
## 4 Forced Entry Burglary 1403
## 5 Forced Entry Burglary 1421
## 6 Forced Entry Burglary 1424
## 7 Forced Entry Burglary <NA>
## 8 Forced Entry Burglary 1409
## 9 Forced Entry Burglary 1412
## 10 Forced Entry Burglary 1409
## # … with 210 more rows
burgres_tab %>%
group_by(Callout,TRACTBNA) %>%
summarise(n=n()) %>%
ungroup
burgres %>%
st_join(tracts) %>%
st_drop_geometry() %>%
count(TRACTBNA,Callout) %>%
complete(TRACTBNA,Callout,fill=list(n=0L)) ->
burgres_counts
burgres_counts
## # A tibble: 62 x 3
## TRACTBNA Callout n
## <fct> <fct> <int>
## 1 1401 Forced Entry Burglary 0
## 2 1401 Unforced Entry Burglary 2
## 3 1402 Forced Entry Burglary 2
## 4 1402 Unforced Entry Burglary 1
## 5 1403 Forced Entry Burglary 3
## 6 1403 Unforced Entry Burglary 0
## 7 1404 Forced Entry Burglary 6
## 8 1404 Unforced Entry Burglary 2
## 9 1405 Forced Entry Burglary 7
## 10 1405 Unforced Entry Burglary 4
## # … with 52 more rows
The fill value in complete is given as 0L rather than just 0 – the L after
a number tells R that the number should be stored as a long integer (the
default would be as a double precision real number), but since values
here are counts, an integer format seems more appropriate.
It is now possible to use the output from this as data for a Poisson
regression model. The two factors for predicting the count are Callout
and TRACTBNA. The use of statistical modelling is covered in detail later
in this book, but for now we will just state that the model is
(3.1)
where E(α) is the expected value of n, βi is the effect due to crime type i,
γj is the effect due to tract j, and δij is the effect due to interaction
between crime type i and tract j. If there is no interaction, this suggests
that there is no difference (on average) between the rate of each type of
burglary across the tracts, and so no evidence for geographical
variability. In this case the simpler model
(3.2)
is more appropriate. Here an analysis of deviance for the two models in
equations (3.1) and (3.2) is carried out:
full_mdl <-
glm(n~TRACTBNA*Callout,data=burgres_counts,family
= poisson())
main_mdl <-
glm(n~TRACTBNA+Callout,data=burgres_counts,family
= poisson())
anova(main_mdl,full_mdl)
## Analysis of Deviance Table
##
## Model 1: n ~ TRACTBNA + Callout
## Model 2: n ~ TRACTBNA * Callout
## Resid. Df Resid. Dev Df Deviance
## 1 29 37.586
## 2 0 0.000 29 37.586
callouts %>%
filter(!str_detect(Callout,"Burglary")) %>%
mutate(Callout = fct_drop(Callout)) %>%
st_join(tracts) %>%
st_drop_geometry() %>%
count(TRACTBNA,Callout) %>%
complete(TRACTBNA,Callout,fill=list(n=0L)) ->
disturb_counts
disturb_counts %>%
spread(Callout,n) %>%
transmute(TRACTBNA,
PC_DISPUTE=100*`Family Dispute`/
(`Family Dispute` + `Breach of
Peace`)) %>%
left_join(tracts,.) -> tracts2
The last line may look unfamiliar – the intention here is to pipeline the
result of the transmute to a left join, but here we wish to make it the
second argument (the first being the tracts sf table) because we want
to join the outcome onto tracts to obtain an updated sf object – still a
geographical table with polygons for the tracts and so on. The dot
notation here causes that to happen – and generally it is a useful tool
when non-standard pipelining is required.
As a result of this, tracts2 has a new column PC_DISPUTE giving the
percentages of disturbance callouts that were family disputes (Figure
3.4):
tm_shape(tracts2) +
tm_polygons(col='PC_DISPUTE',
title='Family Disputes (%)') +
tm_layout(legend.position =
c('left','bottom'))
Figure 3.4 Family disputes as a percentage of disturbance callouts
Finally, an alternative view might be to consider each of the types of
disturbance in separate maps showing the percentage of the total of each
particular kind of callout associated with each tract. This is more in
concordance with Tufte’s notion of ‘small multiples’, as advocated in
Tufte (1983, 1990).
This is relatively simple, requiring a slightly different calculation to
compile the sf (here called tracts3):
disturb_counts %>%
spread(Callout,n) %>%
transmute(TRACTBNA,
PC_FAM = 100*`Family
Dispute`/sum(`Family Dispute`),
PC_BOP = 100*`Breach of
Peace`/sum(`Breach of Peace`)) %>%
left_join(tracts,.) -> tracts3
Buffering is also possible via the st_buffer function. Here the roads
subset is buffered by 50 m:
roads1408 %>% st_buffer(50)
This can then be clipped (to find the parts of the buffer inside tract 1408).
This is done by re-intersecting with tracts and selecting those whose
TRACTBNA is still 1408. Noting that roads1408 also has a TRACTBNA
column, this is dropped before carrying out the intersection (see above):
roads1408 %>%
st_buffer(50) %>%
select(-TRACTBNA) %>%
st_intersection(tracts) %>%
filter(TRACTBNA == 1408 ) ->
bufroads1408
The result may be mapped, giving an idea of which areas are more
mixed in terms of owner occupation of residential properties:
tm_shape(tract_sd) +
tm_polygons(col='OO_SD',
title="Owner Occupation SD") +
tm_layout(legend.position =
c("left","bottom"))
A sometimes useful trick is to use this approach to unify all of the entities
into a single sf. This is done by adding a new column with a constant
value, grouping by this column, and summarising without any terms.
Finally, if a boundary is required (i.e. a line-based object marking the
edge of the study area, rather than a polygon), this can be achieved via
st_boundary. A boundary object for blocks is created below:
tracts %>%
mutate(gp=1) %>%
group_by(gp) %>%
summarise %>%
st_boundary() -> bounds
tracts %>%
group_by(gp=1) %>%
summarise %>%
st_boundary() -> bounds
This may now be used alongside the tracts and blocks simple
features to show the different hierarchy is the spatial data in Figure 3.7:
tm_shape(bounds) +
tm_lines(lwd=4,col='dodgerblue') +
tm_shape(tracts) +
tm_borders(lwd=2,col='darkred') +
tm_shape(blocks) + tm_borders()
Key Points:
Pipeline chaining and dplyr manipulation can be used with spatial
data in sf format in the same way as a data.frame or tibble, for
example to select variables (fields), filter rows (records) or mutate
variables.
You will also need to undertake some data downloads, one of which is 11
GB and will take some time.
4.2 INTRODUCTION TO DATABASES
4.2.1 Why use a database?
Up until now in this book all of the code snippets have been applied to in-
memory data: the data were loaded into R in some way, read from either
an external file or one of R’s internal datasets. These data have been of a
small enough size for R to manage them in its internal memory. For
example, the lsoa_data in Chapter 2 was 3.2 MB and the newhaven
dataset in Chapter was 3.0 MB. The main reason for using a database is
that that data are too big and cumbersome for easy manipulation in the
working memory of whatever analysis environment you are working in (R,
Python, QGIS, etc.). Essentially it is for data storage and access
considerations.
Consider a flat 10 GB data table. Any analysis will typically want to select
certain rows or columns based on specific criteria. Such operations
require a sequential read across the entire dataset. Reading (and writing)
operations take time. For example, if reading 1 MB sequentially from
memory takes 0.2 seconds (0.005 seconds from a solid state disk) then
to access an in-memory 10 GB file would take 50 seconds, and
accessing a file from disk would be about 10 times slower. If the file is
modified and has to be written, then this takes longer. If the file is very
large (e.g. 100 GB) then such rates would be optimistic as they assume
that all the data can be read in a single continuous read operation.
Typically, disk storage is increased and memory remains a limiting factor
as not everything can be fit into memory. Additionally, any kind of
selection or filtering of data rows or columns requires linear searches of
the data, and although these can be quicker if the data are structured in
some way (e.g. by sorting some or all of the columns), there is a trade-off
between storage and efficiency. The problem is that sorting columns is
expensive as sorted columns (or indexes) require additional storage.
Thus because of the slowness of reading and writing, data are often held
in indexed databases that are remotely stored (i.e. they are neither on
your computer nor in your computer’s memory) and accessed.
The basic idea behind databases (as opposed to data frames, tibbles and
other in-session data table formats) is that you connect to a database
(either a local one, one in working memory or a remote one), compile
queries that are passed to the database, and only the query result is
returned.
Databases are a collection of data tables that are accessed through
some kind of database management system (DBMS) providing a
structure that supports database queries. Databases frequently hold
multiple data tables which have some field (attribute) in common
supporting relational queries. Records (observations) in different data
tables are related to each other using the field they have in common (e.g.
postcode, national insurance number). This allows data in different tables
to be combined, extracted and/or summarised in some way through
queries. Queries in this context are specific combinations of instructions
to the DBMS in order to retrieve data from the server.
In contrast to spatial queries (see Chapter 7 and other examples
throughout this book) which use some kind of spatial test based on a
topological query (within a certain distance, within a census area, etc.) to
link data, relational queries link data in different tables based on common
attributes or fields.
4.2.2 Databases in R
Relational databases and DBMSs classically use Structured Query
Language (SQL) to retrieve data through queries. SQL has a relatively
simple syntax (supported by ISO standards) but complex queries can be
difficult to code correctly. For these reasons the team behind dplyr
constructed tools and functions that translate into SQL when applied to
databases. Workflows combine the various dplyr verbs used for single-
table and two-table manipulations (Chapters 2 and 3), translate them into
SQL and pass them on to the DBMS.
To query databases in this way requires a database interface (DBI).
Fortunately, a DBI in the form of the DBI package is installed with
dplyr. The DBI separates the connectivity to the DBMS into a front-end
and a back-end. The front-end is you or your R session and the analyses
and manipulations you wish to undertake, and the back-end is the
database and the queries that are passed to it. Applications such as
dplyr use only the exposed front-end of the DBMS and allow you to
connect to and disconnect from the DBMS, create and execute queries,
extract the query results, and so on. The dbplyr package contains
back-end tools that translate the dplyr functions, calls and queries into
SQL in order for them to work with remote database tables, with the DBI
defining an interface for communication between R and relational
DBMSs. So you have used dbplyr before without knowing it in Chapter
3. If you look at the help you will see that the dbplyr package contains
only a small number of wrapper functions – but there are many dplyr
functions that use dbplyr functionality in the background. Thus dbplyr
is used when, for example, you use the select function in dplyr.
This book uses SQLite because it is embedded inside an R package
(RSQLite) which is automatically loaded with dplyr. This provides a
convenient tool for understanding how to manage large datasets because
it is completely embedded inside an R package and you do not need to
connect to a separate database server. A good introduction to this topic
can be found in Horton et al. (2015). The DBI package provides an
interface to many different database packages. This allows you to use
the same R code, for example with dplyr verbs, to access and connect
to a number of back-end database formats including MySQL with the
RMySQL package and Postgres with RPostgreSQL.
4.2.3 Prescribing data
A remote database of around 121 million medical prescriptions issued in
England in 2018 by some 10,600 GP practices called
prescribing.sqlite will be used to undertake some spatial analyses
in Section 4.5. However, as this is 11.2 GB in size, a smaller version with
a sample of 100,000 prescriptions has been created to illustrate how to
set up databases and to query them. This will be used in the code below
to create prescribing_lite.sqlite.
The sampled data tables are contained in an .RData file called
ch4_db.RData which can be downloaded and saved to a local folder.
The code below loads the file from the internet to your current working
directory:
# check your current working directory
getwd()
download.file("http://archive.researchdata.leeds.ac.uk/732/1
"./ch4_db.RData", mode = "wb")
You will notice that five data tables and one sf spatial data object are
loaded to your R session. The data tables are prescriptions,
practices, postcodes, patients and social. You should examine
these – as a reminder the functions str, summary and head are useful
for this, as is the casting of the data.frame to a tibble format using
as_tibble:
str(prescriptions)
head(practices)
as_tibble(postcodes)
summary(social)
head(patients)
lsoa_sf
tbl(db, "prescripts_db")
The above code snippet copied all of the data to db, but it is also
possible to specify particular fields. The code below overwrites
prescriptions in db using the dbWriteTable() function in the DBI
package:
dbWriteTable(conn = db, name = "prescripts_db",
value = prescriptions[,
c("sha","bnf_code","act_cost","month")],
row.names = FALSE, header = TRUE, overwrite =
T)
tbl(db, "prescripts_db")
## # Source: table<prescripts_db> [?? x 4]
## # Database: sqlite 3.30.1 []
## sha bnf_code act_cost month
## <chr> <chr> <dbl> <chr>
## 1 Q48 131002030AAADAD 5.3 04
## 2 Q56 0204000H0BFADAA 16.4 05
## 3 Q66 1001010AJAAADAD 104. 07
## 4 Q67 0407020B0BHACAJ 240. 11
## 5 Q55 1001010AHBBAAAA 20.0 03
## 6 Q63 1001010J0AAAEAE 44.4 11
## 7 Q53 0202030X0AAAAAA 18.7 12
## 8 Q68 0906040G0AAEKEK 2.56 08
## 9 Q57 130201100BBBAAN 6.61 08
## 10 Q63 0205051R0AAALAL 58.6 01
## # … with more rows
A similar syntax can be used to work out average prescription costs for
different SHAs:
tbl(db, "prescripts_db") %>%
group_by(sha) %>%
summarise(mean_cost = mean(act_cost, na.rm =
T))
In the code below, the first line checks for the existence of
prescribing_lite.sqlite and removes it if it exists. The second
then creates a connection to the database:
if (file.exists("prescribing_lite.sqlite") ==
TRUE)
file.remove("prescribing_lite.sqlite")
db = DBI::dbConnect(RSQLite::SQLite(),
dbname=
"prescribing_lite.sqlite" )
If you look in your working folder using Windows Explorer (PC), your file
manager in Linux or Finder (Mac), you will see that an empty object
called prescribing_lite.sqlite has been created. It is empty and
has no size because it has not yet been populated with data.
The following code populates the db object with the tables in Figure 4.1
using the dbWriteTable function, and then closes the connection to the
database:
dbWriteTable(conn = db, name = "prescriptions",
value = prescriptions,
row.names = FALSE, header = TRUE )
dbWriteTable(conn = db, name = "practices",
value = practices,
row.names = FALSE, header = TRUE )
dbWriteTable(conn = db, name = "patients",
value = patients,
row.names = FALSE, header = TRUE )
dbWriteTable(conn = db, name = "postcodes" ,
value = postcodes,
row.names = FALSE, header = TRUE )
dbWriteTable(conn = db, name = "social", value
= social,
row.names = FALSE, header = TRUE )
dbDisconnect(db)
Again, note the use of the dbDisconnect function in the last line. If you
enter db at the console you will see that it is disconnected.
If you check again in your working folder you will see that
prescribing_lite.sqlite now has been populated and has a size
of around 35 MB.
To access the data, we need simply to connect to the database, having
created it:
db <- DBI::dbConnect(RSQLite::SQLite(),
dbname="prescribing_lite.sqlite")
You can check what the database contains using different commands
from the DBI package:
It is possible to query the data tables in the database. For example, the
code below summarises the mean cost of prescriptions in each of the
SHAs, arranges them in descending mean cost order, filters them for at
least 100 items and prints the top 10 to the console:
tbl(db, "prescriptions") %>%
group_by(sha) %>%
summarise(
mean_cost = mean(act_cost, na.rm = T),
n = n()
) %>%
ungroup() %>%
arrange(desc(mean_cost)) %>%
filter(n > 100) %>% print(n=10)
However, this is working with only one of the data tables in the db
database. It is not taking advantage of the layered analysis that is
possible by linking data tables, as shown in Figure 4.1.
To link the data tables in the db database we need to specify the links
between them, just as we did using the join functions in Chapter 2.
However, to join in the right way, we need to think about how we wish to
construct our queries and what any queries to database tables will return
(and therefore the analysis and inference they will support). The DBI
package has a number of functions that list data tables and their fields as
above. Similarly, tbl in the dplyr package can be used to do these
summaries:
tbl(db, "social")
tbl(db, "practices")
tbl(db, "patients")
colnames(tbl(db, "prescriptions"))
The results are similar to the str, head and summary functions used
earlier. However, it is important to restate what the tbl function is doing:
it is sending a tbl query to db about the table (e.g. the prescriptions
table) in db. As tbl returns the first 10 records this is what is returned
from db – the data returned to the R session are only the result of this
query and not the whole of the data table. To emphasise this, the code
below compares the size of the query and the size of the
prescriptions data table that is loaded into the R session (note that
there are small differences in the way that file sizes are calculated
between Windows and Mac operating systems):
format(object.size(tbl(db, "prescriptions")),
unit="auto")
## [1] "5.3 Kb"
format(object.size(prescriptions), unit=
"auto")
## [1] "11.2 Mb"
4.3.3 Summary
The basic process of creating a database is to first define a connection to
a database, then populate it with data, and, when it is populated, to close
the connection.
Two kinds of database were illustrated: an in-memory one, good for
development; and a local on-file database. The procedures for
constructing these were essentially the same:
1. Define a connection to a database.
2. Populate the database with data.
3. Close the database.
This sequence can be used to construct much larger databases.
Key Points
Different types of databases can be created including local in-
memory databases for prototyping and testing, local permanent on-
file ones, and remote databases held elsewhere on another
computer.
The basic process of creating a database is first to define a
connection to a database, then to populate it with data, and, when it
is populated, to close the connection.
Once instantiated or connected to, the copy_to function in dplyr
or the dbWriteTable function in the DBI package can be used to
populate the database tables with data.
The main principle in working with databases is that the data reside
elsewhere (i.e. not in working memory) and only the results of
queries are pulled from the database, not the data themselves.
The tbl function in dplyr creates a reference to the data table in
the database without pulling any data from the database, allowing
subsequent queries to be constructed.
4.4 DATABASE QUERIES
In this section we will explore a number of core database operations
using dplyr to construct queries and apply them to the
prescribing_lite.sqlite database. There are three main groups
of operations that are commonly used in database queries, either singly
or in combination. Queries specify operations that:
1. Extract (specifying criteria, logical pattern matching, etc.)
2. Join (linking different tables)
3. Summarise (grouping, using summary functions, maintaining fields,
creating new fields).
There are some overlaps with Sections 2.4 and 3.3 in terms of the
operations using dplyr tools, but here these are applied in a database
context. In many cases the syntax is exactly the same as those applied to
data.frame and tibble formats in the previous chapters. On
occasion, however, they need to be adapted for working with databases.
4.4.1 Extracting from a database
It is possible to extract whole records (rows) and fields (columns),
individual rows and columns that match some criteria, and individual
elements (cells in a data table). The two most commonly used
approaches for extracting data are:
1. By specifying some kind of logical test or conditions that have to be
satisfied for the data to be extracted
2. By specifying the location of the data you wish to extract, for
example by using the ith row or jth column, or variable names.
Logical queries have TRUE or FALSE answers and use logical operators
(e.g. greater than, less than, equal to, not equal to). These have been
covered in previous chapters, but the full set of logical operators can be
found in the R help (enter ?base::Logic). The other main way of
selecting is through some kind of text pattern matching. Both may be
used to subset database fields (columns) by filtering and/or database
records (rows) by selecting.
You should be familiar with the dplr verbs introduced in Chapter 2 and
illustrated in Chapter 3. These can be applied to databases in the same
way.
A connection has to be made for a database to be queried. Connect to
the prescribing_lite.sqlite database you created earlier as
before:
library(RSQLite)
db <- dbConnect(SQLite(),
dbname="prescribing_lite.sqlite")
The code below uses filter() to extract the prescriptions for a specific
flu vaccine via the BNF code (see, for example,
https://openprescribing.net/bnf/) and orders the extracted records by
volume (items):
# one pattern
tbl(db, "prescriptions") %>%
filter(bnf_code %like% '%1404000H0%')
# multiple patterns
tbl(db, "prescriptions") %>%
filter(bnf_name %like% 'Dermol%' |
bnf_name %like% 'Fluad%')
Fields can also be selected using pattern matching, but here a set of tidy
matching functions can be applied:
tbl(db, "prescriptions") %>%
select(starts_with("bnf_"))
tbl(db, "prescriptions") %>%
select(contains("bnf_"))
Thus far, all of the results of the code snippets applied to the database
have been printed out to the console. No data have been returned to the
R session, meaning that all of the analysis has taken place away from the
working memory of your computer. The use of collect() returns the
results. It creates an object (even if it is just printed out). To show this,
run the code below which uses the object.size function to evaluate
the memory cost to R of running the code snippets. This is a key
advantage of working with databases.
# size of the call
tbl(db, "prescriptions") %>% object.size()
# size of a longer call
tbl(db, "prescriptions") %>%
filter(bnf_code %like% '%1404%') %>%
arrange(desc(act_cost)) %>% object.size()
# size of what is returned with collect
tbl(db, "prescriptions") %>%
filter(bnf_code %like% '%1404%') %>%
arrange(desc(act_cost)) %>% collect() %>%
object.size()
Note again that without the collect() function in the code above, the
joined data remain in the database (i.e. are not returned to the console).
Also note that not all dplyr joins currently support database queries,
although they may in the future:
tbl(db, "prescriptions") %>%
full_join(tbl(db, "practices")) %>% collect()
%>% dim()
tbl(db, "prescriptions") %>%
right_join(tbl(db, "practices")) %>%
collect() %>% dim()
This means that queries with nested calls to different database tables
have to be carefully designed.
Of course, the dplyr verbs can also be applied, for example to filter the
data as was done in the first join example here. The code below selects
for antidepressants, undertakes two inner joins and specifies a set of
attributes to be returned, in this case the SHA, month, the geographic
coordinates of the practice, the drug code and the item cost. You should
note the way that the code forces R to take the dplyr version of the
select function, using the format package_name::fucntion_name.
Sometimes you may have similarly named functions from different
packages loaded in your session – the raster package, for example,
also has a function called select.
# entire dataset
tbl(db, "prescriptions") %>%
summarise(total = sum(act_cost, na.rm = T))
# grouped by sha
tbl(db, "prescriptions") %>%
group_by(sha) %>%
summarise(total = sum(act_cost, na.rm = T))
%>%
arrange(desc(total))
# grouped by sha and month
tbl(db, "prescriptions") %>%
group_by(sha, month) %>%
summarise(total = sum(act_cost, na.rm = T))
%>%
arrange(desc(total))
It is also possible to pass SQL code directly to the database using the
dbGetQuery function in the DBI package:
dbGetQuery(db,
"SELECT `practice_id`, AVG(`act_cost`) AS
`mean_cost`
FROM `prescriptions`
GROUP BY `practice_id`
ORDER BY `mean_cost` DESC") %>% as_tibble()
Key Points
Database queries are used to select, filter, join and summarise data
from databases.
Selection and filtering are undertaken using logical statements or by
specifying the named or numbered location (records, fields) of the
data to be extracted.
Queries can be constructed using dplyr syntax for data
manipulations, selection, filtering, mutations, joins, grouping and so
on, which generally use the same syntax as dplyr operations on in-
memory data tables described in Chapter 3.
There are some differences, especially with pattern matching (e.g.
for multiple pattern matches, the %in% function can be used).
dplyr translates queries into SQL which is passed to the database
and the SQL code generated by the dplyr query can be printed out
with the show_query() function.
SQL code can be passed directly to the database using the
dbGetQuery function in the DBI package.
The collect() function at the end of a dplyr query returns the
query result to the current R session.
4.5 WORKED EXAMPLE: BRINGING IT ALL
TOGETHER
This section develops a worked example using the data tables held in the
prescribing.sqlite database. This contains the full
prescriptions data table, with some 120 million prescription records
for 2018, from which the prescribing_lite.sqlite was sampled
(all of the other data tables in the database are the same). This will take
some time to download (it is 11.2 GB in size); you might wish to
undertake another task, it will take a few hours to download.
download.file("http://archive.researchdata.leeds.ac.uk/734/1
"./prescribing.sqlite", mode =
"wb")
Again this can be added to the code block using a join with the step 4
code nested inside the join. A further mutate operation is then included
to determine the costs per person and the results ordered again for
illustration:
tbl(db, "prescriptions") %>%
filter(bnf_code %like% '040702%') %>%
group_by(practice_id) %>%
summarise(cost = sum(act_cost, na.rm = T))
%>%
ungroup() %>%
left_join(
tbl(db, "patients") %>%
group_by(practice_id) %>%
summarise(prac_tot = sum(count, na.rm =
T)) %>%
ungroup() %>%
left_join(tbl(db, "patients")) %>%
mutate(prac_prop = as.numeric(count) /
as.numeric(prac_tot))
) %>%
mutate(lsoa_cost = cost*prac_prop) %>%
group_by(lsoa_id) %>%
summarise(tot_cost = sum(lsoa_cost, na.rm =
T)) %>%
ungroup() %>%
filter(!is.na(tot_cost)) %>%
left_join(
tbl(db, "social") %>%
mutate(population =
as.numeric(population)) %>%
select(lsoa_id, population)
) %>%
mutate(cost_pp = tot_cost/population) %>%
arrange(desc(cost_pp))
Then the code bock above can be run, replacing the last line
(arrange(desc(cost_pp))) with collect() -> lsoa_result to
pull the results of the query from the server and to assign these to a local
in-memory R object. This will take a bit longer to run.
Finally, close the connection to the database:
dbDisconnect(db)
The results can be inspected and the summary indicates the presence of
a record with NA values.
summary(lsoa_result)
## lsoa_id tot_cost population
## Length:32935 Min. : 2.74 Min. : 983
## Class :character 1st Qu.: 4464.48 1st Qu.:
1436
## Mode :character Median: 6238.25 Median :
1564
## Mean : 6655.18 Mean : 1614
## 3rd Qu.: 8263.61 3rd Qu.: 1733
## Max. : 130651.50 Max. : 8300
## NA's : 1
## cost_pp
## Min. : 0.001494
## 1st Qu.: 2.829891
## Median : 4.001882
## Mean : 4.161286
## 3rd Qu.: 5.228562
## Max. :30.272204
## NA's :1
lsoa_result %>% arrange(desc(tot_cost))
## # A tibble: 32,935 x 4
## lsoa_id tot_cost population cost_pp
## <chr> <dbl> <dbl> <dbl>
## 1 NO2011 130651. NA NA
## 2 E01018076 63208. 2088 30.3
## 3 E01021360 34481. 2001 17.2
## 4 E01030468 32875. 3499 9.40
## 5 E01026121 32683. 2469 13.2
## 6 E01021770 31760. 2013 15.8
## 7 E01020797 30391. 2501 12.2
## 8 E01023155 29682. 2768 10.7
## 9 E01020864 28340. 2068 13.7
## 10 E01016076 27921. 2406 11.6
## # … with 32,925 more rows
A quick examination and Google of the LSOAs with the highest cost and
rates of prescribing rates per patient in England in 2018, E01018076
(Fenland in Cambridgeshire), E01021360 (Braintree in Essex) and
E01025575 (Wyre near Blackpool) indicates that all are in the urban
fringe:
lsoa_result %>% arrange(desc(cost_pp))
Figure 4.2 The LSOA with the highest opioid prescribing rates per
person in 2018 with an OSM backdrop (© OpenStreetMap contributors)
A more informative national map is shown in Figure 4.3 (this will take
some time to render!). This clearly shows higher opioid prescribing rates
in the rural fringe, coastal areas and in the post-industrial areas of the
North:
lsoa_sf %>% left_join(lsoa_result) %>%
tm_shape() +
tm_fill("cost_pp", style = "quantile",
palette = "GnBu",
title = "Rate per person", format =
"Europe_wide")+
tm_layout(legend.position = c("left", "top"))
4.6 SUMMARY
This chapter has illustrated how database queries can be constructed
using the dplyr verbs and joins to wrangle data held in different
database tables. These were used to construct a complex query that
integrated and pulled data from different tables in the on-file database,
but which required some of the dplyr functions to be replaced with
workarounds, particularly for pattern matching. A key point is that dplyr
tries to be as lazy as possible by never pulling data into R unless
explicitly requested. In effect all of the dplyr commands are compiled
and translated into SQL before they are passed to the database in one
step. In this schema, the results of queries (e.g. tbl) create references
to the data in the database and the results that are returned and printed
to the console are just summaries of the query – the data remain on the
database. The dplyr query results can be returned using collect,
which retrieves data to a local tibble. The SQL created by dplyr can
be examined with the show_query() function.
Figure 4.3 The spatial variation in opioid prescribing per person in 2018
The examples here used in-memory and local databases. Of course they
can also be connected to remotely and the dplyr team provide a
hypothetical example of the syntax for doing that (see
https://db.rstudio.com/dplyr/):
con <- DBI::dbConnect(RMySQL::MySQL(),
host = "database.rstudio.com",
user = "hadley",
password =
rstudioapi::askForPassword("Database password")
)
REFERENCES
Horton, N. J., Baumer, B. S. and Wickham, H. (2015) Setting the stage
for data science: Integration of data management skills in introductory
and second courses in statistics. Preprint, arXiv: 1502.00318.
Rowlingson, B., Lawson, E., Taylor, B. and Diggle, P. J. (2013) Mapping
English GP prescribing data: A tool for monitoring health-service
inequalities. BMJ Open, 3(1), e001363.
5 EDA AND FINDING STRUCTURE IN DATA
5.1 OVERVIEW
The preceding chapters have introduced data and spatial data (Chapter
2), the tools in the dplyr package and their use in manipulating and
linking data and, critically, within chains of operations in a piping syntax
(Chapter 3), and how databases for very large datasets can be created
and held on file locally but outside of R’s internal memory and then
queried using piped dplyr functions (Chapter 4). As yet, little formal
consideration has been given to exploratory data analysis (EDA),
although some has been implicitly undertaken in the form of scatter plots
and data summaries (as well as lollipop plots in Chapter 2).
The aim of EDA is to generate understanding of the data by revealing
patterns, trends and relationships within and among the data variables.
EDA generates summaries of data properties (data distribution, central
tendencies, spread, etc.) and correlations with other variables, and
reports these using tables or graphics. The first part of this chapter
illustrates the core techniques in EDA.
However, standard EDA in this way provides only limited information
about the structure of the data and the multivariate interactions and
relationships between sets of variables the data contain. More nuanced
and multivariate data understanding is needed to support hypothesis
development and testing and inference (such as are undertaken in
Chapter 6). Key to this are approaches for examining data and spatial
data structure.
The following packages will be required for this chapter:
library(tidyverse)
library(RColorBrewer)
library(GGally)
library(data.table)
library(sf)
library(ggspatial)
library(tmap)
library(grid)
library(gridExtra)
The data used in this chapter combine the results of the opioid
prescribing analysis that was written to lsoa_result.RData and other
data tables in ch4_db.RData from Chapter 4. You should make sure
these are in your current working directory and load them:
load("lsoa_result.RData")
load("ch4_db.RData")
ls()
This shows that the correlation is highly significant (very low p-value) and
also gives a 95% confidence interval of the correlation. An alternative
format is as follows:
cor.test(social$unemployed, social$noqual)
Key Points
EDA aims to provide understandings of data properties (distributions,
data spread, central tendencies, etc.) and dataset structure (how
different variables interact with each other).
EDA supports hypothesis development and the choice and
development of methods.
It is critical for understanding and communicating the results of
analysis.
Data visualisations build on standard numeric approaches for
exploring data properties and structure.
5.3 EDA WITH GGPLOT2
Visual approaches are very useful for examining single variables and for
examining the interactions of variables together. Common visualisations
include histograms, frequency curves and bar charts, and scatter plots,
some which have been already introduced in this and earlier sessions.
The ggplot2 package (Wickham, 2016) supports many types of visual
summary. It allows multiple, simultaneous visualisations, and supports
grouping by colour, shape and facets. The precise choice of these, out of
the many possible options, requires some careful thinking. Do the data
need to be sorted or faceted? Should transparency be used? Do I need
to group the data? And so on.
The ggplot2 package is installed with the tidyverse package. R has
several systems for making visual outputs such as graphs, but ggplot2
is one of the most elegant and most versatile. It implements the grammar
of graphics (Wilkinson, 2012), a coherent system for describing and
constructing graphs using a layered approach. If you would like to learn
more about the theoretical underpinnings of ggplot2, Wickham (2010)
is recommended.
The remainder of this chapter provides a rounded introduction to
ggplot2, but note that it is impossible to describe all of the different
parameters and visualisation options that are available. For this reason,
the aims here are to establish core skills in visualising data with the
ggplot2 package.
5.3.1 ggplot basics
The ggplot2 package provides a coherent system for describing and
building graphs, implementing the grammar of graphics. The basic idea is
that graphs are composed of different layers each of which can be
controlled. The basic syntax is:
# specify ggplot with some mapping aesthetics
ggplot(data = <data>, mapping = <aes>) +
geom_<function>()
ggplot(social) +
geom_point(aes(x = unemployed, y = noqual),
alpha = 0.1) +
geom_smooth(aes(x = unemployed, y = noqual),
method = "lm")
This hints at how control can be exercised over different plot layers,
specifying different data, different aesthetics/parameters, and so on.
Layer-specific parameters can be set such as colour, shape,
transparency, size, thickness, and so on, depending on the graphic, to
change the default style. Additionally, the ggplot2 package includes a
number of predefined styles, called by theme_X(), that can be added as
a layer. The code below applies theme_minimal() as in Figure 5.1:
ggplot(data = social, aes(x = unemployed, y =
noqual)) +
# specify point characteristics
geom_point(alpha = 0.1, size = 0.7, colour =
"#FB6A4A", shape = 1) +
# specify a trend line and a theme/style
geom_smooth(method = "lm", colour =
"#DE2D26") +
theme_minimal()
The second option is to generate individual plots for each group using
some kind of faceting. The code below generates separate scatter plots
for each OAC type as in Figure 5.2:
ggplot(data = social, aes(x = unemployed, y =
noqual)) +
# specify point characteristics
geom_point(alpha = 0.1,size = 0.7,colour =
"#FB6A4A",shape = 1) +
# add a trend line
geom_smooth(method = "lm", colour =
"#DE2D26") +
# specify the faceting and a theme/style
facet_wrap("oac", nrow = 2) +
theme_minimal()
# a density plot
ggplot(social, aes(x = llti)) +
geom_density()
# a histogram
ggplot(social, aes(x = llti)) +
geom_histogram(bins = 30, col = "red", fill =
"salmon")
# a boxplot
ggplot(social, aes(x = "", y = llti)) +
geom_boxplot(fill = "dodgerblue", width =
0.2) +
xlab("LLTI") + ylab("Value")
The density plots can be further extended to compare the density plots of
different groups, showing how different levels of llti are associated
with different OAC classes:
ggplot(social, aes(llti, stat(count), fill =
oac)) +
geom_density(position = "fill") +
scale_colour_brewer(palette = "Set1")
library(RColorBrewer)
display.brewer.all()
brewer.pal(11, "Spectral")
brewer.pal(9, "Reds")
social %>%
ggplot(aes(y=llti, fill = oac)) +
# specify the facets
facet_wrap(~oac, ncol = 2) +
# flip the coordinates and specify the
boxplot
coord_flip() + geom_boxplot() +
# specify no legend and the colour palette
theme(legend.position = "none") +
scale_fill_manual(name = "OAC class",
values = brewer.pal(8,
"Spectral"))
Notice how the piping syntax was used with ggplot in the above,
effectively piping the first argument (data).
These examples illustrate how additional variables, particularly
categorical variables, can be used to split the plot into facets, or subplots
that each display one subset of the data.
This can be further refined by ordering the OAC classes by their median
llti values and plotting them, using a number of other plot parameters
as in Figure 5.5:
social %>%
ggplot(aes(x = reorder(oac, llti, FUN =
median),
y= llti, fill = oac)) +
# specify the boxplot
geom_boxplot(aes(group = oac),
outlier.alpha=0.4,
outlier.colour= "grey25") +
# specify the colour palette
scale_fill_manual(name = "OAC class",
values = brewer.pal(8,
"Spectral")) +
# flip the coordinates and specify the
axis labels
coord_flip() + xlab ("") + ylab ("LLTI") +
# specify some styling
theme_minimal() + theme(legend.position =
"none")
Figure 5.5 Boxplots of llti against OAC classes, ordered by the class
median value
There is quite a lot going on in the code above. It is instructive to unpick
some of it. You should note (and play around with!) the specification of
the fill aesthetics that tells ggplot which variables are to be shaded,
the treatment of outliers, specifying their transparency and shading and
the overriding of the default ggplot2 palette with
scale_fill_manual(). The coord_flip() function transposes the
x- and y-axes and their legends. Finally, one of the ggplot themes was
applied. This has a layout that includes a legend, meaning that a line of
code to remove the legend has to be specified after the theme call if that
is what is wanted.
Key Points
Single, continuous variables can be visualised using density plots,
histograms or boxplots.
Density histograms indicate the relative probability of the bins, and
because the sum of the bin counts is 1, they can be used to compare
distributions of different continuous variables.
Faceting and grouping can be used to compare across groups.
Grouped boxplots can be ordered by some function (mean, median,
etc.).
5.5 EDA OF MULTIPLE CONTINUOUS
VARIABLES
Earlier scatter plots of individual pairs of variables were used to show
correlations. The scatter plot provides the simplest way of examining how
two variables interact. These are essentially visualisations of the
correlations generated in Section 5.2. The code below plots the ft49
and llti variables as an example of a simple scatter plot.
geom_smooth() by default fits a trend line using a generalised additive
model when there are more than 1000 observations and a loess model
otherwise (in the code snippets above a linear trend line was fitted):
social %>% ggplot(mapping = aes(x = ft49, y =
llti)) +
geom_point(size = 0.2, alpha = 0.2) +
geom_smooth()
This will take some time run because of the size of the social dataset.
Another useful tool here is to show the upper triangle of the scatter plot
matrix with smoothed trend lines. These are achieved with loess curve
fits (Cleveland, 1979). In a similar way to the default geom_smooth
methods, these are smooth bivariate trend lines and provide a good way
of judging by eye whether there are useful correlations in the data,
including collinearity in variables. Essentially a straight-line-shaped trend
with not too much scattering of the individual points suggests collinearity
might be an issue. When two predictors are correlated it can be difficult to
identify whether it is one or the other (or both) that influence the quantity
(response) to be predicted. The code below does this. Adding
upper.panel=panel.smooth to the above code causes the loess
curves to be added:
plot(social[, c(4, 5:11)], cex = 0.2, col =
grey(0.145, alpha=0.2),
upper.panel=panel.smooth)
social %>%
ggplot(mapping = aes(x = llti, y =
unemployed)) +
geom_hex() + labs(fill='Count') +
scale_fill_gradient(low =
"lightgoldenrod1", high = "black")
In the above code the data were rescaled to show the variables on the
same polar axis. Rescaling can be done in a number of ways, and there
are many functions to do it. Here the scale() function applies a classic
standardised approach around a mean of 0 and a standard deviation of 1
(i.e. a z-score). Others use variable minimum and maximum to linearly
scale between 0 and 1 (such as the rescale function in the scales
package). The polar plots in the figure show that there are some large
differences between classes in most of the variables, although how
evident this is will depend on the method of rescaling and aggregation.
Try changing scale to scales::rescale in the code above and
changing mean to median to examine these.
Figure 5.11 Radar plots of the variable mean values for each OAC class
Writing figures to file
Figures, plots, graphs, maps, and so on can be written out to a file for
use in a document or report. There are a whole host of formats to which
graphics can be written out, but usually we want PDFs, JPEGs, TIFFs or
PNGs. The code below creates and writes a figure in PNG format. You
will have to experiment with the different ways that the size and density
(e.g. dots per inch, dpi) can be adjusted, which of course affect the plot
display (text size etc.).
PDF, PNG and TIFF files can be written using the following functions:
pdf()
png()
tiff()
You should examine the help for these. The key thing you need to know
is that these functions all open a file. The open file needs to have a map
or the figure written to it and then needs to be closed using dev.off().
So the basic syntax is (do not run this code – it is just showing syntax):
pdf(file = "MyPlot.pdf", other settings)
<code to produce figure or map>
dev.off()
You should try to write out a .png file of the plot using the code below.
This writes a .png file to the current working directory, which can always
be changed:
# open the file
png(filename = "Figure1.png", w = 7, h = 5,
units = "in", res = 300)
# make the figure
social %>% ggplot(mapping = aes(x = llti, y =
unemployed)) +
geom_hex() + labs(fill='Count') +
scale_fill_gradient(low = "lightgoldenrod1",
high = "black")
# close the file
dev.off()
Key Points
Scatter plots of individual pairs of variables are the core technique
for visualising their correlations.
Adding trend lines of different forms can help to show these
relationships.
The GGalley package provides a number of very powerful tools for
visualisations that are built on ggplot2.
Multiple correlations (as in a correlation matrix) can be visualised
using geom_tile().
Pairwise correlations for very large numbers of observations can be
visualised using different summary bins and contours.
Multivariate data properties can be visualised using radar plots, and
these can be used to compare the multivariate properties of different
groups.
5.6 EDA OF CATEGORICAL VARIABLES
Categorical data typically describe the class of an observation. This
section describes some basic approaches for examining categorical
variables and describes techniques for visualising them and how they
interact.
5.6.1 EDA of single categorical variables
Individual categorical variables can be compared by examining their
frequencies. Frequencies of a single categorical variable can be
examined numerically with tables and visually with some kind of bar or
column plot. The code below calculates the frequencies numerically for
the counts of OAC classes in the social data table in different ways
using base R and dplyr functionality:
The class frequencies can also be plotted using the geom_bar function
with ggplot. The code snippets below start with a basic plot and build to
more sophisticated visualisations:
# standard plot
ggplot(social, aes(x = factor(oac_code))) +
geom_bar(stat = "count") + xlab("OAC")
# ordered using fct_infreq() from the forcats
package
# (loaded with tidyverse)
ggplot(social,
aes(fct_infreq(factor(oac_code)))) +
geom_bar(stat = "count") + xlab("OAC")
# orientated a different way
ggplot(social, aes(y = factor(oac))) +
geom_bar(stat = "count") + ylab("OAC")
# with colours - note the use of the fill in
the mapping aesthetics
ggplot(social, aes(x = factor(oac_code), fill =
oac)) +
geom_bar(stat = "count")
# with bespoke colours with scale_fill_manual
and a brewer palette
ggplot(social, aes(x = factor(oac_code), fill =
oac)) +
geom_bar(stat = "count") +
scale_fill_manual("OAC class label",
values = brewer.pal(8,
"Set1")) +
xlab("OAC")
We can extend the bar plot to include the counts of specific components.
The code below uses a piped operation to calculate grouped summaries
of the total number of people with no qualifications (NoQual) and with
level 4 qualifications (L4Qual) in each OAC class, which are then
passed to different plot functions to create Figure 5.12. (Note that these
are similar to the plots created by the code above but not exactly the
same due to the different populations in each LSOA.) The code snippet
below uses different approaches to show the results of the grouping, with
different parameters used in the ggplot aesthetics and in the geom_bar
parameters. Each plot is assigned to an R object and then these are
combined using the grid.arrange function in the gridExtra package
to generate Figure 5.12.
# using a y aesthetic and stat = "identity" to
represent the values
g1 = social %>% group_by(oac_code) %>%
# summarise
summarise(NoQual = sum(population * noqual),
L4Qual = sum (population * l4qual))
%>%
# make the result long
pivot_longer(-oac_code) %>%
# plot
ggplot(aes(x=factor(oac_code), y=value,
fill=name)) +
geom_bar(stat="identity",
position=position_dodge()) +
scale_fill_manual("Qualifications",
values =
c("L4Qual"="red","NoQual"="orange")) +
xlab("OAC") + theme_minimal()
# using just an x aesthetic and stat = "count"
with fill and weight
g2 = social %>% group_by(oac_code) %>%
# summarise
summarise(NoQual = sum(population * noqual),
L4Qual = sum (population * l4qual))
%>%
# make the result long
pivot_longer(-oac_code) %>%
# plot
ggplot(aes(factor(oac_code))) +
geom_bar(stat="count", aes(fill=name,
weight=value)) +
scale_fill_manual("Qualifications",
values =
c("L4Qual"="red","NoQual"="pink")) +
xlab("OAC") + theme_minimal()
# plot both plots
gridExtra::grid.arrange(g1, g2, ncol=2)
Figure 5.12 Bar plots of OAC class with the proportions of people with
different levels of qualifications, created and displayed in different ways
A different approach would be to use geom_col() instead of
geom_bar() in a hybrid of the plots creating g1 and g2 above. The
geom_col() function requires mapping aesthetics for both x and y to be
specified (as in g1) and then can be specified with or without
position=position_dodge() to create either g1 (with) or g2
(without), to have the bars stacked side by side or on top of each other,
respectively. Try running the code below first with geom_col()
commented out and then uncommented:
social %>% group_by(oac_code) %>%
summarise(NoQual = sum(population * noqual),
L4Qual = sum(population * l4qual))
%>%
pivot_longer(-oac_code) %>%
ggplot(aes(x=factor(oac_code), y= value,
fill= name)) +
# Option 1: stacked on top
geom_col() +
# Option 2: stacked side by side (uncomment
the line below)
# geom_col(position=position_dodge()) +
scale_fill_manual("Qualifications",
values =
c("L4Qual"="red","NoQual"="pink")) +
xlab("OAC") + theme_minimal()
5.6.2 EDA of multiple categorical variables
Previous sections contained many examples of visualisations of different
categorical variables in terms of their continuous variable properties. This
was done using grouped or faceted plots of single variables (boxplots in
Figure 5.6), individual pairwise correlations (scatter plots in Figure 5.2),
multiple pairwise scatter plots (Figures 5.6, 5.7 and 5.9) or radar plots of
multiple properties (Figure 5.11). All of these sought to show the
differences in numeric variables between groups, in this case OAC
classes. Implicitly these have shown the relationships between
continuous variables by categorical groups.
However, it is also possible to examine how different categorical
variables interact using a correspondence table. This is commonly done
to assess accuracy in categorical mapping such as land cover and land
use. The table is used to compare observed classes (e.g. collected by
field survey) with those predicted by a remote sensing classification. Here
it can be used to compare the correspondences between different
categories in the social dataset which contains a Rural–Urban
Classification (RUC) as well as the OAC. The RUC contains eight
classes for the LSOA areas for England and Wales. Classes A1, B1, C1
and C3 are urban classes and D1, D2, E1 and E2 are rural classes.
The relationship between the RUC classes and the OAC classes can be
simply summarised using different functions for tabulating data counts in
a correspondence table:
social %>%
select(oac, ruc11_code) %>% xtabs ( ~ ., data
= .)
## ruc11_code
## oac A1 B1 C1 C2
## Cosmopolitan student neighbourhoods 271
73 784 6
## Countryside living 15 8 536
15
## Ethnically diverse professionals 2158
123 2760 1
## Hard-pressed communities 1706 281
2727 19
## Industrious communities 1142 329
4252 50
## Inner city cosmopolitan 2121 3
46 0
## Multicultural living 2724 131
983 0
## Suburban living 1386 260
3636 3
## ruc11_code
## oac D1 D2 E1 E2
## Cosmopolitan student neighbourhoods 1 1
2 0
## Countryside living 1082 80 2319
326
## Ethnically diverse professionals 129 3
32 0
## Hard-pressed communities 88 7 0
0
## Industrious communities 1228 105
61 2
## Inner city cosmopolitan 0 0 0 0
## Multicultural living 0 0 0 0
## Suburban living 661 1 76 0
Notice the parameters that are passed to the xtabs function. xtabs is
not a piping-friendly function, and requires formula and data parameters
as inputs. The dot (.) indicates whatever comes out of the pipe, hence the
selection of two variables in the line above and the specification of the
data using the dots.
The RUC is also hierarchical in that it can be recast into two high-level
classes of Urban and Rural:
social %>%
select(oac, ruc11_code) %>%
# recode the classes
mutate(UR = ifelse(str_detect(ruc11_code ,
"A1|B1|C1|C2"),
"Urban","Rural")) %>%
# select the UR and OAC variables and pipe to
xtabs
select(oac, UR) %>% xtabs ( ~ ., data = .)
## UR
## oac Rural Urban
## Cosmopolitan student neighbourhoods 4
1134
## Countryside living 3807 574
## Ethnically diverse professionals 164
5042
## Hard-pressed communities 95 4733
## Industrious communities 1396 5773
## Inner city cosmopolitan 0 2170
## Multicultural living 0 3838
## Suburban living 738 5285
Tables can also be visualised using a heatmap (as was done in Figure
5.8) to show the relative frequency of interaction between the two
classes. The code for generating the graduated correspondence table is
below:
social %>%
select(oac, ruc11_code) %>% xtabs( ~ ., data
= .) %>%
data.frame() %>%
ggplot(aes(ruc11_code, oac, fill= Freq)) +
geom_tile(col = "lightgray") +
xlab("") + ylab("") + coord_fixed() +
scale_fill_gradient(low = "white", high =
"#CB181D") +
theme_bw()
Key Points
Categorical data describe the class of an observation.
Categories and classes can be examined by comparing their
frequencies in tabular format and visually using column or bar charts.
Correspondence tables provide a convenient way to summarise how
two categorical variables interact.
These can be cast into heatmaps in ggplot using the geom_tile
function.
Categories can be used to group or facet plots in order to explore
how different group numeric properties and attributes vary (as was
done for scatter plots in Figure 5.2).
More complex summaries can be visualised using faceted radar
plots.
A word on colours
This chapter has used different colours in the visualisations. Figure 5.1
generated a scatter plot using colours in hexadecimal format
("#FB6A4A" and "#DE2D26") and the code for the heatmaps defined a
colour ramp from a named colour ("white") to a hexadecimal one
("#CB181D"). This starts at the different ways that colours can be
specified. There are three basic approaches.
First, named colours can be used. These are listed as follows using both
British and American spellings:
colours()
colors()
You will have noticed that different colour palettes typically use
hexadecimal format such as the calls to brewer.pal above:
brewer.pal(11, "Spectral")
brewer.pal(9, "Reds")
The named and hexadecimal formats can be used directly with ggplot,
while the RGB format needs to be converted as demonstrated in the
code below:
col_name = "orangered1"
col_rgb = t(col2rgb(col_name)/255)
col_hex = rgb(col_rgb, maxColorValue = 255)
# named colour
ggplot(data = social, aes(x = unemployed, y =
noqual)) +
geom_point(alpha = 0.1, size = 0.7,
colour = col_name, shape = 1) +
theme_minimal()
# hexadecimal colour
ggplot(data = social, aes(x = unemployed, y =
noqual)) +
geom_point(alpha = 0.1, size = 0.7,
colour = rgb(col_rgb), shape = 1) +
theme_minimal()
# RGB colour with conversion
ggplot(data = social, aes(x = unemployed, y =
noqual)) +
geom_point(alpha = 0.1, size = 0.7,
colour = rgb(col_rgb), shape = 1) +
theme_minimal()
The next step is to think about dates. The as.Date and related functions
are key for manipulating dates in R. The code below attaches ‘2018’ for
the year and ‘01’ as a nominal day to make a formal date attribute to
each of the objects in the filtered data, building on the code above:
# add date
prescriptions %>% filter(str_starts(bnf_code,
"04030")) %>%
mutate(date = as.Date(paste0("2018-", month,
"-01"))) %>% head()
prescriptions %>%
# extract data
filter(str_starts(bnf_code, "04030")) %>%
# add date
mutate(Date = as.Date(paste0("2018-", month,
"-01"))) %>%
# group by data and summarise
group_by(Date) %>% summarise(Count = n()) %>%
# and plot the line with a trend line
ggplot(aes(x=Date, y=Count)) + geom_line() +
geom_smooth(method = "loess")
Note the use of the group_by and summarise within the piped
commands above.
This can be extended to compare different trends for different groups of
drugs. The code below compares respiratory prescriptions associated
with asthma (BNF code 03010). Notice the use of the nested ifelse
within the call to mutate followed by the use of filter to extract the
two sets of prescriptions and the grouping by two variables as in Figure
5.13:
prescriptions %>%
# extract data
mutate(Condition =
ifelse(str_starts(bnf_code , "04030"),
"SAD",
ifelse(str_starts(bnf_code ,
"03010"),"Asthma","Others"))) %>%
filter(Condition != "Others") %>%
# add date
mutate(Date = as.Date(paste0("2018-", month,
"-01"))) %>%
# group by date and summarise
group_by(Date, Condition) %>% summarise(Count
= n()) %>%
ungroup() %>%
# and plot the line with a trend line
ggplot(aes(x=Date, y=Count, colour =
Condition)) + geom_line() +
geom_smooth(method = "loess") + theme_bw() +
scale_color_manual(values = c("#00AFBB",
"#E7B800"))
Figure 5.13 Time series plots for prescriptions associated with seasonal
affective disorder (SAD) and asthma
Database version
The database version of the above is slightly different due to nuances in
the way that dplyr interfaces with SQL using the DBI package. The
code below connects to the full prescriptions data table in
prescribing.sqlite that was introduced in Chapter 4. You should
connect to this database. Note that you may have to change your
working directory to the folder location of the 11.2 GB file you
downloaded in Chapter 4.
library(RSQLite)
db <- dbConnect(SQLite(),
dbname="prescribing.sqlite")
One of the issues with the dplyr database interface is that it does not
translate the paste0 function as above, so this is moved to the ggplot
part of the code, and we have to use a different pattern matching than
str_detect (see Chapter 4). Also note that in the extended plot, the
selection and labelling have a different syntax. The code below takes
some time to run because it summarises data from 120 million records:
tbl(db, "prescriptions") %>%
# extract data
filter(bnf_code %like% '04030%') %>%
# group by date and summarise
group_by(month) %>%
summarise(Count = n()) %>%
ungroup() %>%
# plot
ggplot(aes(x=as.Date(paste0("2018-", month,
"-15")), y=Count)) +
geom_line() + xlab("Date") +
geom_smooth(method = "loess")
Key Points
Trend analysis in ggplot requires temporal data to be in date
format, created using the as.Date function.
5.8 SPATIAL EDA
The final section of this examination of EDA with ggplot describes how
it can be used with spatial data. The tmap package (Tennekes, 2018)
was formally introduced in Chapter 2, with further examples in Chapter 3,
and it was used to generate maps of the database query outputs in
Chapter 4. However, the ggplot2 package can also be used to visualise
spatial data. This section provides some outline examples of how to do
this with point and area data.
The code below subsets the lsoa_sf layer for Nottingham and links the
result to the LSOA prescribing attributes in lsoa_result:
Figure 5.14 The map of the LSOAs in Nottingham created using ggplot
The geom_sf function adds a geometry stored in an sf object to the
plot. It looks for an attribute called geom or geometry in the sf object.
On occasion you may need to specify this as in the code snippet below to
reproduce Figure 5.14:
ggplot(nottingham) + geom_sf(aes(geometry =
geom))
It is also possible to add scale bars and north arrows using some of the
functions in the ggspatial package, and to control the appearance of
the map using both predefined themes (such as theme_bw) and specific
theme elements. Examples of some of these are shown in the code
below to create Figure 5.15:
ggplot(nottingham) +
geom_sf(aes(geometry = geom, fill = cost_pp),
size = 0.1) +
scale_fill_viridis_c(name = "Opioid
costs\nper person") +
annotation_scale(location = "tl") +
annotation_north_arrow(location = "tl",
which_north = "true",
pad_x = unit(0.2, "in"), pad_y =
unit(0.25, "in"),
style =
north_arrow_fancy_orienteering) +
theme_bw() +
theme(panel.grid.major = element_line(color =
gray(.5),
linetype="dashed",size=0.
panel.background =
element_rect(fill="white"))
Figure 5.15 Map of limiting long-term illness in Nottingham using
ggplot with cartographic enhancements
Finally, as with tmap, multiple data layers can be included in the map.
The code below creates and then adds the locations and names of some
randomly selected LSOAs to the map:
# select locations
set.seed(13)
labels.data =
nottingham[sample(1:nrow(nottingham), 10),]
labels.data = cbind (labels.data,
st_coordinates(st_centroid(labels.data
# create ggplot
ggplot(nottingham) +
geom_sf(aes(geometry = geom), fill = "white"
, size = 0.1) +
geom_point(data = labels.data, aes(x=X, y=Y),
colour = "red") +
geom_text(data = labels.data, aes(x=X, y=Y,
label= lsoa_id),
fontface = "bold", check_overlap = T) +
theme_bw() +
theme(axis.title = element_blank())
It is possible to use ggplot and tmap to visualise analyses of spatial
structure in spatial data variables.
The key thing that we are interested in, which cannot be directly
determined using the aspatial bivariate and multivariate approaches
(correlations, correspondence tables, etc.), is the locations where, for
example, correlations were high or low. Some approaches are outlined
for determining similar properties but with a spatial flavour in the
remainder of this section.
The code below does a similar thing to what was done in some of the
ggpairs plots above that partitioned the social data around median
unemployment. This time it is done using the nottingham spatial object
with the opioid prescribing results as presented in Figure 5.16. This
shows that there are clear spatial trends, with a classic doughnut effect
around the city centre. Both tmap and ggplot approaches are shown:
# tmap
p1 = nottingham %>%
mutate(cost_high = ifelse(cost_pp >
median(cost_pp),
"High","Low")) %>%
tm_shape() +
tm_graticules(ticks = FALSE, col = "grey") +
tm_fill("cost_high", title = "Cost pp.",
legend.is.portrait = FALSE) +
tm_layout(legend.outside = F,
legend.position = c("left",
"bottom"))
# ggplot
p2 = nottingham %>%
mutate(cost_high = ifelse(cost_pp >
median(cost_pp),
"High","Low")) %>%
ggplot() +
geom_sf(aes(geometry = geom, fill =
cost_high), size = 0.0) +
scale_fill_discrete(name = "Cost pp.") +
theme_bw() +
theme(panel.grid.major = element_line(
color=gray(.5), linetype= "dashed",
size=0.3),
panel.background = element_rect(fill=
"white"),
legend.position= "bottom",
axis.text = element_text(size = 6))
# plot together using grid
library(grid)
# clear the plot window
dev.off()
pushViewport(viewport(layout=grid.layout(2,1)))
print(p1, vp=viewport(layout.pos.row = 1))
print(p2, vp=viewport(layout.pos.row = 2))
Figure 5.16 LSOAs with high and low areas of opioid prescribing using
(top) tmap and (bottom) ggplot approaches
Rather than a binary approach to folding the data (e.g. around a median
as above), another way is to use that as an inflection point and use
colour to indicate the degree to which individual LSOAs are above or
below it. Divergent palettes can help as in Figure 5.17. Notice the
placement of the legend:
ggplot(nottingham) +
geom_sf(aes(geometry = geom, fill = cost_pp),
size = 0.0) +
scale_fill_gradient2(low= "#CB181D", mid =
"white", high= "#2171B5",
midpoint =
median(nottingham$cost_pp),
name = "Unemployment\n
around median") +
theme_bw() +
theme(axis.line=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank(),
legend.position = "bottom",
legend.direction = "horizontal")
Figure 5.17 LSOA opioid prescribing costs, shaded above (blue) and
below (red) the median rate
The choice between ggplot and tmap approaches may relate to the
user’s norms about maps, which many people believe should always be
accompanied by a scale bar and a north arrow. Classic map production
with common cartographic conventions such as legends, scale bars and
north arrows is readily supported by tmap, and although possible with
ggplot2 is less directly implemented in the latter. A comprehensive
treatment of the use of the tmap and ggplot2 packages to map spatial
data is given in Brunsdon and Comber (2018).
Chapter 8 describes a number of alternative visualisations of spatial data
using hexbins and different kinds of cartograms.
Key Points
Spatial EDA (mapping!) can be undertaken with ggplot as well as
tmap using the geom_sf function.
This is supported by the ggspatial package which contains
functions for map embellishments (scale bars, north arrows, etc.).
Continuous spatial data properties can be visualised by partitions
(e.g. around the mean or median) and supported by divergent
palettes.
5.9 SUMMARY
R/RStudio provides a fertile environment for the development of tools for
visualising data, data attributes and their statistical and distributional
relationships.
It is impossible to capture the full dynamics of this information
environment: new webpages, blogs, tutorials and so on appear
constantly. In this context the aims of this chapter are not to provide a
comprehensive treatment of ggplot2 or of all the different mapping
options for spatial data. However, we have illustrated many generic
ggplot visualisation techniques with some implicit design principles,
highlighting why the ggplot2 package has become one of the default
packages for data visualisation.
A range of different types of visualisation have been used to illustrate
specific points in previous chapters, including lollipop charts, scatter
plots, boxplots, histograms, density plots and pairwise correlation plots.
This chapter has sought to bring these together, identifying some
common themes, for example the use of layered approaches to graphics
and maps, and highlighting particular techniques. We have not by
definition covered everything – bar charts, parallel plots, calendar
heatmaps, for example. The potential for visualisation refinement with
ggplot2 and its layered approach is endless. The code snippets in this
chapter illustrate the syntax of ggplot and provide the reader with
baseline skills to allow them to understand how other plot types could be
implemented.
A further point is that ggplot2 provides the engine for much of the
development of visualisation tools in R. It has been adapted and
extended in many other packages – have a look at the packages
beginning with gg on the CRAN website (https://cran.r-
project.org/web/packages/available_packages_by_name.html), some of
which have been used in this and other chapters:
GGally was used to produce the ggpairs plots but has other very
useful functions such as ggcoef() which plots regression model
coefficients. Consider the aim to model opioid costs (cost_pp) from
the other socio-economic variables in the social dataset using
regression. The code below creates a model of cost_pp, prints out
the model coefficients and plots them with 95% confidence intervals
using the ggcoef() function:
df = nottingham %>% st_drop_geometry() %>%
inner_join(social) %>%
select(-c(lsoa_id, population, employed,
tot_cost,
ruc11_code, ruc11, oac, oac_code))
regression.model <- lm(cost_pp ~ ., data = df)
round(summary(regression.model)$coefficients,
4)
ggcoef(regression.model, errorbar_height=.2,
color = "red", sort= "ascending") +
theme_bw()
You will also need to download some house price and socio-economic
data for Liverpool. Details of this are in Section 6.3.
In the physical sciences, there is quite often a very rigorous derivation of
the relationship between a set of quantities, expressed as an equation.
However, measurements may be necessary to calibrate some physical
constants in the equation. An example of this is van der Waals’ equation
of state for fluids (van der Waals, 1873):
(6.1)
This is a modification of the ideal gas law:
(6.2)
where P is pressure, V is the volume occupied by the container of the
fluid, R is the universal gas constant,1 T is temperature based on an
absolute zero scale (e.g. kelvin), and n is the amount in moles (molecular
weight in grams) of the fluid. The two remaining quantities, a and b, are
constants for a given fluid – and are universally zero in the ideal gas law.
Since it is possible to measure or calculate V, P, n and T with good
accuracy, and R is known, it is possible to take observations over a
variety of values for a given fluid, and then use non-linear regression
techniques to obtain estimates for a and b.
1 Approximately 8.314 J mol−1 K−1 – the product of Boltzmann’s constant
and Avogadro’s number.
The point here is not to provide a tutorial on physical chemistry, but to
spotlight a certain approach to modelling. In the basic (so-called ‘ideal’)
gas law, molecules are presumed to be zero-dimensional points
(occupying no volume) and intermolecular attraction is also ignored. Van
der Waals considers the implication of dropping these two assumptions,
and in each case linking these with an explicit and rigorously thought-
through mathematical modification. Essentially the coefficient a allows for
intermolecular attraction, and the coefficient b allows for the volume
occupied by each molecule – these will vary for different fluids. It is
unlikely that anyone would pick that equation out of thin air – the model is
the outcome of quite precise mathematical (and physical) reasoning.
However, this is not the only way that models are produced. Quite often
the objective is to understand processes for which data can be collected,
and it is expected that certain variables will be connected, but there is no
clear pathway (as with the van der Waals example) to deriving a
mathematical model such as equation (6.1). Thus, it may be expected
that house prices in an area are related to, say, site properties (i.e.
properties of the house itself), neighbourhood properties and
environmental properties. These are essentially the assumptions made
by Freeman (1979). Suppose we had a number of measurements
relating to each of these, respectively named S1, …, Sk1, N1, …, Nk2 and
E1, …, Ek3. If P is a house price then we might expect P to depend on
the other characteristics. Generally we could say
(6.3)
Here, it is stated that some kind of function links the quantities, but it is
just referred to as f – a general function. A further step towards reality is
to state
(6.4)
where ϵ is a stochastic (random) element. This is generally recognised
now as a hedonic price equation. Here, a great deal of thought has still
gone into the derivation of this model, building on Rosen (1974),
Freeman (1974) and Lancaster (1966); however, there is no clear
guidance as to the form that f should take. For example, Halvorsen and
Pollakowski (1981) argue that ‘[the] appropriate functional form for a
hedonic price equation cannot in general be specified on theoretical
grounds’ and that it should be determined on a case-by-case basis. They
go on to produce a quite flexible model (involving Box–Cox power
transformations of the predictors of house price choice, and linear and
quadratic terms of these), where statistical tests may be used to decide
whether some simplifications can be made. However, this model is
justified essentially by its flexibility, rather than direct derivation in the
style of van der Waals. In a hundred years a progression can be seen in
which the set of applications for quantitative modelling has widened (from
the physical to the economic), but the certainty in the form of the model is
reduced.
This trend has perhaps accelerated in more recent decades, to a
situation where machine learning is proposed as a means of finding a
functional form (without necessarily stating the function itself) and very
little consideration of the under-lying process generating the data is
considered. Indeed, this is considered by some (e.g. Anderson, 2008) to
be a major advantage of the approach.
However, while the situation has accelerated, opinion has diversified, and
others would argue strongly against a theory-free approach – see Kitchin
(2014), Brunsdon (2015) or Comber and Wulder (2019) for examples. In
fact there is a spectrum of opinion on this, perhaps leading to a diverse
set of approaches being adopted. As well as work on ‘black box’ models
of machine learning and model fitting, a number of new approaches have
emerged from the statistical community – for example, generalised
additive models (Hastie and Tibshirani, 1987) where a multivariate
regression of the form
(6.5)
is proposed, in which f1,…, fk are arbitrary functions and ϵ is a Gaussian
error term with zero mean. The framework can be adapted to other
distributions in the exponential family, such as Poisson or binomial, with
appropriate transformations of y. Although this form is not entirely
general, it does have the property discussed earlier that it is quite flexible
– providing flexibility in the situation where there is no obviously
functional form for the model. There are statistical tests proposed to
evaluate whether certain terms should be included in the model, or
whether some of the arbitrary fs could be replaced by a linear term.
Another reason why attention is paid to models with a range of statistical
sophistication is the advent of ‘big data’. More flexible models tend to
require a greater degree of computational intensity, and sometimes they
do not scale well when the size of datasets becomes large. It may be
more practical to use a simple linear regression model – if it is a
reasonable approximation of reality in the situation where it is applied –
rather than a more complicated model.
Also, more flexible models have more parameters to estimate, and for
equivalent-sized data samples they will be more prone to sampling
variability. This may be seen as a price worth paying if there is no reason
to adopt a particular model, but if there is a strong theoretical argument
that a specific mathematical form should apply then there is no need to
use a more general – and more variable – approach. Sometimes a
balance has to be struck between an inaccurate model and an unreliably
calibrated model.
Finally, the term ‘inaccurate’ is deliberately used here, rather than
‘wrong’. There are degrees of inaccuracy. Newton’s laws predict
planetary motion to an extremely high degree of accuracy, despite being
proven ‘wrong’ by Einsteinian relativity. It took a very long time and a
major theoretical re-evaluation to identify a need for revision.
Key Points
Some models are directly derived from considered and precise
mathematical and physical reasoning about the process being
considered (as in the van der Waals example).
Others are constructed through an understanding of the process
being considered, and this may be subjective (as in the house price
example).
The applications for quantitative modelling have widened from the
physical sciences to the economic and others, but the certainty in the
form of the model is less.
This has been exacerbated by the increase in machine learning
applications which are good at finding an appropriate functional form
(they are flexible), but with very little consideration of the underlying
process.
6.2 QUESTIONS, QUESTIONS
Much of the purpose of modelling is to answer questions – for example,
‘By how much would the average house price in an area increase if it was
within 1 km of a woodland area?’, ‘What is the pressure of 1 mole of
hydrogen gas contained in a volume of 500 cm3 at a temperature of 283
K?’, or ‘Is there a link between exposure to high-volume traffic and
respiratory disease?’. Questions may be answered with varying degrees
of accuracy, and varying degrees of certainty. The gas pressure question
can be answered accurately and with a very high degree of confidence
(49.702 atm). The others less so – for the house price case even the
question asked only for the average house price, given a number of
predictors. In reality it might also be useful to specify some kind of range
– or at least an indicator of variability. The last question implies a yes/no
response. One could make such a response, but realistically this would
have some probability of being wrong. It would be helpful if that
probability could be estimated.
Perhaps something to be deduced from these comments is that there is a
need – in many cases – not only to answer a question, but also to obtain
some idea of the reliability of the answer. In some cases this might be
relatively informal, while in others a more formal quantified response can
be considered – for example, a number range, a probability distribution
for an unknown quantity, or a probabilistic statement on the truth of an
answer. The latter will also require data to be collected and provided.
Part of this problem can be addressed by statistical theory and probability
theory, but a number of other issues need to be addressed as well – for
example, the reliability of the data, or the appropriateness of the
technique used to provide numerical answers. Good data science should
take all of these into account. In the following subsections, a number of
these issues will be addressed, with examples in R. These examples do
not present the only way to answer questions, but are intended to give
examples of thinking that could be of help. If readers can come up with
effective critiques of the approaches, one intended aim of this book will
be achieved!
6.2.1 Is this a fake coin?
Probability-based questions are possibly some of the simplest to answer,
particularly with the use of R. Here, there is a relatively small amount of
data, but similar approaches using larger datasets are possible. The key
idea here is to think about how the question can be answered in
probability terms.
Suppose you are offered a choice of two coins, one known to be genuine,
and one a known fake. The coin you take is flipped five times, and four
heads are obtained. We also know that for this kind of coin, genuine
versions are ‘fair’ (in that the probability of obtaining a head is 0.5). For
the fake coin, however, the probability is 0.6. Based on the result of coin
flipping above, is the coin a fake?
One problem here is that of equifinality (Von Bertalanffy, 1968) – coins
with head probability of either 0.5 or 0.6 could give four heads out of five.
Different processes (i.e. flipping a genuine coin versus flipping a fake
coin) could lead to the same outcome. However, it may be that the
outcome is less likely in one case than another – and in turn this could
suggest how the likelihoods for the two possible coin types differ. Here it
is equally likely that we have a fair or fake coin – due to the initial choice
of coins offered. The probability that the coin is fair, given the outcome,
could be calculated analytically. However, since part of the idea here is to
consider the possibility of more complicated outcomes, an experimental
approach will be used, as set out here:
1. Generate a large number (n) of simulated sets of five coin flips. To
represent the belief that the coin is equally likely to be fake or
genuine, select the chance of getting heads to be either 0.5 or 0.6
with equal probability.
2. Store the outcomes (number of heads), and whether the probability
was 0.5 or 0.6.
3. Select out those sets of coin flips with four heads.
4. Compute the experimental probability of the coin being genuine
(given the result) by finding the proportion of 0.5-based results.
This is done below. The seed for the random number generator is fixed
here, for reproducibility. probs contains the probability for each set of
coin flips. The results are obtained using the rbinom function, which
simulates random, binomially distributed results – effectively numbers of
heads when the coin is flipped five times. n_tests here is 1,000,000,
allowing reasonable confidence in the results. The filtering and proportion
of genuine/fake coin outcomes are computed via dplyr functions:
library(tidyverse)
set.seed(299792458)
n_tests <- 1000000
probs <- sample(c(0.5,
0.6),n_tests,replace=TRUE)
results <-
tibble(Pr_head=probs,Heads=rbinom(n_tests,5,probs))
results %>% filter(Heads==4) %>%
count(Pr_head) %>%
mutate(prop=n/sum(n)) %>%
select(-n)
## # A tibble: 2 x 2
## Pr_head prop
## <dbl> <dbl>
## 1 0.5 0.377
## 2 0.6 0.623
So, on the basis of this experiment it looks as though the odds of the
chosen coin being a fake (i.e. a coin with a heads probability of 0.6) are
about 2 : 1.
Now, suppose instead of being offered a choice of two coins, this was
just a coin taken from your pocket. There are fake coins as described
above in circulation (say, 1%) so the chance of the probability of getting a
head being 0.6 is now just 0.01, not 0.5. The code can be rerun (below),
adapting the sample function with the probs parameter:
set.seed(299792458)
n_tests <- 1000000
probs <-
sample(c(0.5,0.6),n_tests,replace=TRUE,prob=c(0.99,0.01))
results2 <-
tibble(Pr_head=probs,Heads=rbinom(n_tests,5,probs))
results2 %>% filter(Heads==4) %>%
count(Pr_head) %>%
mutate(prop=n/sum(n)) %>%
select(-n)
## # A tibble: 2 x 2
## Pr_head prop
## <dbl> <dbl>
## 1 0.5 0.983
## 2 0.6 0.0172
There is still some chance of the coin being fake, but now it is much less
than before (about 2%).
In fact these calculations could have been dealt with analytically. The
coin flips follow a binomial distribution, so that if the probability of a head
is θ, then the probability of m heads out of n flips is
(6.6)
When m = 4, n = 5, and θ = 0.5, for the true coin we have P(5|0.5) = 5/32.
This binomial is derived using factorials, denoted by n! for the factorial of
n:
(6.7)
In R the probability of m heads out of n flips for the true coin be
calculated as follows:
factorial(5)/factorial(4)*(factorial(5-4)) *
(0.5^4)*(1-0.5)^(5-4)
## [1] 0.15625
For the fake coin, with θ = 0.6, the probability of m heads out of n flips is
162/625:
MASS::fractions(factorial(5)/factorial(4)*
(factorial(5-4))*(0.6^4)*(1-0.6)^(5-4))
We can then use Bayes’ theorem to work out the posterior probability that
a coin is fake. The basic formulation of this is the probability of the event
over the sum of the probabilities of all events:
(6.8)
where
prior probability of Event1 = P(Event1),
marginal probability of Event2 = P(Event2),
posterior probability = P(Event1|Event2),
likelihood = P(Event2|Event1).
Here the likelihood is 0.5, the prior probability is 162/625, and the
evidence is the sum of our belief: in this case there is an equal chance of
getting either a fake or real coin.
Thus without doing any experimentation (i.e. when the prior probability of
a coin being fake is 50%) we can determine the posterior probability of
the coin being fake:
(6.9)
which agrees with the experimental result to three decimal places. The
calculation can be carried out in R as follows, this time using the dbinom
function to generate the probabilities of m heads out of n flips for the two
coins rather than the use of the long-hand factorial approach above:
# prior probabilities
p_fake <- 0.5
p_gen <- 0.5
# probability of 4 heads
p_4_fake <- dbinom(4,5,0.6)
p_4_gen <- dbinom(4,5,0.5)
# posterior probability using Bayes
p_fake_after <- p_fake * p_4_fake / (p_fake *
p_4_fake + p_gen * p_4_gen)
p_fake_after
## [1] 0.6239018
# show fraction/ratio
MASS::fractions(p_fake_after)
## [1] 355/569
library(ggplot2)
set.seed(602214086)
n_tests <- 1000000
probs <- runif(n_tests)
results <-
tibble(Pr_head=probs,Heads=rbinom(n_tests,20,probs))
ggplot(results %>%
filter(Heads==13),aes(x=Pr_head)) +
geom_histogram(bins=40,fill='darkred')
This gives an idea of the likely value of θ on the basis of the 20 flips. As
before, Bayes’ theorem, now with a more formal notation, can be used to
obtain a theoretical distribution for θ given the outcome. If the
experimental result (the 13 out of 20 heads) is referred to as D then
(6.10)
Here, it can be seen that the experimental approach has worked well. In
particular, this is hopefully a good omen for those situations where an
experimental approach is possible, but no theoretical analysis has been
proposed.
Figure 6.2 Comparison of theoretical and experimental distribution of θ
given D
The question here is of the kind ‘what is the value of some parameter?’,
rather than ‘is some statement true?’. This kind of analysis could be
turned into the latter kind, given that a distribution for θ has been
provided. For example, suppose there was concern as to whether the
coin was fair. Although with a continuous distribution it does not really
make much sense to ask what the probability is that θ = 0.5 exactly, one
could decide that if it were between 0.49 and 0.51 this would be
reasonable, and estimate this probability:
results %>%
filter(Heads==13) %>%
mutate(in_tol = Pr_head > 0.49 & Pr_head <
0.51) %>%
count(in_tol) %>%
transmute(in_tol,prob=n/sum(n))
## # A tibble: 2 x 2
## in_tol prob
## <lgl> <dbl>
## 1 FALSE 0.969
## 2 TRUE 0.0307
Thus, the probability that θ lies in the interval (0.49, 0.51) is just over
0.03, given the coin-flipping result.
We can also modify this analysis in the light of further information.
Suppose, as before, you were offered two coins prior to flipping, a known
genuine one and a known fake one, and you choose one of the two, not
knowing which it was. This time you know genuine coins have θ = 0.5 but
do not know the value of θ for a fake. This prior knowledge of θ could be
simulated by randomly generating θ as follows:
1. Simulate ‘fake’ or ‘genuine’ with respective probabilities 0.5 for each
(to represent your knowledge of whether coin is fake).
2. If genuine, set θ to 0.5.
3. If fake, set θ to a uniform random number between 0 and 1.
Thus, θ has a distinct chance of being exactly 0.5 and an equal chance of
being anywhere between 0 and 1, with uniform likelihood. The code to
generate θ on this basis and simulate the coin flips is as follows:
set.seed(6371)
n_sims <- 1000000
theta <- ifelse(runif(n_sims) <
0.5,runif(n_sims),0.5)
results3 <-
tibble(Theta=theta,Heads=rbinom(n_sims,20,theta))
From this, it seems that in this case the likelihood of the coin being
genuine is around 0.6. This may seem strange given the earlier results,
but two things need to be considered:
The probability of the coin actually having θ = 0.5 is higher here – in
the previous example no prior expectation was based on any
possible value.
In the case of a fake coin, the prior expectation of θ exceeding 0.5
was the same as not exceeding this value.
As before, the evidence could be viewed graphically (see Figure 6.3):
results3 %>%
filter(Heads==13) %>%
ggplot(aes(x=Theta)) +
geom_histogram(bins=40,fill='navyblue')
So the outcome is still quite uncertain. This is something we will term the
paradox of replicability. Even if a very large number of observations have
been made of a random process, this makes the process no less
random. All the large number of observations provide us with – providing
they are appropriate observations – is a very reliable picture of the nature
of the randomness. However, some seem to believe that more data will
give less error in predictions.
Here, this can be illustrated by imagining that instead of 20 coin flips (and
13 heads) you had 2000 coin flips with 1300 heads. This would give a
much more certain estimate of θ with almost no evidence supporting the
fact that the genuine coin was chosen:
set.seed(6371)
# Because there are many more possible outcomes
we need to increase
# the number of flips to get a reasonable
sample with exactly 1300
# heads
n_sims <- 100000000
## Then things are done pretty much as before
theta <- ifelse(runif(n_sims) <
0.5,runif(n_sims),0.5)
results4 <-
tibble(Theta=theta,Heads=rbinom(n_sims,2000,theta))
results4 %>%
filter(Heads==1300) %>%
mutate(Hp5=Theta==0.5) %>%
count(Hp5) %>%
transmute(Hp5,prob=n/sum(n))
## # A tibble: 1 x 2
## Hp5 prob
## <lgl> <dbl>
## 1 FALSE 1
This shows that there are no simulations that result in exactly 1300 heads
out of 2000 that gave values of θ = 0.500. However, these results can
then be used to create a predictive distribution for the number of heads in
2000 flips:
set.seed(5780) # Reproducibility!
# Simulate the future coin flipping
results4 %>%
filter(Heads==1300) %>%
mutate(`Heads
predicted`=factor(rbinom(nrow(.),20,Theta),levels=0
->
predictions2
# Show predicted outcome as a histogram
predictions2 %>%
ggplot(aes(x=`Heads predicted`)) +
geom_bar(fill='firebrick') +
scale_x_discrete(drop=FALSE)
Figure 6.5 Predictive distribution based on the 2000 coin flip experiment,
for 20 new flips
The results in Figure 6.5 suggest there is now much stronger evidence
that the probability of flipping a head is 13/20 or 0.65, so that 13 heads is
a much more likely outcome than was suggested on the basis of just 20
flips – but there is still plenty of uncertainty!
This last experiment also suggests another characteristic of the kind of
experimental prediction used here. It is possible to predict the outcome of
processes that were not identical to the one used to estimate the
unknown parameter. In that last example the difference was trivial –
using a 2000 coin flip experiment to predict a 20 coin flip process. But
providing appropriate characteristics carry through, it is possible to
predict outcomes of relatively different processes. Thus if the same coin
were used in a different experiment, outcomes could be predicted using
the experimental distribution for θ already obtained.
A more subtle example will be illustrated next. Suppose you have the
evidence based on the 20 coin flips (i.e. 13 heads, simulated possible θ
values stored in results3) but in a new experiment the coin will be
flipped until six heads are seen. Here you wish to predict how many coin
flips this is likely to take. That can be modelled using the negative
binomial distribution. If nt is the number of tails thrown in the sequence of
flips carried out until nh heads are thrown, then
(6.12)
Thus, the total number of flips is just nt + nh where the distribution for nt is
given by equation (6.12), and nh is fixed (here it takes the value 6). In R,
the function rnbinom generates random numbers from this distribution.
Thus an experimental prediction for the total number of coin tosses
(based on the 13 out of 20 outcome) can be created as follows:
set.seed(12011)
# define an integer to factor conversion
function
int_to_factor <- function(x)
factor(x,levels=min(x):max(x))
results3 %>%
filter(Heads==13) %>%
mutate(`Heads predicted`=
int_to_factor(rnbinom(nrow(.),size=6,prob=Theta)
->
predictions
predictions %>%
ggplot(aes(x=`Heads predicted`)) +
geom_bar(fill='firebrick') +
scale_x_discrete(drop=FALSE)
Figure 6.6 Predicted distribution of the number of flips need to get six
heads
Again, the outcome is not completely predictable, but it seems more likely
that around 8–10 flips will be needed to obtain the six heads (Figure 6.6).
Key Points
Models are used to answer questions with some indication of the
reliability of the answer.
One approach to reliability/uncertainty is through probability, which
indicates the likelihoods of different outcomes to the question.
Simulated models of these theoretical processes are relatively
simple to encode in R.
If the outcomes are unknown then these can be simulated with
suitable parameterisation in order to compute experimental
probabilities of different answers (such as the coin being true or
fake).
It is easy to generate likelihoods for different theoretical questions
and answers (e.g. if the probability of a fake changes to 0.01, or if
the question changes to the probability of getting a head).
Often there is no answer to true/false statements, rather there are
probabilities of a value being within a range (e.g. the difficulty in
determining the probability that θ = 0.5 exactly versus being between
0.49 and 0.51).
This is done by examining the probability distributions (θ) that lead to
the observed outcome (e.g. of 13 heads out of 20).
6.3 MORE CONCEPTUALLY DEMANDING
QUESTIONS
The above section could be said to make use of simplistic examples, but
the motivation was to introduce some of the typical validation and
assessment questions asked when modelling and to set out one possible
experimental methodology for answering them. The process being
modelled was that of flipping a coin, and although this was modelling a
random process, in one sense it resembles van der Waals’ approach as
set out in Section 6.1: for each of the processes described there is a
clearly defined (and derivable) model for the outcome based on a set of
assumptions about the coin flips, the way the coin was chosen, and the
characteristics of fake and genuine coins. However, as described in
Section 6.1, this kind of rigorous derivation may not be possible. For
example, it may be of interest to model house price as a function of local
unemployment rates. Whereas it seems plausible that the two variables
may be connected, there is no immediately obvious functional
relationship between the two quantities. This section provides a worked
example in a more ‘real-world’ context of assessing the validity of two
different potential house price models.
You will need to clear your workspace, load some packages and then
download and load the ch6.RData file to your R/RStudio session:
# load packages
library(tidyverse)
library(dbplyr)
library(sf)
library(tmap)
library(RColorBrewer)
library(spdep)
The code below downloads data for the worked example to your current
working directory. You may want to check this:
# get the current working directory
getwd()
# clear the workspace
rm(list = ls())
# download data
download.file("http://archive.researchdata.leeds.ac.uk/740/1
"./ch6.RData", mode = "wb")
# load the data
load("ch6.RData")
# examine
ls()
## [1] "lsoa" "oa" "properties"
This loads three spatial datasets in sf format as follows:
properties, a multipoint object of houses for sale in the Liverpool
area;
oa, a multipolygon object of Liverpool census Output Areas (OAs)
with some census attributes;
lsoa, a multipolygon object of Liverpool census Lower Super Output
Areas (LSOAs) with an LSOA code attribute.
The OAs nest into the LSOAs (typically each LSOA contains around five
OAs). The LSOAs will be used in this section to construct and evaluate
the models to be compared. The datasets, their spatial frameworks and
their attributes are fully described in Chapter 7, where they are explored
and used in much greater depth.
6.3.1 House price problem
Consider the task of constructing different house price models using
socio-economic data and deciding which one to use. This decision should
be underpinned by some form of model validation and assessment.
Two house price models are assessed in terms of their ability to
reproduce characteristics of the input dataset. Two ideas are useful, and
after they are explored, the model evaluation ideas from the coin-tossing
experiment can be applied to assess the two models. The two ideas are
as follows:
A model is of practical use if it can simulate datasets that are similar
in character to the actual data.
However, different models can be useful in different ways and in
different situations we may define ‘similar in character’ differently.
The two house price models both consider house prices to be a function
of geographical, social and physical factors, but with the factors provided
in different ways. The physical factors are:
whether the house is terraced;
whether any bedrooms have en suite facilities;
whether it has a garage;
the number of bedrooms it has.
The geographical factor is simply the northing value of the property. The
data used in the analysis (compiled below) are projected in the OSGB
projection, which records location in metres as an easting (latitude) and a
northing (longitude). The northing is the y coordinate of the house’s
location and in the Liverpool area (Merseyside) there is a notable
north/south trend in house price.
The two models model the social factors differently. For model 1, a
number of variables representing social and demographic conditions of
the local OA are included:
unemployment rate;
proportion of population under 65;
proportion of green space in the area (a crude proxy for quality of
life).
This is a relatively limited number of variables, but crudely encapsulates
socio-economic conditions, age structure and quality of life. However,
these are, at best, blunt instruments, although all of these factors
correlate with house prices.
Model 2 takes a different approach to representing the socio-economic
conditions of a house’s locality. In this case, the model uses a geo-
demographic classification of each census OA, as defined by the OAC as
introduced in earlier chapters. Each OA is classified into one of eight
groups, seven of which are present in the Liverpool data:
constrained city dwellers;
cosmopolitans;
countryside (not present in Liverpool!);
ethnicity central;
hard-pressed living;
multicultural metropolitans;
suburbanites;
urbanites.
The OAC, a categorical variable, is used as a predictor. Unlike the
approach in model 1, the OAC groups are the distillation of a large
number of variables (see Vickers and Rees, 2007; Gale et al., 2016).
However, by reducing this information into just eight distinct classes,
arguably some information is lost. Here, the interest is in which of these
two approaches – both with strengths and weaknesses – best models the
property price data.
6.3.2 The underlying method
In more formal terms, both models will model log house price as a linear
function of the predictor variables. Thus, model 1 in R modelling notation
is defined as follows:
log(Price)~unmplyd+Terraced+gs_area+Ensuite+Garage+as.numeri
In both cases, coefficients seem to take the sign expected – for example,
the presence of a garage increases value.
We may now create simulated datasets for both of the models. The
function get_s2 estimates (the variance) and the function simulate
creates the simulation of the data – in terms of simulated house prices for
each property. Here 1000 simulations of each dataset are created and
stored in the matrices m1_sims and m2_sims. Each column in these
matrices corresponds to one simulated set of prices for each property.
# Helper functions - here
# ’m’ refers to a model
# ’p’ a prediction
# ’s’ a simulated set of prices
get_s2 <- function(m)
sqrt(sum(m$residuals^2)/m$df.residual)
simulate <- function(m) {
# get the model predictions of price
p <- predict(m)
# a random distribution with sd equal to the
variance
r <- rnorm(length(p),mean=0,sd=get_s2(m))
# create a simulated set of prices and return
the result
s <- exp(p + r)
return(s)}
# set up results matrices
m1_sims <-
matrix(0,length(data_anal$Price),1000)
m2_sims <-
matrix(0,length(data_anal$Price),1000)
# set a random seed and run the simulations
set.seed(19800518)
for (i in 1:1000) {
m1_sims[,i] <- simulate(model1)
m2_sims[,i] <- simulate(model2)
}
Next, the computation of Moran’s I of the mean house price for properties
in each LSOA is undertaken using tools in spdep and aggregation via
group_by. Note that as not all LSOAs have a property in them, the
simple features object lsoa_mp has a small number of ‘holes’ in it:
# determine LSOA mean house price
lsoa_mp <- data_anal %>%
group_by(code) %>%
summarise(mp=mean(Price)) %>%
left_join(lsoa,.) %>%
filter(!is.na(mp))
# create a weighted neighbour list of the LSOA
areas
lw <- nb2listw(poly2nb(lsoa_mp))
# determine the spatial autocorrelation of LSOA
mean house price
target <- moran.test(lsoa_mp$mp,lw)$estimate[1]
Now, test1 and test2 contain lists of the simulated values of Moran’s I
for each model. The assessment of the two models is carried out by
seeing how many of each sample are sufficiently close to the observed
value. An exact match is more or less impossible, but we could, for
example, say that a Moran’s I within 0.05 is acceptable. Thus, the count
of such acceptable values for each group is computed for model 1 (t1)
and model 2 (t2):
t1 <- sum(abs(test1 - target) < 0.05)
t2 <- sum(abs(test2 - target) < 0.05)
c(t1,t2)
## [1] 554 253
Next, a connection is set up. Note that you may have to change your
working directory to the folder location of the 11.2 GB file you
downloaded in Chapter 4:
db <- dbConnect(SQLite(), dbname=
"prescribing.sqlite")
Recall that, working with dbplyr, the tbl command here does not
immediately create a table in R; it simply associates a database query to
extract a table from a database connection and stores this in an object.
All of the various dplyr-like operations essentially modify this query, so
that cost_tab associates the resultant (quite complex) SQL query to
obtain the table created at the end of this process.
So although it looks like a table, it is in fact an encoded query to a
database, with some tibble-like properties. You can check this by
examining the class of cost_tab:
class(cost_tab)
If the above query were sent to the database connection db this would
return the dataset used for the planned regression. However, there are
around 3.7 million records here (the large-n problem described above).
By using dbSendQuery and dbFetch it is possible to retrieve the results
of the query in chunks. Here chunks of 400,000 observations are used. In
the code below, first a query is sent to the database connection (using
dbSendQuery). The result of the query is not a data frame itself, but
another kind of connection (rather like a file handle) that can be used to
fetch the data – either as a whole or in chunks of n observations. Here,
the chunking approach is used. The handle (here stored in cost_res)
bookmarks its location in the table generated by the query, so that
successive calls lead to successive chunks of the resultant data frame
being returned.
In turn, biglm works in a similar way. The first call fits a linear regression
model to the first chunk of data. The following calls to update will update
the model, giving the result of a linear regression model, and add an
extra chunk of data added to the model. This works using the ‘chunked’
approach to the forming of (XTX) and XTy – and also keeping a running
estimate of β. Thus, the approach to fitting a large regression model is to
read a chunk and add a chunk to the model, until all of the dataset has
been added. When the entire query has been fetched, an empty
database is returned. Thus, the approach described may be set up as a
loop, with the exit condition being an empty data frame returned from
dbFetch. Finally, the dbClearResult closes off the handle (similar to
closing a file), freeing up the database connection for any future queries,
and the database is closed:
# Obtain the query used to obtain the required
data
cost_query <- sql_render(cost_tab) %>%
as.character()
# Set up the query to the database connection
db
cost_res <- dbSendQuery(db,cost_query)
# Fetch the first chunk
chunk <- dbFetch(cost_res,n=400000)
cost_lm <- biglm(act_cost~ unemployed +
noqual,data= chunk)
# Keep fetching chunks and updating the model
until none are left
counter = 1 # a counter to help with progress
repeat{
chunk <- dbFetch(cost_res,n=400000)
if (nrow(chunk) == 0) break
cost_lm <- update(cost_lm,chunk)
cat(counter, "\t")
counter = counter+1
}
# Close down the query
dbClearResult(cost_res)
Once this step has been completed, the model may be inspected, to see
the fitted coefficient values, or predict new values:
# Print the results
summary(cost_lm)
Here the fitted coefficients (and their standard errors) show that
coefficients for low levels of education (noqual) have a positive linkage
with prescription expenditure on antidepressants, and perhaps
surprisingly unemployment is negatively associated.
This demonstrates how OLS regression can be applied to large datasets.
However, earlier generalised linear models were mentioned and
illustrated by equation (6.15). Recall that they are calibrated using an
iteratively refitted regression model. Thus, a process such as that above
has to be applied iteratively. This would require some coding to be set
up, including a ‘first principles’ implementation of the GLM calibration
procedure with the ‘chunking’ OLS method embedded. Fortunately, this is
simplified with the big GLM function bigglm in the biglm library.
This works in a similar way to biglm but is capable of fitting models
where the y-variable has distributions other than Gaussian. For example,
it could be assumed that y has a gamma distribution, and that g is the log
function. The distribution of the actual prescription cost is then modelled
as
(6.18)
where µ is the expected value of y. ν is a shape parameter: when ν = 1, y
has an exponential distribution, and when ν > 1 it has an asymmetric
distribution with a long tail (see Figure 6.7). Here, E(y) = µ and Var(y) =
y−1.
This model seems sensible. Although it cannot be justified in the rigorous
way that van der Waals’ model can be, it provides a reasonable schema
that may justify exploration. Here we are considering the situation where
prescription costs depend on some aspects of socio-economic
deprivation, and where they also have a distribution with a relatively
heavy ‘upper tail’ where a small number of prescriptions are very
expensive. Also, the logarithmic link suggests that regression effects are
multiplicative, so that for more expensive drugs, a unit change in a
deprivation-related attribute (qualifications, unemployment, etc.) is
associated with a change in prescription that is proportional to the cost of
the drug.
Once this function – essentially a tool to set up the bigglm call – has
been defined, the model may be fitted. The query used to extract the data
needed is still stored in cost_query, so the fetching function may be
made based on this. Similarly, the database connection was set up
earlier, and stored in db. Thus, the model may be set up as below, and
this will take time to run – times will vary depending on your computer!
library(lubridate) # To time the process
# Call the ’fetcher’ function
cost_df <- make_fetcher(db,cost_query,500000)
# Run the glm
starting_point <- now()
cost_glm <- bigglm(act_cost~ unemployed +
noqual,
data=cost_df,
family=Gamma(log),
start=c(4.2,-0.004,1.007),
maxit=35)
duration <- now() - starting_point
duration
## Time difference of 4.153473 mins
Note that the p-values associated with the coefficients are very low,
suggesting that there is strong evidence that they are not zero. However,
it should also be noted that the value of unemployed is very low, which
suggests that although not zero, the effect attributable to it is very small –
a unit change in the unemployment score results in a change in the log of
the expected prediction cost of –2.5240, or a multiplicative factor of
around 0.0801, suggesting minimal practical implications. However, the
noqual factor, at 1.1147 for the logged expectancy, translates to a
multiplicative change of 3.0488, a more notable effect, with a
multiplicative increase in costs of around 3 for each antidepressant
prescription (suggesting larger doses?) for each additional percentage of
the population with no qualifications.
6.4.3 A random subset for regressions
Above it was noted that a good starting guess for the regression
coefficients was helpful, and that this could be obtained by initially
calibrating the model with a subset of the full data, chosen at random.
One issue here is that selecting a random subset from an external
database also requires consideration. Such a subset can be selected via
an SQL query, but sometimes there are some issues. First, although
dbplyr can be used on an external database, operations like mutate
are restricted to the operations permitted by the SQL interface of the
external database. An issue here is that random number generation in
SQLite is far less flexible than that provided in R, essentially consisting
of a single function random that generates random 64-bit signed
integers. These are integers between –9,223,372,036,854,775,808 and
+9,223,372,036,854,775,807 with equal probabilities of occurrence.
There are two ways that random could be used to select a random
subset, of size M, say:
1. Assign a random number to each record, sort the data according to
this number, then select the first M of them.
2. By using modulo and division, alter the range of the random number
to [0, 1] and then select those records where the new number is
below M/N. This can be achieved approximately by rescaling to [0,
1,000,000] and making the threshold 1,000,000M/N.
Option 1 has the advantage of guaranteeing exactly M cases in the
subset, whereas option 2 provides this on average. However, option 1
requires sorting a large external database (which can take some time),
whereas option 2 does not. Since in this situation the exact size of the
subset is not particularly important, option 2 is used. This can be done
using dbplyr but extending the original query used to extract the
required prescription data. Suppose here a subsample of size 4443 is
required.
This can be achieved with the code below. First the random subset
selection is applied to the cost_tab external table option. This is then
fed to the standard glm function (not bigglm, since the dataset is
smaller). The collect function here takes the remote connection
database query and converts it to a standard tibble:
# Augment the dataquery
set.seed(290162)
cost_glm_sample <- cost_tab %>%
mutate(r = abs(random() %% 1000000)) %>%
filter(r < 1000000*4443.0/3678372) %>%
collect()
ss_cost_glm <- glm(act_cost~ unemployed +
noqual,
data=cost_glm_sample,
family=Gamma(log))
coef(ss_cost_glm)
## (Intercept) unemployed noqual
## 3.814116 -2.251489 1.088021
dbWriteTable(db,"glm_prescription_sample",cost_glm_sample)
As can be seen, although the results are slightly less accurate, the model
fitting has been speeded up by a factor of around 7.9.
Finally, close the database:
dbDisconnect(db)
Key Points
Generalised linear models are powerful statistical tools for working
with data whose distribution is a member of the exponential family of
distributions.
They require a link function to transform the target variable from
Gaussian, binomial, Poisson, etc., distributions, so that it maps onto
the linear predictors.
However, there can be problems when using standard regression
approaches applied to datasets with large m (fields) and n
(observations).
Chunking approaches to OLS and generalised regression were
illustrated using the biglm and bigglm functions applied to the
prescribing data introduced in Chapter 4.
A starting guess at the coefficients is helpful, and these can be
identified by taking a random subset of the data. However, care
needs to be taken if using the internal database functions for
generating these.
Some approaches for speeding up GLM parameter estimation for
large datasets are suggested.
6.5 QUESTIONING THE ANSWERING
PROCESS AND QUESTIONING THE
QUESTIONING PROCESS
To conclude this chapter we wish to provide some final advice regarding
modelling and testing model assumptions. In particular, we have outlined
two key types of inference – classical and Bayesian – and have
suggested how these may be used to assess accuracy of model
calibration, test a hypothesis or provide probabilistic predictions (although
probabilities mean subtly different things for Bayesian and classical
inference). The general idea of this chapter has been to provide some
guidelines for thinking about models, and how they relate to data. The
following chapter will concentrate more on the use of fitting and prediction
techniques in a more machine learning framework. Since both
approaches appear in practice, it is important to be aware of the ideas
underlying each of them.
However, perhaps it is equally important not just to look at the interplay
between models and data, but to question the model itself, or the veracity
of the data used to calibrate it. In particular, one thing frequently
assumed is that the model chosen is actually correct – both Bayesian
and classical approaches hinge on this. Some hypothesis tests allow for
nested alternative models, where one kind of model is a subset of
another (e.g. model B may be a subset of regression model A where
some of the coefficients are zero). A broader approach may consider
several competing models that are not nested – for example, gamma
versus Gaussian errors in the example given earlier. There are some
techniques that allow this (e.g. the use of Bayes factors), and these may
be necessary in some circumstances. A guide is given in Bernardo and
Smith (2009).
Similarly, the reliability of the data must also be considered. There are a
number of checks (often visual) that may be done to identify outliers.
These can be genuine (e.g. a particular house in a database was sold at
an incredibly high price) but may also be attributable to recording error
(when someone typed in the house price they accidentally hit the trailing
zero too many times). In either case it is important to identify them, and
scrutinise them to identify, if possible, which kind of outlier they are.
Arguably a dataset with a number of genuine outliers needs an
appropriate probability distribution to model it, perhaps one with heavier
tails than a Gaussian distribution.
Another possibility is that the entire dataset is suspicious. One way of
checking this – often used in forensic accounting – is to call on Benford’s
law (Benford, 1938). This makes use of the observation that in many real-
life datasets the first digit in numerical variables is more likely to take a
low value. More specifically, if d1 ∈ {1, … , 9} is the first digit, then its
probability of occurrence in a large dataset is
(6.19)
Although not conclusive proof, datasets deviating from this may require
further scrutiny. The R package benford.analysis allows these kinds
of tests to take place.
In general we advise a sceptical approach, in the sense that nothing can
be beyond question. Indeed we would further encourage people to
consider the scenarios under which certain questions have arisen – and
for which answers are demanded – as well as the consequences of
reporting the results of the analysis. Question everything, including the
questions and those who ask them.
6.6 SUMMARY
This chapter has aimed to provide an in-depth examination of the
process involved in model development. First, it explored models built by
utilising ideas from some underlying theory, and then others not built with
such definite guidance. It demonstrated that some models can be derived
directly via theoretical reasoning about the process being considered,
with theory often evolving through empirical observation (as in the van
der Waals example), and others can be constructed in a more
exploratory way, with the intention of increasing the (possibly subjective)
understanding of the process being considered. The key issue in the
latter is that the certainty in the form that the model should take is
generally less (as in the house price models). This is the situation with
much modelling now: the ability to construct models (with less certainty of
form) has increased with the many machine learning applications and
non-parametric approaches. However, in some cases this has been led
to a decrease in consideration of the underlying processes. Second, this
chapter highlighted approaches for approximating the degree of
certainty/reliability of model results, using coin-tossing examples. These
demonstrated how knowing something about the inputs allowed
probability to be used to indicate the likelihood of different outcomes,
both theoretically and through simulation-based examples. This showed
how the likelihoods of different answers could be updated if the
underlying process changed, the difficulty in determining the probability of
precise answers, and the relative ease of being able to calculate
probabilities of answers being within a specific range. Third, this chapter
illustrated approaches for situations where the precise derivation of
probabilities and likelihoods is not possible, using a worked example of
competing house price models. Here simulated house price data were
constructed from the parameters of simple regression models. The
Moran’s I (in this case) of the simulated house price data was compared
with that of the actual data, thereby generating approximate model
probabilities, and allowing one model to be chosen over another. Next,
approaches for handling very large datasets in standard linear regression
models were illustrated using the biglm package and a number of
approaches for speeding up GLM parameter estimation for large datasets
were suggested. Finally, the importance of taking a critical approach to
modelling and the preparedness to question everything was emphasised.
REFERENCES
Anderson, C. (2008) The end of theory: The data deluge makes the
scientific method obsolete. Wired, 23 June.
www.wired.com/science/discoveries/magazine/16-07/pb_theory.
Atkinson, K. A. (1989) An Introduction to Numerical Analysis. (2nd edn).
New York: John Wiley & Sons.
Benford, F. (1938) The law of anomalous numbers. Proceedings of the
American Philosophical Society, 78(4), 551–572.
Bernardo, J. M. and Smith, A. F. M. (2009) Bayesian Theory. Hoboken,
NJ: John Wiley & Sons.
Brunsdon, C. (2015) Quantitative methods I: Reproducible research and
human geography. Progress in Human Geography, 40(5), 687–696.
Cliff, A. D. and Ord, J. K. (1981) Spatial Processes: Models &
Applications. London: Pion.
Comber, A. and Wulder, M. (2019) Considering spatiotemporal
processes in big data analysis: Insights from remote sensing of land
cover and land use. Transactions in GIS, 23(5), 879–891.
Freeman, A. M. (1974) On estimating air pollution control benefits from
land value studies. Journal of Environmental Economics and
Management, 1(1), 74–83.
Freeman, A. M. (1979) Hedonic prices, property values and measuring
environmental benefits: A survey of the issues. Scandinavian Journal
of Economics, 81(2), 154–173.
Gale, C. G., Singleton, A., Bates, A. G. and Longley, P. A. (2016)
Creating the 2011 Area Classification for Output Areas (2011 OAC).
Journal of Spatial Information Science, 12, 1–27.
Halvorsen, R. and Pollakowski, H. O. (1981) Choice of functional form for
hedonic price equations. Journal of Urban Economics, 10(1), 37–49.
Hastie, T. and Tibshirani, R. (1987) Generalized additive models: Some
applications. Journal of the American Statistical Association, 82(398),
371–386.
Kitchin, R. (2014) Big data and human geography: Opportunities,
challenges and risks. Dialogues in Human Geography, 3(3), 262–267.
Lancaster, K. J. (1966) A new approach to consumer theory. Journal of
Political Economy, 74, 132–157.
Lumley, T. (2018) Fast generalised linear models by database sampling
and one-step polishing. Preprint, arXiv:1803.05165v1.
McCullagh, P. and Nelder, J. A. (1989) Generalized Linear Models (2nd
edn). London: Chapman & Hall.
Rosen, S. (1974) Hedonic prices and implicit markets: Product
differentiation in pure competition. Journal of Political Economy, 82(1),
34–55.
van der Waals, J. D. (1873) On the continuity of the gaseous and liquid
states. PhD thesis, Universiteit Leiden.
Vickers, D. and Rees, P. (2007) Creating the UK National Statistics 2001
output area classification. Journal of the Royal Statistical Society,
Series A, 170(2), 379–403.
Von Bertalanffy, L. (1968) General System Theory: Foundations,
Development, Applications. New York: George Brazilier.
7 APPLICATIONS OF MACHINE LEARNING
TO SPATIAL DATA
7.1 OVERVIEW
The chapter introduces the application of machine learning to spatial
data. It describes and illustrates a number of important considerations, in
particular the mechanics of machine learning (data pre-processing,
training and validation splits, algorithm tuning) and the distinction
between inference and prediction. Some house price and socio-economic
data for Liverpool in the UK are introduced and used to illustrate key
considerations in the mechanics of machine learning. The classification
and regression models are constructed using implementations within the
caret package, a wrapper for hundreds of machine learning algorithms.
Six algorithms are applied to the data to illustrate predictive and
inferential modelling:
Standard linear regression (SLR)/linear discriminant analysis (LDA)
Bagged regression trees (BRTs)
Random forests (RF)
Gradient boosting machines (GBMs)
Support vector machine (SVM)
k-nearest neighbour (kNN).
You will need to load the following packages, some of which may need to
be installed with their dependencies:
library(sf)
library(tmap)
library(caret)
library(gbm)
library(rpart)
library(tidyverse)
library(gstat)
library(GGally)
library(visNetwork)
library(rgl)
library(cluster)
library(RColorBrewer)
7.2 DATA
Socio-economic data for the Liverpool area are used to illustrate specific
considerations in machine learning applications. The ch7.RData R
binary file includes three sf objects with population census data at two
different scales and data of residential properties for sale for the
Liverpool area.
The code below loads the file from the internet to your current working
directory:
# check your current working directory
getwd()
download.file("http://archive.researchdata.leeds.ac.uk/741/1
"./ch7.RData", mode = "wb")
Now load the ch7.RData file to your R/RStudio session and examine
the result:
load("ch7.RData")
ls()
## [1] "lsoa" "oa" "properties"
Some EDA can provide an initial understanding of the data, using some
of the methods introduced in Chapter 5. For example, the code below
creates a ggpairs plot of the oa data, and you could modify this to
create a similar plot of the lsoa data. Here again you can widen the plot
display pane in RStudio if the ggpairs plot does not display:
# OA pairs plot
oa %>% st_drop_geometry() %>%
select_if(is.numeric) %>%
ggpairs(lower = list(continuous =
wrap("points", alpha =
0.5, size= 0.1))) +
theme(axis.line=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank())
If you examine the oa and lsoa correlations you should note a number
of things:
1. the broad similarity of the distribution of attribute values across the
two scales;
2. the extremely skewed distributions of the greenspace attribute
(gs_area) at both scales, but with less skew at the coarser LSOA
scale;
3. the relatively large correlations between employment (employd) and
age group percentages;
4. stronger correlations at OA level.
These suggest that the modifiable areal unit problem (Openshaw, 1984)
might cause some differences in model outcome in this case.
Alternative visualisations are possible with the properties data:
ggplot(properties,aes(x=Beds, y=Price,
group=Beds)) + geom_boxplot()
We can also use tmap to see the geographical distribution of houses with
different numbers of bedrooms, which has a geographical pattern:
tmap_mode('view')
tm_shape(properties) +
tm_dots(col='Beds', size = 0.03, palette =
"PuBu")
tmap_mode('plot')
However, given the complexity of the variables and their binary nature
(try running summary(properties)), perhaps a more informative
approach to understanding structure in the properties data is to use a
regression tree. These are like decision tress with a hierarchical
structure. They use a series of binary rules to try to partition data in order
to determine or predict an outcome. If you have played the games 20
Questions or Guess Who (https://en.wikipedia.org/wiki/Guess_Who%3F)
then you will have constructed your own decision tree to get to the
answer. In Guess Who, an optimal strategy is to try to split the potential
solutions by asking questions that best split the remain potential answers,
and regression trees try to do the same thing. They have a root node at
the top which performs the first split, and branches connect to other
nodes, with all branches relating to a yes or no answer connecting to any
subsequent intermediary nodes (and associated division of potential
outcomes) before connecting to terminal nodes with predicted outcomes.
Regression trees use recursive partitioning to generate multiple subsets
such that the members of each of the final subsets are as homogeneous
as possible, and the results describe a series of rules that were used to
create the subsets/partitions. This process can be undertaken for both
continuous variables using regression trees and categorical variables
using classification trees. There are many methodologies for constructing
these, but one of the oldest is known as the classification and regression
tree approach developed by Breiman (1984).
A regression tree for house price can be constructed using the rpart
package. Note the use of st_drop_geometry and mutate_if to
convert the logical TRUE and FALSE keyword values to character format:
You can examine the regression tree, noting that indented information
under each split represents a subtree:
tree_model
This can also be visualised using the visNetwork package as in Figure
7.1, which shows the variables that are used to split the data, the split
values for the variables, and the number of records that are partitioned by
the branch thickness. The main branches in determining price are related
to the number of bedrooms, whether en suite facilities are available, and
whether it is Victorian. Interestingly, it is evident that some geography
comes into play for larger semi-detached houses. However, the tree
gives an indication of the data structure in relation to the Price variable:
li.brary(visNetwork)
visTree(tree_model,legend=
FALSE,collapse=TRUE,direction='LR')
visTree(tree_model,legend=
FALSE,collapse=TRUE,direction='LR')
Figure 7.1 The regression tree of house price in the properties data
Key Points
Socio-economic data at two different scales plus data on residential
properties for sale for the Liverpool area were loaded into the R
session.
Some EDA and spatial EDA were used to examine the data, using
methods from Chapter 5.
These were enhanced with a regression tree approach to examine
the structure of the properties data relative to house price.
7.3 PREDICTION VERSUS INFERENCE
Statistical models are constructed for two primary purposes: to predict or
to infer some process. The distinction between the two activities has
sometimes been blurred as the volume of activities in data analytics (or
statistics – see the preface to this book) has increased, and the
background training of those involved in these activities has diversified.
Sometimes the terminology is confusing, especially the use of the term
inference or infer: in the previous chapter we discussed how one could
infer from classical and Bayesian models, for example.
In truth, both prediction and inference seek to infer y from x in some way
through some function f, where x represents the input (predictor,
independent, etc.) variables, y is the target (or dependent) variable and ϵ
is some random error term:
(7.1)
It is always useful to return to core knowledge to (re-)establish
fundamental paradigms, and in this case to consider how others regard
the distinction between prediction and other kinds of inference:
However, the problem in this case (and in most others) is that the
variables are in different units of measurement: Lat is in degrees, Beds
is a count of beds and Price is in thousands of pounds. Floor area, if
present, would be in square metres. The danger is that models
constructed on the raw data can be dominated by variables with large
numerical values. For example, if Price were in pounds rather than
thousands of pounds then this domination would be greatly enhanced.
Similarly, the relative influence of Lat would change if it were expressed
as an easting in metres, as in the OSGB projections of oa and lsoa.
However, this seems to be a somewhat arbitrary way to proceed – the
houses have the same physical size, and the same location regardless of
units of measurement, and it seems nonsensical for the notion of
closeness to change with any change in these units.
For this reason, data are often rescaled (or scaled) prior to applying
maximum likelihood algorithms. Usually this is done either by computing
z-scores, or rescaling by minimum and maximum values. For any
variable x, rescaling by z-score is as follows:
(7.2)
where is the mean of x and σx is the standard deviation of x. The result is
that z has a mean of 0 and a standard deviation of 1, regardless of the
units of x. Additionally, the distributions of the z-scores are the same as
those of x.
An alternative approach is scaling via minimum and maximum values to
generate a rescaled variable u:
(7.3)
This maps the largest value of x to 1, and the smallest to 0. Again the
distributions are unchanged and the values are independent of the
measurement units.
R has a number of in-built functions for rescaling data, including the
scale function in the base R installation which rescales via z-scores.
This changes the data values but not their distribution:
library(dplyr)
df <- properties %>% st_drop_geometry() %>%
select(Price, Beds, Lat) %>%
mutate(Beds = as.numeric(as.character(Beds)))
X <- df %>% scale()
You can also check that this has been applied individually to each
variable and not in a ‘group-wise’ way across all of the numeric variables
using a single pooled mean and standard deviation across all variables:
apply(X, 2, mean)
apply(X, 2, sd)
The distributions of the target variable Price are similar across the two
splits:
summary(properties$Price[train.index])
summary(properties$Price[-train.index])
The split or partition index can be used to create the training and test
data as in the code snippets below:
train_data =
properties[train.index,] %>%
st_drop_geometry() %>%
mutate_if(is_logical,as.character)
test_data =
properties[-train.index,] %>%
st_drop_geometry() %>%
mutate_if(is_logical,as.character)
7.4.3 Measures of fit
The identification of the function f in equation (7.1) depends not only on
the training data, but also on a number of tuning parameters. Varying the
tuning parameters will alter how close f(X) is to y, with the aim of
identifying which combination of parameters get the closest. A key tuning
parameter is the choice over measures of fit.
The general method of evaluating model performance using this split is
called cross-validation (CV). This is an out-of-sample test to validate the
model using some fitness measure to determine how well the model will
generalise (i.e. predict using an independent dataset).
So the general method of evaluating model performance is to split the
data into a training set and a test set, denoted by S = X, y and S′ = X′, y′
respectively, and calibrate f using S, with a given set of tuning
parameters. Then X′ is used to compute f(X′) and the results are
compared to y′. This CV procedure returns a measure of how close the
predictions of y′ are to the observed values of y′ when f(X′) is applied.
There are a number of ways of measuring this. Two commonly used
methods (for regression) are the root mean square error (RMSE), defined
by squaring the errors (the residual differences between y′ and f(X′)),
finding their mean and taking the square root of the resulting mean:
(7.4)
An alternative is the mean absolute error (MAE), defined by
(7.5)
And perhaps the most common accuracy measure for regression model
fit is R2, the coefficient of determination, calculated from the residual sum
of squares over the total sum of squares:
(7.6)
All of these essentially measure the degree to which the predicted
responses differ from the actual ones in the test dataset. For the MAE
and RMSE measures, smaller values imply better prediction
performance, and for R2 higher ones do. Similarly, in categorical
prediction, the proportion of correctly predicted categories is a useful
score, and again larger values imply better performance.
Of course the descriptions above relate to models applied to test
(validation) data subsets, typically in the context of prediction. Models are
also evaluated during their construction using leave-one-out cross-
validation and k-fold cross-validation. These procedures include a
resampling process that splits the data into k groups or ‘folds’. The
approach, in outline, is to randomly split the data into k folds, such that
each observation is included once in each fold. Then, for each fold, keep
one as the hold-out or test data, train the model on the rest and evaluate
it on the test, retaining the evaluation score. The retained evaluation
scores are summarised to give an overall measure of fit. In this process
each individual observation is uniquely assigned to a single fold, used in
the hold-out set once and used to train the model k − 1 times. It takes
longer than fitting a single model as it effectively undertakes k + 1 model
fits, but is important for generating reliable models, whether for prediction
or inference.
The caret package includes a large number of options for evaluating
model fit that can be specified using the trainControl() function. You
should examine the method parameters that can be passed to
trainControl in the help. The trainControl() function essentially
specifies the type of ‘in-model’ sampling and evaluation undertaken to
iteratively refine the model. It generates a list of parameters that are
passed to the train function that creates the model:
The outputs of these could be examined one by one but in this case only
the first three values are different:
# there are many settings
names(ctrl1)
# compare the ones that have been specified
c(ctrl1[1], ctrl1[2], ctrl1[3])
c(ctrl2[1], ctrl2[2], ctrl2[3])
We can inspect the outputs, which tell us that the model optimised on
nine nearest neighbours: remember that these are neighbours in a
multidimensional feature space, which also includes geographic space as
two of the inputs are latitude and longitude. These are shown in the error
from the resampling for RMSE, MAE and R2:
knnFit
## k-Nearest Neighbors
##
## 2963 samples
## 41 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 2963, 2963, 2963,
2963, 2963, 2963, …
## Resampling results across tuning parameters:
##
## k RMSE Rsquared MAE
## 5 121.4402 0.4558293 65.39469
## 7 117.4175 0.4798227 63.10847
## 9 116.0912 0.4874645 62.31709
##
## RMSE was used to select the optimal model
using
## the smallest value.
## The final value used for the model was k =
9.
Note also that much more detail is included in the result: try entering
names(knnFit) and examine the help for the caret train function for
descriptions of these:
names(knnFit)
help(train, package = "caret")
It is clear that the error rates have not bottomed out and that we might be
able to improve the model fit by letting it tune for a bit longer. This can be
done by passing additional parameters to the train() function such as
tuneLength. What this does is make the algorithm evaluate a greater
number of parameter settings:
set.seed(123)
knnFit <- train(Price ~ ., data = train_data,
method = "knn",
tuneLength = 20)
The kNN model has only one parameter to tune. Typically many more
need to be evaluated. This is illustrated with a gradient boosting machine
(GBM) model which is described later in this chapter. The tunable
parameters for the GBM can again be listed with the modelLookup
function. Note that caret does not load all the functions and packages
on installation, but instead assumes that they are present and loads them
as needed: if a package is not present, caret provides a prompt to
install it.
modelLookup("gbm")
## model parameter label forReg
## 1 gbm n.trees # Boosting Iterations
TRUE
## 2 gbm interaction.depth Max Tree
Depth TRUE
## 3 gbm shrinkage Shrinkage TRUE
## 4 gbm n.minobsinnode Min. Terminal Node
Size TRUE
## forClass probModel
## 1 TRUE TRUE
## 2 TRUE TRUE
## 3 TRUE TRUE
## 4 TRUE TRUE
This tells us that there are four tuning parameters for a GBM:
n_trees, the number of trees;
interaction.depth, a tree complexity parameter;
shrinkage, the learning/adaptation rate;
n.minobsinnode, the minimum number of samples in a node for
splitting.
Again an initial model can be generated and the results evaluated, noting
that you may have to install the gbm package and that the model may
take some time to train:
set.seed(123)
ctrl <- trainControl(method="repeatedcv",
repeats=10)
gbmFit <- train(Price ~ ., data = train_data,
method = "gbm",
trControl = ctrl, verbose =
FALSE)
gbmFit
The results show that the different accuracy measures (RMSE, R2 and
MAE) provide an indication of the different agreements averaged over the
cross-validation iterations in ctrl. However, they also indicate that the
default GBM implementation in caret evaluates different values for only
two tuning para-meters (interaction.depth and n.tree) and uses
only a single value for shrinkage (0.1) and n.minobsinnode (10). In
this case the best model was found with n.trees = 50,
interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode =
10.
Here it is more evident than in the kNN case that the train function
automatically evaluates a limited set of tuning parameters and since, by
default, the size of this is 3Parameters, this results in 9 being evaluated in
the case of GBM (compared with 3 for kNN).
However, as well as increasing the tuneLength passed to train as
above, it is possible for user-defined sets of parameters be evaluated.
These are passed to the tuneGrid argument in train, for which a
series of models are created and then evaluated. This requires a data
frame to be created and passed to the tuneGrid argument in train,
with columns for each tuning parameter, named in the same way as the
arguments required by the function. The examples in the information box
below define a grid that contains 270 combinations of parameters – these
will take an hour or two to evaluate.
Returning to the gbmFit object defined above, the results can be
examined by printing out the whole model or just the best tuning
combinations of those evaluated under the default settings:
gbmFit
gbmFit$bestTune
If you are unsure whether the full range of tuning parameters has been
evaluated you could define a grid containing even more combinations of
parameters, perhaps leaving this to run overnight:
params <- expand.grid(n.trees = seq(50, 400, by
= 50),
interaction.depth =
seq(1, 5, by = 1),
shrinkage = seq(0.1, 0.5,
by = 0.1),
n.minobsinnode = seq(10,
50, 10))
dim(params)
#head(params)
gbmFit <- train(Price ~ ., data = train_data,
method = "gbm",
trControl = ctrl,
tuneGrid = params, verbose =
FALSE)
7.4.5 Validation
Finally, having created trained a well-tuned model, it can be applied to
the hold-out data for model evaluation using the predict function. This
gives an indication of the generalisability of the model – the degree to
which it is specific to the training/validation split. Theoretically the split
was optimised such that the properties of the response variable (Price)
were the same in both subsets. The code below goes back to the last
kNN model, applies this to the test data and compares the predicted
house price values with the actual observed house price values. The
scatter plot of these with some trend lines is shown in Figure 7.3.
pred = predict(knnFit, newdata = test_data)
data.frame(Predicted = pred, Observed =
test_data$Price) %>%
ggplot(aes(x = Observed, y = Predicted)) +
geom_point(size = 1, alpha = 0.5) +
geom_smooth(method = "loess", col = "red") +
geom_smooth(method = "lm")
Figure 7.3 Predicted against observed house price values from the final
kNN model, with the model fit indicated by the blue linear trend line and
the loess trend in red giving an indication of the variation in prediction
from this
Both loess and linear regression trend line are included in Figure 7.3. We
have not fitted a non-linear regression, but this trend line reflects the
variation in the predicted and observed data overall. The loess
inflections suggest that the model does reasonably well at predicting
house prices in the lower middle of the range of values, where both the
loess and lm (linear model) trend lines are parallel, but it does less well
at predicting higher observed values.
A measure of overall model fit can be obtained, which essentially
summarises the comparison of the observed values of Price against the
predicted ones:
postResample(pred = pred, obs =
test_data$Price)
## RMSE Rsquared MAE
## 115.8387874 0.4636135 65.3384096
The structure of the new dataset can be examined using regression trees
as before:
set.seed(123)
tree_model_oa <- rpart(Price ~ ., data=
data_anal %>%
mutate_if(is_logical,as.character))
Note that it is important to rescale the data with z-scores, after splitting
the data into training and validation subsets. The danger of
normalising/rescaling the data before the split is that information about
the future is being introduced into the training explanatory variables.
These data will be used in the caret machine learning prediction models
below. Inference models will use the full data and the classification
models will explore the OAC variable, and so new data will be created for
these.
7.5.2 Model overviews
A sample of the many possible machine learning algorithms and models
for classification and regression are illustrated. These have been
selected to represent a spectrum of increasing non-linear complexity
through ensemble approaches such as bagged regression trees, random
forests and learning approaches (support vector machines), as well as
classic approaches such as k-nearest neighbour and discriminant
analysis. This subsection provides a (very) brief overview of these. In
subsequent sections and subsections their strengths of prediction,
inference and classification are compared with standard linear
regression/linear discriminant analysis. Recall that caret will evaluate a
small range of tuning parameters, but these can also be specified in a
tuning grid as describe above.
Linear regression and discriminant analysis
Linear regression seeks to identify (fit) the hyperplane in multivariate
space that minimises the difference between the observed target variable
and that predicted by the model. The linear nature of the model relates to
this and can be applied for prediction and inference. Discriminant function
analysis (Fisher, 1936) can be used to predict membership or class for a
discrete group of classes. It extracts discriminant functions from
independent variables which are used to generate class membership
probabilities for each observation. If there are k groups, the aim is to
extract k discriminant functions. An observation is assigned to a group if
the value for the discriminant function for the group is the smallest. It was
extended from the linear classifier to the quadratic case by Marks and
Dunn (1974). Because of the partitioning of the space in this way for
clusters, any collinear variables in the data must be removed and any
numerical variables converted to factors. Here the lda implementation in
caret is used, but you should note that there are many flavours of
discriminant analysis included in caret (see
https://topepo.github.io/caret/train-models-by-tag.html#discriminant-
analysis).
k-nearest neighbour
The kNN algorithm operates under the assumption that records with
similar values of, for example, Price in the properties data have
similar attributes – they are expected to be close in the multidimensional
feature space described earlier. In this sense this algorithm seeks to
model continuous variables or classes, given other nearby values. The
algorithm is relatively straightforward: for each yi, select the k
observations closest to observation i and predict yi to be the arithmetic
mean of y of these observations. In this example, it is used to compute
the average of the k nearest observations. Thus the kNN algorithm is
based on feature similarity to the number of neighbours (k) in attribute
rather than geographical space.
Bagged regression trees
Bootstrap aggregating or bagging (Breiman, 1996) seeks to overcome
the high variance in regression trees by generating multiple models with
the same parameters. The variance issue with regression trees is that
although the initial partitions at the top of the tree will be similar between
runs and sampled data, there can be large differences in the branches
lower down between individual trees. This is because later (deeper)
nodes tend to overfit to specific sample data attributes in order to further
partition the data. As a result, samples that are only slightly different can
result in variable models and differences in predicted values. This high
variance causes model instability. The models (and their predictions) are
also sensitive to the initial training data sample, and thus regression trees
suffer from poor predictive accuracy.
To overcome this, bootstrap aggregating (bagging) was proposed by
Breiman (1996) to improve regression tree performance. Bagging, as the
name suggests, generates multiple models with the same parameters
and averages the results from multiple tress. This reduces the chance of
overfitting as might arise with a single model and improves model
prediction. The bootstrap sample in bagging will on average contain 63%
of the training data, with about 37% left out of the bootstrapped sample.
This is the out-of-bag sample which is used to determine the model’s
accuracy through a cross-validation process. There are three steps to
bagging:
1. Create a number of samples from the training data. These are
termed bootstrapped samples because they are repeatedly sampled
from a training set, before the model is computed from them. These
samples contain slightly different data but with the same distribution
properties of the full dataset.
2. For each bootstrap sample create (train) a regression tree.
3. Determine the average predictions from each tree to generate an
overall average predicted value.
Random forests
Bagged regression trees may still suffer from high variance, with
associated reduced predictive power. They may exhibit tree correlation
as the trees generated by the bagging process may not be completely
independent of each other due to all predictor variables being considered
at every split of every tree. As a result, trees from different bootstrap
samples may have a similar structure to each other, and this can prevent
bagging from optimally reducing the variance of the model prediction and
performance. Random forests (Breiman, 2001) seek to reduce tree
correlation. They build large collections of decorrelated trees by adding
randomness to the tree construction process. They do this by using a
bootstrap process and by split-variable randomisation. Bootstrapping is
similar to bagging, but repeatedly resamples from the original sample.
Trees that are grown from a bootstrap resampled dataset are more likely
to be decorrelated. The split-variable randomisation introduces random
splits and noise to the response, limiting the search for the split variable
to a random subset of the variables. The basic regression random forest
algorithm proceeds as follows:
1. Select the number of trees ('ntrees')
2. for i in 'ntrees' do
3. | Create a bootstrap sample of the original
data
4. | Grow a regression tree to the bootstrapped
data
5. | for each split do
6. | | Randomly select 'm' variables from 'p'
possible variables
7. | | Identify the best variable / split among
'm'
8. | | Split into two child nodes
9. | end
10. end
Gradient boosting machine
Boosting seeks to convert weak learning trees into strong learning ones,
with each tree fitted on a slightly modified version of the original data.
Gradient boosting machines (Friedman, 2001) use a loss function to
indicate how well a model fits the underlying data. It trains many trees
(models) in a gradual, additive and sequential manner, and
parameterises subsequent trees by evaluated losses in previous ones
(boosting them). The loss function is specific to the model objectives. For
the house price models here, the loss function quantifies the error
between true and predicted (modelled) house prices. GBMs have some
advantages for inference as the results from the boosted regression trees
are easily explainable and the importance of the inputs can easily be
retrieved.
Support vector machine
Support vector machine (SVM) analyses can also be used for
classification and regression (Vapnik, 1995). SVM is a non-parametric
approach that relies on kernel functions. For regression, SVM model
outputs do not depend on distributions of input variables (unlike ordinary
linear regression). Rather SVM uses kernel functions that allow non-
linear models to be constructed under the principle of maximal margin,
which focuses on keeping the error below a certain value rather than
prediction. The kernel functions undertake complex data transformations
and then determine the process (hyperplanes) that separates the
transformed data. The support vectors are the data points that are closer
to the hyperplane, influencing its position and orientation. These support
vectors allow the margins to be maximised and removing support vectors
changes the position of the hyperplane, allowing the SVM to be built.
7.5.3 Prediction
The code below creates six prediction models using standard linear
regression (SLR), k-nearest neighbour (kNN), bagged regression trees
(BRTs), random forests (RF), gradient boosting machines (GBMs) and
support vector machine (SVM). Each of the models is constructed with
the rescaled training data (train_z), the same control parameters (a
10-fold cross-validated model) and the default model tuning parameters.
The predictive models are applied to the normalised test data (test_z)
to generate predictions of known house price values which are then
evaluated. Finally, the predictions from the different models are
compared.
The code below creates each of the models, with some of them (e.g. the
random forest) taking some time to run:
lmFit = train (Price~.,data = train_z, method =
"lm",
trControl = ctrl, verbose =
FALSE)
knnFit = train(Price~.,data = train_z, method =
"knn",
trControl = ctrl, verbose =
FALSE)
tbFit = train(Price~.,data = train_z, method =
"treebag",
trControl = ctrl, verbose =
FALSE)
rfFit = train(Price~.,data = train_z, method =
"rf",
trControl = ctrl, verbose =
FALSE)
gbmFit = train(Price~.,data = train_z, method =
"gbm",
trControl = ctrl, verbose =
FALSE)
svmFit = train(Price~.,data = train_z, method =
"svmLinear",
trControl = ctrl, verbose =
FALSE)
Remember that you can save R objects to allow you to come back to
these without running them again:
# the models
save(list =
c("lmFit","knnFit","tbFit","rfFit","gbmFit","svmFit"),
file = "all_pred_fits.RData")
For all of the models, the outputs can be examined in the same way (with
a model named xxFit):
Tuning parameters can be examined using modelLookup.
The model results can be examined through xxFit$results.
The best fitted model is reported in zzFit$finalModel.
Predictions can be made using the predict function, which
automatically takes the best model: pred.xx = predict(xxFit,
newdata = test_z).
The prediction can then be evaluated using the postResample
function to compare the known test values of Price with the
predicted ones: postResample(pred = pred.xx, obs =
test_z$Price).
The predicted values for each individual model can be plotted against the
observed values as in Figure 7.3.
The effectiveness of model prediction can also be compared, using the
postResample function in caret with the generic predict function.
Table 7.1 shows the accuracy of the predicted house price values when
compared against the known, observed house price values in the test
data for each model. Here, it looks like the random forest approach is
generating the model with the strongest fit.
# generate the predictions for each model
pred.lm = postResample(pred = predict(lmFit,
newdata = est_z),
obs = test_z$Price)
pred.knn = postResample(pred = predict(knnFit,
newdata = test_z),
obs = test_z$Price)
pred.tb = postResample(pred = predict(tbFit,
newdata = test_z),
obs = test_z$Price)
pred.rf = postResample(pred = predict(rfFit,
newdata = test_z),
obs = test_z$Price)
pred.gbm = postResample(pred = predict(gbmFit,
newdata = test_z),
obs = test_z$Price)
pred.svm = postResample(pred = predict(svmFit,
newdata = test_z),
obs = test_z$Price)
# Extract the model validations
df_tab = rbind(
lmFit$results[which.min(lmFit$results$Rsquared),2
knnFit$results[which.min(knnFit$results$Rsquared),2
tbFit$results[which.min(tbFit$results$Rsquared),2
rfFit$results[which.min(rfFit$results$Rsquared),2
gbmFit$results[which.min(gbmFit$results$Rsquared),5
svmFit$results[which.min(svmFit$results$Rsquared),2
colnames(df_tab) = paste0("Model",
colnames(df_tab))
# Extract the prediction validations
df_tab2 = t(cbind(pred.lm, pred.knn, pred.tb,
pred.rf, pred.gbm, pred.svm))
colnames(df_tab2) = paste0("Prediction",
colnames(df_tab2))
# Combine
df_tab = data.frame(df_tab, df_tab2)
rownames(df_tab) = c("SLR", "kNN", "BRT", "RF",
"GBM", "SVM")
# Print out
df_tab
Table 7.1
In summary, there are a large number of possible refinements and
choices in the models used for prediction:
Different models will be better suited to the input data than others,
suggesting the need to explore a number of model types and
families.
The model predictions can be evaluated by the degree to which they
predict observed, known values in the test data.
The model algorithms can be tuned beyond the caret defaults
through a tuning grid, and details of the tuning parameters are listed
using the modelLookup("<method>") function.
The models could be run outside of caret to improve their speed
and tuning options as some take a long time to run with even modest
data dimensions.
You may wish to save the results of this and the preceding sections (e.g.
save.image(file = "section 7.5.3.RData")).
7.5.4 Inference
In contrast to prediction, where the aim is to create a model able to
reliably predict the response variable, given a set of input variables, the
aim of inference is understanding, specifically, process understanding.
The code below defines X and Y to illustrate a different caret syntax
used in the inferential models. It drops one of the life-stage/age variables
(u25) because groups of variables adding to 1, 100, and so on across all
records can confound statistical models. It also converts the logical true
or false values to characters and the Beds variable from an ordered
factor to numeric values:
X = data_anal %>% select(-Price, -u25) %>%
mutate_if(is_logical,as.character) %>%
mutate(Beds = as.numeric(Beds)) %>%
mutate_if(is_double,scale) %>% data.frame()
Y = data_anal["Price"]
The data considerations for inference are slightly different than for
prediction because the aims are different. First, data do not have to be
split into training and validation subsets. In fact they should not be split as
there is danger that some of the potential for understanding will be lost
through the split, and there is no need to hold data back for training and
validation. This means that the full structure of the data can be exploited
by the model. Second, there is a critical need to consider variable
selection (also known as model selection), particularly in the context of
collinearity. Model selection is an important component of any regression
analysis, but more so for understanding than for prediction, as indicated
above: simply, if the aim is understanding and two variables are
correlated (i.e. essentially have the same relationship with the target
variable) then identifying the effects of each predictor variable on the
target variable may be difficult. Determining which predictor variables to
include in the analysis is not so important for prediction, where the aim is
simply to identify the model with the strongest prediction accuracy.
Collinearity occurs when pairs of predictor variables have a strong
positive or negative relationship between each other. Strong collinearity
can reduce model reliability and precision and can result in unstable
parameter estimates, inflated standard errors and inferential biases
(Dormann et al., 2013). As a result, model extrapolation may be
erroneous and there may be problems in separating variable effects
(Meloun et al., 2002). The risk of collinearity increases as more predictors
are introduced and occurs when pairs of predictor variables have a
strong positive or negative relationship with each other. It is typically
considered a potential problem when data pairs have correlations of less
than –0.8 or greater than +0.8 (Comber and Harris, 2018). Collinearity
can be handled by variable reduction and transformation techniques such
as principal components analysis (PCA) regression or by related
approaches such as partial least squares regression (e.g. Frank and
Friedman, 1993). Other approaches for variable selection and handling
collinearity include penalised approaches such as the elastic net, a hybrid
of ridge regression and the lasso – the least absolute shrinkage and
selection operator (Zou and Hastie, 2005). The key point is that the
failure to correctly specify a model when collinearity is present can result
in a loss of precision and power in the coefficient estimates, resulting in
poor inference and process understanding.
We can investigate collinearity among the variables in a number of ways.
The simplest is to construct a linear regression model and then examine
the variance inflation factors (VIFs) using the vif function in the car
package. Heuristics for interpreting these are taken from Belsley et al.
(2005) and O’Brien (2007): collinearity is likely to be a potential problem
with VIFs greater than 10 for a given predictor. The code below
constructs a regression model and then returns the input variables with a
VIF greater than 2. The results suggest that only very weak collinearity
exists within these data:
library(car)
m = lm(Price~.,data=cbind(X,Y))
vif(m)[ vif(m)>2]
## Garden Terraced Detached Semi.Detached
## 2.095882 2.093380 4.556362 2.334352
## Easting u16 u45 u65
## 2.733526 2.316360 2.371342 2.105384
## o65
## 2.357073
The code below creates six inference models using standard linear
regression (SLR), k-nearest neighbour (kNN), bagged regression trees
(BRTs), random forests (RF), gradient boosting machines (GBMs) and
support vector machine (SVM). None of these are tuned beyond the
defaults in caret. However, tuning has been heavily illustrated in
previous sections. The aim here is to illustrate the different types of
models, how they can be compared and evaluated and then used to
generate understanding of the factors associated with house price. The
code below creates each of the models, with the RF taking some time to
run:
ctrl=
trainControl(method="repeatedcv",number=10,repeats=5)
set.seed(123)
lmFit = train(Price~.,data = cbind(X,Y),method
= "lm",
trControl = ctrl)
knnFit = train(Price~.,data = cbind(X,Y),method
= "knn",
trControl = ctrl,trace = F)
tbFit = train(Price~.,data = cbind(X,Y),method
= "treebag",
trControl = ctrl)
rfFit = train(Price~.,data = cbind(X,Y),method
= "rf",
trControl = ctrl,verbose =
F,importance = T)
gbmFit = train(Price~.,data = cbind(X,Y),method
= "gbm",
trControl = ctrl,verbose = F)
svmFit = train(Price~.,data = cbind(X,Y),method
= "svmLinear",
trControl = ctrl,verbose = F)
Table 7.2
Next, a series of predictive models are constructed and evaluated
internally, and their accuracy is reported using the accuracy and kappa
measures as described above (Table 7.3). Note that again these might
take a few minutes to run and you could consider including verbose =
F for some of them:
Here we see that the most accurate model is the random forest model.
Table 7.4
For classification models constructed using caret, each predictor
variable has a separate variable importance for each class, except for
classification trees, bagged trees and boosted trees (see
https://topepo.github.io/caret/variable-importance.html). However, for the
other models, the variable importance for each class can be examined,
as is shown in Figure 7.5 for the SVM model.
data.frame(varImp(svmMod)$importance,
var=rownames(varImp(svmMod)$importance))
%>%
pivot_longer(-var) %>%
ggplot(aes(reorder(var, value), value, fill =
var)) +
geom_col() + coord_flip() +
scale_fill_brewer(palette= "Dark2") +
labs(x = "", y = "Variable Importance for SVM
clusters",
fill = "Attribute") +
facet_wrap(~name, ncol = 3, scales = "fixed")
+
theme(axis.text.y = element_text(size = 7),
legend.position = "none")
The WCSS values can be plotted against the cluster sizes as a useful
diagnostic exercise (as in Figure 7.7). This shows that as the number of
clusters increases, the within-cluster variance decreases, as you would
expect – each group becomes more homogeneous as the number of
groups increases. The plot of the smallest cluster size against the total
number of clusters shows the tipping point at which cluster size starts to
drop off – again indicating very small, niche and not very universal
clusters. Here, it seems that a value of k around 8 has a good position on
the elbow of the scree plot and results in a reasonable smallest cluster
size:
par(mfrow = c(1,2)) # set plot parameters
# the WCSS scree plot
plot(1:100, wss[1:100], type = "h", xlab =
"Number of Clusters",
ylab = "Within Cluster Sum of Squares")
# the smallest cluster size
plot(1:100, smallest.clus[1:100], type = "h",
xlab = "Number of Clusters",
ylab = "Smallest Cluster Size")
par(mfrow = c(1,1)) # reset the plot parameters
Figure 7.7 Cluster scree plot (left) and smallest cluster size plot (right)
The classification for eight clusters is recalculated – this quantity is stored
in n.clus in the code below. This is used in the code below as it
provides a reasonable trade-off between WCSS and minimum cluster
size and the results can be mapped:
set.seed(1234) # reproducibility!
cls <- kmeans(st_drop_geometry(oa[,-c(1, 9)]),
centers = 8, iter.max = 100,
nstart = 20)
oa$cls = cls$cluster
# map
tm_shape(oa) +
tm_fill("cls", title = "Class No.", style =
"cat",
palette = "Set1")
# remove the ’cls’ variable
oa = oa [, -11]
Here we can see that the first four components account for 82.3% of the
variance in the data, which can be used as input to cluster analysis. This
is incredibly useful when the input data have many variables and the
results can be evaluated and mapped as before:
cls <-
kmeans(PCA$scores[,1:4],centers=8,iter.max=100,nstart=20)
The code above creates the PCA and lists the proportion of variance
explained by each individual component. The next step is to investigate
the cumulative amount of variance explained. The aim here is to
determine the number of components explaining 70–90% of the variation
in the data:
cumsum(PCA$sdev^2/sum(PCA$sdev^2 ))
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
Comp.6
## 0.1432658 0.2388074 0.2882204 0.3330168
0.3700920 0.4041500
## Comp.7 Comp.8 Comp.9 Comp.10 Comp.11
Comp.12
## 0.4354055 0.4648313 0.4937707 0.5201652
0.5448618 0.5692588
## Comp.13 Comp.14 Comp.15 Comp.16
Comp.17 Comp.18
## 0.5929655 0.6160866 0.6386988 0.6599927
0.6803223 0.6997149
## Comp.19 Comp.20 Comp.21 Comp.22
Comp.23 Comp.24
## 0.7185753 0.7364051 0.7537062 0.7705425
0.7870188 0.8030244
## Comp.25 Comp.26 Comp.27 Comp.28
Comp.29 Comp.30
## 0.8177391 0.8318939 0.8453906 0.8581611
0.8708262 0.8825205
## Comp.31 Comp.32 Comp.33 Comp.34
Comp.35 Comp.36
## 0.8940712 0.9048314 0.9154465 0.9252348
0.9333868 0.9410854
## Comp.37 Comp.38 Comp.39 Comp.40
Comp.41 Comp.42
## 0.9486426 0.9556952 0.9623954 0.9688771
0.9749741 0.9807231
## Comp.43 Comp.44 Comp.45 Comp.46
Comp.47 Comp.48
## 0.9853435 0.9894028 0.9934334 0.9961548
0.9987706 1.0000000
## Comp.49
## 1.0000000
Here we can see that the first 24 components account for 80.3% of the
variance in the data. We will use these for a k-means cluster analysis.
Again different cluster sizes are evaluated:
set.seed(1234) # Reproducible outcome
smallest.clus <- wss <- rep(0, 100) # define 2
variables
for (i in 1:100 ) {
clus <- kmeans(PCA$scores[, 1:24],
centers = i, iter.max = 100,
nstart = 20)
wss[i] <- clus$tot.withinss
smallest.clus[i] <- min(clus$size)
if(i%% 10 == 0) cat("progress:", i, "% \n")
}
We can now plot the sums of squares and the cluster sizes as before,
with the results suggesting that 10 clusters are appropriate, providing a
reasonable trade-off between the WCSS and minimum cluster size:
plot(1:100, wss[1:100], type = "h", main =
"Cluster Scree Plot",
xlab = "Number of Clusters",
ylab = "Within Cluster Sum of Squares")
plot(1:100, smallest.clus[1:100], type = "h",
main = "Smallest Cluster Plot",
xlab = "Number of Clusters" ,
ylab = "Smallest Cluster Size" )
set.seed(1234)
clus <- keans(PCA$scores[, 1:24],
centers = 10, iter.max = 100,
nstart = 20)
LSOAclusters = clus$cluster
The above code used the ddply function in the plyr package. This is
used to split the original input data (the class_data data frame) by the
LSOAclusters variable which indicates the cluster membership of each
census area. This is an example of a split–apply–combine operation. The
data frame is split into a number of sub-frames according to the value of
LSOAclusters, then an operation is applied to each sub-frame (in this
case computing a list of column means – the expression
numcolwise(mean) takes a function (mean) and creates a new
function, which is then applied to each numeric column in the input data),
and finally these are combined together to create a single data frame.
The result can be explored visually. The heatmap function provides a
visual representation of the z_scores variable (Figure 7.8). It generates
a grid-based image of the values in the two-dimensional array. However,
note that the column and row labels are both categorical, non-ordinal
variables. Although the cluster numbers are integers, no significance is
attached to their values – they merely serve as labels. This implies that
both rows and columns may be permutated, without loss of generality.
Doing this can sometimes aid the clarity of the visualisation. In this case,
rows are reordered on the basis of applying a dendrogram ordering
algorithm to the result of a hierarchical clustering algorithm, where the
values along each row are treated as a vector. The clustering is carried
out on the basis of Euclidean distance between the vectors using the
complete linkage approach (Sørensen, 1948), and the reordering is
based on the reorder.dendrogram function provided in the R stats
package provided in standard distributions of R. As noted, this reordering
is applied to the rows, but following this, a reordering is also applied to
the columns – this time treating the columns as vectors, and computing
Euclidean distances between these.
library(RColorBrewer)
heatmap(t(z_scores),
scale = 'none',
col=brewer.pal(6,'BrBG'),
breaks=c(-1e10,-2,-1,0,1,2,+1e10),
xlab='Cluster Number', cexRow = 0.6,
add.expr=abline(h=(0:49)+0.5,v=
(0:10)+0.5,col='white'))
Key Points
Cartograms break the link between the mapped size of the reporting
units on the map and their actual physical size in reality.
Areas (units) are rescaled areas according to some attribute, while
seeking to preserve as much of the topology as possible (i.e. the
spatial relationships between units).
Three types of cartogram were illustrated: contiguous, non-
contiguous and Dorling.
The critical assumption in the use of cartograms is that the original
units have some sort of meaning.
Figure 8.2 Original map of LSOA unemployment rates in Nottingham and
the different cartograms
8.4 HEXAGONAL BINNING AND TILE MAPS
Different flavours of cartograms provide a way to overcome the invisibility
problem in standard choropleth maps by spatially exaggerating areas
with high values and reducing the map space given to areas with low
values. However, all cartograms, but especially contiguous cartograms,
impose a high level of distortion – the distortion problem (Harris et al.,
2018), changing the shape and locations of areas on the map, such that
areas with the lowest values are both allocated the lowest amount of map
space and distorted. They can also massively over- or underemphasise
the difference between values at different locations, resulting in different
kinds of misrepresentation (Harris et al., 2018).
This presents a challenge: cartograms can fail if the level of geographical
distortion makes the map difficult to interpret, despite the aim of
improving the communication of spatial data (Harris et al., 2017b). There
are a number of potential approaches for dealing with misrepresentation.
One critical but frequently overlooked solution is to make sure the original
map is always presented alongside the cartogram.
Another is to carefully rescale the attribute value that is being mapped in
order to achieve an acceptable balance between invisibility and
distortion, using square roots or logs of counts (squares or exponential
for proportions). Figure 8.3 compares contiguous cartograms scaled in
this way.
# rescale the value
nottingham$u_sq = (nottingham$unemployed)^3
nottingham$u_exp = exp (nottingham$unemployed)
# create cartograms
n_cart_sq <- cartogram_cont(nottingham, "u_sq",
itermax=10)
n_cart_ex <- cartogram_cont(nottingham,
"u_exp", itermax=10)
# create tmap plots
p3 =
tm_shape(n_cart_sq) +
tm_polygons("u_sq", palette = "viridis",
style = "kmeans") +
tm_layout(title = "Squares", frame = F,
legend.show=FALSE)
p4 =
tm_shape(n_cart_ex) +
tm_polygons("u_exp", palette = "viridis",
style = "kmeans") +
tm_layout(title = "Exponential", frame = F,
legend.show= FALSE)
# plot with original
tmap_arrange(p1, p2, p3, p4, ncol = 2)
Figure 8.3 The effect of rescaling values in different ways on cartogram
distortion
A further solution is to use equal area tile maps – an equal area
cartogram – giving equal space to all map areas and making them all
equally visible. Commonly used tile shapes include squares and
hexagons, which are shaded in the same way as a normal choropleth.
The geogrid package provides functions to do this.
First an initial grid is established of n grid cells, where n is the number of
polygons in the input data. The arrangement of these mimics the
coverage of the input data:
hg = calculate_grid(nottingham, learning_rate =
0.05,
grid_type = "hexagonal",
verbose = F)
The spatial properties of the result can be compared with the input as in
Figure 8.4:
tm_shape(hg[[2]], bbox =
nottingham)+tm_borders() +
tm_shape(nottingham) + tm_borders(col =
"darkgrey") +
tm_layout(title = "Hexbins", frame = F,
legend.show=FALSE)
Figure 8.4 The hexagonal grid and the original LSOA data
Then each of the original area/polygons is allocated to the nearest grid.
This requires a combinatorial optimisation routine to assign each original
area to a new grid cell, in such a way that the distances between the old
and new locations are minimised across all locations using Kuhn’s
Hungarian method (Kuhn, 1955). This can take time to solve:
hg = assign_polygons(nottingham, hg)
And of course the same can be done to create a square grid – again this
takes time:
sg = calculate_grid(nottingham, learning_rate =
0.05,
grid_type = "regular",
verbose = F)
sg = assign_polygons(nottingham, sg)
Figure 8.5 Hexagonal and square bin grids showing the spatial
distribution of employment alongside the original areas
Key Points
Cartograms can suffer from distortion problems which over- or
underemphasise the differences between values at different
locations.
Such misrepresentations can be overcome by always presenting the
original map alongside the cartogram, by carefully rescaling the
attribute value that is being mapped or by using equal area tile
maps, which use heuristic searches to allocate each original area to
a tile.
8.5 SPATIAL BINNING DATA: A SMALL
WORKED EXAMPLE
Equal area tiles can also be used as a spatial framework to collate,
aggregate and summarise other data – to bin data. The basic idea is the
same as the use of geom_hex to summarise the number of data points
with similar x and y values, as was done in Chapter 5. But rather than
plot them in a scatter plot, with the bin shades indicating the number of
data points in each x−y range, here the approach used is to summarise
data points falling within certain spatial locations.
The principle can be applied to spatial data points that record the
occurrence of something, with longitude and latitude or easting and
northing being used instead of x and y.
First, it is instructive to examine a familiar example. Figure 8.6 shows the
properties data in Liverpool, introduced in Chapter 7, summarised
over hexbins representing LSOAs in Liverpool. The code below loads the
properties simple features object of houses for sale in Liverpool. It
then uses the lsoa_sf data and the st_make_grid function in sf to
create a set of 500 m hexbins covering the study area. Load the Chapter
7 data with the properties layer:
load("ch7.RData")
The code below replicates the steps outlined in Chapter 4, which also
describes each step individually. In brief, the steps are as follows:
1. Extract the prescriptions of interest, summarising the total number
(n) and cost (tot_cost) for each GP practice.
2. Determine the distributions of patients across all LSOAs by GP
practice.
3. Summarise and link the prescriptions distributions over the LSOAs:
the costs, the counts of prescriptions and the prescribing rates.
4. Extract and link to the LSOA socio-economic data from the social
database table.
The code for undertaking these steps is below:
tbl(db, "prescriptions") %>%
# step 1
filter(bnf_code %like% '040303%') %>%
group_by(practice_id) %>%
summarise (prac_cost = sum(act_cost, na.rm =
T),
prac_items = sum(items, na.rm = T))
%>%
ungroup() %>%
# step 2
left_join(
tbl(db, "patients") %>%
group_by(practice_id) %>%
summarise(prac_pats = sum(count, na.rm =
T)) %>%
ungroup() %>%
left_join(tbl(db, "patients")) %>%
mutate(pats_prop = as.numeric(count) /
as.numeric(prac_pats))
) %>%
# step 3
mutate(lsoa_cost = prac_cost*pats_prop,
lsoa_items = prac_items*pats_prop) %>%
group_by(lsoa_id) %>%
summarise(lsoa_tot_cost = sum(lsoa_cost,
na.rm = T),
lsoa_tot_items = sum(lsoa_items,
na.rm = T)) %>%
ungroup() %>%
filter(!is.na(lsoa_tot_cost)) %>%
# step 4
left_join(
tbl(db, "social") %>%
mutate(population =
as.numeric(population))
) %>%
mutate(cost_pp = lsoa_tot_cost / population,
items_pp = lsoa_tot_items/population)
%>%
collect() -> lsoa_result
The database connection can now be closed and the result examined,
after a bit of reordering:
dbDisconnect(db)
names(lsoa_result)
## [1]
"lsoa_id" "lsoa_tot_cost" "lsoa_tot_items"
## [4] "population" "employed"
"unemployed"
## [7] "noqual" "l4qual" "ptlt15"
## [10] "pt1630" "ft3148" "ft49"
## [13] "llti" "ruc11_code" "ruc11"
## [16] "oac_code" "oac" "cost_pp"
## [19] "items_pp"
lsoa_result = lsoa_result[, c ( 1 , 2 , 3 , 18
, 19 , 4 : 17 )]
lsoa_result %>% arrange(-cost_pp)
## # A tibble: 32,935 x 19
## lsoa_id lsoa_tot_cost lsoa_tot_items
cost_pp items_pp
## <chr> <dbl> <dbl> <dbl>
<dbl>
## 1 E01018… 11003. 9537. 5.27
4.57
## 2 E01017… 4578. 3109. 3.90
2.65
## 3 E01015… 6475. 4955. 3.80
2.91
## 4 E01019… 5600. 3921. 3.61
2.52
## 5 E01015… 3886. 2990. 3.59
2.76
## 6 E01015… 7135. 5475. 3.51
2.69
## 7 E01015… 4372. 3364. 3.32
2.55
## 8 E01023… 4995. 2716. 3.03
1.65
## 9 E01022… 4220. 2532. 2.89
1.73
## 10 E01033… 3312. 2433. 2.89
2.12
## # … with 32,925 more rows, and 14 more
variables:
## # population <dbl>, employed <dbl>,
unemployed <dbl>,
## # noqual <dbl>, l4qual <dbl>, ptlt15 <dbl>,
pt1630 <dbl>,
## # ft3148 <dbl>, ft49 <dbl>, llti <dbl>,
ruc11_code <chr>,
## # ruc11 <chr>, oac_code <int>, oac <chr>
8.6.3 Mapping
It would be relatively straightforward to simply map the counts or rates.
The code below reloads the Chapter 4 data and then uses tmap to do
this:
load("ch4_db.RData")
lsoa_sf %>% left_join(lsoa_result) %>%
tm_shape() + tm_fill("cost_pp", style =
"kmeans",
palette = "YlOrRd",
legend.hist = T) +
tm_layout(frame = F, legend.outside = T,
legend.hist.width = 1,
legend.format = list(digits = 1),
legend.outside.position = c("left",
"top"),
legend.text.size = 0.7,
legend.title.size = 1)
Then the intersection layer, ol, which now contains an attribute of the
hexbin it intersects with (HexID), can be summarised over the hexbins.
The code below defines does this and links back to hex to create
hex_ssri:
st_drop_geometry(ol) %>% group_by(HexID) %>%
summarise(hex_cost = sum(lsoa_tot_cost),
hex_items = sum(lsoa_tot_items),
hex_pop = sum(population)) %>%
ungroup() %>%
mutate(cost_pp = hex_cost/hex_pop,
items_pp = hex_items/hex_pop) %>%
right_join(hex) -> hex_ssri
# add the geometry back
st_geometry(hex_ssri) = hg
The rates can be calculated from the interpolated count data and mapped
as in Figure 8.9:
# map
tm_shape(hex_ssri2) +
tm_polygons(c("cost_pp", "items_pp"),
palette = "viridis", style =
"kmeans")
The two approaches show the same broad pattern, with differences
between the binned results in Figures 8.8 and 8.9 being mainly in the
rural and coastal fringes (e.g. on the East and North East coast). This
shows the difference between allocation based on LSOA geometric
centroids and proportional areal overlap of the LSOAs with the hexbins.