Machine Learning On Geographical Data Using Python
Machine Learning On Geographical Data Using Python
Apress Standard
The publisher, the authors and the editors are safe to assume that the
advice and information in this book are believed to be true and
accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, expressed or implied, with
respect to the material contained herein or for any errors or omissions
that may have been made. The publisher remains neutral with regard
to jurisdictional claims in published maps and institutional af iliations.
Source Code
All source code used in the book can be downloaded from
github.com/apress/machine-learning-geographic-
data-python.
Any source code or other supplementary material referenced by the
author in this book is available to readers on GitHub
(https://github.com/Apress). For more detailed information, please
visit http://www.apress.com/source-code.
Table of Contents
Part I: General Introduction
Chapter 1: Introduction to Geodata
Reading Guide for This Book
Geodata De initions
Cartesian Coordinates
Polar Coordinates and Degrees
The Difference with Reality
Geographic Information Systems and Common Tools
What Are Geographic Information Systems
Standard Formats of Geodata
Shape ile
Google KML File
GeoJSON
TIFF/JPEG/PNG
CSV/TXT/Excel
Overview of Python Tools for Geodata
Key Takeaways
Chapter 2: Coordinate Systems and Projections
Coordinate Systems
Geographic Coordinate Systems
Projected Coordinate Systems
Local Coordinate Systems
Which Coordinate System to Choose
Playing Around with Some Maps
Example: Working with Own Data
Key Takeaways
Chapter 3: Geodata Data Types
Vector vs. Raster Data
Dealing with Attributes in Vector and Raster
Points
De inition of a Point
Importing an Example Point Dataset in Python
Some Basic Operations with Points
Lines
De inition of a Line
An Example Line Dataset in Python
Polygons
De inition of a Polygon
An Example Polygon Dataset in Python
Some Simple Operations with Polygons
Rasters/Grids
De inition of a Grid or Raster
Importing a Raster Dataset in Python
Key Takeaways
Chapter 4: Creating Maps
Mapping Using Geopandas and Matplotlib
Getting a Dataset into Python
Making a Basic Plot
Plot Title
Plot Legend
Mapping a Point Dataset with Geopandas and Matplotlib
Concluding on Mapping with Geopandas and Matplotlib
Making a Map with Cartopy
Concluding on Mapping with Cartopy
Making a Map with Plotly
Concluding on Mapping with Plotly
Making a Map with Folium
Concluding on Mapping with Folium
Key Takeaways
Part II: GIS Operations
Chapter 5: Clipping and Intersecting
What Is Clipping?
A Schematic Example of Clipping
What Happens in Practice When Clipping?
Clipping in Python
What Is Intersecting?
What Happens in Practice When Intersecting?
Conceptual Examples of Intersecting Geodata
Intersecting in Python
Difference Between Clipping and Intersecting
Key Takeaways
Chapter 6: Buffers
What Are Buffers?
A Schematic Example of Buffering
What Happens in Practice When Buffering?
Creating Buffers in Python
Creating Buffers Around Points in Python
Creating Buffers Around Lines in Python
Creating Buffers Around Polygons in Python
Combining Buffers and Set Operations
Key Takeaways
Chapter 7: Merge and Dissolve
The Merge Operation
What Is a Merge?
A Schematic Example of Merging
Merging in Python
Row-Wise Merging in Python
Attribute Join in Python
Spatial Join in Python
The Dissolve Operation
What Is the Dissolve Operation?
Schematic Overview of the Dissolve Operation
The Dissolve Operation in Python
Key Takeaways
Chapter 8: Erase
The Erase Operation
Schematic Overview of Spatially Erasing Points
Schematic Overview of Spatially Erasing Lines
Schematic Overview of Spatially Erasing Polygons
Erase vs. Other Operations
Erase vs. Deleting a Feature
Erase vs. Clip
Erase vs. Overlay
Erasing in Python
Erasing Portugal from Iberia to Obtain Spain
Erasing Points in Portugal from the Dataset
Cutting Lines to Be Only in Spain
Key Takeaways
Part III: Machine Learning and Mathematics
Chapter 9: Interpolation
What Is Interpolation?
Different Types of Interpolation
Linear Interpolation
Polynomial Interpolation
Nearest Neighbor Interpolation
From One-Dimensional to Spatial Interpolation
Spatial Interpolation in Python
Linear Interpolation Using Scipy Interp2d
Kriging
Linear Ordinary Kriging
Gaussian Ordinary Kriging
Exponential Ordinary Kriging
Conclusion on Interpolation Methods
Key Takeaways
Chapter 10: Classi ication
Quick Intro to Machine Learning
Quick Intro to Classi ication
Spatial Classi ication Use Case
Feature Engineering with Additional Data
Importing and Inspecting the Data
Spatial Operations for Feature Engineering
Reorganizing and Standardizing the Data
Modeling
Model Benchmarking
Key Takeaways
Chapter 11: Regression
Introduction to Regression
Spatial Regression Use Case
Importing and Preparing Data
Iteration 1 of Data Exploration
Iteration 1 of the Model
Iteration 2 of Data Exploration
Iteration 2 of the Model
Iteration 3 of the Model
Iteration 4 of the Model
Interpretation of Iteration 4 Model
Key Takeaways
Chapter 12: Clustering
Introduction to Unsupervised Modeling
Introduction to Clustering
Different Clustering Models
Spatial Clustering Use Case
Importing and Inspecting the Data
Cluster Model for One Person
Tuning the Clustering Model
Applying the Model to All Data
Key Takeaways
Chapter 13: Conclusion
What You Should Remember from This Book
Recap of Chapter 1 – Introduction to Geodata
Recap of Chapter 2 – Coordinate Systems and Projections
Recap of Chapter 3 – Geodata Data Types
Recap of Chapter 4 – Creating Maps
Recap of Chapter 5 – Clipping and Intersecting
Recap of Chapter 6 – Buffers
Recap of Chapter 7 – Merge and Dissolve
Recap of Chapter 8 – Erase
Recap of Chapter 9 – Interpolation
Recap of Chapter 10 – Classi ication
Recap of Chapter 11 – Regression
Recap of Chapter 12 – Clustering
Further Learning Path
Going into Specialized GIS
Specializing in Machine Learning
Remote Sensing and Image Treatment
Other Specialties
Key Takeaways
Index
About the Author
Joos Korstanje
is a data scientist, with over ive years of
industry experience in developing
machine learning tools. He has a double
MSc in Applied Data Science and in
Environmental Science and has
extensive experience working with
geodata use cases. He has worked at a
number of large companies in the
Netherlands and France, developing
machine learning for a variety of tools.
His experience in writing and teaching
has motivated him to write this book on
machine learning for geodata with
Python.
About the Technical Reviewer
Xiaochi Liu
is a PhD researcher and data scientist at
Macquarie University, specializing in
machine learning, explainable arti icial
intelligence, spatial analysis, and their
novel application in environmental and
public health. He is a programming
enthusiast using Python and R to
conduct end-to-end data analysis. His
current research applies cutting-edge AI
technologies to untangle the causal
nexus between trace metal
contamination and human health to
develop evidence-based intervention
strategies for mitigating environmental exposure.
Part I
General Introduction
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
J. Korstanje, Machine Learning on Geographical Data Using Python
https://doi.org/10.1007/978-1-4842-8287-8_1
1. Introduction to Geodata
Joos Korstanje1
(1) VIELS MAISONS, France
Geodata De initions
To get started, I want to cover the basics of coordinate systems in the
simplest mathematic situation: the Euclidean space. Although the world
does not respect the hypothesis made by Euclidean geometry, it is a great
entry into the deeper understanding of coordinate systems.
A two-dimensional Euclidean space is often depicted as shown in
Figure 1-1.
Cartesian Coordinates
To locate points in the Euclidean space, we can use the Cartesian
coordinate system. This coordinate system speci ies each point uniquely
by a pair of numerical coordinates. For example, look at the coordinate
system in Figure 1-2, in which two points are located: a square and a
triangle.
The square is located at x = 2 and y = 1 (horizontal axis). The triangle
is located at x = -2 and y = -1.
Figure 1-2 Two points in a coordinate system. Image by author
The point where the x and y axes meet is called the Origin, and
distances are measured from there. Cartesian coordinates are among the
most well-known coordinate system and work easily and intuitively in
the Euclidean space.
In this schematic drawing, the star is designated as the pole, and the
thick black line to the right is chosen as the polar axis. This system is
quite different from the Cartesian system but still allows us to identify
the exact same points: just in a different way.
The points are identi ied by two components: an angle with respect to
the polar axis and a distance. The square that used to be referred to as
Cartesian coordinate (2,1) can be referred to by an angle from the polar
axis and a distance.
This is shown in Figure 1-4.
Figure 1-4 A point in the polar coordinate system. Image by author
At this point, you can measure the distance and the angle and obtain
the coordinate in the polar system. Judged by the eye alone, we could say
that the angle is probably more or less 30° and the distance is slightly
above 2. We would need to have more precise measurement tools and a
more precise drawing for more precision.
There are trigonometric computations that we can use to convert
between polar and Cartesian coordinates. The irst set of formulas allows
you to go from polar to Cartesian:
The letter r signi ies the distance and the letter φ is the angle. You can go
the other way as well, using the following formulas:
ArcGIS
ArcGIS, made by ESRI, is arguably the most famous software package for
working with Geographic Information Systems. It has a very large
number of functionalities that can be accessed through a user-friendly
click-button system, but visual programming of geodata processing
pipelines is also allowed. Python integration is even possible for those
who have speci ic tasks for which there are no preexisting tools in ArcGIS.
Among its tools are also AI and data science options.
ArcGIS is a great software for working with geodata. Yet there is one
big disadvantage, and that is that it is a paid, proprietary software. It is
therefore accessible only to companies or individuals that have no
dif iculty paying the considerably high price. Even though it may be
worth its price, you’ll need to be able to pay or convince your company to
pay for such software. Unfortunately, this is often not the case.
QGIS and Other Open Source ArcGIS Alternatives
Open source developers have jumped into this open niche of GIS systems
by developing open source (and therefore free to use) alternatives. These
include QGIS, GRASS GIS, PostGIS, and more.
The clear advantage of this is that they are free to use. Yet their
functionality is often much more limited. In most of them, users have the
ability to code their own modules in case some of the needed tools are
not available.
This approach can be a good it for your need if you are not afraid to
commit to a system like QGIS and ill the gaps that you may eventually
encounter.
Python/R Programming
Finally, you can use Python or R programming for working with geodata
as well. Programming, especially in Python or R, is a very common skill
among data professionals nowadays.
As programming skills were less well spread a few years back, the
boom in data science, machine learning, and arti icial intelligence has
made languages like Python become very commonly spread throughout
the workforce.
Now that many are able to code or have access to courses to learn
how to code, the need for full software becomes less. The availability of a
number of well-functioning geodata packages is enough for many to get
started.
Python or R programming is a great tool for treating geodata with
common or more modern methods. By using these programming
languages, you can easily apply tools from other libraries to your geodata,
without having to convert this to QGIS modules, for example.
The only problem that is not very well solved by programming
languages is long-term geodata storage. For this, you will need a
database. Cloud-based databases are nowadays relatively easy to arrange
and manage, and this problem is therefore relatively easily solved.
Shape ile
The shape ile is a very commonly used ile format for geodata because it
is the standard format for ArcGIS. The shape ile is not very friendly for
being used outside of ArcGIS, but due to the popularity of ArcGIS, you will
likely encounter shape iles at some point.
The shape ile is not really a single ile. It is actually a collection of iles
that are stored together in one and the same directory, all having the
same name. You have the following iles that make up a shape ile:
– my ile.shp: The main ile, also called the shape ile (confusing but true)
– my ile.shx: The shape ile index ile
– my ile.dbf: The shape ile data ile that stores attribute data
– my ile.prj: Optional ile that stores spatial reference and projection
metadata
As an example, let’s look at an open data dataset containing the
municipalities of the Paris region that is provided by the French
government. This dataset is freely available at
https://geo.data.gouv.fr/en/datasets/8fadd7040c4b94f
2c318a0971e8faedb7b5675d6
On this website, you can download the data in SHP/L93 format, and
this will allow you to download a directory with a zip ile. Figure 1-6
shows what this contains.
Figure 1-6 The inside of the shape ile. Image by author Data source: Ministry of DINSIC. Original
data downloaded from
https://geo.data.gouv.fr/en/datasets/8fadd7040c4b94f2c318a0971e8faedb7b5675d6, updated on 1
July 2016. Open Licence 2.0 (www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf)
As you can see, there are the .shp ile (the main ile), the .shx ile (the
index ile), the .dbf ile containing the attributes, and inally the optional
.prj ile.
For this exercise, if you want to follow along, you can use your local
environment or a Google Colab Notebook at
https://colab.research.google.com/.
You have to make sure that in your environment, you install
geopandas:
Then, make sure that in your environment you have a directory called
Communes_MGP.shp in which you have the four iles:
– Communes_MGP.shp
– Communes_MGP.dbf
– Communes_MGP.prj
– Communes_MGP.shx
In a local environment, you need to put the “sample_data” ile in the
same directory as the notebook, but when you are working on Colab, you
will need to upload the whole folder to your working environment, by
clicking the folder icon and then dragging and dropping the whole folder
onto there. You can then execute the Python code in Code Block 1-1 to
have a peek inside the data.
To make something more visual, you can use the code in Code Block 1-
2.
shapefile.plot()
Code Block 1-2 Plotting the shape ile
You will obtain the map corresponding to this dataset as in Figure 1-8.
Figure 1-8 The map resulting from Code Block 1-2. Image by author Data source: Ministry of
DINSIC. Original data downloaded from
https://geo.data.gouv.fr/en/datasets/8fadd7040c4b94f2c318a0971e8faedb7b5675d6, updated on 1
July 2016. Open Licence 2.0 (www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf)
To get a KML ile into Python, we can again use geopandas. This time,
however, it is a bit less straightforward. You’ll also need the Fiona
package to obtain a KML driver. The total code is shown in Code Block 1-
3.
import fiona
gpd.io.file.fiona.drvsupport.supported_drivers['KML']
= 'rw'
Figure 1-10 The KML data shown in Python. Image by author Data source: Ministry of DINSIC.
Original data downloaded from
https://geo.data.gouv.fr/en/datasets/8fadd7040c4b94f2c318a0971e8faedb7b5675d6, updated on 1
July 2016. Open Licence 2.0 (www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf)
As before, you can plot this geodataframe to obtain a basic map
containing the municipalities of the area of Paris and around. This is
done in Code Block 1-4.
kmlfile.plot()
Code Block 1-4 Plotting the KML ile data
Figure 1-11 The plot resulting from Code Block 1-4. Screenshot by author Data source: Ministry of
DINSIC. Original data downloaded from
https://geo.data.gouv.fr/en/datasets/8fadd7040c4b94f2c318a0971e8faedb7b5675d6, updated on 1
July 2016. Open Licence 2.0 (www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf)
GeoJSON
The json format is a data format that is well known and loved by
developers. Json is much used in communication between different
information systems, for example, in website and Internet
communication.
The json format is loved because it is very easy to parse, and this
makes it a perfect storage for open source and other developer-oriented
tools.
Json is a key-value dataset, which is much like the dictionary in
Python. The whole is surrounded by accolades. As an example, I could
write myself as a json object as in this example:
{ 'first_name': 'joos',
'last_name': 'korstanje',
'job': 'data scientist' }
As you can see, this is a very lexible format, and it is very easy to
adapt to all kinds of circumstances. You might easily add GPS coordinates
like this:
{ 'first_name': 'joos',
'last_name': 'korstanje',
'job': 'data scientist',
'latitude': '48.8566° N',
'longitude': '2.3522° E' }
You can get a GeoJSON ile easily into the geopandas library using the
code in Code Block 1-5.
As expected, the data looks exactly like before (Figure 1-13). This is
because it is transformed into a geodataframe, and therefore the original
representation as json is not maintained anymore.
Figure 1-13 The geojson content in Python. Image by author Data source: Ministry of DINSIC.
Original data downloaded from
https://geo.data.gouv.fr/en/datasets/8fadd7040c4b94f2c318a0971e8faedb7b5675d6, updated on 1
July 2016. Open Licence 2.0 (www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf)
You can make the plot of this geodataframe to obtain a map, using the
code in Code Block 1-6.
geojsonfile.plot()
Code Block 1-6 Plotting the geojson data
TIFF/JPEG/PNG
Image ile types can also be used to store geodata. After all, many maps
are 2D images that lend themselves well to be stored as an image. Some
of the standard formats to store images are TIFF, JPEG, and PNG.
– The TIFF format is an uncompressed image. A georeferenced TIFF
image is called a GeoTIFF, and it consists of a directory with a TIFF ile
and a tfw (world ile).
– The better-known JPEG ile type stores compressed image data. When
storing a JPEG in the same folder as a JPW (world ile), it becomes a
GeoJPEG.
– The PNG format is another well-known image ile format. You can make
this ile into a GeoJPEG as well when using it together with a PWG
(world ile).
Image ile types are generally used to store raster data. For now,
consider that raster data is image-like (one value per pixel), whereas
vector data contains objects like lines, points, and polygons. We’ll get to
the differences between raster and vector data in a next chapter.
On the following website, you can download a GeoTIFF ile that
contains an interpolated terrain model of Kerbernez in France:
https://geo.data.gouv.fr/en/datasets/b0a420b9e003
d45aaf0670446f0d600df14430cb
You can use the code in Code Block 1-7 to read and show the raster ile
in Python.
Note Depending on your OS, you may obtain a .tiff ile format rather
than a .tif when downloading the data. In this case, you can simply
change the path to become .tiff, and the result should be the same. In
both cases, you will obtain the image shown in Figure 1-15.
Figure 1-15 The plot resulting from Code Block 1-7. Image by author Data source: Ministry of
DINSIC. Original data downloaded from
https://geo.data.gouv.fr/en/datasets/b0a420b9e003d45aaf0670446f0d600df14430cb, updated on
“unknown.” Open Licence 2.0 (www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf)
CSV/TXT/Excel
The same ile as used in the irst three examples is also available in CSV.
When downloading it and opening it with a text viewer, you will observe
something like Figure 1-16.
Figure 1-16 The contents of the CSV ile. Image by author Data source: Ministry of DINSIC. Original
data downloaded from
https://geo.data.gouv.fr/en/datasets/b0a420b9e003d45aaf0670446f0d600df14430cb, updated on
“unknown.” Open Licence 2.0 (www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf)
The important thing to take away from this part of the chapter is that
geodata is “just data,” but with geographic references. These can be
stored in different formats or in different coordinate systems to make
things complicated, but in the end you must simply make sure that you
have some sort of understanding of what you have in your data.
You can use many different tools for working with geodata. The goal of
those tools is generally to make your life easier. As a last step for this
introduction, let’s have a short introduction to the different Python tools
that you may encounter on your geodata journey.
Overview of Python Tools for Geodata
Here is a list of Python packages that you may want to look into on your
journey into geodata with Python:
Geopandas
General GIS tool with a pandas-like code syntax that makes it very
accessible for the data science world.
Fiona
Reading and writing geospatial data.
Rasterio
Python package for reading and writing raster data.
GDAL/OGR
A Python package that can be used for translating between different GIS
ile formats.
RSGISLIB
A package containing remote sensing tools together with raster
processing and analysis.
PyProj
A package that can transform coordinates with multiple geographic
reference systems.
Geopy
Find postal addresses using coordinates or the inverse.
Shapely
Manipulation of planar geometric objects.
PySAL
Spatial analysis package in Python.
Scipy.spatial
Spatial algorithms based on the famous scipy package for data science.
Cartopy
Package for drawing maps.
GeoViews
Package for interactive maps.
A small reminder: As Python is an open source environment and those
libraries are mainly developed and maintained by unpaid open source
developers, there is always that chance that something changes or
becomes unavailable. This is the risk of working with open source. In
most cases, there are no such big problems, but they can and do
sometimes happen.
Key Takeaways
1.
Cartesian coordinates and polar coordinates are two alternative
coordinate systems that can indicate points in a two-dimensional
Euclidean space.
2.
The world is an ellipsoid, which makes the two-dimensional
Euclidean space a bad representation. Other coordinate systems
exist for this real-world scenario.
3.
Geodata is data that contains geospatial references. Geodata can
come in many different shapes and sizes. As long as you have
software implementation (or the skills to build it), you will be able to
convert between data formats.
4.
A number of Python packages exist that do a lot of the heavy lifting
for you.
5.
The advantage of using Python is that you can have a lot of autonomy
on your geodata treatment and that you can bene it from the large
number of geodata and other data science and AI packages in the
ecosystem.
6. A potential disadvantage of Python is that the software is open
source, meaning that you have no guarantee that your preferred
libraries still exist in the future. Python is also not suitable for long-
term data storage and needs to be complemented with such a data
storage solution (e g databases or ile storage)
storage solution (e.g., databases or ile storage).
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
J. Korstanje, Machine Learning on Geographical Data Using Python
https://doi.org/10.1007/978-1-4842-8287-8_2
Figure 2-1 Airplane routes are not straight on a mapImage adapted from
https://en.wikipedia.org/wiki/World_map#/media/File:Blue_Marble_2002.png (Public Domain
Image. 10 February 2002)
Let’s now consider an example where you are holding a round soccer
ball. When going from one point to another on a ball, you will intuitively
be able to say which path is the fastest. If you are looking straight at the
ball, when following your inger going from one point to another, you will
see your hand making a shape like in Figure 2-2.
Figure 2-2 The shortest path on a ball is not a straight line in two-dimensional view. Image by
author
Coordinate Systems
While the former discussion was merely intuitive, it is now time to slowly
get to more of icial de initions of the concepts that you have seen. As we
are ignoring the height of a point (e.g., with respect to sea level) for the
moment, we can identify three types of coordinate systems:
– Geographic Coordinate Systems
– Projected Coordinate Systems
– Local Coordinate Systems
Let’s go over all three of them.
Geographic Coordinate Systems
Geographic Coordinate Systems are the coordinate systems that we have
been talking about in the previous part. They respect the fact that the
world is an ellipsoid, and they, therefore, express points using degrees or
radians latitude and longitude.
As they respect the ellipsoid property of the earth, it is very hard to
make maps or plots with such coordinate systems.
X and Y Coordinates
When working with Projected Coordinate Systems, we do not talk about
latitude and longitude anymore. As latitude and longitude are relevant
only for measurements on the globe (ellipsoid), but on a lat surface, we
can drop this complexity. Once the three-dimensional lat/long
coordinates have been converted to the coordinates of their projection,
we simply talk about x and y coordinates.
X is generally the distance to the east starting from the origin and y
the distance to the north starting from the origin. The location of the
origin depends on the projection that you are using. The measurement
unit also changes from one Projected Coordinate System to another.
Conformal Projections
If shapes are important for your use case, you may want to use a
conformal projection. Conformal projections are designed to preserve
shapes. At the cost of distorting the areas on your map, this category of
projections guarantees that all of the angles are preserved, and this
makes sure that you see the “real” shapes on the map.
Mercator
The Mercator map is very well known, and it is the standard map
projection for many projects. Its advantage is that it has north on top and
south on the bottom while preserving local directions and shapes.
Unfortunately, locations far away from the equator are strongly
in lated, for example, Greenland and Antarctica, while zones on the
equator look too small in comparison (e.g., Africa).
The map looks like shown in Figure 2-5.
Figure 2-5 The world seen in a Mercator projection Source:
https://commons.wikimedia.org/wiki/File:Mercator_projection_of_world_with_grid.png. Public
Domain
Equidistant Projections
As the name indicates, you should use equidistant projections if you want
a map that respects distances. In the two previously discussed projection
types, there is no guarantee that distance between two points is
respected. As you can imagine, this will be a problem for many use cases.
Equidistant projections are there to save you if distances are key to your
solution.
Azimuthal Equidistant Projection
One example of an equidistant projection is the azimuthal equidistant
projection, also called Postel or zenithal equidistant. It preserves
distances from the center and looks as shown in Figure 2-7.
Figure 2-7 The world seen in an azimuthal equidistant projection Source:
https://commons.wikimedia.org/wiki/File:Azimuthal_equidistant_projection_of_world_with_grid.pn
g. Public Domain
Then you can use the code in Code Block 2-2 to import your map and
show the data that is contained within it.
import fiona
import geopandas as gpd
gpd.io.file.fiona.drvsupport.supported_drivers['KML']
= 'rw'
kmlfile =
gpd.read_file(“the/path/to/the/exported/file.kml")
print(kmlfile)
Code Block 2-2 Importing the data
You’ll ind that there is just one line in this dataframe and that it
contains a polygon called France. Figure 2-11 shows this.
Figure 2-11 The contents of the dataframe. Image by author
print(kmlfile.loc[0,'geometry'])
Code Block 2-3 Extracting the geometry from the dataframe
You will see that the data of this polygon is a sequence of coordinates
indicating the contours. This looks like Code Block 2-4.
You will obtain a map of the polygon. You should recognize the exact
shape of the polygon, as it was de ined in your map or in the example
map, depending on which one you used.
kmlfile.crs
Code Block 2-6 Extracting the coordinate system
You’ll see the result like in Figure 2-13 being shown in your notebook.
Figure 2-13 The result of Code Block 2-6. Image by author
It may be interesting to see what happens when we plot the map into
a very different coordinate system. Let’s try to convert this map into a
different coordinate system using the geopandas library. Let’s change
from the geographic WGS 84 into the projected Europe Lambert
conformal conic map projection, which is also known as ESRI:102014.
The code in Code Block 2-7 makes the transformation from the source
coordinate system to the target coordinate system.
proj_kml = kmlfile.to_crs('ESRI:102014')
proj_kml
Code Block 2-7 Changing the coordinate system
Figure 2-14 The resulting dataframe from Code Block 2-7. Image by author
proj_kml.plot()
plt.title('ESRI:102014 map')
Code Block 2-8 Plotting the map
The result is shown in Figure 2-15.
Figure 2-15 The plot resulting from Code Block 2-8. Image by author
Although differences here are small, they can have a serious effect on
your application. It is important to understand here that none of the
maps are “wrong.” They just use a different mathematical formula for
projecting a 3D curved piece of land onto a 2D image.
Key Takeaways
1.
Coordinate systems are mathematical descriptions of the earth that
allow us to communicate about locations precisely
2.
Many coordinate systems exist, and each has its own advantages and
imperfections. One must choose a coordinate system depending on
their use case.
3.
Geographic Coordinate Systems use degrees and try to model the
Earth as an ellipsoid or sphere.
4.
Projected Coordinate Systems propose methods to convert the 3D
reality onto a 2D map. This goes as the cost of some features of
reality, which cannot be presented perfectly in 2D.
5. There are a number of well-known projection categories. Equidistant
makes sure that distances are not disturbed. Equal area projections
make sure that areas are respected. Conformal projections maintain
shape. Azimuthal projections keep directions the same.
6.
Throughout the previous chapters, you have been (secretly) exposed to a number
of different geodata data types. In Chapter 1, we have talked about identifying
points inside a coordinate system. In the previous chapter, you saw how a polygon
of the shape of the country France was created. You have also seen examples of a
TIFF data ile being imported into Python.
Understanding geodata data types is key in working ef iciently with geodata.
In regular, tabular, datasets, it is generally not too costly to transform from one
data type to another. Also, it is generally quite easy to say which data type is the
“best” data type for a given variable or a given data point.
In geodata, the choice of data types is much more impacting. Transforming
polygons of the shapes of countries into points is not a trivial task, as this would
require de ining (arti icially) where you’d want to put each point. This would be in
the “middle,” which would often require quite costly computations.
The other way around, however, would not be possible anymore. Once you
have a point dataset with the centers of countries, you would never be able to ind
the countries’ exact boundaries anymore.
This problem is illustrated using the two images in Figures 3-1 and 3-2. You
see one that has a map with the contours of the countries of the world, which has
some black crosses indicating some of the countries’ center points. In the second
example, you see the same map, but with the polygons deleted. You can see clearly
that once the polygon information is lost, you cannot go back to this information.
This may be acceptable for some use cases, but it may be a problem for many use
cases.
In this chapter, you will see the four main types of geodata data types, so that
you will be comfortable working with all types, and you will be able to decide on
the type of data to use.
Figure 3-1 Putting center points in polygons is possible Image adapted from geopandas (BSD 3 Clause License)
Figure 3-2 Going from points back to polygons is not possible Image by author
For raster data, the storage is generally image-like. As explained before, each
pixel has a value. It is therefore common to store the data as a two-dimensional
table in which each row represents a row of pixels, and each column represents a
column of pixels. The values in your data table represent the values of one and
only one variable. Working with raster data can be a bit harder to get into, as this
image-like data format is not very accommodating to adding additional data.
Figure 3-4 shows an example of this.
We will now get to an in-depth description of each of the data types that you
are likely to encounter, and you will see how to work with them in Python.
Points
The simplest data type is probably the point. You have seen some examples of
point data throughout the earlier chapters, and you have seen before that the
point is one of the subtypes of vector data.
Points are part of vector data, as each point is an object on the map that has its
own coordinates and that can have any number of attributes necessary. Point
datasets are great for identifying locations of speci ic landmarks or other types of
locations. Points cannot store anything like the shape of the size of landmarks, so
it is important that you use points only if you do not need such information.
De inition of a Point
In mathematics, a point is generally said to be an exact location that has no length,
width, or thickness. This is an interesting and important concept to understand
about point data, as in geodata, the same is true.
A point consists only of one exact location, indicated by one coordinate pair
(be it x and y, or latitude and longitude). Coordinates are numerical values,
meaning that they can take an in inite number of decimals. The number 2.0, for
example, is different than 2.1. Yet 2.01 is also different, 2.001 is a different
location again, and 2.0001 is another, different, location.
Even if two points are very close to each other, it would theoretically not be
correct that they are touching each other: as long as they are not in the same
location, there will always be a small distance in between the points.
Another consideration is that if you have a point object, you cannot tell
anything about its size. Although you could make points larger and smaller on the
map, your point still stays at size 0. It is really just a location.
Of course, this is an extract, and the real list of variables about the squirrels is
much longer. What is interesting to see is how the KML data format has stored
point data just by having coordinates with it. Python (or any other geodata tool)
will recognize the format and will be able to automatically import this the right
way.
To import the data into Python, we can use the same code that was used in the
previous chapter. It uses Fiona and geopandas to import the KML ile into a
geopandas dataframe. The code is shown in Code Block 3-1.
import fiona
import geopandas as gpd
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] =
'rw'
kmlfile = gpd.read_file("2018 Central Park Squirrel Census
- Squirrel Data.kml")
print(kmlfile)
Code Block 3-1 Importing the Squirrel data
You will see the dataframe, containing geometry, being printed as shown in
Figure 3-6.
Figure 3-6 Capture of the Squirrel data. Image by author Data source: NYC OpenData. 2018 Central Park
Squirrel Census
You can clearly see that each line is noted as follows: POINT (coordinate
coordinate). The coordinate system should be located in the geodataframe’s
attributes, and you can look at it using the code in Code Block 3-2.
kmlfile.crs
Code Block 3-2 Inspecting the coordinate system
You’ll see the info about the coordinate system being printed, as shown in
Figure 3-7.
Figure 3-7 The output from Code Block 3-2. Image by author Data source: NYC OpenData. 2018 Central Park
Squirrel Census
You can plot the map to see the squirrel sightings on the map using the code in
Code Block 3-3. It is not very pretty for now, but additional visualization
techniques will be discussed in Chapter 4. For now, let’s focus on the data formats
using Code Block 3-3.
kmlfile.columns
Code Block 3-4 Inspecting the columns
You’ll see that only the data shown in Figure 3-9 has been successfully
imported.
Figure 3-9 The output from Code Block 3-4. Image by author
Now, this would be a great setback with any noncode geodata program, but as
we are using Python, we have the full autonomy of inding a way to repair this
problem. I am not saying that it is great that we have to parse the XML ourselves,
but at least we are not blocked at this point.
XML parsing can be done using the xml library. XML is a tree-based data
format, and using the xml element tree, you can loop through the different levels
of the tree and go down in distance. Code Block 3-5 shows how to do this.
import xml.etree.ElementTree as ET
tree = ET.parse("2018 Central Park Squirrel Census –
Squirrel Data.kml")
root = tree.getroot()
for x in elementdata:
df_row.append(x[0].text)
df.append(df_row)
We can now ( inally) apply our ilter on the column shift, using the code in
Code Block 3-6.
AM_geodata = gpd.GeoDataFrame(AM_data,
geometry=gpd.points_from_xy(AM_data['x'], AM_data['y']))
PM_geodata = gpd.GeoDataFrame(PM_data,
geometry=gpd.points_from_xy(PM_data['x'], PM_data['y']))
Code Block 3-7 Create geometry format
AM_geodata.plot()
plt.title('AM squirrels')
PM_geodata.plot()
plt.title('PM squirrels')
Code Block 3-8 Building the two plots
The result is shown in Figure 3-11. You now have the maps necessary to
investigate differences in AM and PM squirrels. Again, visual parameters can be
improved here, but that will be covered in Chapter 4. For now, we focus on the
data types and their possibilities.
Figure 3-11 The maps resulting from Code Block 3-8 Image by author Data source: NYC OpenData. 2018
Central Park Squirrel Census
Lines
Line data is the second category of vector data in the world of geospatial data.
They are the logical next step after points. Let’s get into the de initions straight
away.
De inition of a Line
Lines are also well-known mathematical objects. In mathematics, we generally
consider straight lines that go from one point to a second point. Lines have no
width, but they do have a length.
In geodata, line datasets contain not just one line, but many lines. Line
segments are straight, and therefore they only need a from point and a to point.
This means that a line needs two sets of coordinates (one of the irst point and
one of the second point).
Lines consist of multiple line segments, and they can therefore take different
forms, consisting of straight line segments and multiple points. Lines in geodata
can therefore represent the shape of features in addition to length.
import pandas as pd
flights_data = pd.read_csv('flights.csv')
flights_data
Code Block 3-9 Import the lights data in Python
Figure 3-12 The lights data. Image by author Data source: www.kaggle.com/usdot/ light-delays, Public
Domain
geolookup = pd.read_csv('airports.csv')
geolookup
Code Block 3-10 Importing the airports data in Python
As you can see inside the data, the airports.csv is a ile with geolocation
information, as it contains the latitude and longitude of all the referenced
airports. The lights.csv contains a large number of airplane routes in the USA,
identi ied by origin and destination airport. Our goal is to convert the routes into
georeferenced line data: a line with a from and to coordinate for each airplane
route.
Let’s start by converting the latitude and longitude variables into a point, so
that the geometry can be recognized in further operations. The following code
loops through the rows of the dataframe to generate a new variable. The whole
operation is done twice, as to generate a “to/destination” lookup dataframe and a
“from/source” lookup dataframe. This is shown in Code Block 3-11.
As the data types are not aligned, the easiest hack here is to convert all the
numbers to strings. There are some missing codes and this would be better to
solve by inspecting the data quality issues, but for this introductory example, the
string conversion does the job for us. You can also see that some columns are
dropped here. This is done in Code Block 3-12.
flights_data['ORIGIN_AIRPORT'] =
flights_data['ORIGIN_AIRPORT'].map(str)
flights_data['DESTINATION_AIRPORT'] =
flights_data['DESTINATION_AIRPORT'].map(str)
flights_data = flights_data[['ORIGIN_AIRPORT',
'DESTINATION_AIRPORT']]
Code Block 3-12 Converting the data – part 2
We now get to the step to merge the dataframes of the lights together with
the from and to geographical lookups that we just created. The code in Code Block
3-13 merges two times (once with the from coordinates and once with the to
coordinates).
After running this code, you will end up with a dataframe that still contains
one row per route, but it has now got two georeference columns: the from
coordinate and the to coordinate. This result is shown in Figure 3-14.
Figure 3-14 The dataframe resulting from Code Block 3-13. Image by author Data source:
www.kaggle.com/usdot/ light-delays, Public Domain
The inal step of the conversion process is to make lines out of this to and
from points. This can be done using the LineString function as shown in Code
Block 3-14.
lines = []
for i,row in flights_data.iterrows():
try:
point_from = row['geometry_from']
point_to = row['geometry_to']
lines.append(LineString([point_from, point_to]))
except:
#some data lines are faulty so we ignore them
pass
You will end up with a new geometry variable that contains only LINESTRINGS.
Inside each LINESTRING, you see the four values for the two coordinates (x and y
from, and x and y to). This is shown in Figure 3-15.
Figure 3-15 Linestring geometry. Image by author Data source: www.kaggle.com/usdot/ light-delays, Public
Domain
Now that you have created your own line dataset, let’s make a quick
visualization as a inal step. As before, you can simply use the plot functionality to
generate a basic plot of your lines. This is shown in Code Block 3-15.
You should now obtain the map of the USA given in Figure 3-16. You clearly see
all the airplane trajectories expressed as straight lines. Clearly, not all of it is
correct as lights do not take a straight line (as seen in a previous chapter).
However, it gives a good overview of how to work with line data, and it is
interesting to see that we can even recognize the USA map by just using light
lines (with some imagination).
Figure 3-16 Plot resulting from Code Block 3-15. Image by author Data source: www.kaggle.com/usdot/ light-
delays, Public Domain
Polygons
Polygons are the next step in complexity after points and lines. They are the third
and last category of vector geodata.
De inition of a Polygon
In mathematics, polygons are de ined as two-dimensional shapes, made up of
lines that connect to make a closed shape. Examples are triangles, rectangles,
pentagons, etc. A circle is not of icially a polygon as it is not made up of straight
lines, but you could imagine a lot of very small straight lines being able to
approximate a circle relatively well.
In geodata, the de inition of the polygon is not much different. It is simply a list
of points that together make up a closed shape. Polygons are generally a much
more realistic representation of the real world. Landmarks are often identi ied by
points, but as you get to a very close-up map, you would need to represent the
landmark as a polygon (the contour) to be useful. Roads could be well represented
by lines (remember that lines have no width) but would have to be replaced by
polygons once the map is at a small enough scale to see houses, roads, etc.
Polygons are the data type that has the most information as they are able to
store location (just like points and lines), length (just like lines), and also area and
perimeter.
You’ll see the content of the polygon dataset in Figure 3-17. It contains some
polygons and some multipolygons (polygons that consist of multiple polygons,
e.g., the USA has Alaska that is not connected to their other land, so they need
multiple polygons to describe their territory).
Figure 3-17 Content of the polygon dataImage by author Source: geopandas, BSD 3 Clause Licence
You can easily create a map, as we did before, using the plot function. This is
demonstrated in Code Block 3-17, and this time, it will automatically plot the
polygons.
geojsonfile.plot()
Code Block 3-17 Plotting polygons
Figure 3-18 The plot of polygons as created in Code Block 3-17. Image by author Source: geopandas, BSD 3
Clause Licence
geojsonfile['area'] = geojsonfile['geometry'].apply(lambda
x: x.area)
geojsonfile.sort_values('area').head(10)
Code Block 3-18 Working with the area
In Figure 3-19, you’ll see the irst ten rows of this data, which are the world’s
smallest countries in terms of surface area.
Figure 3-19 The irst ten rows of the data. Image by author Source: geopandas, BSD 3 Clause Licence
We can also compute the length of the borders by calculating the length of the
polygon borders. The length attribute allows us to do so. You can use the code in
Code Block 3-19 to identify the ten countries with the longest contours.
geojsonfile['length'] =
geojsonfile['geometry'].apply(lambda x: x.length)
geojsonfile.sort_values('length',
ascending=False).head(10)
Code Block 3-19 Identify the ten countries with longest contours
You’ll see the result in Figure 3-20, with Antarctica being the winner. Attention
though, as this may be distorted by coordinate system choice. You may remember
that some commonly used coordinate systems have strong distortions toward
the poles and make more central locations smaller. This could in luence the types
of computations that are being done here. If a very precise result is needed, you’d
need to tackle this question, but for a general idea of the countries with the
longest borders, the current approach will do.
Figure 3-20 Dataset resulting from Code Block 3-19. Image by author
Rasters/Grids
Raster data, also called grid data, is the counterpart of vector data. If you’re used
to working with digital images in Python, you might ind raster data quite similar.
If you’re used to working with dataframes, it may be a bit more abstract, and take
a moment to get used to it.
import rasterio
griddata = r'ore-kbz-mnt-litto3d-5m.tif'
img = rasterio.open(griddata)
matrix = img.read()
matrix
Code Block 3-20 Opening the raster data
As you can see in Figure 3-21, this data looks nothing like a geodataframe
whatsoever. Rather, it is just a matrix full of the values of the one (and only one)
variable that is contained in this data.
Figure 3-21 The raster data in Python. Image by author Data source: Ministry of DINSIC,
https://geo.data.gouv.fr/en/datasets/b0a420b9e003d45aaf0670446f0d600df14430cb. Creation data:
Unknown. Open Licence 2.0: www.etalab.gouv.fr/wp-content/uploads/2018/11/open-licence.pdf
You can plot this data using the default color scale, and you will see what this
numerical representation actually contains. As humans, we are particularly bad at
reading and interpreting something from a large matrix like the one earlier, but
when we see it color-coded into a map, we can get a much better feeling of what
we are looking at. The code in Code Block 3-21 does exactly that.
Raster data is a bit more limited than vector data in terms of adding data to it.
Adding more variables would be quite complex, except for making the array into a
3D, where the third dimension contains additional data. However, for plotting,
this would not be of any help, as the plot color would still be one color per pixel,
and you could never show multiple variables for each pixel with this approach.
Raster data is still a very important data type that you will often need and
often use. Any value that needs to be measured over a large area will be more
suitable to raster. Examples like height maps, pollution maps, density maps, and
much more are all only solvable with rasters. Raster use cases are generally a bit
more mathematically complex, as they often use a lot of matrix computations.
You’ll see examples of these mathematical operations throughout the later
chapters of the book.
Key Takeaways
1.
There are two main categories of geodata: vector and raster. They have
fundamentally different ways of storing data.
2.
Vector data stores objects and stores the geospatial references for those
objects.
3.
Raster data cuts an area into equal-sized squares and stores a data value for
each of those squares.
4.
There are three main types of vector data: point, line, and polygon.
5.
Points are zero-dimensional, and they have no size. They are only indicated
by a single x,y coordinate. Points are great for indicating the location of
objects.
6.
Lines are one-dimensional. They have a length, but no width. They are
indicated by two or more points in a sequence. Lines are great for indicating
line-shaped things like rivers and roads.
7. Polygons are two-dimensional objects. They have a shape and size. Polygons
are great when your objects are polygons and when you need to retain this
information. Polygons can indicate the location of objects if you also need to
locate their contour. It can also apply for rivers and roads when you also need
to store data about their exact shape and width. Polygons are the data type
that can retain the largest amount of information among the three vector data
types
types.
8.
Raster data is suitable for measurements that are continuous over an area,
like height maps, density maps, heat maps, etc.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
J. Korstanje, Machine Learning on Geographical Data Using Python
https://doi.org/10.1007/978-1-4842-8287-8_4
4. Creating Maps
Joos Korstanje1
(1) VIELS MAISONS, France
Mapmaking is one of the earliest and most obvious use cases of the ield of
geodata. Maps are a special form of data visualization: they have a lot of standards
and are therefore easily recognizable and interpretable for almost anyone.
Just like other data visualization methods, maps are a powerful tool to share a
message about a dataset. Visualization tools are often wrongly interpreted as an
objective depiction of the truth, whereas in reality, map makers and visualization
builders have a huge power of putting things on the map or leaving things out.
An example is color scale picking on maps. People are so familiar with some
visualization techniques that when they see them, they automatically believe
them.
Imagine a map showing pollution levels in a speci ic region. If you would want
people to believe that pollution is not a big problem in the area, you could build
and share a map that shows areas with low pollution as dark green and very
strongly polluted areas as light green. Add to that a small, unreadable, legend, and
people will easily interpret that there is no big pollution problem.
If you want to argue the other side, you could publish an alternative map that
shows the exact same values, but you depict strong pollution as dark red and
slight pollution as light red. When people see this map, they will directly be
tempted to conclude that pollution is a huge problem in your area and that it
needs immediate action.
It is important to understand that there is no truth in choosing visualization.
There are however a number of levers in mapmaking that you should master well
in order to create maps for your speci ic purpose. Whether your purpose is
making objective maps, beautiful maps, or communicating a message, there are a
number of tools and best practices that you will discover in this chapter. Those
are important to remember when making maps and will come in handy when
interpreting maps as well.
Once you execute this code, you’ll see the irst ive lines of the geodataframe
containing the world’s countries, as displayed in Figure 4-1.
Figure 4-1 The data. Image by authorData source: geopandas, BSD 3 Clause Licence
For this example, we’ll make a map that is color-coded: colors will be based on
the area of the countries. To get there, we need to add a column to the
geodataframe that contains the countries’ areas. This can be obtained using Code
Block 4-2.
If you now look at the dataframe again, you’ll see that an additional column is
indeed present, as shown in Figure 4-2. It contains the area of each country and
will help us in the mapmaking process.
Figure 4-2 The data with an additional column. Image by authorData source: geopandas, BSD 3 Clause Licence
If you do this, you’ll obtain a plot that just contains the polygons, just like in
Figure 4-3. There is no additional color-coding going on.
Figure 4-3 Larger plot size. Image by authorData source: geopandas, BSD 3 Clause Licence
As the goal of our exercise is to color-code countries based on their total area,
we’ll need to start improving on this map with additional plotting parameters.
Adding color-coding to a plot is fairly simple using geopandas and matplotlib.
The plot method can take an argument column, and when specifying a column
name there, the map will automatically be color-coded based on this column.
In our example, we want to color-code with the newly generated variable
called area, so we’ll need to specify column=’area’ in the plot arguments. This is
done in Code Block 4-4.
world.plot(column='area', cmap='Greys')
Code Block 4-4 Adding a color-coded column
You will see the black and white coded map as shown in Figure 4-4.
Figure 4-4 The grayscale map resulting from Code Block 4-4. Image by author Data source: geopandas, BSD 3
Clause Licence
Plot Title
Let’s continue working on this map a bit more. One important thing to add to any
visualization, including maps, is a title. A title will allow readers to easily
understand what the goal of your map is.
When making maps with geopandas and matplotlib, you can use the
matplotlib command plt.title to easily add a title on top of your map. The example
in Code Block 4-5 shows you how it’s done.
world.plot(column='area', cmap='Greys')
plt.title('Area per country')
Code Block 4-5 Adding a plot title
You will obtain the map in Figure 4-5. It is still the same map as before, but
now has a title on top of it.
Figure 4-5 The same map with a title. Image by author Data source: geopandas, BSD 3 Clause Licence
Plot Legend
Another essential part of maps (and other visualizations) is to add a legend
whenever you use color or shape encodings. In our map, we are using color-
coding to show the area of the countries in a quick visual manner, but we have not
yet added a legend. It can therefore be confusing for readers of the map to
understand which values are high areas and which indicate low areas.
In the code in Code Block 4-6, the plot method takes two additional arguments.
Legend is set to True to generate a legend. The legend_kwds takes a dictionary
with some additional parameters for the legend. The label will be the label of the
legend, and the orientation is set to horizontal to make the legend appear on the
bottom rather than on the side. A title is added at the end of the code, just like you
saw in the previous part.
This is the inal version of this map for the current example. The map does a
fairly good job at representing a numerical value for different countries. This type
of use case is easily solvable with geopandas and matplotlib. Although it may not
be the most aesthetically pleasing map, it is perfect for analytical purposes and
the like.
cities =
gpd.read_file(gpd.datasets.get_path('naturalearth_cities'))
cities.head()
Code Block 4-7 Importing the data
When executing this code, you’ll see the irst ive lines of the dataframe, just
like shown in Figure 4-7. The column geometry shows the points, which are two
coordinates just like you have seen in earlier chapters.
Figure 4-7 Head of the data. Image by author Data source: geopandas, BSD 3 Clause Licence
You can easily plot this dataset with the plot command, as we have done many
times before. This is shown in Code Block 4-8.
cities.plot()
Code Block 4-8 Plotting the cities data
You will obtain a map with only points on it, as shown in Figure 4-8.
Figure 4-8 Plot of the cities data. Image by author Data source: geopandas, BSD 3 Clause Licence
This plot is really not very readable. We need to add a background into this for
more context. We can use the world’s countries for this, using only the borders of
the countries and leaving the content white.
The code in Code Block 4-9 does exactly that. It starts with creating the ig and
ax and then sets the aspect to “equal” to make sure that the overlay will not be
causing any mismatching. The world (country polygons) is then plotted using the
color white to make it seem see-through, followed by the cities with a marker=‘x’
for squares and the color=‘black’ for black color.
fig, ax = plt.subplots()
ax.set_aspect('equal')
world.plot(ax=ax, color='white', edgecolor='grey')
cities.plot(ax=ax, marker='x', color='black',
markersize=15)
plt.title('Cities plotted on a country border base map')
plt.show()
Code Block 4-9 Adding a background to the cities data
Figure 4-9 Adding a background to the cities data. Image by author Data source: geopandas, BSD 3 Clause
Licence
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1,
projection=ccrs.PlateCarree())
ax.set_extent([-10, 40, 30, 70], crs=ccrs.PlateCarree())
# background image
ax.stock_img()
ax.add_feature(cfeature.LAND)
ax.add_feature(cfeature.COASTLINE)
ax.add_feature(states_provinces, edgecolor='gray')
# Add a copyright
text = AnchoredText('\u00A9 Natural Earth; license: public
domain',loc=4, prop={'size': 12}, frameon=True)
ax.add_artist(text)
plt.show()
Code Block 4-10 Creating a Cartopy plot
The map resulting from this introductory Cartopy example is shown in Figure
4-10.
Figure 4-10 The Cartopy map. Image by author Data source: Natural Earth, provided through Cartopy, Public
Domain
import plotly.express as px
data = px.data.gapminder().query("year==2002")
data.head()
Code Block 4-11 Map with Plotly
The content of the irst ive lines of the dataframe tells us what type of
variables we have. For example, you have life expectancy, population, and gdp per
country per year. The ilter on 2002 that was applied in the preceding query
makes that we have only one data point per city; otherwise, plotting would be
more dif icult.
Let’s create a new variable called gdp to make the plot with. This variable can
be computed using Code Block 4-12.
Let’s now make a bubble map in which the icon for each country is larger or
smaller based on the newly created variable gdp using Code Block 4-13.
Even with this fairly simple code, you’ll obtain a quite interestingly looking
graph, as shown in Figure 4-12.
Figure 4-12 A map with Plotly. Image by author Data source: Plotly data gapminder. Source data is free data
from the World Bank via gapminder.org, CC-BY licence.
Each of the countries has a bubble that makes reference to their gdp, the
continents each have a different color, and you have a background that is the well-
known natural earth. You can hover over the data points to see more info about
each of them.
import folium
m = folium.Map(location=[48.8545, 2.2464])
m
Code Block 4-14 Mapping with Folium
Figure 4-13 A map with Folium. Image by author This Produced Work is based on the map data source from
OpenStreetMap. Map Data Source OpenStreetMap contributors. Underlying data is under Open Database Licence.
Map (Produced Work) is under the copyright of this book.
https://wiki.osmfoundation.org/wiki/Licence/Licence_and_Legal_FAQ#What_is_the_licence,_how_can_I_use_it?
Now, the interesting thing here is that this map does not contain any of your
data. It seems like it could be a map illed with complex points, polygons, labels,
and more, and deep down somewhere in the software it is. The strong point of
Folium as a visualization layer is that you do not at all need to worry about this.
All your “background data” will stay cleanly hidden from the user. You can imagine
that this would be very complex to create using the actual polygons, lines, and
points about the Paris region.
Let’s go a step further and add some data to this basemap. We’ll add two
markers (point data in Folium terminology): one for the Eiffel Tower and one for
the Arc de Triomphe.
The code in Code Block 4-15 shows a number of additions to the previous
code. First, it adds a zoom_start. This basically tells you how much zoom you want
to show when initializing the map. If you have played around with the irst
example, you’ll see that you can zoom out so far as to see the whole world on your
map and that you can zoom in to see a very detailed map as well. It really is very
complete. However, for a speci ic use case, you would probably want to focus on a
speci ic region or zone, and setting a zoom_start will help your users identify what
they need to look at.
Second, there are two markers added to the map. They are very intuitively
added to the map using the .add_to method. Once added to the map, you simply
show the map like before, and they will appear. You can specify a popup so that
you see additional information when hovering over your markers. Using HTML
markup, you can create whole paragraphs of information here, in case you’d want
to.
As the markers are point geometry data, they just need x and y coordinates to
be located on the map. Of course, these coordinates have to be in the correct
coordinate system, but that is nothing different from anything you’ve seen before.
import folium
m = folium.Map(location=[48.8545, 2.2464], zoom_start=11)
folium.Marker(
[48.8584, 2.2945], popup="Eiffel Tower").add_to(m)
folium.Marker(
[48.8738, 2.2950], popup="Arc de Triomphe").add_to(m)
m
Code Block 4-15 Add items to the Folium map
If you are working in a notebook, you will then be able to see the interactive
map appear as shown in Figure 4-14. It has the two markers for showing the Eiffel
Tower and the Arc de Triomphe, just like we started out to do.
Figure 4-14 Improved Folium map. Image by author This Produced Work is based on the map data source from
OpenStreetMap. Map Data Source OpenStreetMap contributors. Underlying data is under Open Database Licence.
Map (Produced Work) is under the copyright of this book.
https://wiki.osmfoundation.org/wiki/Licence/Licence_and_Legal_FAQ#What_is_the
_licence,_how_can_I_use_it?
For more details on plotting maps with Folium, I strongly recommend you to
read the documentation. There is much more documentation out there, as well as
sample maps and examples with different data types.
Key Takeaways
1.
There are many mapping libraries in Python, each with its speci ic advantages
and disadvantages.
2.
Using geopandas together with matplotlib is probably the easiest and most
intuitive approach to making maps with Python. This approach allows you to
work with your dataframes in an intuitive pandas-like manner in geopandas
and use the familiar matplotlib plotting syntax. Aesthetically pleasing maps
may be a little bit of work to obtain.
3.
Cartopy is an alternative that is less focused on data and more on the actual
mapping part. It is a very speci ic library to cartography and has good support
for different geometries, different coordinate systems, and the like.
4.
Plotly is a visualization library, and it is, therefore, less focused on the
geospatial functionalities. It does come with a powerful list of visualization
options, and it can create aesthetically pleasing maps that can really
communicate a message.
5.
Folium is a great library for creating interactive maps. The maps that you can
create even with little code are of high quality and are similar in user
experience to Google Maps and the like. The built-in background maps allow
you to make useful maps even when you have very little data to show.
6.
Having seen those multiple approaches to mapmaking, the most important
takeaway is that maps are created for a purpose. They either try to make an
objective representation of some data, or they can try to send a message.
They can also be made for having something that is nice to look at. When
choosing your method for making maps with Python, you should choose the
library and the method that best serves your purpose. This always depends
on your use case.
Part II
GIS Operations
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
J. Korstanje, Machine Learning on Geographical Data Using Python
https://doi.org/10.1007/978-1-4842-8287-8_5
What Is Clipping?
Clipping, in geoprocessing, takes one layer, an input layer, and uses a
speci ied boundary layer to cut out a part of the input layer. The part that
is cut out is retained for future use, and the rest is generally discarded.
The clipping operation is like a cookie cutter, in which your cookie
dough is the input layer in which a cookie-shaped part is being cut out.
In the coming part, you will see a more practical application of this
theory by applying the clipping operation in Python.
Clipping in Python
In this example, you will see how to apply a clipping operation in Python.
The dataset is a dataset that I have generated speci ically for this
exercise. It contains two features:
A line that covers a part of the Seine River (a famous river in Paris,
France, which also covers a large part of the country of France)
A polygon that covers the center of Paris
The goal of the exercise is to clip the Seine River to the Paris center
region. This is a very realistic use of the clipping operation. After all,
rivers are often multicountry objects and are often displayed in maps.
When working on a more local map, you will likely encounter the case
where you will have to clip rivers (or other lines like highways, train lines,
etc.) to a more local extent.
Let’s start with importing the dataset and opening it. You can ind the
data in the GitHub repository. For the execution of this code, I’d
recommend using a Kaggle notebook or a local environment, as Colab has
an issue with the clipping function at the time of writing.
You can import the data using geopandas, as you have learned in
previous chapters. The code for doing this is shown in Code Block 5-1.
gpd.io.file.fiona.drvsupport.supported_drivers['KML']
= 'rw'
data = gpd.read_file('ParisSeineData.kml')
print(data)
Code Block 5-1 Importing the data
We can quickly use the geopandas built-in plot function to get a plot of
this data. Of course, you have already seen more advanced mapping
options in the previous chapters, but the goal here is just to get a quick
feel of the data we have. This is done in Code Block 5-2.
data.plot()
Code Block 5-2 Plotting the data
When using this plot method, you will observe the map in Figure 5-5,
which clearly contains the two features: the Seine River as a line and the
Paris center as a polygon.
Figure 5-5 The plot resulting from Code Block 5-2. Image by author
seine = data.iloc[0:1,:]
seine.plot()
Code Block 5-3 Extract the Seine data
You can verify in the resulting plot (Figure 5-6) that this has been
successful.
Figure 5-6 The plot resulting from Code Block 5-3. Image by author
Now, we do the same for the Paris polygon using the code in Code
Block 5-4.
paris = data.iloc[1:2,:]
paris.plot()
Code Block 5-4 Extracting the Paris data
You will obtain a plot with the Paris polygon to verify that everything
went well. This is shown in Figure 5-7.
Figure 5-7 The plot resulting from Code Block 5-4. Image by author
Now comes the more interesting part: using the Paris polygon as a
clip to the Seine River. The code to do this using geopandas is shown in
Code Block 5-5.
paris_seine = seine.clip(paris)
paris_seine
Code Block 5-5 Clipping the Seine to the Paris region
You will obtain a new version of the Seine dataset, as shown in Figure
5-8.
paris_seine.plot()
Code Block 5-6 Plotting the clipped data
Figure 5-9 The Seine River clipped to the Paris polygon. Image by author
This result shows that the goal of the exercise is met. We have
successfully imported the Seine River and Paris polygon, and we have
reduced the size of the Seine River line data to it inside Paris.
You can imagine that this can be applied for highways, train lines,
other rivers, and other line data that you’d want to use in a map for Paris,
but that is available only for a much larger extent. The clipping operation
is fairly simple but very useful for this, and it allows you to remove
useless data from your working environment.
What Is Intersecting?
The second operation that we will be looking at is the intersection. For
those of you who are aware of set theory, this part will be relatively
straightforward. For those who are not, let’s do an introduction of set
theory irst.
Sets, in mathematics, are collections of unique objects. A number of
standard operations are de ined for sets, and this is generally helpful in
very different problems, one of which is geodata problems.
As an example, we could imagine two sets, A and B:
– Set A contains three cities: New York, Las Vegas, and Mexico City.
– Set B contains three cities as well: Amsterdam, New York, and Paris.
There are a number of standard operations that are generally applied
to sets:
– Union: All elements of both sets
– Intersection: Elements that are in both sets
– Difference: Elements that are in one but not in the other (not
symmetrical)
– Symmetric difference: Elements that are in A but not in B or in B but
not in A
With the example sets given earlier, we would observe the following:
– The union of A and B: New York, Las Vegas, Mexico City, Amsterdam,
Paris
– The intersection of A and B: New York
– The difference of A with B: Las Vegas, Mexico City
– The difference of B with A: Amsterdam, Paris
– The symmetric difference: Las Vegas, Mexico City, Amsterdam, Paris
The diagram in Figure 5-10 shows how these come about.
Figure 5-10 part 1: Set operations. Image by author
Figure 5-10 part 2: More set operations. Image by author
In the irst part of this chapter, you have seen that iltering is an
important basic operation in geodata. Set theory is useful for geodata, as
it allows you to have a common language for all these ilter operations.
The reason that we are presenting the intersection in the same
chapter as the clip is that they are relatively similar and are often
confused. This will allow us to see what the exact similarities and
differences are.
What Happens in Practice When Intersecting?
An intersection in set theory takes two input sets and keeps only those
items from the set that are present in both. In geodata processing, the
same is true. Consider that your sets are now geographical datasets, in
which we use the geographical location data as identi ier of the objects.
The intersection of two objects will keep all features (columns) of both
datasets, but it will keep only those data points that are present in both
datasets.
As an example, let’s consider that we again use the Seine River data,
and this time we use the main road around Paris (Boulevard
Périphérique) to identify places at which we should ind bridges or
tunnels. This could be useful, for example, if we have no data about
bridges and tunnels yet, and we want to automatically identify all
locations at which we should ind bridges or tunnels.
The intersection of the two would allow us to keep both the
information about the road data and the data from the river dataset while
reducing the data to the locations where intersections are to be found.
Of course, this can be generalized to a large number of problems
where the intersection of two datasets is needed.
This basically just ilters out some points, and the resulting shapes
are still points. Let’s now see what happens when applying this to two
line datasets.
Line datasets will work differently. When two lines have a part at the
exact same location, the resulting intersection of two lines could be a line.
In general, it is more likely that two lines intersect at a crossing or that
they are touching at some point. In this case, the intersection of two lines
is a point. The result is therefore generally a different shape than the
input. This is shown in the schematic drawing in Figure 5-12.
Figure 5-12 Schematic drawing of intersecting lines. Image by author
The lines intersect at three points, and the resulting dataset just
shows these three points. Let’s now see what happens when intersecting
polygons.
Conceptually, as polygons have a surface, we consider that the
intersection of two polygons is the surface that they have in common.
The result would therefore be the surface that they share, which is a
surface and therefore needs to be polygon as well. The schematic drawing
in Figure 5-13 shows how this works.
Figure 5-13 Intersecting polygons. Image by author
Intersecting in Python
Let’s now start working on the example that was described earlier in this
chapter. We take a dataset with the Boulevard Pé riphé rique and the Seine
River, and we use the intersection of those two to identify the locations
where the Seine River crosses the Boulevard Pé riphé rique.
You can use the code in Code Block 5-7 to import the data and print
the dataset.
gpd.io.file.fiona.drvsupport.supported_drivers['KML']
= 'rw'
data =
gpd.read_file('ParisSeineData_example2_v2.kml')
data.head()
Code Block 5-7 Import and print the data
You will observe the data in Figure 5-14.
There are two polygons, one called Seine and one called Boulevard
Périphérique. Let’s use Code Block 5-8 to create a plot to see what this
data practically looks like. We can use the cmap to specify a colormap and
obtain different colors. You can check out the matplotlib documentation
for an overview of colormaps; there are many to choose from.
data.plot(cmap='tab10')
Code Block 5-8 Plot the data with a colormap
Figure 5-15 The plot resulting from Code Block 5-8. Image by author
Compared to the previous example, the data has been converted to
polygons here. You will see in a later chapter how to do this automatically
using buffering, but for now it has been done for you, and the polygon
data is directly available in the dataset.
We can clearly see two intersections, so we can expect two bridges (or
tunnels) to be identi ied. Let’s now use the intersection function to ind
these automatically for us.
The code in Code Block 5-9 shows how to use the overlay function in
geopandas to create an intersection.
intersection = seine.overlay(periph,
how='intersection')
intersection
Code Block 5-9 Creating an intersection
The result is a dataset with only the intersection of the two polygons,
as shown in Figure 5-16.
Figure 5-16 The plot resulting from Code Block 5-9. Image by author
intersection.plot()
Code Block 5-10 Plotting the intersection
The result may look a bit weird without context, but it basically just
shows the two bridges/tunnels of the Parisian Boulevard Pé riphé rique.
This is shown in Figure 5-17.
Figure 5-17 The two crossings of the Seine and the Boulevard Pé riphé rique. Image by author
Key Takeaways
1.
There are numerous basic geodata operations that are standardly
implemented in most geodata tools. They may seem simple at irst
sight, but applying them to geodata can come with some dif iculties.
2.
The clipping operation takes an input dataset and reduces its size to
an extent given by a boundary dataset. This can be done for all
geodata data types.
3.
Using clipping for raster data or points comes down to deleting the
pixel points that are out of scope.
4.
Using clipping for lines or polygons will delete those lines and
polygons that are out of scope entirely, but will create a new reduced
form for those points that are partly inside and partly outside of the
boundaries.
5.
The intersection operation is based on set theory and allows to ind
features that are shared between two input datasets. It is different
from clipping, as it treats the two datasets as input and therefore
keeps the features of both of them. In clipping, this is not the case, as
only the features from the input dataset are considered relevant.
6.
Intersecting points basically comes down to iltering points based on
their presence in both datasets.
7.
Intersecting lines generally results in points (either crossings
between two lines or touchpoints between two curving lines), but
they can also be lines if two lines are perfectly equal on a part of their
trajectory.
8.
Intersecting polygons will result in one or multiple smaller polygons,
as the intersection is considered to be the area that the two polygons
have in common.
9. You have seen how to use geopandas as an easy tool for both clipping
and intersecting operations.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
J. Korstanje, Machine Learning on Geographical Data Using Python
https://doi.org/10.1007/978-1-4842-8287-8_6
6. Buffers
Joos Korstanje1
(1) VIELS MAISONS, France
Representing the road as a line will allow you to do many things, like
compute the length of a road, ind crossings with other roads and make a
road network, etc. However, in real life, a road has a width as well. For
things like urban planning around the road, building a bridge, etc., you
will always need to know the width of the road at each location.
As you have seen when covering data types in Chapter 3, lines have a
length but not width. It would not be possible to represent the width of a
line. You could however create a buffer around the line and give the buffer
a speci ied width. This would result in a polygon that encompasses the
road, and you would then be able to generate a polygon-like data.
In this schematic drawing, you see the left image containing points,
which are depicted here as stars. In the right image, you see how the
buffers are circular polygons that are formed exactly around the point.
Although it may seem dif icult to ind use cases for this, there are
cases where this may be useful. Imagine that your point data are sources
of sound pollution and that it is known that the sound can be heard a
given number of meters from the source. Creating buffers around the
point would help to determine regions in which the sound problems
occur.
Another, very different use case could be where you collect data
points that are not very reliable. Imagine, for example, that they are gps
locations given by a mobile phone. If you know how much uncertainty
there is in your data points, you could create buffers around your data
points that state that all locations that are inside the buffer may have
been visited by the speci ic mobile phone user. This can be useful for
marketing or ad recommendations and the like.
You see that the left image contains one line (the planned railroad)
and a number of houses (depicted as stars). On the top right, you see a
narrow buffer around the line, which shows the heavy impact. You could
ilter out the points that are inside this heavy impact buffer to identify
them in more detail. The bottom-left graph contains houses with a
moderate impact. You could think of using set operations from the
previous chapter to select all moderate impact houses that are not inside
the heavy impact buffer (e.g., using a difference operation on the buffer,
but other approaches are possible as well).
In the left part of the schematic drawing, you see the lake polygon, an
oval. On the right, you see that a gray buffer has been created around the
lake – maybe not the best way to estimate the exact location of your path,
but de initely an easy way to create the new feature quickly in your
dataset.
Now that you have seen how buffers work in theory, it is time to move
on to some practice. In the following section, we will start applying these
operations in Python.
You will see that the data contains eight subway stations. They do not
have names as that does not really have added value for this example.
They are all point data, having a latitude and longitude. They also have a
z-score (height), but they are not used and they are therefore all at zero.
Let’s make a quick and easy visualization to get a better feeling for the
data that we are working with. You can use the code in Code Block 6-2 to
do so.
data.plot()
Code Block 6-2 Plotting the data
This plot will show the plot of the data. This is shown in Figure 6-6.
Figure 6-6 The plot resulting from Code Block 6-2. Image by author
As you can see, the points are displayed on the map, on the subway
line that goes east-west. When we add houses to this data, we could
compute distances from each house to each subway station. However, we
could not use these points in a set operation or overlay. The overlay
method would be much easier to compute than the distance operation,
which shows why it is useful to master the buffer operation.
We can use it to combine with other features as speci ied in the
de inition of the example. Let’s now add a buffer on those points to start
creating a house selection polygon.
Creating the buffer is quite easy. It is enough to use “.buffer” and
specify the width, as is done in Code Block 6-4.
data.buffer(0.01)
Code Block 6-4 Creating the buffer
import contextily as cx
Figure 6-9 The plot resulting from Code Block 6-5. Image by author using contextily source data
and image as referenced in the image
import pandas as pd
df = pd.DataFrame(
{
'Name': ['metro'],
'geometry':
[LineString(data.loc[[7,6,5,4,0,1,2,3],
'geometry'].reset_index(drop=True))]
}
)
gdf = gpd.GeoDataFrame(df)
gdf
Code Block 6-7 Add this LineString to our existing plot
You will see that the resulting geodataframe has exactly one line,
which is the line representing our subway, as shown in Figure 6-11.
To plot the line, let’s add this data into the plot with the background
map directly, using the code in Code Block 6-8.
import contextily as cx
# use paris data to set extent but leave invisible
ax = paris.plot(figsize=(15,15), color="None")
You now obtain a map that has the subway station buffers and the
subway rails as a line. The result is shown in Figure 6-12.
Figure 6-12 The plot with the two data types. Image by author using contextily source data and
image as referenced in the image
gdf.buffer(0.001)
Code Block 6-9 Adding a smaller buffer
By creating the buffer in this way, you end up with a geodataframe
that contains a polygon of the buffer, rather than with the initial line data.
This is shown in Figure 6-13.
Now, you can simply add this polygon in the plot, and you’ll obtain a
polygon that shows areas that you should try to ind a house in and some
subareas that should be avoided. Setting the transparency using the
alpha parameter can help a lot to make more readable maps. This is done
in Code Block 6-10.
import contextily as cx
This shows the map of Paris in which the best circles for use are
marked in green, but in which the red polygon should be avoided as it is
too close to the subway line. In the following section, we will add a third
criterion on the map: proximity to a park. This will be done by creating
buffers on polygons.
In Figure 6-15, you will see that there are 18 parks in this dataset, all
identi ied as polygons.
Figure 6-15 The data from Code Block 6-11. Image by author
You can visualize this data directly inside our map, by adding it as
done in Code Block 6-12.
import contextily as cx
The parks are shown in the map as black contour lines. No buffers
have yet been created. This intermediate result looks as shown in Figure
6-16.
Figure 6-16 The map with the parks added to it. Image by author using contextily source data and
image as references in the image
parks.buffer(0.01)
Code Block 6-13 Adding the buffer to the parks
This looks like shown in Figure 6-17.
After the buffer, you have polygon data, just like you had before. Yet
the size of the polygon is now larger as it also has the buffers around the
original polygons. Let’s now add this into our plot, to see how this affects
the places in which we want to ind a house. This is done in Code Block 6-
14.
import contextily as cx
The colors and “zorder” (order of overlay) have been adjusted a bit to
make the map more readable. After all, it starts to contain a large number
of features. You will see the result shown in Figure 6-18.
Figure 6-18 The plot resulting from Code Block 6-14. Image by author using contextily source data
and image as referenced in the image
This map is a irst result that you could use. Of course, you could go
even further and combine this with the theory from Chapter 5, in which
you have learned how to use operations from set theory to combine
different shapes. Let’s see how to do this, with a inal goal to obtain a
dataframe that only contains the areas in which we do want to ind a
house, based on all three criteria from the introduction.
station_buffer = data.buffer(0.01)
rails_buffer = gdf.buffer(0.001)
park_buffer = parks.buffer(0.01)
A = gpd.GeoDataFrame({'geometry': station_buffer})
B = gpd.GeoDataFrame({'geometry': park_buffer})
C = gpd.GeoDataFrame({'geometry': rails_buffer})
Code Block 6-15 Prepare to create an intersection layer
You will obtain a dataset that looks like the data shown in Figure 6-19.
Figure 6-19 The data resulting from Code Block 6-16. Image by author
The data still looks like a dataframe from before. The only difference
that occurs is that the data becomes much more complex with every step,
as the shapes of our acceptable locations become less and less regular.
Let’s do a map of our inal object using Code Block 6-18.
import contextily as cx
A_and_B_not_C.plot(ax=ax, edgecolor='none',
color='green', alpha=0.8)
Figure 6-20 The inal map of the exercise. Image by author using contextily source data and image
as referenced in the image
As you can see in Figure 6-20, the green areas are now a ilter that we
could use to select houses based on coordinates. This answers the
question posed in the exercise and results in an interesting map as well. If
you want to go further with this exercise, you could create a small dataset
containing point data for houses. Then, for looking up whether a house
(point data coordinate) is inside a polygon, you can use the operation
that is called “contains” or “within.” Documentation can be found here:
–
https://geopandas.org/en/stable/docs/reference/api
/geopandas.GeoSeries.within.html
–
https://geopandas.org/en/stable/docs/reference/api
/geopandas.GeoSeries.contains.html
This operation is left as an exercise, as it goes beyond the
demonstration of the buffer operation, which is the focus of this chapter.
Key Takeaways
1.
There are numerous basic geodata operations that are standardly
implemented in most geodata tools. They may seem simple at irst
sight, but applying them to geodata can come with some dif iculties.
2.
The buffer operation adds a polygon around a vector object. Whether
the initial object is point, line, or polygon, the result is always a
polygon.
3.
When applying a buffer, one can choose the distance of the buffer’s
boundary to the initial object. The choice depends purely on the use
case.
4.
Once buffers are computed, they can be used for mapping purposes,
or they can be used in further geospatial operations.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
J. Korstanje, Machine Learning on Geographical Data Using Python
https://doi.org/10.1007/978-1-4842-8287-8_7
What Is a Merge?
Merging geodata, just like with regular data, consists of taking multiple input
datasets and making them into a single new output feature. In the previous
chapter, you already saw a possible use case for a merge. If you remember,
multiple “suitability” polygons were created based on multiple criteria. At the
end, all of these polygons were combined into a single spatial layer. Although
another solution was used in that example, a merge could have been used to
get all those layers together in one and the same layer.
As you can see, this is a simple SQL-like join that uses a common identi ier
between the two datasets to add the columns of the attribute table into the
columns of the geodata dataset.
An alternative is the spatial join, which is a bit more complex. The spatial
join also combines columns of two datasets, but rather than using a common
identi ier, it uses the geographic coordinates of the two datasets. The
schematic drawing in Figure 7-3 shows how this can be imagined.
In this example, the spatial join is relatively easy, as the objects are exactly
the same in both input datasets. In reality, you may well see slight differences
in the features, but you may also have different features that you want to join.
You can specify all types of spatial join parameters to make the right
combination:
– Joining all objects that are near each other (specify a distance)
– Joining based on one object containing the other
– Joining based on intersections existing
This gives you a lot of tools to work with for combining datasets together,
both row-wise (merge) and column-wise (join). Let’s now see some examples
of the merge, attribute join, and spatial join in Python.
Merging in Python
In the coming examples, we will be looking at some easy-to-understand data.
There are multiple small datasets, and throughout the exercise, we will do all
the three types of merges.
The data contains
–A ile with three polygons for Canada, USA, and Mexico
–A ile with some cities of Canada
–A ile with some cities of the USA
–A ile with some cities of Mexico
During the exercise, we will take the following steps:
– Combine the three city iles using a row-wise merge
– Add a new variable to the combined city ile using an attribute lookup
– Find the country of each of the cities using a spatial lookup with the
polygon ile
Let’s now start by combining the three city iles into a single layer with all
the cities combined.
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] =
'rw'
us_cities =
gpd.read_file('/kaggle/input/chapter7/USCities.kml')
us_cities
Code Block 7-1 Importing the data
canada_cities =
gpd.read_file('/kaggle/input/chapter7/CanadaCities.kml')
canada_cities
Code Block 7-2 Importing the Canada cities
mexico_cities =
gpd.read_file('/kaggle/input/chapter7/MexicoCities.kml')
mexico_cities
Code Block 7-3 Importing the Mexico cities
We can create a map of all three of those datasets using the syntax that
you have seen earlier in this book. This is done in Code Block 7-4.
import contextily as cx
# us cities
ax = us_cities.plot(markersize=128, figsize=(15,15))
# canada cities
canada_cities.plot(ax=ax, markersize=128)
# mexico cities
mexico_cities.plot(ax = ax, markersize=128)
# contextily basemap
cx.add_basemap(ax, crs=us_cities.crs)
Code Block 7-4 Creating a map of the datasets
Now, this is not too bad already, but we actually want to have all this data
in just one layer, so that it is easier to work with. To do so, we are going to do a
row-wise merge operation. This can be done in Python using the pandas
concat method. It is shown in Code Block 7-5.
import pandas as pd
cities = pd.concat([us_cities, canada_cities,
mexico_cities])
cities
Code Block 7-5 Using concatenation
You will obtain a dataset, in which all the points are now combined. Cities
now contain the rows of all the cities of the three input geodataframes, as can
be seen in Figure 7-8.
Figure 7-8 The concatenated dataframe. Image by author
If we now plot this data, we just have to plot one layer, rather than having
to plot three times. This is done in Code Block 7-6. You can see that it has all
been successfully merged into a single layer.
ax = cities.plot(markersize=128,figsize=(15,15))
cx.add_basemap(ax, crs=us_cities.crs)
Code Block 7-6 Plotting the concatenated cities
You can also see that all points now have the same color, because they are
now all on one single dataset. This fairly simple operation of row-wise
merging will prove to be very useful in your daily GIS operations.
Now that we have combined all data into one layer, let’s add some features
using an attribute join.
lookup = pd.DataFrame({
'city': [
'Las Vegas',
'New York',
'Washington',
'Toronto',
'Quebec',
'Montreal',
'Vancouver',
'Guadalajara',
'Mexico City'
],
'population': [
1234,
2345,
3456,
4567,
4321,
5432,
6543,
1357,
2468
]
})
lookup
Code Block 7-7 Create a lookup table
You can see in the dataframe that the population column has been added,
as is shown in Figure 7-11.
Figure 7-11 The data resulting from Code Block 7-8. Image by author
You can now access this data easily, for example, if you want to do ilters,
computations, etc. Another example is to use this attribute data to adjust the
size of each point on the map, depending on the (simulated) population size
(of course, this is toy data so the result is not correct, but feel free to improve
on this if you want to). The code is shown in Code Block 7-9.
ax =
cities_new.plot(markersize=cities_new['population'] //
10, figsize=(15,15))
cx.add_basemap(ax, crs=us_cities.crs)
Code Block 7-9 Plot the new data
The result in Figure 7-12 shows the cities’ sizes being adapted to the value
in the column population, which was added to the dataset through an
attribute join.
Figure 7-12 The map resulting from Code Block 7-9. Image by author using contextily source data and
image as referenced in the image
countries =
gpd.read_file('NorthMiddleAmerciaCountries.kml')
countries
Code Block 7-10 Importing the data
If we plot the data against the background map, you can see that the
polygons are quick approximations of the countries’ borders, just for the
purpose of this exercise. This is done in Code Block 7-11.
ax = countries.plot(figsize=(15,15), edgecolor='black',
facecolor='none')
cx.add_basemap(ax, crs=countries.crs)
Code Block 7-11 Plotting the data
You can see some distortion on this map. If you have followed along with
the theory on coordinate systems in Chapter 2, you should be able to
understand where that is coming from and have the tools to rework this
map’s coordinate system if you’d want to. For the current exercise, those
distortions are not a problem. Now, let’s add our cities onto this map, using
Code Block 7-12.
ax = countries.plot(figsize=(15,15), edgecolor='black',
facecolor='none')
cities_new.plot(ax=ax,
markersize=cities_new['population'] // 10, figsize=
(15,15))
cx.add_basemap(ax, crs=countries.crs)
Code Block 7-12 Add the cities to the map
This brings us to the topic of the spatial join. In this map, you see that
there are two datasets:
– The cities only contain information about the name of the city and the
population.
– The countries are just polygons.
It would be impossible to use an SQL-like join to add a column country to
each of the rows in the city dataset. However, we can clearly see that based on
the spatial information, it is possible to ind out in which country each of the
cities is located.
The spatial join is made exactly for this purpose. It allows us to combine
two datasets column-wise, even when there is no common identi ier: just
based on spatial information. This is one of those things that can be done
with geodata but not with regular data.
You can see in Code Block 7-13 how a spatial join is done between the
cities and countries datasets, based on a “within” spatial join: the city needs
to be inside the polygon to receive its attributes.
cities_3 = cities_new.sjoin(countries, how="inner",
predicate='within')
cities_3
Code Block 7-13 Spatial join between the cities and the countries
Figure 7-16 The data resulting from Code Block 7-13. Image by author
You see that the name of the country has been added to the dataset of the
cities. We can now use this attribute for whatever we want to in the cities
dataset. As an example, we could give the points a color based on their
country, using Code Block 7-14.
cities_3['color'] = cities_3['index_right'].map({0:
'green', 1: 'yellow', 2: 'blue'})
ax = cities_3.plot(markersize=cities_3['population'] //
10, c=cities_3['color'], figsize=(15,15))
cx.add_basemap(ax, crs=cities_3.crs)
Code Block 7-14 Colors based on country
With this inal result, you have now seen multiple ways to combine
datasets into a single dataset:
– The row-wise concatenation operation generally called merge in GIS
– The attribute join, which is done with a geopandas method confusingly
called merge, whereas it is generally referred to as a join rather than a
merge
– The spatial join, which is a join that bases itself on spatial attributes rather
than on any common identi ier
In the last part of this chapter, you’ll discover the dissolve operation,
which is often useful in case of joining many datasets.
The polygons A and B both have the value 1, so grouping by value would
combine those two polygons into one polygon. This operation can be useful
when your data is too granular, which may be because you have done a lot of
geospatial operations or may be because you have merged a large number of
data iles.
Once you execute this, you’ll see that a new column has been added to the
dataset, as shown in Figure 7-19.
Figure 7-19 The data resulting from Code Block 7-15. Image by author
Now the goal is to create two polygons: one for North America and one for
Middle America. We are going to use the dissolve method for this, as shown in
Code Block 7-16.
areas = countries.dissolve(by='Area')[['geometry']]
areas
Code Block 7-16 Dissolve operation
We can now plot this data to see what it looks like, using Code Block 7-17.
ax = areas.plot(figsize=(15,15), edgecolor='black',
facecolor='none')
cx.add_basemap(ax, crs=areas.crs)
Code Block 7-17 Plot the data
The map in Figure 7-21 shows the result of the dissolve operation.
Figure 7-21 The result of the dissolve operation. Image by author using contextily source data and image
as referenced in the image
This combined result has been grouped by the feature area, and it is a
generalized version of the input data. The dissolve operation is therefore
much like a groupby operation, which is a very useful tool to master when
working with geodata.
Key Takeaways
1.
There are numerous basic geodata operations that are standardly
implemented in most geodata tools. They may seem simple at irst sight,
but applying them to geodata can come with some dif iculties.
2.
The merge operation is generally used to describe a row-wise merge, in
which multiple objects are concatenated into one dataset.
3.
The attribute join, which is confusingly called merge in geopandas, does a
column-wise, SQL-like join using a common attribute between the two
input datasets.
The spatial join is another column wise join that allows to combine two
The spatial join is another column-wise join that allows to combine two
4.
datasets without having any common identi ier. The correspondence
8. Erase
Joos Korstanje1
(1) VIELS MAISONS, France
In this chapter, you will learn about the erase operation. The previous
three chapters have presented a number of standard GIS operations.
Clipping and intersecting were covered in Chapter 5, buffering in Chapter
6, and merge and dissolve were covered in Chapter 7.
This chapter will be the last of those four chapters covering common
tools for geospatial analysis. Even though there are much more tools
available in the standard GIS toolbox, the goal here is to give you a good
mastering of the basics and allowing you to be autonomous in learning
the other GIS operations in Python.
The chapter will start with a theoretical introduction of the erase
operation and then follow through with a number of example
implementations in Python for applying the erase on different geodata
types.
In this schematic drawing, you can see that there are three polygons
on the left (numbered 1, 2, and 3). The delete operation has deleted
polygon 2, which makes that there are only two polygons remaining in
the output on the right. Polygon 2 was deleted, or with a synonym, erased.
The table containing the data would be affected as shown in Figure 8-2.
You can see how the data table would change before and after the
operation in the schematic drawing in Figure 8-4.
Figure 8-4 The table view behind the spatial erase. Image by author
You can see that the features 2 and 5 have simply been removed by
the erase operation. This could have been done also using a drop of the
features with IDs 2 and 5. Although using a spatial eraser rather than an
eraser by ID for deleting a number of points gives the same functional
result, it can be very useful and even necessary to use a spatial erase here.
When you have an erase feature, you would not yet have the exact IDs
of the points that you want to drop. In this way, the only way to get the list
of IDs automatically is to do a spatial join, or an overlay, which is what
happens in the spatial erase.
When using more complex features like lines and polygons, the
importance of the spatial erase is even larger, as you will see now.
What happens here is quite different from what happened in the point
example. Rather than deleting or keeping entire features, the spatial
erase has now made an alteration to the features. Before and after, the
data still consists of two lines, yet they are not exactly the same lines
anymore. Only a part of each individual feature was erased, thereby not
changing the number of features but only the geometry. In the data table,
this would look something like shown in Figure 8-6.
Figure 8-6 The table view of erasing lines. Image by author
In the next section, you’ll see how this works for polygons.
In the drawing, you see that there are three polygons in the input
layer on the top left. The erase feature is a rectangular polygon. Using a
spatial erase, the output contains altered versions of polygons 2 and 3,
since the parts of them that overlaid the erase feature have been cut off.
The impact of this operation in terms of data table would also be
similar to the one on the line data. The tables corresponding to this
example can be seen in Figure 8-8.
Figure 8-8 The table view of spatially erasing on polygons. Image by author
You should now have a relatively good intuition about the spatial
erase operation. To perfect your understanding, the next section will
make an in-depth comparison between the spatial eraser and some
comparable operations, before moving on to the implementation in
Python.
Erasing in Python
In this exercise, you will be working with a small sample map that was
created speci ically for this exercise. The data should not be used for any
other purpose than the exercise as it isn’t very precise, but that is not a
problem for now, as the goal here is to master the geospatial analysis
tools.
During the coming exercises, you will be working with a mixed
dataset of Iberia, which is the peninsula containing Spain and Portugal.
The goal of the exercise is to create a map of Spain out of this data,
although there is no polygon that indicates the exact region of Spain: this
must be created by removing Portugal from Iberia.
I recommend running this code in Kaggle notebooks or in a local
environment, as there are some problems in Google Colab for creating
overlays. To get started with the exercise, you can import the data using
the code in Code Block 8-1.
gpd.io.file.fiona.drvsupport.supported_drivers['KML']
= 'rw'
all_data = gpd.read_file('chapter_08_data.kml')
all_data
Code Block 8-1 Importing the data
You will obtain a dataframe like the one in Figure 8-9.
Within this dataframe, you can see that there is a mix of data types or
geometries. The irst two rows contain polygons of Iberia (which is the
contour of Spain plus Portugal). Then you have a number of roads, which
are lines, followed by a number of cities, which are points.
Let’s create a quick map to see what we are working with exactly
using Code Block 8-2. You can use the code hereafter to do so. If you are
not yet familiar with these methods for plotting, I recommend going back
into earlier chapters to get more familiar with this. From here on, we will
go a bit faster over the basics of data imports, ile formats, data types,
and mapping as they have all been extensively covered in earlier parts of
this book.
import contextily as cx
As a result of this code, you will see the map shown in Figure 8-10.
Figure 8-10 The map resulting from Code Block 8-2. Image by author
This map is not very clear, as it still contains multiple types of data
that are also overlapping. As the goal here is to ilter out some data, let’s
do that irst, before working on improved visualizations.
Figure 8-11 The data resulting from Code Block 8-3. Image by author
Now, let’s do the same for Portugal, using the code in Code Block 8-4.
Figure 8-12 The data resulting from Code Block 8-4. Image by author
Figure 8-13 The map resulting from Code Block 8-5. Image by author
spain
Code Block 8-6 Print the data
We can reset the name to Spain using the code in Code Block 8-7.
spain.Name = 'Spain'
spain
Code Block 8-7 Setting the name to Spain
The resulting dataframe now has the correct value in the column
Name, as can be seen in Figure 8-15.
Figure 8-15 The Spain data with the correct name. Image by author
Let’s plot all that we have done until here using a background map, so
that we can keep on adding to this map of Spain in the following
exercises. The code to create this plot is shown in Code Block 8-8.
If you are familiar with the shape of Spain, you will see that it
corresponds quite well on this map. We have successfully created a
polygon for the country of Spain, just using a spatial operation with two
other polygons. You can imagine that such work can occur regularly when
working with spatial data, whether it is for spatial analysis, mapping and
visualizations, or even for feature engineering in machine learning.
In the following section, you will continue this exercise by also
removing the Portuguese cities from our data, so that we only retain
relevant cities for our Spanish dataset.
You will obtain the dataset as shown in Figure 8-17, which contains
only cities.
Figure 8-17 The dataset resulting from Code Block 8-9. Image by author
Now that we have a dataset with only cities, we still need to ilter out
the cities of Spain and remove the cities of Portugal. As you can see, there
is no other column that we could use to apply this ilter, and it would be
quite cumbersome to make a manual list of all the cities that are Spanish
vs. Portuguese. Even if it would be doable for the current exercise, it
would be much more work if we had a larger dataset, so it is not a good
practice.
The following code shows how to remove all the cities that have an
overlay with the Portugal polygon. Setting the how parameter to
“difference” makes that they are removed rather than retained. As a
reminder, you have seen other parameters like intersection and union
being used in previous chapters. If you don’t remember what the other
versions do, it would be good to have a quick look back at this point using
Code Block 8-10.
Figure 8-18 The dataset resulting from Code Block 8-10. Image by author
When comparing this with the previous dataset, you can see that
indeed a number of cities have been removed. The Spanish cities that are
kept are Bilbao, Barcelona, Madrid, Seville, Malaga, and Santiago de
Compostela. The cities that are Portuguese have been removed: Porto,
Lisbon, and Faro. This was the goal of the exercise, so we can consider it
successful.
As a last step, it would be good to add this all to the map that we
started to make in the previous section. Let’s add the Spanish cities onto
the map of the Spanish polygon using the code in Code Block 8-11.
ax = spain.plot(figsize=(15,15), edgecolor='black',
facecolor='none')
spanish_cities.plot(ax=ax, markersize=128)
cx.add_basemap(ax, crs=spain.crs)
Code Block 8-11 Add the Spanish cities on the map
This code will result in the map shown in Figure 8-19, which contains
the polygon of the country Spain, the cities of Spain, and a contextily
basemap for nicer visualization.
Figure 8-19 The map resulting from Code Block 8-11. Image by author
We have now done two parts of the exercise. We have seen how to cut
the polygon, and we have iltered out the cities of Spain. The only thing
that remains to be done is to resize the roads and make sure to ilter out
only those parts of the roads that are inside of the Spain polygon. This
will be the goal of the next section.
You now obtain a dataset that contains only the roads, just like shown
in Figure 8-20.
Figure 8-20 The dataset resulting from Code Block 8-12. Image by author
The problem is not really clear from the data, so let’s make a plot to
see what is wrong about those LineStrings using the code in Code Block
8-13.
ax = spain.plot(figsize=(15,15), edgecolor='black',
facecolor='none')
spanish_cities.plot(ax=ax, markersize=128)
roads.plot(ax=ax, linewidth=4, edgecolor='grey')
cx.add_basemap(ax, crs=spain.crs)
Code Block 8-13 Plot the data
This code will generate the map in Figure 8-21, which shows that
there are a lot of parts of road that are still inside Portugal, which we do
not want for our map of Spain.
Figure 8-21 The map resulting from Code Block 8-13. Image by author
Indeed, you can see here that there is one road (from Porto to Lisbon)
that needs to be removed entirely. There are also three roads that start in
Madrid and end up in Portugal, so we need to cut off the Portuguese part
of those roads.
This is all easy to execute using a difference operation within an
overlay again, as is done in the code in Code Block 8-14.
Figure 8-22 The data resulting from Code Block 8-14. Image by author
You can clearly see that some roads are entirely removed from the
dataset, because they were entirely inside of Portugal. Roads that were
partly in Portugal and partly in Spain were merely altered, whereas roads
that were entirely in Spain are kept entirely.
Let’s now add the roads to the overall map with the country polygon
and the cities, to inish our inal map of only Spanish features. This is
done in Code Block 8-15.
ax = spain.plot(figsize=(15,15), edgecolor='black',
facecolor='none')
spanish_cities.plot(ax=ax, markersize=128)
spanish_roads.plot(ax=ax, linewidth=4,
edgecolor='grey')
cx.add_basemap(ax, crs=spain.crs)
Code Block 8-15 Add the roads to the overall map
The resulting map is shown in Figure 8-23.
Figure 8-23 The map resulting from Code Block 8-15. Image by author
Key Takeaways
1. The erase operation has multiple interpretations. In spatial analysis,
its de inition is erasing features or parts of features based on a
spatial overlay with a speci ied erase feature.
2.
Depending on exact implementation, erasing is basically the same as
the difference operation in overlays, which is one of the set theory
operations.
3.
You can use the difference overlay to erase data from vector datasets
(points, lines, or polygons).
4.
When erasing on points, you will end up erasing or keeping the entire
point, as it is not possible to cut points in multiple parts.
5.
When erasing on lines or polygons, you can erase the complete
feature if it is entirely overlaying with the erase feature, but if the
feature is only partly overlaying, the feature will be altered rather
than removed.
Part III
Machine Learning and Mathematics
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
J. Korstanje, Machine Learning on Geographical Data Using Python
https://doi.org/10.1007/978-1-4842-8287-8_9
9. Interpolation
Joos Korstanje1
(1) VIELS MAISONS, France
After having covered the fundamentals of spatial data in the irst four chapters of this
book, and a number of basic GIS operations in the past four chapters, it is now time to
move on to the last four chapters in which you will see a number of statistics and machine
learning techniques being applied to spatial data.
This chapter will cover interpolation, which is a good entry into machine learning. The
chapter will start by covering de initions and intuitive explanations of interpolation and
then move on to some example use cases in Python.
What Is Interpolation?
Interpolation is a task that is relatively intuitive for most people. From a high-level
perspective, interpolation means to ill in missing values in a sequence of numbers. For
example, let’s take the list of numbers:
1, 2, 3, 4, ???, 6, 7, 8, 9, 10
Many would easily be able to ind that the number 5 should be at the place where the
??? is written. Let’s try to understand why this is so easy. If we want to represent this list
graphically, we could plot the value against the position (index) in the list, as shown in
Figure 9-1.
Figure 9-1 Interpolating in a list of values. Image by author
When seeing this, we would very easily be inclined to think that this data follows a
straight line, as can be seen in Figure 9-2.
Figure 9-2 The interpolated line. Image by author
As we have no idea where these numbers came from, it is hard to say whether this is
true or not, but it seems logical to assume that they came from a straight line. Now, let’s
try another example. To give you a more complex example, try it with the following:
1, ???, 4, ???, 16
If you are able to ind it, your most likely guess would be the doubling function, which
could be presented graphically as shown in Figure 9-3.
When doing interpolation, we try to ind the best estimate for a value in between other
values based on a mathematical formula that seems to it our data. Although interpolation
is not necessarily a method in the family of machine learning methods, it is a great way to
start discovering the ield of machine learning. After all, interpolation is the goal of best
guessing some formula to represent data, which is fundamentally what machine learning
is about as well. But more on that in the next chapter. Let’s irst deep dive into a bit of the
technical details of how interpolation works and how it can be applied on spatial data.
Linear Interpolation
The most straightforward method for interpolation is linear interpolation. Linear
interpolation comes down to drawing a straight line from each point to the next and
estimating the in-between values to be on that line. The graph in Figure 9-4 shows an
example.
Although it seems not such a bad idea, it is not really precise either. The advantage of
linear interpolation is that generally it is not very wrong: you do not risk estimating values
that are way out of bounds, so it is a good irst method to try.
The mathematical function for linear interpolation is the following:
If you input the value for x at which you want to compute a new y, and the values of x
and y of the point before (x0, y0) and after (x1, y1) your new point, you obtain the new y
value of your point.
Polynomial Interpolation
Polynomial interpolation is a bit better for estimating such functions, as polynomial
functions can actually be curved. As long as you can ind an appropriate polynomial
function, you can generally ind a relatively good approximation. This could be something
like Figure 9-5.
Figure 9-5 Polynomial interpolation. Image by author
A risk of polynomial estimation is that it might be very dif icult to actually ind a
polynomial function that its with your data. If the identi ied polynomial is highly complex,
there is a big risk of having some “crazy curves” somewhere in your function, which will
make some of your interpolated values very wrong.
It would be a bit much to go into a full theory of polynomials here, but in short, the
formula of a polynomial is any form of a function that contains squared effects, such as
Many, many other forms of polynomials exist. If you are not aware of polynomials, it
would be worth it checking out some online resources on the topic.
Figure 9-6 Adding nearest neighbor interpolation to the graph. Image by author
This nearest neighbor interpolation will assign the value that is the same value as the
closest point. The line shape is therefore a piecewise function: when arriving closer (on
the x axis) to the next point, the interpolated value (y axis) makes a jump to the next value
on the y axis. As you can see, this really isn’t the best idea for the curve at hand, but in
other situations, it can be a good and easy-to-use interpolation method.
Depending on where we live, we want to have the best appropriate value for ourselves.
In the north of the country, it is 10 degrees Celsius; in the south, it is 0 degree Celsius. Let’s
use a linear interpolation for this, with the result shown in Figure 9-8.
Figure 9-8 The result of the exercise using linear interpolation. Image by author
This linear approach does not look too bad, and it is easy to compute by hand for this
data. Let’s also see what would have happened with a nearest neighbor interpolation,
which is also easy to do by hand. It is shown in Figure 9-9.
Figure 9-9 The result of the exercise using nearest neighbor interpolation. Image by author
The middle part has been left out, as de ining ties is not that simple, yet you can get the
idea of what would have happened with a nearest neighbor interpolation in this example.
For the moment, we will not go deeper into the mathematical de initions, but if you
want to go deeper, you will ind many resources online. For example, you could get started
here: https://towardsdatascience.com/polynomial-interpolation-
3463ea4b63dd. For now, we will focus on applications to geodata in Python.
data = { 'point1': {
'lat': 0,
'long': 0,
'temp': 0 },
'point2': {
'lat': 10,
'long': 10,
'temp': 20 },
'point3' : {
'lat': 0,
'long': 10,
'temp': 10 },
'point4': {
'lat': 10,
'long': 0,
'temp': 30 }
}
Code Block 9-1 The data
Now, the irst thing that we can do is making a dataframe from this dictionary and
getting this data into a geodataframe. For this, the easiest is to make a regular pandas
dataframe irst, using the code in Code Block 9-2.
import pandas as pd
df = pd.DataFrame.from_dict(data, orient='index')
df
Code Block 9-2 Creating a dataframe
As a next step, let’s convert this dataframe into a geopandas geodataframe, while
specifying the geometry to be point data, with latitude and longitude. This is done in Code
Block 9-3.
The plot that results from this code is shown in Figure 9-11.
Figure 9-11 The plot resulting from Code Block 9-3. Image by author
As you can see in this plot, there are four points with a different size. From a high-level
perspective, it seems quite doable to ind intermediate values to ill in between the points.
What is needed to do so, however, is to ind a mathematical formula in Python that
represents this interpolation and then use it to predict the interpolated values.
Now that we have this function, we can call it on new points. However, we irst need to
de ine which points we are going to use for the interpolation. As we have four points in a
square organization, let’s interpolate at the point exactly in the middle and the points that
are in the middle along the sides. We can create this new df using the code in Code Block 9-
5.
The data here only has latitude and longitude, but it does not yet have the estimated
temperature. After all, the goal is to use our interpolation function to obtain these
estimated temperatures.
In Code Block 9-6, you can see how to loop through the new points and call the
interpolation function to estimate the temperature on this location. Keep in mind that this
interpolation function is the mathematical de inition of a linear interpolation, based on
the input data that we have given.
interpolated_temps = []
for i,row in new_df.iterrows():
interpolated_temps.append(my_interpolation_function(row['lat'],
row['long'])[0])
new_df['temp'] = interpolated_temps
new_df
Code Block 9-6 Applying the interpolation
You can see the numerical estimations of these results in Figure 9-13.
Figure 9-13 The estimations resulting from Code Block 9-6. Image by author
The linear interpolation is the most straightforward, and it is clear from those
predictions that they look solid. It would be hard to say whether they are good or not, as
we do not have any ground truth value in interpolation use cases, yet we can say that it is
nothing too weird at least.
Now that we have estimated them, we should try to do some sort of analysis.
Combining them into a dataframe with everything so that we can rebuild the plot is done
in Code Block 9-7.
Even though we do not have an exact metric to say whether this interpolation is good
or bad, we can at least say that the interpolation seems more or less logical to the eye,
which is comforting. This irst try appears rather successful. Let’s try out some more
advanced methods in the next section, to see how results may differ with different
methods.
Kriging
In the irst part of this chapter, you have discovered some basic, fundamental approaches
to interpolation. The thing about interpolation is that you can make it as simple, or as
complex, as you want.
Although the fundamental approaches discussed earlier are often satisfactory in
practical results and use cases, there are some much more advanced techniques that we
need to cover as well.
In this second part of the chapter, we will look at Kriging for an interpolation method.
Kriging is a much more advanced mathematical de inition for interpolation. Although it
would surpass the level of this book to go into too much mathematical detail here, for
those readers that are at ease with more mathematical details, feel free to check out some
online resources like https://en.wikipedia.org/wiki/Kriging and
www.publichealth.columbia.edu/research/population-health-
methods/kriging-interpolation.
Figure 9-15 The interpolated values with Linear Ordinary Kriging. Image by author
Interestingly, some of these estimated values are not the same at all. Let’s plot them to
see whether there is anything weird or different going on in the plot, using the code in
Code Block 9-9.
There is nothing too wrongly estimated if we judge by the plot, so there is no reason to
discount these results. As we have no metric for good or wrong interpolation, this must be
seen as just an alternative estimation. Let’s try to see what happens when using other
settings to Kriging in the next section.
Interestingly, the estimates for point5 and point9 change quite drastically again! Let’s
make a plot again to see if anything weird is occurring during this interpolation. This is
done in Code Block 9-11.
Figure 9-18 The result from Gaussian Ordinary Kriging. Image by author
Again, when looking at this plot, it cannot be said that this interpolation is wrong in
any way. It is different from the others, but just as valid.
Figure 9-19 The result from Exponential Ordinary Kriging. Image by author
Interestingly, again point5 and point9 are the ones that change a lot, while the others
stay the same. For coherence, let’s make the plot of this interpolation as well, using Code
Block 9-13.
Again, nothing obvious wrong with this plot, yet its results are again different than
before. It would only make sense to wonder which of them is right. Let’s conclude on this
in the next section.
Even for such a simple interpolation example, we see spectacularly large differences in
the estimations of points 5 (middle bottom in the graph) and 9 (right middle in the graph).
Now the big question here is of course whether we can say that any of those are better
than the others. Unfortunately, when applying mathematical models to data where there is
no ground truth, you just don’t know. You can build models that are useful to your use case,
you can use human and business logic to assess different estimates, and you can use rules
of thumb like Occam’s razor (keep the simplest possible model) for your decision to retain
one model over the other.
Alternatively, you can also turn to supervised machine learning for this. Classi ication
and regression will be covered in the coming two chapters, and they are also methods for
estimating data points that we don’t know, yet they are focused much more on
performance metrics to evaluate the it of our data to reality, which is often missing in
interpolation use cases.
In conclusion, although there is not necessarily only one good answer, it is always
useful to have a basic working knowledge of interpolation. Especially in spatial use cases,
it is often necessary to convert data measured at speci ic points (like temperature
stations and much more) into a more continuous view over a larger two-dimensional
surface (like countries, regions, and the like). You have seen in this chapter that relatively
simple interpolations are already quite ef icient in some use cases and that there is a vast
complexity to be discovered for those who wanted to go in more depth.
Key Takeaways
1.
Interpolation is the task of estimating unknown values in between a number of known
values, which comes down to estimating values on unmeasured locations.
2.
We generally de ine a mathematical function or formula based on the known values
and then use this function to estimate the values that we do not know.
3.
There are many mathematical “base” formulas that you can apply to your points, and
depending on the formula you chose, you may end up with quite different results.
4.
When interpolating, we generally strive to obtain estimates for points of which we do
not have a ground truth value, that is, we really don’t know which value is wrong or
correct. Cross-validation and other evaluation methods can be used and will be
covered in the coming chapters on machine learning.
5.
In the case where multiple interpolation methods give different results, we often need
to de ine a choice based on common sense, business logic, or domain knowledge.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
J. Korstanje, Machine Learning on Geographical Data Using Python
https://doi.org/10.1007/978-1-4842-8287-8_10
With the current chapter, you are now arriving at one of the main parts of
the book about machine learning, namely, classi ication. Classi ication is,
next to regression and clustering, one of the three main tasks in machine
learning, and they will all be covered in this book.
Machine learning is a very large topic, and it would be impossible to
cover all of machine learning in just these three chapters. The choice has
been made to do a focus on applying machine learning models to spatial
data. The focus is therefore on presenting interesting and realizing use
cases for machine learning on spatial data while showing how spatial
data can be used as an added value with respect to regular data.
There will not be very advanced mathematical, statistical, nor
algorithmic discussions in the chapters. There are many standard
resources out there for those readers who want to gain a deep and
thorough mathematical understanding of machine learning in general.
The chapter will start with a general introduction of what
classi ication is, what we can use it for, and some models and tools that
you’ll need for doing classi ication, and then we’ll dive into a deep spatial
classi ication use case for the remainder of the chapter. Let’s now start
with some de initions and introductions irst.
When looking at this data, you will see something like Figure 10-1.
Figure 10-1 The data resulting from Code Block 10-1. Image by author
The dataset is a bit more complex than what we have worked with in
previous chapters, so let’s make sure to have a good understanding of
what we are working with.
The irst row of the geodataframe contains an object called the mall.
This polygon is the one that covers the entire area of the mall, which is
the extent of our study. It is here just for informative purposes, and we
won’t need it during the exercise.
The following features from rows 1 to 7 present areas of the mall.
They are also polygons. Each area can either be one shop, a group of
shops, a whole wing, or whatnot, but they generally regroup a certain
type of store. We will be able to use this information for our model.
The remaining data are 20 itineraries. Each itinerary is represented
as a LineString, that is, a line, which is just a sequence of points that has
been followed by each of the 20 participants in the study. The name of
each of the LineStrings is either Bought Yes, meaning that they have used
the coupon after the study (indicating the product interests them), or
Bought No, indicating that the coupon was not used and therefore that
the client is probably not interested in the product.
Let’s now move on to make a combined plot of all this data to get an
even better feel of what we are working with. This can be done using
Code Block 10-2.
all_data.plot(figsize=(15,15),alpha=0.1)
Code Block 10-2 Plotting the data
When executing this code, you will end up with the map in Figure 10-
2. Of course, it is not the most visual map, but the goal is here to put
everything together in a quick image to see what is going on in the data.
Figure 10-2 The map resulting from Code Block 10-2
In this map, the most outer light-gray contours are the contours of the
large polygon that sets the total mall area. Within this, you see a number
of smaller polygons, which indicate the areas of interest for our study,
which all have a speci ic group of store types inside them. Finally, you
also see the lines criss-crossing, which represents the 20 participants of
the study making their movements throughout the mall during their visit.
What we want to do now is to use the information of the store
segment polygons to annotate the trips of each participant. It would be
great to end up with a percentage of time that each participant has spent
in each type of store, so that we can build a model that learns a
relationship between the types of stores that were visited in the mall and
the potential interests in the new restaurant.
As a irst step toward this model, let’s separate the data to obtain
datasets with only one data type. For this, we will need to separate the
information polygons from the participant itineraries. Using all that you
have seen earlier in the book, that should not be too hard. The code in
Code Block 10-3 shows how to get the info polygons in a new dataset.
info_polygons = all_data.loc[1:7,:]
info_polygons
Code Block 10-3 Select the info polygons into a separate dataset
Figure 10-3 The data resulting from Code Block 10-3. Image by author
Let’s extract the itineraries as well, using the code in Code Block 10-4.
itineraries = all_data.loc[8:,:]
itineraries
Code Block 10-4 Selecting the itineraries
import pandas as pd
from shapely.geometry.point import Point
results = []
results_df = pd.DataFrame(results)
results_df.columns = ['client_id', 'target',
'point']
results_df
Code Block 10-5 Get the data from a wide data format to a long data format
The result of this code is the data in a long format: one row per point
instead of one row per participant. A part of the data is shown in Figure
10-5.
Figure 10-5 A part of the data resulting from Code Block 10-5. Image by author
Figure 10-6 The data resulting from Code Block 10-7. Image by author
You can see that for most points the operation has been successful.
For a number of points, however, it seems that NA, or missing values, has
been introduced. This is explained by the presence of points that are not
overlapping with any of the store information polygons and therefore
having no lookup information. It would be good to do something about
this. Before deciding what to do with the NAs, let’s use the code in Code
Block 10-8 to count the number of each client for which there is no
reference information.
# inspect NA
joined_data['na'] = joined_data.Name.isna()
joined_data.groupby('client_id').na.sum()
Code Block 10-8 Inspect NA
# drop na
joined_data = joined_data.dropna()
joined_data
Code Block 10-9 Drop NAs
location_behavior =
joined_data.pivot_table(index='client_id',
columns='Name',
values='target',aggfunc='count').fillna(0)
location_behavior
Code Block 10-10 The groupby to obtain location behavior
# standardize
location_behavior = location_behavior.div(
location_behavior.sum(axis=1), axis=0 )
location_behavior
Code Block 10-11 Standardize the data
Modeling
Let’s now keep the data this way for the model – for inputting the data
into the model. Let’s move away from the dataframe format and use the
code in Code Block 10-12 to convert the data into numpy arrays.
X = location_behavior.values
X
Code Block 10-12 Convert into numpy
You can do the same to obtain an array for the target, also called y.
This is done in Code Block 10-13.
y = itineraries.Name.values
y
Code Block 10-13 Get y as an array
After this step, you end up with four datasets: X_train and y_train are
the parts of X and y that we will use for training, and X_test and y_test will
be used for evaluation.
We now have all the elements to start building a model. The irst
model that we are going to build here is the logistic regression. As we do
not have tons of data, we can exclude the use of complex models like
random forests, xgboost, and the like, although they could de initely
replace the logistic regression if we had more data in this use case.
Thanks to the easy-to-use modeling interface of scikit-learn, it is really
easy to replace one model by another, as you’ll see throughout the
remainder of the example.
The code in Code Block 10-15 irst initiates a logistic regression and
then its the model using the training data.
# logistic regression
from sklearn.linear_model import LogisticRegression
my_lr = LogisticRegression()
my_lr.fit(X_train, y_train)
Code Block 10-15 Logistic regression
preds = my_lr.predict(X_test)
preds
Code Block 10-16 Prediction
The array contains the predictions for each of the rows in X_test, as
shown in Figure 10-13.
We do have the actual truth for these participants as well. After all,
they are not really new participants, but rather a subset of participants of
which we know whether they used the coupon that we chose to keep
apart for evaluation. We can compare the predictions to the actual
ground truth, using the code in Code Block 10-17.
The test set is rather small in this case, and we can manually conclude
that the model is actually predicting quite well. In use cases with more
data, it would be better to summarize this performance using other
methods. One great way to analyze classi ication models is the confusion
matrix. It shows in one graph all the data that is correctly predicted, but
also which are wrongly predicted and in that case which errors were
made how many times. The code in Code Block 10-18 shows how to
create such a confusion matrix for this use case.
Figure 10-15 The plot resulting from Code Block 10-18. Image by author
In this graph, you see that most predictions were correct and only one
mistake was made. This mistake was a participant that did not buy,
whereas the model predicted that he was a buyer with the coupon.
Model Benchmarking
The model made one mistake, so we can conclude that it is quite a good
model. However, for completeness, it would be good to try out another
model. Feel free to test out any classi ication model from scikit-learn, but
due to the relatively small amount of data, let’s try out a decision tree
model here. The code in Code Block 10-19 goes through the exact same
steps as before but simply with a different model.
Figure 10-16 The resulting dataframe from Code Block 10-19. Image by author
The result is shown in Figure 10-17 and indeed shows two errors.
There are two errors, both cases of participants that did not buy in
reality. It seems that people that did not buy are a little bit harder to
detect than the opposite, even though more evidence would be needed to
further investigate this.
Key Takeaways
1. Classi ication is an area in supervised machine learning that deals
with models that learn how to use independent variables to predict a
categorical target variable.
2.
Feature engineering together with spatial operations can be used to
get spatial data into a machine learning format. It is important to end
up with variable de initions that will be useful for the classi ication
task at hand.
3.
Train-test-splits are necessary for model evaluation, as models tend
to over it on the training data.
4.
The confusion matrix is a great tool for evaluating classi ication
models’ performances.
5.
Model benchmarking is the task of using multiple different machine
learning models on the same task, so that the best performing model
can be found and retained for the future.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
J. Korstanje, Machine Learning on Geographical Data Using Python
https://doi.org/10.1007/978-1-4842-8287-8_11
11. Regression
Joos Korstanje1
(1) VIELS MAISONS, France
In the previous two chapters, you have learned about the fundamentals of machine
learning use cases using spatial data. You have irst seen several methods of
interpolation. Interpolation was presented as an introduction to machine learning,
in which a theory-based interpolation function is de ined to ill in unknown values of
the target variable.
The next step moved from this unsupervised approach to a supervised approach,
in which we build models to predict values of which we have ground truth values. By
applying a train-test-split, this ground truth is then used to compute a performance
metric.
The previous chapter showed how to use supervised models for classi ication. In
classi ication models, unlike with interpolation, the target variable is a categorical
variable. The shown example used a binary target variable, which classi ied people
into two categories: buyers and nonbuyers.
In this chapter, you will see how to build supervised models for target variables
that are numeric. This is called regression. Although regression, just like
interpolation, is used to estimate a numeric target, the methods are actually
generally closer to the supervised classi ication methods.
In regression, the use of metrics and building models with the best performance
on this metric will be essential as it was in classi ication. The models are adapted for
taking into account a numeric target variable, and the metrics need to be chosen
differently to take into account the fact that targets are numeric.
The chapter will start with a general introduction of what regression models are
and what we can use them for. The rest of the chapter will present an in-depth
analysis of a regression model with spatial data, during which numerous theoretical
concepts will be presented.
Introduction to Regression
Although the goal of this book is not to present a deep mathematical content on
machine learning, let’s start by exploring the general idea behind regression models
anyway. Keep in mind that there are many resources that will be able to ill in this
theory and that the goal of the current book is to present how regression models can
be combined with spatial data analysis and modeling.
Let’s start this section by considering one of the simplest cases of regression
modeling: the simple linear regression. In simple linear regression, we have one
numeric target variable (y variable) and one numeric predictor variable (X variable).
In this example, let’s consider a dataset in which we want to predict a person’s
weekly weight loss based on the number of hours that a person has worked out in
that same week. We expect to see a positive relationship between the two. Figure
11-1 shows the weekly weight loss plotted against the weekly workout hours.
This graph shows a clear positive relationship between workout and weight loss.
We could ind the mathematical de inition of the straight line going through those
points and then use this mathematical formula as a model to estimate weekly weight
loss as a function of the number of hours worked out. This can be shown graphically
in Figure 11-2.
Figure 11-2 The simple linear regression added to the graph. Image by author
y=a*x+b
which would translate to the following for this example:
Weight_Loss = a * Workout + b
Mathematical procedures to determine the best- itting values for a and b exist and
can be used to estimate this model. The exact mathematics behind this will be left
for further reading as to not go out of scope for the current book. However, it is
important to understand the general idea behind estimating such a model.
It is also important to consider which next steps are possible, so let’s spend some
time to consider those. Firstly, the current model is using only a single explanatory
variable (workout), which is not really representative of how one would go about
losing weight.
In reality, one could consider that the food quantity is also a very important
factor in losing weight. This would need an extension of the mathematical formula to
become something like the following:
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
geodata = gpd.read_file('chapter 11 data.kml')
geodata.head()
Code Block 11-1 Importing the data
import pandas as pd
apartment_data = pd.read_excel('house_data.xlsx')
apartment_data.head()
Code Block 11-2 Importing the house data
As you can see from this image, the data contains the following variables:
– Apt ID: The identi ier of each apartment
– Price: The price of each apartment on Airbnb
– MaxGuest: The maximum number of guests allowed in the apartment
– IncludesBreakfast: 1 if breakfast is included and 0 otherwise
The Apt ID is not in the same format as the identi ier in the geodata. It is
necessary to convert the values in order to make them correspond. This will allow us
to join the two datasets together in a later step. This is done using the code in Code
Block 11-3.
After this operation, the dataset now looks as shown in Figure 11-5.
Figure 11-5 The data resulting from Code Block 11-3. Image by author
Now that the two datasets have an identi ier that corresponds, it is time to start
the merge operation. This merge will bring all columns into the same dataset, which
will be easier for working with the data. This merge is done using the code in Code
Block 11-4.
We now have all the columns inside the same dataframe. This concludes the data
preparation phase. As a last step, let’s do a visualization of the apartment locations
within Amsterdam, to get a better feeling for the data. This is done using the code in
Code Block 11-5.
import contextily as cx
Figure 11-7 The map resulting from Code Block 11-5. Image by author using contextily source data and image as
referenced in the image
You can see that the apartments used in this study are pretty well spread
throughout the center of Amsterdam. In the next section, we will do more in-depth
exploration of the dataset.
This histogram shows us that the prices are all between 90 and 170, with the
majority being at 130. The data does not seem to follow a perfectly normal
distribution, although we do see more data points being closer to the center than
further away.
If we would need to give a very quick-and-dirty estimation of the most
appropriate estimation for the price of our Airbnb, we could simply use the average
price of Airbnbs in the center of Amsterdam. The code in Code Block 11-7 computes
this mean.
The result is 133.75, which tells us that setting this price would probably be a
more or less usable estimate if we had nothing more precise. Of course, as prices
range from 90 to 170, we could be either:
– Lose money due to underpricing: If our Airbnb is actually worth 170 and we
choose to price it at 133.75, we would be losing the difference (170 – 133.75) each
night.
– Lose money due to overpricing: If our Airbnb is actually worth 90 and we choose
to price it at 133.75, we will probably have a very hard time inding guests, and
our booking number will be very low.
Clearly, it would be very valuable to have a better understanding of the factors
in luencing Airbnb price so that we can ind the best price for our apartment.
As a next step, let’s ind out how the number of guests can in luence Airbnb
prices. The code in Code Block 11-8 creates a scatter plot of Price against MaxGuests
to visually inspect relationships between those variables.
Although the trend is less clear than the one observed in the theoretical example
in the beginning of this chapter, we can clearly see that higher values on the x axis
(MaxGuests) generally have higher values on the y axis (Price). Figure 11-9 shows
this.
Figure 11-9 The scatter plot of Price against MaxGuests. Image by author
The quality of a linear relationship can also be measured using a more
quantitative approach. The Pearson correlation coef icient is a sort of score between
–1 and 1 that gives this indication. A value of 0 means no correlation, a value close to
–1 means a negative correlation between the two, and a value close to 1 means a
positive correlation between the variables.
The correlation coef icient can be computed using the code in Code Block 11-9.
import numpy as np
np.corrcoef(merged_data['MaxGuests'], merged_data['Price'])
Code Block 11-9 Compute the correlation coef icient
This will give you the correlation matrix as shown in Figure 11-10.
The resulting correlation coef icient between MaxGuests and Price is 0.453. This
is a fairly strong positive correlation, indicating that the number of guests has a
strong positive impact on the price that we can ask for an Airbnb. In short, Airbnbs
for more people should ask a higher price, whereas Airbnbs for small number or
guests should price lower.
As a next step, let’s see whether we can also use the variable IncludesBreakfast
for setting the price of our Airbnb. As the breakfast variable is categorical (yes or
no), it is better to use a different technique for investigating this relationship. The
code in Code Block 11-10 creates a boxplot to answer this question.
This boxplot shows us that Airbnbs that propose a breakfast are generally able to
ask a higher price than Airbnbs that do not propose one. Depending on whether you
propose a breakfast, you should price your apartment accordingly.
X = merged_data[['IncludesBreakfast', 'MaxGuests']]
y = merged_data['Price']
Code Block 11-11 Creating X and y objects
We will use a linear model for this phase of modeling. The scikit-learn
implementation of the linear model can be estimated using the code in Code Block
11-12.
# first version lets just do a quick and dirty non geo model
from sklearn.linear_model import LinearRegression
lin_reg_1 = LinearRegression()
lin_reg_1.fit(X, y)
Code Block 11-12 Linear regression
Now that the model has been itted, we have the mathematical de inition (with
the estimated coef icients) inside our linear regression object.
Interpretation of Iteration 1 Model
To interpret what this model has learned, we can inspect the coef icients. The code in
Code Block 11-13 shows how to see the coef icients that the model has estimated.
Let’s now it the model again, but this time only on the training data. This is done
in Code Block 11-15.
lin_reg_2 = LinearRegression()
lin_reg_2.fit(X_train, y_train)
Code Block 11-15 Fit the model on the train data
To estimate the performance, we use the estimated model (in this case, the
coef icients and the linear model formula) to predict estimate prices on the test
data. This is done in Code Block 11-16.
pred_reg_2 = lin_reg_2.predict(X_test)
Code Block 11-16 Predict on the test set
We can use these predicted values together with the real, known prices of the test
set to compute the R2 scores. This is done in Code Block 11-17.
The resulting R2 score is 0.1007. Although not a great result, the score shows that
the model has some predictive value and would be a better segmentation than using
the mean for pricing.
Figure 11-12 The dataset resulting from Code Block 11-18. Image by author
Let’s see how latitude and longitude are related to the price by making scatter
plots of price vs. latitude and price vs. longitude. The irst scatter plot is created in
Code Block 11-19.
plt.scatter(merged_data['lat'], merged_data['Price'])
Code Block 11-19 Create the scatter plot
Figure 11-13 The scatter plot resulting from Code Block 11-19. Image by author
There does not seem to be too much of a trend in this scatter plot. It seems that
prices are ranging between 90 and 170, and that is not different for any other
latitude. Let’s use the code in Code Block 11-20 to check whether this is true for
longitude as well.
plt.scatter(merged_data['long'], merged_data['Price'])
Code Block 11-20 Create a scatter plot with longitude
We can now use this size setting when creating the scatter plot. This is done in
Code Block 11-22.
plt.scatter(merged_data['long'], merged_data['lat'],
s=merged_data['MarkerSize'], c='none', edgecolors='black')
Code Block 11-22 Create a map with size of marker
This graph shows that there are no linear relationships, but that we could expect
some areas to be learned that have generally high or generally low prices. This would
mean that we may need to change for a nonlinear model to it this reality better.
# add features
X2 = merged_data[['IncludesBreakfast', 'MaxGuests', 'lat',
'long']]
y = merged_data['Price']
The score that this model obtains is –0.04. Unexpectedly, we have a much worse
result than in the previous step. Be careful here, as the DecisionTree results will be
different for each execution due to randomness in the model building phase. You will
probably have a different result than the one presented here, but if you try out
different runs, you will see that the average performance is worse than the previous
iteration.
The DecisionTreeRegressor, just like many other models, can be tuned using a
large number of hyperparameters. In this iteration, no hyperparameters were
speci ied, which means that only default values were used.
As we have a strong intuition that nonlinear models should be able to obtain
better results than a linear model, let’s play around with hyperparameters in the
next iteration.
Figure 11-16 The result of the model tuning loop. Image by author
In this output, you can see that the max_depth of 3 has resulted in an R2 score of
0.54, much better than the result of –0.04. Tuning on max_depth has clearly had an
important impact on the model’s performance. Many other trials and iterations
would be possible, but that is left as an exercise. For now, the DecisionTreeRegressor
with max_depth = 3 is retained as the inal regression model.
plt.figure(figsize=(15,15))
tree.plot_tree(dt_reg_5, feature_names=X2_train.columns)
plt.show()
Code Block 11-26 Generate the tree plot
Figure 11-17 The tree plot resulting from Code Block 11-26. Image by author
We can clearly see which nodes have learned which trends. We can see that
latitude and longitude are used multiple times by the model, which allows the model
to split out speci ic areas on the map that are to be prices worse or better.
As this is the inal model for the current use case, and we know that the R2 score
tells us that the model is a much better estimation than using just an average price,
we can be con ident that pricing our Airbnb using the decision tree model will result
in a more appropriate price for our apartment.
The goal of the use case has therefore been reached: we have created a
regression model to use both spatial data and apartment data to make the best
possible price estimation for an Airbnb in Amsterdam.
Key Takeaways
1.
Regression is an area in supervised machine learning that deals with models
that learn how to use independent variables to predict a numeric target variable.
2.
Feature engineering, spatial data, and other data can be used to feed this
regression model.
3.
The R2 score is a metric that can be used for evaluation regression models.
4.
Linear regression is one of the most common regression models, but many
alternative models, including Decision Tree, Random Forest, or Boosting, can be
used to challenge its performances in a model benchmark.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
J. Korstanje, Machine Learning on Geographical Data Using Python
https://doi.org/10.1007/978-1-4842-8287-8_12
12. Clustering
Joos Korstanje1
(1) VIELS MAISONS, France
In this fourth and last chapter on machine learning, we will cover clustering. To get
this technique in perspective, let’s do a small recap of what we have gone through in
terms of machine learning until now.
The machine learning topics started after the introduction of interpolation. In
interpolation, we tried to estimate a target variable for locations at which the value of
this target variable is unknown. Interpolation uses a mathematical formula to decide
on the best possible theoretical way to interpolate these values.
After interpolation, we covered classi ication and regression, which are the two
main categories in supervised modeling. In supervised modeling, we build a model
that uses X variables to predict a target (y) variable. The great thing about supervised
models is that we have a large number of performance metrics available that can help
us in tuning and improving the model.
Introduction to Clustering
In clustering, the goal is to identify clusters, or groups, of observations based on some
measure of similarity or distance. As mentioned before, there is no target variable
here: we simply use all of the available variables about each observation to create
groups of similar observations.
Let’s consider a simple and often used example. In the graph in Figure 12-1, you’ll
see a number of people (each person is an observation) of which we have collected
the spending on two product groups at a supermarket: snacks and fast food is the irst
category and healthy products is the second.
As this data has only two variables, it is relatively easy to identify three groups of
clients in this database. A subjective proposal for boundaries is presented in the graph
in Figure 12-2.
Figure 12-2 The graph showing a clustering example. Image by author
In this graph, you see that the clients have been divided in three groups:
1.
Unhealthy spenders: A cluster of clients who spend a lot in the category snacks
and fast food, but not much in the category healthy
2.
Healthy spenders: A cluster of clients who spend a lot on healthy products but not
much on snacks and fast food
3.
Small spenders: People who do not spend a lot at all
An example of a way in which a supermarket could use such a clustering is sending
personalized advertisements or discount coupons to those clients of which they know
they’ll be interested.
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
geodata = gpd.read_file('chapter_12_data.kml')
geodata.head()
Code Block 12-1 Importing the data
The data are stored as one LineString for each person. There are no additional
variables available. Let’s now make a simple plot to have a better idea of the type of
trajectories that we are working with. This can be done using the code in Code Block
12-2.
geodata.plot()
Code Block 12-2 Plotting the data
To add a bit of context to these trajectories, we can add a background map to this
graph using the code in Code Block 12-3.
import contextily as cx
ax = geodata.plot(figsize=(15,15), markersize=64)
cx.add_basemap(ax, crs = geodata.crs)
Code Block 12-3 Plotting with a background map
Figure 12-5 The map resulting from Code Block 12-3. Image by author using contextily source data and image as
referenced in the image
The three trajectories are based in the city of Brussels. For each of the three
trajectories, you can visually identify a similar pattern: there are clustered parts
where there are multiple points in the same neighborhood, indicating points of
interest. Then there are also some parts where there is a real line-like pattern which
indicates movements from one point of interest to another.
Figure 12-6 The result from Code Block 12-4. Image by author
Let’s plot the trajectory of this person in order to have a more detailed vision of
the behavior of this person. This map can be made using the code in Code Block 12-5.
ax = one_person.plot(figsize=(15,15), markersize=64)
cx.add_basemap(ax, crs = one_person.crs)
Code Block 12-5 Creating a map of the trajectory of Person 1
Figure 12-7 The map resulting from Code Block 12-5. Image by author using contextily source data and image as
referenced in the image
You can see from this visualization that the person has been at two locations for a
longer period: one location on the top left of the map and a second point of interest on
the bottom right. We want to reach a clustering model that is indeed capable of
capturing these two locations.
To start building a clustering model for Person 1, we need to convert the
LineString into points. After all, we are going to cluster individual points to identify
clusters of points. This is done using the code in Code Block 12-6.
import pandas as pd
one_person_points_df = pd.DataFrame(
[x.strip('(').strip(')').strip('0').strip(' ').split(' ')
for x in str(one_person.loc[0, 'geometry'])
[13:].split(',')],
columns=['long','lat']
)
one_person_points_df = one_person_points_df.astype(float)
one_person_points_df.head()
Code Block 12-6 Convert the LineString into points
The data format that results from this code is shown in Figure 12-8.
Figure 12-8 The new data format of latitude and longitude as separate columns. Image by author
Now that we have the right data format, it is time to apply a clustering method. As
our data is in latitude and longitude, the distance between two points should be
de ined using haversine distance. We choose to use the OPTICS clustering method, as
it applies well to spatial data. Its behavior is the following:
– OPTICS decides itself on the number of clusters that it wants to use. This is opposed
to a number of models in which the user has to decide on the number of clusters.
– OPTICS can be tuned to in luence the number of clusters that the model chooses.
This is important, as the default settings may not result in the exact number of
clusters that we want to obtain.
– OPTICS is able to discard points: when points are far away from all identi ied
clusters, they can be coded as –1, meaning an outlier data point. This will be
important in the case of spatial clustering, as there will be many data points that
are on a transportation part of the trajectory that will be quite far away from the
cluster centers. This option is not available in all clustering methods, but it is there
in OPTICS and some other methods like DBSCAN.
Let’s start with an OPTICS clustering that uses the default settings. This is done in
the code in Code Block 12-7.
clustering = OPTICS(metric='haversine')
one_person_points_df.loc[:,'cluster'] =
clustering.fit_predict(np.radians(one_person_points_df[['lat',
'long']]))
Code Block 12-7 Apply the OPTICS clustering
The previous code has created a column called cluster in the dataset, which now
contains the cluster that the model has found for each row, each data point. The code
in Code Block 12-8 shows how to have an idea of how the clusters are distributed.
one_person_points_df['cluster'].value_counts()
Code Block 12-8 Show the value counts
Now, as said before, the cluster –1 identi ied outliers. Let’s delete them from the
data with the code in Code Block 12-9.
We can now compute the central points of each cluster by computing the median
point with a groupby operation. This is done in Code Block 12-10.
medians_of_POI = one_person_points_df.groupby(['cluster'])
[['lat', 'long']].median().reset_index(drop=False)
medians_of_POI
Code Block 12-10 Compute medians of clusters
Let’s plot those central coordinates on a map using the code in Code Block 12-11.
The basic plot with the three central points is shown in Figure 12-11.
Figure 12-11 The plot with the three central points of Person 1. Image by author
Let’s use the code in Code Block 12-12 to add more context to this map.
ax = one_person.plot(figsize=(15,15))
medians_of_POI_gdf.plot(ax=ax,markersize=128)
cx.add_basemap(ax, crs = one_person.crs)
Code Block 12-12 Plot a basemap behind the central points
Figure 12-12 Plotting the central points to a background map. Image by author using contextily source data and
image as references in the image
This map shows that the clustering was not totally successful. The cluster centroid
on the top left did correctly identify a point of interest, and the one bottom right as
well. However, there is one additional centroid in the middle that should not have
been identi ied as a point of interest. In the next section, we will tune the model to
improve this result.
clustering = OPTICS(
min_samples = 10,
max_eps=2.,
min_cluster_size=8,
xi = 0.05,
metric='haversine')
one_person_points_df.loc[:,'cluster'] =
clustering.fit_predict(
np.radians(one_person_points_df[['lat', 'long']]))
one_person_points_df =
one_person_points_df[one_person_points_df['cluster'] !=
-1]
medians_of_POI = one_person_points_df.groupby(['cluster'])
[['lat', 'long']].median().reset_index(drop=False)
print(medians_of_POI)
medians_of_POI_gdf = gpd.GeoDataFrame(medians_of_POI,
geometry=
[Point(x) for x in
zip(
list(medians_of_POI['long']),
list(medians_of_POI['lat'])
)
])
ax = one_person.plot(figsize=(15,15))
medians_of_POI_gdf.plot(ax=ax,markersize=128)
cx.add_basemap(ax, crs = one_person.crs)
Code Block 12-13 Applying the OPTICS with different settings
Figure 12-13 The map resulting from Code Block 12-13. Image by author using contextily data and image as
referenced in the map
As you can see, the model has correctly identi ied the two points (top left and
bottom right) and no other points. The model is therefore successful at least for this
person. In the next section, we will apply this to the other data as well and see whether
the new cluster settings give correct results for them as well.
clustering = OPTICS(
min_samples = 10,
max_eps=2.,
min_cluster_size=8,
xi = 0.05,
metric='haversine')
one_person_points_df.loc[:,'cluster'] =
clustering.fit_predict(
np.radians(one_person_points_df[['lat', 'long']]))
one_person_points_df =
one_person_points_df[one_person_points_df['cluster'] !=
-1]
medians_of_POI =
one_person_points_df.groupby(['cluster'])[['lat',
'long']].median().reset_index(drop=False)
print(medians_of_POI)
medians_of_POI_gdf = gpd.GeoDataFrame(medians_of_POI,
geometry=
[Point(x) for x in
zip(
list(medians_of_POI['long']),
list(medians_of_POI['lat'])
)
])
ax = gpd.GeoDataFrame([row],
geometry=[row['geometry']]).plot(figsize=(15,15))
medians_of_POI_gdf.plot(ax=ax,markersize=128)
plt.show()
Code Block 12-14 Apply the model to all data
The resulting output and graphs will be shown hereafter in Figures 12-14, 12-15,
and 12-16.
This irst map shows the result that we have already used before. Indeed, for
Person 1, the OPTICS model has correctly identi ied the two points of interest. Figure
12-15 shows the results for Person 2.
Figure 12-15 The three central points of Person 2 against their trajectory. Image by author
For Person 2, we can see that there are three points of interest, and the OPTICS
model has correctly identi ied those three centroids. The model is therefore
considered successful on this person. Let’s now check the output for the third person
in Figure 12-16.
Figure 12-16 The two centroids of Person 3 against their trajectory
This result for Person 3 is also successful. There were two points of interest in the
trajectory of Person 3, and the OPTICS model has correctly identi ied those two.
Key Takeaways
1.
Unsupervised machine learning is a counterpart to supervised machine learning.
In supervised machine learning, there is a ground truth with a target variable. In
unsupervised machine learning, there is no target variable.
2.
Feature reduction is a family of methods in unsupervised machine learning, in
which the goal is to rede ine variables. It is not very different to apply feature
reduction in spatial use cases.
3.
Clustering is a family of methods in unsupervised machine learning that focuses
on inding groups of observations that are fairly similar. When working with
spatial data, there are some speci ics to take into account when clustering.
4.
The OPTICS clustering model with haversine distance was used to identify points
of interest in the trajectories of three people in Brussels. Although the default
OPTICS model did not ind those points of interests correctly, a manual tuning has
resulted in a model that correctly identi ies the points of interest of each of the
three people observed in the data.
© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
J. Korstanje, Machine Learning on Geographical Data Using Python
https://doi.org/10.1007/978-1-4842-8287-8_13
13. Conclusion
Joos Korstanje1
(1) VIELS MAISONS, France
Key Takeaways
1.
Throughout this book, you have seen three main topics:
a.
Spatial data, its theoretical speci icities, and managing spatial
data in Python
b.
GIS spatial operations in Python
c.
Machine learning on spatial data and speci ic considerations to
adapt regular machine learning to the case of spatial data
2. This chapter has presented a number of ideas for further learning
in the form of potential learning paths:
a.
Specialize in GIS by going into more detail of different GIS tools
and mapmaking.
b. Specialize in machine learning by studying machine learning
theory and practice in more detail
theory and practice in more detail.
c.
Going into advanced earth observation use cases and
combining this with the study of the ield of computer vision.
d.
Other ideas include data engineering by focusing on long-term
ef iciently storing geodata or any other ield that has a heavy
component of spatial data, like meteorology, hydrology, and
much more.
Index
A
Albers equal area conic projection
Azimuthal equidistant projection
Azimuthal/true direction projection
B
Babinet projection
Buffering operations
data type
de inition
difference creation
GIS spatial operations
intersection operation
line data
point data
polygon
Python
data resulting
house searching criteria
LineString object
point data
polygons
visualization
schematic diagram
set operations
standard operations
C
Cartesian coordinate system
Cartopy
Classi ication
data modeling
array resulting
dataframe format
error analysis
logistic regression
plot resulting
predictions
resulting comparison
strati ication
GIS spatial operations
machine learning
model benchmarking
reorganization/standardization
spatial communication
advantage/disadvantage
data resulting
feature engineering
geodataframe
importing data
map resulting
operation
resulting dataframe
source code
truncated version
use case
wide/long data format
Clipping operation
de inition
differences
GIS spatial operations
line data
Python
dataset
features
plot resulting
seine dataset
source code
schematic drawing
Clustering
background map
central points
chart plotting
de inition
extract data
GIS spatial operations
graph representation
importing/inspecting data
latitude and longitude
LineString model
map creation
models
OPTICS method
result information
resulting map
source code
spatial data
tuning models
map resulting
OPTICS model
source code
Conformal projections
Conic equidistant projection
Coordinate systems
airplane navigation
Cartesian
geographic systems
GIS spatial operations
local system
maps
dataframe
ESRI:102014
export map
features
Google My Maps
libraries installation
plotting process
polygon-shaped map data
polar system
projected system
time and pose problems
two-dimensional Euclidean space
two-dimensional view
types
D
Data types
GIS spatial operations
lines
airports data
dataframe
de inition
LineString geometry
mathematical objects
merging data
plot resulting
Python
points
See Point data
polygon information
polygons
de inition
operations
Python dataset
rasters/grids
de inition
Python
vector/raster data
Dissolve operation
de inition
GIS spatial operations
grouped dataset
Python
schematic drawing
Doubly equidistant projection
See Two-point equidistant projection
E
Elliptical projection
Equal area projections
Equidistant projections
Erase operation
clipping
deleting/dropping
GIS spatial operations
line data
overlay
points
polygons
Python
data resulting
data table
de inition
Iberia
line data
map resulting
plot resulting
point data
Spain data
visualization
schematic drawing
spatial operations
table view
F
Folium map
G
Geodata system
CSV/TXT/Excel
de inition
distance/direction
GIS
See Geographic Information Systems (GIS)
JSON format
KML ile
magnetic direction measurements
Python packages
shape ile
TIFF/JPEG/PNG images
Geographic coordinate systems
ETRS89
latitude/longitude
WGS 84/EPSG:4326
Geographic Information Systems (GIS)
ArcGIS
coordinate systems and projections
database system
intensive workloads
machine learning
open source
Python/R programming
remote sensing/image treatment
specialization
Geopandas/matplotlib map
color-coded column
dataset
documentation
grayscale map
legend
plot method
point dataset
title image
Global Positioning System (GPS)
H
Homolographic projection
I, J
Interpolation
benchmark
classi ication/regression
curved line
de inition
GIS spatial operations
Kriging
See Kriging
linear
list graphical process
nearest neighbor
one-dimensional/spatial interpolation
polynomial functions
Python
dataframe
data points
geodataframe
numerical estimations
plot resulting
2D linear function
straight line
Intersecting operation
buffering operation
conceptual data
differences
geographical datasets
GIS spatial operations
line datasets
polygons
Python
colormap
import data
overlay function
plot resulting
schematic drawing
set operations
standard operations
K
Kriging solutions
exponential setting
fundamental approaches
Gaussian
linear
plot
L
Lambert conformal conic projection
Lambert equal area azimuthal
Linear interpolation
Local Coordinate Systems
M
Mapmaking
Cartopy
color scale picking
folium
geopandas/matplotlib
additional column
color-coded column
documentation
grayscale map
legend
plot method
point dataset
title image
GIS spatial operations
Plotly
Mercator map projection
Merge operation
attribute join
de inition
GIS spatial operations
Python
attribute join
concatenation
datasets
lookup table
map resulting
row-wise
spatial information
types
schematic drawing
spatial join
Mollweide projection
N
Nearest neighbor interpolation
O
Overlay operation
P, Q
Plotly map
Point data
de inition
ilter morning vs. afternoon
geometry format
operations
Python
content
coordinate system
data information
graph image
plotting information
squirrel data
XML parsing
Polar coordinate system
components
de inition
formulas
radians vs. degrees
schematic drawing
trigonometric computations
Polynomial interpolation
Potential learning paths
Projected coordinate systems
azimuthal/true direction
conformal projection
equal area
equidistant
features
x and y coordinates
R
Raster vs. vector data
Regression models
data exploration
exploration and model
GIS spatial operations
importing/preparing data
linear
mathematical form
metrics/building models
modeling process
code results
decision tree
DecisionTreeRegressor
geographic data
interpretation
linear model
max_depth
prediction
R2 score evaluation
train and test
numeric target/predictor variable
target variable
S
Spline/piecewise polynomial
Supervised models
T
Two-point equidistant projection
U
Unsupervised models
V, W, X, Y, Z
Vector vs. raster data
Visualization method