Statistics and Data Visualization in Climate Science With R and Python
Statistics and Data Visualization in Climate Science With R and Python
R and Python
A comprehensive overview of essential statistical concepts, useful statistical methods, data visual-
ization, machine learning, and modern computing tools for the climate sciences and many others
such as geography and environmental engineering. It is an invaluable reference for students and
researchers in climatology and its connected fields who wish to learn data science, statistics, R and
Python programming. The examples and exercises in the book empower readers to work on real cli-
mate data from station observations, remote sensing, and simulated results. For example, students
can use R or Python code to read and plot the global warming data and the global precipitation data
in netCDF, csv, txt, or JSON; and compute and interpret empirical orthogonal functions. The book’s
computer code and real-world data allow readers to fully utilize the modern computing technology
and updated datasets. Online supplementary resources include R code and Python code, data files,
figure files, tutorials, slides, and sample syllabi.
Samuel S. P. Shen is Distinguished Professor of Mathematics and Statistics at San Diego State Uni-
versity, and Visiting Research Mathematician at Scripps Institution of Oceanography, University of
California – San Diego. Formerly, he was McCalla Professor of Mathematical and Statistical Sci-
ences at the University of Alberta, Canada, and President of the Canadian Applied and Industrial
Mathematics Society. He has held visiting positions at the NASA Goddard Space Flight Center,
the NOAA Climate Prediction Center, and the University of Tokyo. Shen holds a B.Sc. degree in
Engineering Mechanics and a Ph.D. degree in Applied Mathematics.
Gerald R. North is University Distinguished Professor Emeritus and former Head of the Department
of Atmospheric Science at Texas A&M University. His research focuses on modern and paleo-
climate analysis, satellite remote sensing, climate and hydrology modeling, and statistical methods in
atmospheric science. He is an elected fellow of the American Geophysical Union and the American
Meteorological Society. He has received several awards including the Harold J. Haynes Endowed
Chair in Geosciences of Texas A&M University, the Jules G. Charney medal from the American
Meteorological Society, and the Scientific Achievement medal from NASA. North holds both B.Sc.
and Ph.D. degrees in Physics.
‘Statistics and Data Visualization in Climate Science with R and Python by Sam Shen and
Jerry North is a fabulous addition to the set of tools for scientists, educators and students
who are interested in working with data relevant to climate variability and change . . . I can
testify that this book is an enormous help to someone like me. I no longer can simply
ask my grad students and postdocs to download and analyze datasets, but I still want
to ask questions and find data-based answers. This book perfectly fills the 40-year gap
since I last had to do all these things myself, and I can’t wait to begin to use it . . . I am
certain that teachers will find the book and supporting materials extremely beneficial as
well. Professors Shen and North have created a resource of enormous benefit to climate
scientists.’
Dr Phillip A. Arkin, University of Maryland
‘This book is a gem. It is the proverbial fishing rod to those interested in statistical analysis
of climate data and visualization that facilitates insightful interpretation. By providing a
plethora of actual examples and R and Python scripts, it lays out the “learning by doing”
foundation upon which students and professionals alike can build their own applications
to explore climate data. This book will become an invaluable desktop reference in Climate
Statistics.’
Professor Ana P. Barros, University of Illinois Urbana-Champain
‘A valuable toolkit of practical statistical methods and skills for using computers to analyze
and visualize large data sets, this unique book empowers readers to gain physical under-
standing from climate data. The authors have carried out fundamental research in this field,
and they are master teachers who have taught the material often. Their expertise is evident
throughout the book.’
Professor Richard C. J. Somerville, University of California, San Diego
‘This book is written by experts in the field, working on the frontiers of climate science. It
enables instructors to “flip the classroom”, and highly motivated students to visualize and
analyze their own data sets. The book clearly and succinctly summarizes the applicable sta-
tistical principles and formalisms and goes on to provide detailed tutorials on how to apply
them, starting with very simple tasks and moving on to illustrate more advanced, state-of-
the-art techniques. Having this book readily available should reduce the time required for
advanced undergraduate and graduate students to achieve sufficient proficiency in research
methodology to become productive scientists in their own right.’
Professor John M. Wallace, University of Washington
Statistics and Data Visualization
in Climate Science with
R and Python
SAMUEL S. P. SHEN
San Diego State University
GERALD R. NORTH
Texas A&M University
Shaftesbury Road, Cambridge CB2 8EA, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India
103 Penang Road, #05–06/07, Visioncrest Commercial, Singapore 238467
www.cambridge.org
Information on this title: www.cambridge.org/9781108842570
DOI: 10.1017/9781108903578
© Cambridge University Press & Assessment 2023
This publication is in copyright. Subject to statutory exception and to the provisions
of relevant collective licensing agreements, no reproduction of any part may take
place without the written permission of Cambridge University Press & Assessment.
First published 2023
Printed in the United Kingdom by CPI Group Ltd, Croydon CR0 4YY
A catalogue record for this publication is available from the British Library.
Library of Congress Cataloging-in-Publication Data
Names: Shen, Samuel S., author. | North, Gerald R., author.
Title: Statistics and data visualization in climate science with R and Python / Samuel S. P. Shen,
Gerald R. North.
Description: Cambridge ; New York : Cambridge University Press, 2023. | Includes bibliographical
references and index.
Identifiers: LCCN 2023029571 | ISBN 9781108842570 (hardback) | ISBN 9781108829465 (paperback) |
ISBN 9781108903578 (ebook)
Subjects: LCSH: Climatology – Statistical methods. | Climatology – Data processing. | Information
visualization. | R (Computer program language) | Python (Computer program language)
Classification: LCC QC874.5 .S48 2023 | DDC 551.63/3–dc23/eng/20230816
LC record available at https://lccn.loc.gov/2023029571
ISBN 978-1-108-84257-0 Hardback
Cambridge University Press & Assessment has no responsibility for the persistence
or accuracy of URLs for external or third-party internet websites referred to in this
publication and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
Give the pupils something to do, not something to learn; and doing is of such a nature
as to demand thinking; learning naturally results.
— John Dewey
Contents
5.6 SVD for the Standardized Sea Level Pressure Data of Tahiti
and Darwin 191
5.7 Chapter Summary 193
References and Further Reading 195
Exercises 196
Index 387
Preface
The learning goal of this book is to master the commonly used statistical methods in cli-
mate science and use them for real climate data analyses and visualization using computers.
We hope that this book will quickly help you improve your skills of data science and sta-
tistics. You will feel comfortable to explore large datasets in various kinds of data formats,
compute their statistics, generate statistical plots, visualize the data, interpret your results,
and present your results. We appreciate the psychological value of instant gratification,
which is adopted in the design of most video games. We want you to gain some useful skill
after you have spent two or three hours reading the book and following the interaction pro-
cedures described by running and modifying our R code or Python code, and by exploring
the climate data visualization technologies, such as 4-dimensional visual delivery (4DVD)
of big climate data (www.4dvd.org). You might have noticed that we use the word “skill”
rather than “knowledge” gained from our book. We believe that you need skills and your
knowledge comes from your skills and your practice of the skills. We emphasize “do!” Our
book is designed in such a way that you will use your computer to interact with the book,
for example running R code in R Studio or Python code in Jupyter Notebook and Colab
to reproduce the figures in the book, or modifying the code for the analyses and plots
of your own datasets. Your interaction with the book is essential to effectively improve
your data science and statistics skills, which can be applied not only to climate data but
also to other datasets of your interest. We expect that your skill improvement will make
you more efficient when handling the data in your study or job, and hence will bring you
more opportunities. Specifically, you will be able to plot big data in the file formats of
.nc, .csv, .json, .txt, compute statistical parameters (e.g., mean, standard devia-
tion, quantiles, test statistics, empirical orthogonal functions, principal components, and
multivariate regressions), and generate various kinds of figures (e.g., surface air temper-
ature for January 1942, atmospheric carbon dioxide data of Mauna Loa, pie chart, box
plot, histogram, periodogram, and chi-square fit to monthly precipitation data). If you are
already sophisticated with data analysis and statistics, you may appreciate our explana-
tion of the assumptions and limitations of statistical methods, such as the four assumptions
for a linear regression, the test of serial correlation, and intuition of the theory behind
covariances and spectral analyses.
To help you quickly learn your desired skills, we try our best to make each chapter self-
contained as much as we can. This allows you to read the chapter of your interest without
xiii
xiv Preface
reading the previous chapters or with minimum references. We are able to do so because of
your interaction with the book through your computer and the Internet with some help from
learning resources freely available at our book website www.climatestatistics.org.
Who should read this book? We wrote this book for the following groups of people: (i)
undergraduate and graduate students majoring in atmospheric sciences, oceanic sciences,
geography, and climate science in general to help them quickly gain the data handling
skills for research, for supporting course work, or for job interviews; (ii) climate science
professionals who may wish to check whether they have correctly applied statistical meth-
ods or answer questions from their junior colleagues, such as how to make inference on
regression; (iii) college students in other majors who wish to use climate data as examples
to learn data science skills and R or Python programming so that they can compete well
in the job market; (iv) college professors who wish to modernize their courses or curric-
ula amid the development of the digital economy and who wish to find a textbook that
allows them to effectively engage students to learn data science skills; and (v) senior or
retired scientists and engineers who wish to use an interactive book to sharpen their mind
by mimicking computer programs and generating beautiful figures.
A simple answer is that the need is wide and urgent. Almost all climate scientists wonder
at one time or another if a statistics method they or their students have used in a project
is sound. What are the assumptions and limitations of the method? They may wish to tell
a better story about the method and its results, which are appreciated and agreed upon
by professional statisticians. Students who take statistics courses taught by faculty from
the Department of Statistics often learn very little about real climate datasets, and find
that what they have learned is completely disjoint from the real climate science problems.
For example, a statistics course never offers chances to analyze a dataset of more than
2.0 Gigabytes in a .nc file, or to plot 2D or 3D graphics or maps. This is so because
traditional books are theory based or based on small datasets because of limited access
to laptop computers. Some books are more advanced for research, but they still use the
figures adopted from publications, not easily reproducible by readers. Consequently, most
students of statistics courses forget what they have learned after the exams, and can hardly
tell a story about the usage of statistical methods. We wrote this book to fill this gap that
links assumptions and computing tools to real world data and practical problems. We intend
to train you with useful skills through your interactions with the book so that you will not
feel disjoint and unmotivated, because you have a chance to interact with the book and
work on your own datasets.
Both authors have taught climate statistics courses many times in the USA, Canada,
Japan, and China. In the earlier years, we taught using the old method, although we once in
a while used some real climate data. We learned the shortcomings of the traditional method
of teaching: one-directional instruction and lecture from professors to students. One of us
xv Preface
(S. S. P. S.) changed the one-directional method to the interaction method of instruction
in 2015, when it was feasible to require every student to bring a laptop computer to the
classroom. This interactive learning has helped students effectively in the job market. The
method has already been used by another book authored by S. S. P. S. with Richard C. J.
Somerville, Climate Mathematics: Theory and Applications, also published by Cambridge
University Press in 2019.
We wrote this statistics and data visualization book partly because of our experience in
research and teaching. The book is built on class notes we wrote at Texas A&M University
(eight times taught as a semester-long course by G. R. N.), University of Alberta (twice
taught as a semester-long course by S. S. P. S.), University of Tokyo (once taught as a short
course on error estimation by S. S. P. S.), the NOAA National Centers of Environmental
Information (once taught as a short course mainly about R and its use in analyzing cli-
mate data with different formats and visualizing the analysis results by S. S. P. S.), and
the Chinese Academy of Sciences (once taught as a short course mainly about data visu-
alization by R by S. S. P. S.). The courses were at graduate level and the audiences were
graduate students not only majoring in atmospheric and oceanic sciences, but also in engi-
neering, mathematics, and statistics. Some audiences were climate research professionals.
The main purpose of the courses was to prepare students to correctly and efficiently use
statistical theory and methods to analyze climate data, visualize the data, and interpret data
analysis results.
Another motivation for writing the current book is the overwhelming problems in the
traditional statistics courses for today’s climate science. Climate science students may find
traditional statistics courses inadequate, as they often focus on the mathematical power of
the statistical theories, but not much on the practical data analysis examples, and even less
or not at all on the visualization of modern climate data. Such a course is not engaged with
the students’ research work and their career. For example, students need to plot data in
both 2D and 3D, and compute the empirical orthogonal functions for large data matrices.
Our book fills the gap and provides both computer codes and data that allow students
to make analysis and visualization, and to modify our codes for their research and jobs.
Our book may be considered as a toolbox for climate data. Because we purposely limited
the book size to avoid the audience being overwhelmed by a thick book, we thus leave
some computer codes on our book’s website www.climatestatistics.org, including
tutorials for R and Python, and some examples. The freely available toolbox on our website
will be convenient to use and will be updated with new data, computer codes, additional
exercise problems, and other resources to empower your interaction with the book. The
toolbox also allows a reader to try our learning-by-doing method first, before purchasing
our book.
We follow the education theory of learning-by-doing, which in our case means using your
computer and our R or Python code to interact with our book and modifying our computer
xvi Preface
codes and websites to analyze your own data and generate corresponding figures. Learning-
by-doing is the core methodology of the progressive education of John Dewey (1859–
1952), an American philosopher and educator. Dewey felt that the experience of students
and teachers together yields extra value for both. Instructors are not just to lecture and
project authority, instead they are to collaborate with students and guide students to gain
experience of solving problems of their interest. Although Dewey’s education theory was
established initially for schoolchildren, we feel that the same is applicable to undergraduate
and graduate students. Our way of learning-by-doing is to enable students to use R or
Python code and other resources in the book and its website to reproduce the figures and
numerical results in the book. Further, students can modify the computer code and solve
their own problems, such as visualizing the climate data in a similar format and of their
own interest, or analyzing their own data using a similar or a modified method. Thus,
audience interaction is the main innovative feature of our book, allowing the audience to
gain experience of practicing, thinking, applying, and consequently understanding. The
ancient Chinese educator Confucius (551–479 BC) said, “I hear, and I forget; I see, and I
remember; and I do, and I understand.” Although John Dewey and Confucius were more
than 2,000 years apart, they had a similar philosophy of learning-by-doing.
As illustrated in Figure 0.1, our pedagogy has three stages: do, reflect, and apply. Coau-
thor S. S. P. S. practiced this pedagogy in recent years. He presents a question or a problem
at the beginning of a class. Then he asks students to orally ask the same question, or
describe the same problem, or answer his question in their own words. For example, why is
the tiny amount of carbon dioxide in the atmosphere important for climate change? Next,
he and students search and re-search for data and literature, work on the observed carbon
dioxide data at Mauna Loa using computer data visualization, and discuss the structure of
greenhouse gasses whose molecules have three or more atoms. To understand the data bet-
ter, they use more data analysis methods, such as the time series decomposition. Next, he
encourages his students to share this experience with their grandparents, other family mem-
bers, or friends. Finally, students apply the skills gained to solve their own problems with
their own data, by doing homework, working on projects, finding employment, or making
innovations. In this cycle, students have gathered experience and skills to improve their
life and to engage with the community. In the short term, students trained in this progres-
sive cycle of learning-by-doing have a better chance of becoming good problem solvers,
smooth story narrators, and active leaders in a research project in a lab or an internship
company, and consequently to become competitive in the job market. In the long term,
students trained in this cycle are likely to become to life-time learners and educators. John
Dewey said: “Education is not preparation for life but life itself.” We would like to modify
this to “Education is not only preparation for a better life, but also is life itself.”
Dewey’s progressive education theory is in a sharp contrast to the traditional learning
process based on the logic method, which aims at cultivating high-achieving scholars.
The commonly used pedagogy of lecture-read-homework-exam often uses the logic-based
approach. The climax of logic-based education is that the instructors have the pleasure of
presenting their method and theory, while students are so creative that they will produce a
new or a better theory. Many outstanding scholars went through their education this way,
and many excellent textbooks were written for this approach. However, our book does not
xvii Preface
5. Apply:
Employment
Creation
Leadership 1. Goal−set:
Reward Describe
Problem;
Ask
Questions
4. Succeed: Apply
Skills
Examples
Do
Generalization
Innovations
Reflect 2. Do:
Research
Assumptions
Formulas
Solutions
3. Reflect:
Share & Tell
Experience
Feedback
Enjoy
t
Figure 0.1 John Dewey’s pedagogy of learning-by-doing: A cycle from a problem to skill training via solving a problem, to
reflection, to success and applications, and to reward and new problems. It is a cycle of understanding and progress.
The size of each piece of the “Do, Reflect, Apply” pie approximates the proportion of the learning effort. “Do” is
emphasized here, indicated by the largest piece of the pie.
follow this approach, since our book is written for the general population of students, not
just for the elite class or even scholars-to-be. If you wish to be such an ambitious scholar,
then you may use our book very differently: you can read the book quickly and critically
for gaining knowledge instead of skills, and skip our reader–book interaction functions.
Our pedagogy is result-oriented. If using car-driver and car-mechanic as metaphors, our
book targets the 99% or more who are car-drivers, and provides some clues for the less than
1% of the audience who wish to become car-mechanics. Good drivers understand the limits
of a car, are sensitive to abnormal noise or motion of a car, and can efficiently and safely
get from A to B. Therefore, this book emphasizes (i) assumptions of a method, (ii) core and
concise formulas, (iii) product development by a computer, and (iv) result interpretation.
This is in contrast with traditional books, which are often in an expanded format of (ii) with
xviii Preface
many mathematical derivations, and challenge the mathematical capability of most readers.
Our philosophy is to minimize the mathematical challenge, but to deepen the understanding
of the ideas using visual tools and using storytelling experience. Our audience will be able
to make an accurate problem statement, ask pointed questions, set up statistical models,
solve the models, and interpret solutions. Therefore, instead of aiming our training at the
few “mechanics” and thus incurring a high risk of failure, we wish to train a large number
of good “drivers” with a large probability of success.
The book has an extensive coverage of statistical methods useful in modern climate
science, ranging from probability density functions, machine learning basics, modern mul-
tivariate regression, and climate data analysis examples, to a variety of computer codes
written in R and Python for climate data visualization. The details are listed in the Table of
Contents. Each chapter starts with an outline of the materials and ends with a summary of
the methods discussed in the chapter. Examples are included for each major concept. The
website for this book www.climatestatistics.org is cross-linked with the CUP site
www.cambridge.org/climatestatistics.
Material-wise, our book has the following four distinct features:
(1) The book treats statistics as a language for every climate scientist to master. The book
includes carefully selected statistical methods for modern climate science students.
Every statistics formula in the book has a climate science interpretation, and every
core climate data analysis is rigorously examined with statistical assumptions and
limitations.
(2) The book describes a complete procedure for statistical modeling for climate data.
The book discusses random fields, their meaning in statistical models of climate data,
model solutions and interpretations, and machine learning.
(3) The book includes free computer codes in both R and Python for readers to conven-
iently visualize the commonly used climate data. The book has computer codes to
analyze NCEP/NCAR Reanalysis data for EOF patterns, and has examples to analyze
the NOAAGlobalTemp data for global warnings. The book website contains many
online resources, such as an instructors’ guide, PowerPoint slides, images, and tutori-
als for R and Python coding. Readers can use the given computer codes to reproduce
and modify all the figures in the book, and generate high-quality figures from their
own datasets for reports and publications.
(4) This textbook has a large number of exercise problems and has solution manuals. Every
chapter has at least 20 exercise problems for homework, tests, and term projects. A
solutions manual with both R and Python codes is available for instructors from the
Cambridge University Press.
Acknowledgments
We thank our employers San Diego State University, University of Alberta, Texas A&M
University, and NASA Goddard Space Flight Center for their support of our teaching and
research on the materials included in this book. We appreciate that our employers gave
us freedom to lecture using the many non-traditional materials included in this book. Of
course, we are indebted to the students in our classes and labs who motivated us to teach
better. The interactions with students gave us much joy when developing materials for this
book and some students contributed directly. For example, Kaelia Okamura led the Python
code development based on the original R code written by Samuel Shen; Kaelia Oka-
mura, Thomas Bui, Danielle Lafarga, Joaquin Stawsky, Ian Ravenscroft, Elizabeth Reddy,
Matthew Meier, Tejan Patel, Yiyi Li, Julien Pierret, Ryan Lafler, Anna Wade, Gabriel
Smith, Braandon Gradylagrimas, and many other students contributed to the development
of the computer code, book website, solutions manuals, and datasets. Several students in
this group were awarded fellowships by the NOAA Education Partnership Program with
Minority Serving Institutions, USA. We thank NOAA for this support.
We thank the friendly editorial team of Cambridge University Press (CUP) for their pro-
fessional and high-quality work. In particular, we thank our book editor Matt Lloyd for
his maximum patience with our writing progress, his professional guidance on the fig-
ure styles, and his advice on the book audience. We thank Ursula Acton for her excellent
text copy-editing. We thank Reshma Xavier and Sapphire Duveau for their efficient man-
agement of the production of our book. We thank Sarah Lambert and Rowan Groat for
their clear instructions for us to prepare the supplementary materials for the book. Our
interactions with CUP were always pleasant and productive.
xix
How to Use This Book
The course prerequisite for this book is one semester of calculus and one semester of
linear algebra. These two prerequisite courses are included in the courses of a standard
undergraduate curriculum of atmospheric science, oceanography, or climate science.
In lieu of these two courses for satisfying the prerequisite, you can also use chapters
3 and 4, and appendices of the book Climate Mathematics: Theory and Applications by
Samuel S. P. Shen and Richard C. J. Somerville, Cambridge University Press, 2019, 391pp.
Besides, more than 50% of this book, including Chapters 1–4, and most of Chapters
7 and 9 requires only high school mathematics. These materials are sufficient for a one-
semester course of computer programming and data science for students in majors that
do not require calculus or linear algebra. Often geography, environmental science, and
sustainability are such majors.
How Each Audience Group Can Make the Best Use of This Book
The book can be used as the textbook for a variety of courses, such as Statistical Meth-
ods [for Climate Science], Statistics and Data Visualization [for Climate Science], Climate
Data Analysis, Climate Statistics, and Climate Data Visualization. The course can be taught
at graduate student, advanced undergraduate student, and climate science professional lev-
els. The book can also be used as a reference manual or a toolkit for a lab. The potential
audience of this book and their purposes may be classified into the following groups.
User Group I: Textbook for an upper division undergraduate course with a title
like Climate Statistics, or Data Analysis, or Climate Data Science, or Computer
Programming and Data Analysis, or Statistical Methods in Atmospheric Sciences.
The course can use most of the materials in the book but exclude more advanced
topics, such as Sections 4.2, 4.3, 6.4, 6.5, 7.5–7.7, 8.3, and 8.4.
xx
xxi How to Use This Book
User Group II: Textbook for a graduate course with some undergraduate
seniors enrolled. The course title may be named as in Group I, or with the word
“Advanced,” e.g., Advanced Climate Data Science. The instructor might choose
more advanced topics and omit some elementary sections that the students are
already familiar with or can be quickly reviewed by themselves. The elementary
sections include Sections 2.1, 4.1, and 5.1–5.4.
User Group III: Data science skills may help students in some traditional climate-
related humanity majors to find employment. Data science courses, such as Python
Programming and Data Science or Python Programming for Environmental Data,
have been listed as electives. These students may not have mastered the methods
of calculus and linear algebra. These courses may use materials from Chapters 1 to
4, and most of Chapters 7 and 9.
User Group IV: Textbook for a short course or workshop for climate scientists
with a focus on data analysis and visualization. The instructor may select materials
based on the purpose of the short course. For example, if the course is The Basics
of Machine Learning for Climate Science, then use Chapter 9; if it is The Basics
of Climate Data Visualization, then use Chapter 1; and if it is Python Coding for
Climate Data, then select examples in Chapters 1, 4, 8, and 9.
Users Group V: Reference manual for climate professionals who wish to use this
book to train their junior colleagues in their labs for data analysis, or use this book
as a toolkit for their research. For example, when you wish to use the Kendall tau
test for statistical significance, you can find a numerical example and its Python or
R code in this book; when your students wish to analyze and plot a certain dataset,
you may refer them to this book to find proper examples and computer codes.
These are to make your work easy and efficient, which is a practical purpose we
bore in mind when writing this book.
Users Group VI: Some senior or retired scientists and engineers may wish to keep
their minds sharp by playing with R or Python codes and climate data, such as the
CO2 concentration data shown in Figure 7.3. Instead of learning many statistical
concepts and methods, this book will allow them to play with a few computer codes
and generate some interesting figures (e.g., Figs. 7.3 and 8.4) to discuss with their
friends.
xxii How to Use This Book
This is an interactive textbook designed for readers in the digital age. The book con-
tains many R and Python codes, which can also be found at the book’s website
www.climatestatistics.org. The relevant climate data used in the computer codes can
also be downloaded directly from the website. An efficient way of learning from this book
is by doing: to reproduce the figures and numerical results using the computer codes from
either the book or its website. You can modify the code to alter the figures for your purposes
and your datasets. These figures and computer codes can help you practice what you have
learned and better understand the ideas and theory behind each method. We emphasize the
value of practice in acquiring quantitative analysis skills.
For instructors of this textbook, we have tried to make this book user-friendly for both
students and instructors. As an instructor, you can simply follow the book when teaching
in a classroom, and you can assign the exercise problems at the end of each chapter. A
solutions manual in R and Python with answers for all the problems is available through
the publisher to qualified instructors.
Basics of Climate Data Arrays, Statistics,
1
and Visualization
People talk about climate data frequently, read or imagine climate data, and yet rarely play
with or use climate data, because they often think it takes a computer expert to do that.
However, that is changing. With today’s technology, anyone can use a computer to play
with climate data, such as a sequence of temperature values of a weather station at different
observed times or a matrix of data for a station for temperature, air pressure, precipitation;
wind speed and wind direction at different times; and an array of temperature data on a 5-
degree latitude–longitude grid for the entire world for different months. The first is a vector.
The second is a variable-time matrix, and the third a space-time 3-dimensional array. When
considering temperature variation in time at different air pressure levels and different water
depths, we need to add one more dimension: altitude. The temperature data for the ocean
and atmosphere for the Earth is a 4-dimensional array, with 3D space and 1D time. This
chapter attempts to provide basic statistical and computing methods to describe and visu-
alize some simple climate datasets. As the book progresses, more complex statistics and
data visualization will be introduced.
We use both R and Python computer codes in this book for computing and visualization.
Our method description is stated in R. A Python code following each R code is included in
a box with a light yellow background. You can also learn the two computer languages and
their applications to climate data from the book Climate Mathematics: Theory and Appli-
cations (Shen and Somerville 2019) and its website www.climatemathematics.org.
The climate data used in this book are included in the data.zip file downloadable from
our book website www.climatestatistics.org. You can also obtain the updated data
from the original data providers, such as www.esrl.noaa.gov and www.ncei.noaa.gov.
After studying this chapter, a reader should be able to analyze simple climate datasets,
compute data statistics, and plot the data in various ways.
In a list of popular climate datasets, the global average annual mean surface air tempera-
ture anomalies might be on top. Here, anomalies means the temperature departures from
the normal temperature or the climatology. Climatology is usually defined as the mean of
temperature data in a given period of time, such as from 1971 to 2020. Thus, temperature
anomaly data are the differences of the temperature data minus the climatology.
1
2 Basics of Climate Data Arrays, Statistics, and Visualization
This section will use the global average annual mean surface air temperature anomaly
dataset as an example to describe some basic statistical and computing methods.
The data are part of the dataset named as the NOAA Merged Land Ocean Global Sur-
face Temperature Analysis (NOAAGlobalTemp) V4. The dataset was generated by the
NOAA National Centers for Environmental Information (NCEI) (Smith et al. 2008; Vose
et al. 2012). Here, NOAA stands for the United States National Oceanic and Atmospheric
Administration.
The anomalies are with respect to the 1971–2000 climatology, i.e., 1971–2000 mean. An
anomaly of a weather station datum is defined by the datum minus the station climatology.
The first anomaly datum, -0.37 ◦ C, indexed by [1] in the above data table, corresponds
to 1880 and the last to 2018, a total of 139 years. The last row is indexed from [133] to
[139].
One might be interested in various kinds of statistical characteristics of the data, such
as mean, variance, standard deviation, skewness, kurtosis, median, 5th percentile, 95th
percentile, and other quantiles. Is the data’s probabilistic distribution approximately nor-
mal? What does the box plot look like? Are there any outliers? What is a good graphic
representation of the data, i.e., popularly known as a climate figure?
When considering global climate changes, why do scientists often use anomalies, instead
of the full values directly from the observed thermometer readings? This is because the
observational estimates of the global average annual mean surface temperature are less
accurate than the similar estimates for the changes from year to year. There is a con-
cept of characteristic spatial correlation length scale for a climate variable, such as surface
temperature. The length scale is often computed from anomalies.
The use of anomalies is also a way of reducing or eliminating individual station biases.
A simple example of such biases is that due to station location, which is usually fixed
in a long period of time. It is easy to understand, for instance, that a station located in
3 1.1 Global Temperature Anomalies from 1880 to 2018
the valley of a mountainous region might report surface temperatures that are higher than
the true average surface temperature for the entire region and cannot be used to describe
the behavior of climate change in the region. However, the anomalies at the valley station
may synchronously reflect the characteristics of the anomalies for the region. Many online
materials give justifications and examples on the use of climate data anomalies, e.g., NOAA
NCEI (2021).
The global average annual mean temperature anomalies quoted here are also impor-
tant for analyzing climate simulations. When we average over a large scale, many random
errors cancel out. When we investigate the response of such large scale perturbations as the
variations of Sun brightness or atmospheric carbon dioxide, these averaged data can help
validate and improve climate models. See the examples in the book by Hennemuth et al.
(2013) that includes many statistical analyses of both observed and model data.
Many different ways have been employed to visualize the global average annual mean
temperature anomalies. The following three are popular ones appearing in scientific and
news publications: (a) a simple point-line graph, (b) a curve of staircase steps, and (c)
a color bar chart, as shown in Figures 1.1–1.3. This subsection shows how to generate
these figures by R and Python computer programming languages. The Python codes are in
yellow boxes. To download and run the codes, visit www.climatestatistics.org.
t
Figure 1.1 Point-line graph of the 1880–2018 global average annual mean temperature anomalies with respect to the
1971–2000 climatology, based on the NOAAGlobalTemp V4 data.
4 Basics of Climate Data Arrays, Statistics, and Visualization
anomalies. It is plotted from the NOAAGlobalTemp V4 data quoted in Section 1.1.1. The
figure can be generated by the following computer code.
# R plot Fig . 1 . 1 : A simple line graph of data
# go to your working directory
setwd ( " / Users / sshen / climstats " )
# read the data file from the folder named " data "
NOAAtemp = read . table (
" data / aravg . ann . land _ ocean . 9 0 S . 9 0 N . v 4 . 0 . 1 . 2 0 1 9 0 7 . txt " ,
header = FALSE ) # Read from the data folder
dim ( NOAAtemp ) # check the data matrix dimension
#[1] 140 6 # 1 4 0 years from 1 8 8 0 to 2 0 1 9
# 2 0 1 9 will be excluded since data only up to July 2 0 1 9
# col 1 is year , col 2 is anomalies , col 3 -6 are data errors
par ( mar = c ( 3 . 5 ,3 . 5 ,2 . 5 ,1 ) , mgp = c ( 2 ,0 . 8 ,0 ))
plot ( NOAAtemp [ 1 : 1 3 9 ,1 ] , NOAAtemp [ 1 : 1 3 9 ,2 ] ,
type = " l " , col = " brown " , lwd = 3 , cex . lab = 1 . 2 , cex . axis = 1 . 2 ,
main = " Global Land - Ocean Average Annual Mean
Surface Temperature Anomalies : 1 8 8 0 -2 0 1 8 " ,
xlab = " Year " ,
ylab = expression (
paste ( " Temperature Anomaly [ " , degree , " C ] " )))
0.6
Temperature anomaly [°C]
−0.6 −0.2 0.2
t
Figure 1.2 Staircase chart of the 1880–2018 global average annual mean temperature anomalies.
t
Figure 1.3 Color bar chart of the 1880–2018 global average annual mean temperature anomalies.
7 1.1 Global Temperature Anomalies from 1880 to 2018
The commonly used basic statistical indices include mean, variance, standard devia-
tion, skewness, kurtosis, and quantiles. We first use R to calculate these indices for the
global average annual mean temperature anomalies. Then we describe their mathematical
formulas and interpret the numerical results.
# R code for computing statistical indices
setwd ( " / Users / sshen / climstats " )
NOAAtemp = read . table (
" data / aravg . ann . land _ ocean . 9 0 S . 9 0 N . v 4 . 0 . 1 . 2 0 1 9 0 7 . txt " ,
header = FALSE )
temp 2 0 1 8 = NOAAtemp [ 1 : 1 3 9 ,2 ] # use the temp data up to 2 0 1 8
head ( temp 2 0 1 8 ) # show the first six values
# [ 1 ] -0 . 3 7 0 2 2 1 -0 . 3 1 9 9 9 3 -0 . 3 2 0 0 8 8 -0 . 3 9 6 0 4 4 -0 . 4 5 8 3 5 5 -0 . 4 7 0 3 7 4
mean ( temp 2 0 1 8 ) # mean
# [ 1 ] -0 . 1 8 5 8 6 3 2
sd ( temp 2 0 1 8 ) # standard deviation
#[1] 0.324757
var ( temp 2 0 1 8 ) # variance
8 Basics of Climate Data Arrays, Statistics, and Visualization
#[1] 0.1054671
library ( e 1 0 7 1 )
# This R library is needed to compute the following parameters
# install . packages (" e 1 0 7 1 ") # if it is not in your computer
skewness ( temp 2 0 1 8 )
#[1] 0.7742704
kurtosis ( temp 2 0 1 8 )
# [ 1 ] -0 . 2 6 1 9 1 3 1
median ( temp 2 0 1 8 )
# [ 1 ] -0 . 2 7 4 4 3 4
quantile ( temp 2 0 1 8 , probs = c ( 0 . 0 5 , 0 . 2 5 , 0 . 7 5 , 0 . 9 5 ))
# 5% 25% 75% 95%
# -0 . 5 7 6 4 8 6 1 -0 . 4 1 1 9 7 7 0 0 . 0 1 5 5 2 4 5 0 . 4 1 3 2 3 8 3
We use x = {x1 , x2 , . . . , xn } to denote the sampling data for a time series. The statistical
indices computed by this R code are based on the following mathematical formulas for
mean, variance, standard deviation, skewness, and kurtosis:
1 n
Mean: µ(x) = ∑ xk , (1.3)
n k=1
1 n
Variance by unbiased estimate: σ 2 (x) = ∑ (xk − µ(x))2 , (1.4)
n − 1 k=1
Standard deviation: σ (x) = (σ 2 (x))1/2 , (1.5)
1 n x − µ(x) 3
Skewness: γ3 (x) = ∑ k , (1.6)
n k=1 σ
1 n xk − µ(x) 4
Kurtosis: γ4 (x) = ∑ − 3. (1.7)
n k=1 σ
A quantile cuts a sorted sequence of data. For example, the 25th quantile, also called 25th
percentile or 25% quantile, is the value at which 25% of the sorted data is smaller than this
value and 75% is larger than this value. The 50th percentile is also known as the median.
# percentiles
print ( " The 5 th , 2 5 th , 7 5 th and 9 5 th percentiles are : " )
probs = [ 5 , 2 5 , 7 5 , 9 5 ]
print ([ round ( np . percentile ( temp 2 0 1 8 , p ) , 5 ) for p in probs ])
print ()
The meaning of these indices may be explained as follows. The mean is the simple
average of samples. The variance of climate data reflects the strength of variations of a
climate system and has units of the square of the data entries, such as [◦ C]2 . You may have
noticed the denominator n − 1 instead of n, which is for an estimate of unbiased sample
variance. The standard deviation describes the spread of the sample entries and has the
same units as the data. A large standard deviation means that the samples have a broad
spread.
Skewness is a dimensionless quantity. It measures the asymmetry of sample data. Zero
skewness means a symmetric distribution about the sample mean. For example, the skew-
ness of a normal distribution is zero. Negative skewness denotes a skew to the left, meaning
the existence of a long tail on the left side of the distribution. Positive skewness implies a
long tail on the right side.
The words “kurtosis” and “kurtic” are Greek in origin and indicate peakedness. Kurtosis
is also dimensionless and indicates the degree of peakedness of a probability distribution.
The kurtosis of a normal distribution1 is zero when 3 is subtracted as in Eq. (1.7). Positive
kurtosis means a high peak at the mean, thus the distribution shape is slim and tall. This
is referred to as leptokurtic. “Lepto” is Greek in origin and means thin or fine. Negative
kurtosis indicates a low peak at the mean, thus the distribution shape is fat and short,
referred to as platykurtic. “Platy” is also Greek in origin and means flat or broad.
For the 139 years of the NOAAGlobalTemp global average annual mean temperature
anomalies, mean is −0.1959◦ C, which means that the 1880–2018 average is lower than the
average during the climatology period 1971–2000. During the climatology period, the tem-
perature anomaly average is approximately zero, and can be computed by the R command
mean(temp2018[92:121]).
The variance of the data in the 139 years from 1880 to 2018 is 0.1055 [◦ C]2 , and the
corresponding standard derivation is 0.3248 [◦ C]. The skewness is 0.7743, meaning skewed
with a long tail on the right, thus with more extreme high temperatures than extreme low
temperatures, as shown by Figure 1.4. The kurtosis is −0.2619, meaning the distribution is
flatter than a normal distribution, as shown in the histogram Figure 1.4.
The median is −0.2744◦ C and is a number characterizing a set of samples such that 50%
of the sample values are less than the median, and another 50% are greater than the median.
To find the median, sort the samples from the smallest to the largest. The median is then
the sample value in the middle. If the number of the samples is even, then the median is
equal to the mean of the two middle sample values.
1 Chapter 2 will have a detailed description of the normal distribution and other probabilistic distributions.
10 Basics of Climate Data Arrays, Statistics, and Visualization
Quantiles are defined in a way similar to median by sorting. For example, 25-percentile
(also called the 25th percentile) −0.4120◦ C is a value such that 25% of sample data are
less than this value. By definition, 60-percentile is thus larger than 50-percentile. Here, per-
centile is a description of quantile relative to 100. Obviously, 100-percentile is the largest
datum, and 0-percentile is the smallest one. Often, a box plot is used to show the typical
quantiles (see Fig. 1.5).
The 50-percentile (or 50th percentile) −0.2744◦ C is equal to the median. If the distribu-
tion is symmetric, then the median is equal to the mean. Otherwise, these two quantities
are not equal. If the skew is to the right, then the mean is on the right of the median: the
mean is greater than the median. If the skew is to the left, then the mean is on the left of
the median: the mean is less than the median. Our 139 years of temperature data are right
skewed and have mean equal to −0.1959◦ C, greater than their median equal to −0.2969◦ C.
We will use the 139 years of NOAAGlobalTemp temperature data and R to illustrate some
commonly used statistical figures: histogram, boxplot, Q-Q plot, and linear regression trend
line.
t
Figure 1.4 Histogram and its normal distribution fit of the global average annual mean temperature anomalies from 1880 to
2018. Each small interval in the horizontal coordinate is called a bin. The frequency in the vertical coordinate is the
number of temperature anomalies in a given bin. For example, the frequency for the bin [−0.5, −0.4]◦ C is 17.
11 1.2 Commonly Used Climate Statistical Plots
The shape of the histogram agrees with the characteristics predicted by the statistical
indices in the previous subsection:
(i) The distribution is asymmetric and skewed to the right with skewness equal to
0.7743.
(ii) The distribution is platykurtic with a kurtosis equal to 0.2619, i.e., it is flatter than
the standard normal distribution indicated by the blue curve.
Figure 1.5 is the box plot of the 1880-2018 NOAAGlobalTemp global average annual mean
temperature anomaly data, and can be made from the following R command.
# R plot Fig . 1 . 5 : Box plot
boxplot ( NOAAtemp [ 1 : 1 3 9 , 2 ] , ylim = c ( - 0 . 8 , 0 . 8 ) ,
ylab = expression ( paste (
" Temperature anomalies [ " , degree , " C ] " )) ,
width = NULL , cex . lab = 1 . 2 , cex . axis = 1 . 2 )
12 Basics of Climate Data Arrays, Statistics, and Visualization
0.5
first quartile, i.e., 25th percen-
tile 0.4120◦ C. The box’s upper
t
Figure 1.5 Box plot of the global average annual mean temperature anomalies
from 1880 to 2018.
which is −0.6872◦ C. The points
outside the two whiskers are con-
sidered outliers. Our dataset has
two outliers, which are 0.6607
and 0.7011◦ C, and are denoted by two small circles in the box plot. The two outliers
occurred in 2015 and 2016, respectively.
Figure 1.6 shows quantile-quantile (Q-Q) plots, also denoted by q-q plots, qq-plots, or
QQ-plots.
13 1.2 Commonly Used Climate Statistical Plots
3
2
Quantile of temperature anomalies
1
0
−1
−2
−3
−3 −2 −1 0 1 2 3
Quantile of N(0,1)
t
Figure 1.6 Black empty-circle points are the Q-Q plot of the standardized global average annual mean temperature anomalies
versus standard normal distribution. The purple points are the Q-Q plot for the data simulated by rnorm(139).
The red is the distribution reference line of N(0, 1).
The function of a Q-Q plot is to compare the distribution of a given set of data with a
specific reference distribution, such as a standard normal distribution with zero mean and
standard deviation equal to one, denoted by N(0, 1). A Q-Q plot lines up the percentiles of
data on the vertical axis and the same number of percentiles of the specific reference dis-
tribution on the horizontal axis. The pairs of the quantiles (xi , yi ), i = 1, 2, . . . , n determine
the points on the Q-Q plot. Here, xi and yi correspond to the same cumulative percentage
or probability pi for both x and y variables, where pi monotonically increases from approx-
imately 0 to 1 as i goes from 1 to n. A red Q-Q reference line is plotted as if the vertical
axis values are also the quantiles of the given specific distribution. Thus, the Q-Q reference
line should be diagonal.
The black empty circles in Figure 1.6 compare the quantiles of the standardized global
average annual mean temperature anomalies marked on the vertical axis with those of the
standard normal distribution marked on the horizontal axis. The standardized anomalies
are equal to anomalies divided by the sample standard deviation. The purple dots shows a
Q-Q plot of a set of 139 random numbers simulated by the standard normal distribution.
As expected, the simulated points are located close to the red diagonal line, which is the
distribution reference line of N(0, 1). On the other hand, the temperature Q-Q plot shows a
considerable degree of scattering of the points away from the reference line. We may intu-
itively conclude that the global average annual temperature anomalies from 1880 to 2018
are not exactly distributed according to a normal (also known as Gaussian) distribution.
14 Basics of Climate Data Arrays, Statistics, and Visualization
However, we may also conclude that the distribution of these temperatures is not very far
away from the normal distribution either, because the points on the Q-Q plot are not very
far away from the distribution reference line, and also because even the simulated N(0, 1)
points are noticeably off the reference line for the extremes.
Figure 1.6 can be generated by the following computer code.
# R plot Fig . 1 . 6 : Q - Q plot for the standardized
# global average annual mean temperature anomalies
temp 2 0 1 8 <- NOAAtemp [ 1 : 1 3 9 ,2 ]
tstand <- ( temp 2 0 1 8 - mean ( temp 2 0 1 8 ))/ sd ( temp 2 0 1 8 )
set . seed ( 1 0 1 )
qn <- rnorm ( 1 3 9 ) # simulate 1 3 9 points by N ( 0 ,1 )
qns <- sort ( qn ) # sort the points
qq 2 <- qqnorm ( qns , col = " blue " , lwd = 2 )
In the R code, we first standardize (also called normalize) the global average annual
mean temperature data by subtracting the data mean and dividing by the data’s stand-
ard deviation. Then, we use these 139 years of standardized global average annual mean
temperature anomalies to generate a Q-Q plot, which is shown in Figure 1.6.
Climate data analysis often involves plotting a linear trend line for time series data, such as
the linear trend for the global average annual mean surface temperature anomalies, shown
in Figure 1.7. The R code for plotting a linear trend line of data sequence y and time
sequence t is abline(lm(y ∼ t)).
Figure 1.7 can be generated by the following computer code.
# R plot Fig . 1 . 7 : Data line graph with a linear trend line
par ( mar = c ( 3 . 5 ,3 . 5 ,2 . 5 ,1 ) , mgp = c ( 2 ,0 . 8 ,0 ))
plot ( NOAAtemp [ 1 : 1 3 9 ,1 ] , NOAAtemp [ 1 : 1 3 9 ,2 ] ,
type = " l " , col = " brown " , lwd = 3 ,
main = " Global Land - Ocean Average Annual Mean
Surface Temperature Anomaly : 1 8 8 0 -2 0 1 8 " ,
cex . lab = 1 . 2 , cex . axis = 1 . 2 ,
xlab = " Year " ,
ylab = expression ( paste (
" Temperature Anomaly [ " , degree , " C ] " ))
)
abline ( lm ( NOAAtemp [ 1 : 1 3 9 ,2 ] ~ NOAAtemp [ 1 : 1 3 9 ,1 ]) ,
lwd = 3 , col = " blue " )
t
Figure 1.7 Linear trend line with the 1880–2018 global average annual mean surface temperature based on the
NOAAGlobalTemp V4.0 dataset.
16 Basics of Climate Data Arrays, Statistics, and Visualization
lm ( NOAAtemp [ 1 : 1 3 9 ,2 ] ~ NOAAtemp [ 1 : 1 3 9 ,1 ])
# ( Intercept ) NOAAtemp [ 1 : 1 3 9 , 1 ]
# -1 3 . 8 7 2 9 2 1 0.007023
# Trend 0 . 7 0 2 3 degC / 1 0 0 a
text ( 1 9 3 0 , 0 . 5 ,
expression ( paste ( " Linear trend : 0 . 7 0 2 3 " ,
degree , " C / 1 0 0 a " )) ,
cex = 1 . 5 , col = " blue " )
1.3 Read netCDF Data File and Plot Spatial Data Maps
Climate data are at spatiotemporal points, such as at the grid points on the Earth’s surface
and at a sequence of time. NetCDF (Network Common Data Form) is a popular file format
for modern climate data with spatial locations and temporal records. The gridded NOAA-
GlobalTemp data have a netCDF version, and can be downloaded from
www.esrl.noaa.gov/psd/data/gridded/data.noaaglobaltemp.html
The data are written into a 3D array, with 2D latitude–longitude for space, and 1D for
time. R and Python can read and plot the netCDF data. We use the NOAAGlobalTemp
as an example to illustrate the netCDF data reading and plotting. Figure 1.8 displays a
temperature anomaly map for the entire globe for December 2015.
Figure 1.8 can be generated by the following computer code.
# R read the netCDF data : NOAAGlobalTemp
setwd ( " / Users / sshen / climstats " )
# install . packages (" ncdf 4 ")
library ( ncdf 4 )
nc = ncdf 4 :: nc _ open ( " data / air . mon . anom . nc " )
nc # describes details of the dataset
Lat <- ncvar _ get ( nc , " lat " )
17 1.3 Read netCDF Data File and Plot Spatial Data Maps
4
50
Latitude 2
0 0
−2
−50
−4
−6
50 100 150 200 250 300 350
Longitude
t
Figure 1.8 The surface temperature anomalies of December 2015 with respect to the 1971–2000 climatology (data source: The
NOAAGlobalTemp V4.0 gridded monthly data).
# create map
dmap = Basemap ( projection = ’ cyl ’ , llcrnrlat = min ( lat ) ,
urcrnrlat = max ( lat ) , resolution = ’c ’ ,
llcrnrlon = min ( lon ) , urcrnrlon = max ( lon ))
# draw coastlines , state and country boundaries , edge of map
dmap . drawcoastlines ()
dmap . drawstates ()
dmap . drawcountries ()
# convert latitude / longitude values to plot x / y values
x , y = dmap (* np . meshgrid ( lon , lat ))
# draw filled contours
cnplot = dmap . contourf (x , y , mapmat , clev , cmap = myColMap )
# tick marks
ax . set _ xticks ([ 0 , 5 0 , 1 0 0 , 1 5 0 , 2 0 0 , 2 5 0 , 3 0 0 , 3 5 0 ])
ax . set _ yticks ([ - 5 0 ,0 ,5 0 ])
ax . tick _ params ( length = 6 , width = 2 , labelsize = 2 0 )
# add colorbar
# pad : distance between map and colorbar
cbar = dmap . colorbar ( cnplot , pad = " 4 % " , drawedges = True ,
shrink = 0 . 5 5 , ticks = [ - 6 ,-4 ,-2 ,0 ,2 ,4 ,6 ])
# add colorbar title
cbar . ax . set _ title ( ’ [$\ degree $ C ] ’ , size = 1 7 , pad = 1 0 )
cbar . ax . tick _ params ( labelsize = 1 5 )
# add plot title
plt . title ( ’ NOAAGlobalTemp Anomalies Dec 2 0 1 5 ’ ,
size = 2 5 , fontweight = " bold " , pad = 1 5 )
# label x and y
plt . xlabel ( ’ Longitude ’ , size = 2 5 , labelpad = 2 0 )
plt . ylabel ( ’ Latitude ’ , size = 2 5 , labelpad = 1 0 )
# display on screen
plt . show ()
You can also use the Panoply software package to plot the map (see Fig. 1.9). This is a very
powerful data visualization tool developed by NASA specifically for displaying netCDF
files. The software package is free and can be downloaded from
www.giss.nasa.gov/tools/panoply.
To make a Panoply plot, open Panoply and choose Open from the File menu. Open
dropdown, which will allow you to go to the right directory to find the netCDF file you
wish to plot. In our case, the file is air.mon.anom.nc. Choose the climate parameter air.
Click on Create Plot. A map will show up. Then you have many choices to modify the
map, ranging from the data Array, Scale, and Map, etc., and finally produce the figure.
You can then tune the figure by choosing different graphics parameters underneath the
figure, such as Array(s) to choose which month to plot, Map to choose the map projection
types and the map layout options, and Labels to type proper captions and labels.
To learn more about Panoply, please use an online Panoply tutorial, such as the NASA
Panoply help page www.giss.nasa.gov/tools/panoply/help/.
20 Basics of Climate Data Arrays, Statistics, and Visualization
t
Figure 1.9 A Panoply plot of Robinson projection map for the surface temperature anomalies of December 2015.
Compared with the R or Python map Figure 1.8, the visualization effect of the Panoply
map seems more appealing. However, R and Python have the advantage of flexibil-
ity and can deal with all kinds of data. For example, the Plotly graphing library in
R https://plotly.com/r/ and in Python https://plotly.com/python/ can even
make interactive and 3D graphics. You may find some high-quality figures from the Inter-
governmental Panel on Climate Change (IPCC) report (2021) www.ipcc.ch and reproduce
them using the computing tools described here.
A very useful climate data visualization technique is the Hovmöller diagram. It displays
how a climate variable varies with respect to time along a given line section of latitude,
longitude, or altitude. It usually has the abscissa for time and ordinate for the line section.
A Hovmöller diagram can conveniently show time evolution of a spatial pattern, such as
wave motion from south to north or from west to east.
Figure 1.10 is a Hovmöller diagram for the sea surface temperature (SST) anomalies
at a longitude equal to 240◦ , i.e., 120◦ W, in a latitude interval [30◦ S, 30◦ N], with time
range from January 1989 to December 2018. When the red strips become strong from the
south to north, a strong El Niño occurs, such as those in the 1997–1998 and 2015–2016
winters.
21 1.4 1D-Space-1D-Time Data and Hovmöller Diagram
20
1
Latitude 10
0 0
−10
−1
−20
−30 −2
1990 1995 2000 2005 2010 2015
Time
t
Figure 1.10 Hovmöller diagram for the gridded NOAAGlobalTemp monthly anomalies at longitude 120◦ W and a latitude interval
[30◦ S, 30◦ N].
The Hovmöller diagram Figure 1.10 may be plotted by the following computer code.
# R plot Fig . 1 . 1 0 : Hovmoller diagram
library ( maps )
mapmat = NOAAgridT [ 3 0 ,1 2 : 2 4 ,1 3 0 9 : 1 6 6 8 ]
# Longitude = 2 4 0 deg , Lat =[ - 3 0 3 0 ] deg
# Time = Jan 1 9 8 9 - Dec 2 0 1 8 : 3 0 years
mapmat = pmax ( pmin ( mapmat , 2 ) , - 2 ) # put values in [ - 2 ,2 ]
par ( mar = c ( 4 ,5 ,3 ,0 ))
int = seq ( - 2 ,2 , length . out = 8 1 )
rgb . palette = colorRampPalette ( c ( ’ black ’ , ’ blue ’ ,
’ darkgreen ’ , ’ green ’ , ’ yellow ’ , ’ pink ’ , ’ red ’ , ’ maroon ’) ,
interpolate = ’ spline ’)
par ( mar = c ( 3 . 5 ,3 . 5 ,2 . 5 ,1 ) , mgp = c ( 2 . 4 , 0 . 8 , 0 ))
x = seq ( 1 9 8 9 , 2 0 1 8 , len = 3 6 0 )
y = seq ( - 3 0 , 3 0 , by = 5 )
filled . contour (x , y , t ( mapmat ) ,
color . palette = rgb . palette , levels = int ,
plot . title = title ( main =
" Hovmoller diagram of the NOAAGlobalTemp Anomalies " ,
xlab = " Time " , ylab = " Latitude " , cex . lab = 1 . 2 ) ,
plot . axes ={ axis ( 1 , cex . axis = 1 . 2 );
axis ( 2 , cex . axis = 1 . 2 );
map ( ’ world 2 ’ , add = TRUE ); grid ()} ,
key . title = title ( main =
expression ( paste ( " [ " , degree , " C ] " ))) ,
key . axes ={ axis ( 4 , cex . axis = 1 . 2 )})
# plot functions
myColMap = L i ne ar S eg m en te d Co lo r ma p . from _ list (
name = ’ my _ list ’ ,
colors =[ ’ black ’ , ’ blue ’ , ’ darkgreen ’ , ’ green ’ , ’ lime ’ ,
’ yellow ’ , ’ pink ’ , ’ red ’ , ’ maroon ’] , N = 1 0 0 )
clev 2 = np . linspace ( mapmat 2 . min () , mapmat 2 . max () , 5 0 1 )
contf = plt . contourf ( time , lat 3 , mapmat 2 ,
clev 2 , cmap = myColMap );
plt . text ( 2 0 1 9 . 2 , 3 1 . 5 ,
" [$\ degree $ C ] " , color = ’ black ’ , size = 2 3 )
plt . title ( " Hovmoller diagram of the \ n \
NOAAGlobalTemp Anomalies " ,
fontweight = " bold " , size = 2 5 , pad = 2 0 )
plt . xlabel ( " Time " , size = 2 5 , labelpad = 2 0 )
plt . ylabel ( " Latitude " , size = 2 5 , labelpad = 1 2 )
colbar = plt . colorbar ( contf , drawedges = False ,
ticks = [ - 2 ,-1 ,0 ,1 ,2 ])
The 4D climate data means 3D spatial dimensions and 1D time. For example, the NCEP
Global Ocean Data Assimilation System (GODAS) monthly water temperature data are
at 40 depth levels ranging from 5 meters to 4478 meters and at 1/3 degree latitude by
1 degree longitude horizontal resolution and are from January 1980. The NOAA-CIRES
20th Century Reanalysis (20CR) monthly air temperature data are at 24 different pressure
levels ranging from 1,000 mb to 10 mb and at 2-degree latitude and longitude horizontal
resolution and are from January 1851.
GODAS data can be downloaded from NOAA ESRL,
www.esrl.noaa.gov/psd/data/gridded/data.godas.html.
The data for each year is a netCDF file and is about 140MB. The following R code can
read the GODAS 2015 data into R.
23 1.5 4D netCDF File and Its Map Plotting
Figure 1.11 shows the 2015 annual mean water temperature at 195 meters depth based
on the GODAS data. At this depth level, the equatorial upwelling appears: The deep ocean
cooler water in the equatorial region upwells and makes the equatorial water cool. The
equatorial water is not the hottest anymore at this level, and is cooler than the water in
some subtropical regions as shown in Figure 1.11.
Figure 1.11 can be generated by the following computer code.
# R plot Fig . 1 . 1 1 : The ocean potential temperature
# the 2 0 th layer from surface : 1 9 5 meters depth
# compute 2 0 1 5 annual mean temperature at 2 0 th layer
library ( maps )
climmat = matrix ( 0 , nrow = 3 6 0 , ncol = 4 1 8 )
24 Basics of Climate Data Arrays, Statistics, and Visualization
t
Figure 1.11 The 2015 annual mean water temperature at 195 meters depth based on GODAS data.
Besides using R, Python, and Panoply to plot climate data, other software packages
may also be used to visualize data for some specific purposes, such as Paraview for 3D
visualization and 4DVD for fast climate diagnostics and data delivery.
1.6.1 Paraview
1.6.2 4DVD
4DVD (4-dimensional visual delivery of big climate data) is a fast data visualization and
delivery software system at www.4dvd.org. It optimally harnesses the power of distributed
computing, databases, and data storage to allow a large number of general public users
to quickly visualize climate data. For example, teachers and students can use 4DVD to
26 Basics of Climate Data Arrays, Statistics, and Visualization
t
Figure 1.12 A 3D view of the December 2015 annual mean water temperature based on the GODAS data. The surface of the
Northern Hemisphere and the cross-sectional map along the equator from surface to the 2,000 meters depth. The
dark color indicates land; the red, yellow, and blue colors indicate temperature.
t
Figure 1.13 The 4DVD screenshot for the NOAA-CIRES 20th Century Reanalysis temperature at 750 millibar height level for June
2014.
visualize climate model data in classrooms and download the visualized data instantly.
The 4DVD website has a tutorial for users.
Here, we provide an example of 4DVD visualization of the NOAA-CIRES 20CR climate
model data for atmosphere. The 4DVD can display a temperature map at a given time and
given pressure level, as shown in Figure 1.13 for January 1851 and 750 millibar pressure
level, or one can obtain a time series of the monthly air temperature for a specific grid box
27 1.6 Paraview, 4DVD, and Other Tools
from January 1851. In fact, 4DVD can show multiple time series for the same latitude-
longitude location but at different pressure levels. The 4DVD not only allows a user to
visualize the data but also to download the data for the figure shown in the 4DVD system. In
this sense, 4DVD is like a machine that plays data, while the regular DVD player machine,
popular for about 30 years since the 1980s, plays DVD discs for music and movies.
Besides R, Python, Panoply, ParaView, and 4DVD, there are many other data visualization
and delivery software systems. A few popular and free ones are listed as follows.
Nullschool https://nullschool.net is a beautiful data visualizer of wind, ocean
flows, and many climate parameters. It is supported by the data from a global numerical
weather prediction system, named the Global Forecast System (GFS) run by the United
States’ National Weather Service (NWS).
Ventusky www.ventusky.com has both website and smartphone app. It has an attractive
and user-friendly interface that allows users to get digital weather information instantly
around the globe.
Climate Reanalyzer climatereanalyzer.org is a comprehensive tool for climate data
plotting and download. It has a user-friendly interface for reanalysis and historical station
data. The data can be exported in either CSV (comma-separated values) format or JSON
(JavaScript object notation) format.
Google Earth Engine earthengine.google.com provides visualization tools together
with a huge multi-petabytes storage of climate data. Its modeling and satellite data sources
are from multiple countries.
Giovanni giovanni.gsfc.nasa.gov is an online climate data plotting tool with an
interface. It allows users to download the plotted figures in different formats, such as png.
It is supported by various kinds of NASA climate datasets.
Climate Engine climateengine.org is a web application for plotting climate and
remote sensing data. Similar to Giovanni, it also has a tabular interface for a user to
customized data and maps.
NOAA Climate at a Glance www.ncdc.noaa.gov/cag is a data visualization tool
mainly for visualizing the observed climate data over the United States. It has functions
for both spatial maps and historical time series.
Web-based Reanalyses Intercomparison Tools (WRIT) psl.noaa.gov/data/writ is
similar to Giovanni, in that WRIT also has an interface table for a user to enter plot param-
eters. The plot (in postscript format) and its data (in netCDF format) can be downloaded.
WRIT is designed for the data from climate reanalysis models.
the cosine function as a blue thick dashed curve.” ChatGPT will give you the following
Python code:
import numpy as np
import matplotlib . pyplot as plt
x = np . linspace ( - np . pi , np . pi , 1 0 0 )
y = np . cos ( x )
You can copy the above code and paste it to your Jupyter Notebook cell, then run the
code to produce the curve of the cosine function in the interval [−π, π]. With this sample
and according to your needs, you can modify the code for different functions, intervals,
colors, thicknesses, titles, axis labels, and more.
You can also ask ChatGPT to work on a dataset, such as “Write an R code to plot the
NOAAGlobalTempts.csv data and its trend line.” ChatGPT will give you an R code. You
copy and paste the code to your RStudio. Set your R code in the correct directory that has
the NOAAGlobalTemp annual time series data file, NOAAGlobalTempts.csv. Run your R
code to plot the data and the linear trend line. Of course, you can modify the code according
to your requirements. The R code is as follows.
setwd ( ’/ Users / sshen / climmath / data ’)
library ( tidyverse )
efficiently use it to give you hints and inspire your ideas. You still have to produce your
own work.
This chapter has provided a brief introduction to useful statistical concepts and methods
for climate science and has included the following material.
(i) Formulas to compute the most commonly used statistical indices:
• Mean as a simple average, median as a datum whose value is in the middle of
the sorted data sequence, i.e., the median is larger than 50% of the data and
smaller than the remaining 50%,
• Standard deviation as a measure of the width of the probability distribution,
• Variance as the square of the standard deviation,
• Skewness as a measure of the degree of asymmetry in distribution, and
• Kurtosis as a measure of the peakedness of the data distribution compared to
that of the normal distribution.
(ii) The commonly used statistical plots:
• Histogram for displaying the probability distribution of data,
• Linear regression line for providing a linear model for data,
• Box plot for quantifying the probability distribution of data, and
• Q-Q plot for checking whether the data are normally distributed.
(iii) Read and plot the netCDF file for plotting a 2D map by R and Python. Other data
visualization tools, such as Panoply, Paraview, 4DVD, Plotly, and Nullschool, were
briefly introduced. The online tools like 4DVD and Nullschool may be used for
classroom teaching and learning.
Computer codes and climate data examples are given to demonstrate these concepts
and the use of the relevant tools and formulas. With this background, plus some R or
Python programming skill, you will have sufficient knowledge to meet the needs of the
basic statistical analysis for climate data.
References and Further Reading
[1] B. Hennemuth, S. Bender, K. Bulow, et al., 2013: Statistical Methods for the Analysis
of Simulated and Observed Climate Data, Applied in Projects and Institutions Dealing
with Climate Change Impact and Adaptation. CSC Report 13, Climate Service Center,
Germany.
www.climate-service-center.de/imperia/md/content/csc/
projekte/csc-report13_englisch_final-mit_umschlag.pdf
This free statistics recipe book outlines numerous methods for climate data
analysis, collected and edited by climate science professionals, and has
examples of real climate data.
[2] IPCC, 2021: AR6 Climate Change 2021: The Physical Science Basis. Contribution
of Working Group I to the Sixth Assessment Report of the Intergovernmental Panel
on Climate Change [V. Masson-Delmotte, P. Zhai, A. Pirani et al. (eds.)]. Cambridge
University Press.
This is the famous IPCC report available for free at the IPCC website,
www.ipcc.ch/report/ar6/wg1/. It includes many high-quality figures
plotted from climate data.
[3] NCEI, 2021: Anomalies vs. Temperature. Last accessed on 12 May 2021.
www.ncdc.noaa.gov/monitoring-references/dyk/
anomalies-vs-temperature
30
31 Exercises
These authors are among the main contributors who reconstructed sea surface
temperature (SST). They began their endeavor in the early 1990s.
[6] R. S. Vose, D. Arndt, V. F. Banzon et al., 2012: NOAA’s merged land-ocean surface
temperature analysis. Bulletin of the American Meteorological Society, 93, 1677–1685.
These authors are experts on the reconstruction of both SST and the land sur-
face air temperature, and have published many papers in data quality control
and uncertainty quantifications.
Exercises
1.1 The NOAA Merged Land Ocean Global Surface Temperature Analysis (NOAAGlobal-
Temp) V5 includes the global land-spatial average annual mean temperature anomalies
as shown here:
1880 -0.843351 0.031336 0.009789 0.000850 0.020698
1881 -0.778600 0.031363 0.009789 0.000877 0.020698
1882 -0.802413 0.031384 0.009789 0.000897 0.020698
......
The first column is for time in years, and the second is the temperature anomalies in
[Kelvin]. Columns 3–6 are data errors. This data file for the land temperature can be
downloaded from the NOAAGlobalTemp website
www.ncei.noaa.gov/data/noaa-global-surface-temperature/v5/access/
timeseries
32 Basics of Climate Data Arrays, Statistics, and Visualization
NOAAGlobalTemp_v5.0.0_gridded_s188001_e202104_c20210509T133251.nc
data file godas2015.nc in data.zip for this book or download the netCDF GODAS
data from a NOAA website, such as
www.esrl.noaa.gov/psd/data/gridded/data.godas.html
1.21 Use Panoply to plot the same maps as the previous problem but with the Robinson
projection.
1.22 Use Plotly in R or Python to plot four December 2015 potential temperature maps at the
depth layers 5, 25, 105, and 195 meters based on the GODAS data. Try to place the four
maps together to make a 3D visualization.
1.23 Plot four June 2015 potential temperature maps at the depth layers 5, 25, 105, and 195
meters using the same netCDF GODAS data file as the previous problem.
1.24 Use Plotly in R or Python to plot four June 2015 potential temperature maps at the depth
layers 5, 25, 105, and 195 meters based on the GODAS data. Try to place the four maps
together to make a 3D visualization.
1.25 Plot three cross-sectional maps of the December 2015 potential temperature at 10◦ S,
equator, and 10◦ N using the same netCDF GODAS data file as the previous problem.
Each map is on a 40 × 360 grid, with 40 depth levels and 1-deg longitude resolution for
the equator.
1.26 Do the same as the previous problem but for the June 2015 GODAS potential
temperature data.
1.27 Use R or Python to plot a cross-sectional map of the December 2015 potential tempera-
ture along a meridional line at longitude at 170◦ E using the same netCDF GODAS data
file as the previous problem.
1.28 Use R or Python to plot a cross-sectional map of the June 2015 potential temperature
along a meridional line at longitude at 170◦ E using the same netCDF GODAS data file
as the previous problem.
2 Elementary Probability and Statistics
This chapter describes the basic probability and statistics that are especially useful in cli-
mate science. Numerous textbooks cover the subjects in this chapter. Our focus is on the
climate applications of probability and statistics. In particular, the emphasis is on the use
of modern software that is now available for students and journeyman climate scientists. It
is increasingly important to pull up a dataset that is accessible from the Internet and have a
quick look at the data in various forms without worrying about the details of mathematical
proofs and long derivations. Special attention is also given to the clarity of the assumptions
and limitations of the statistical methods when applied to climate datasets. Several appli-
cation examples have been included, such as the probability of dry spells and binomial
distribution, the probability of the number of storms in a given time interval based on the
Poisson distribution, the random precipitation trigger based on the exponential distribution,
and the standard precipitation index and Gamma distribution.
2.1.1 Definition
35
36 Elementary Probability and Statistics
Example 2.1 Let X be the number of wet days in any two successive days in New York
City, when the weather is classified as two types: Wet or W if the day has more than 0.2
mm/day precipitation, and Dry or D if otherwise. Consider two sequential days denoted by
the outcomes D and W . There are only four possible outcomes {DD, DW,W D,WW } and
three possible values {0, 1, 2} of x as shown in Table 2.1. Figure 2.1 shows precipitation
data and the D and W days for New York City data (based on the Global Historical Clima-
tology Network-Daily (GHCN-D) station USC00280907 at latitude 40°540 0000 N, longitude
74°240 0000 W) from June 1, 2019 to August 31, 2019. The probabilities in the fourth column
of Table 2.1 are also computed from the daily station data.
W
D
0
t
Figure 2.1 The daily precipitation data for New York City from June 1, 2019 to August 31, 2019, based on the GHCN-D station
USC00280907 at latitude 40°540 0000 N, longitude 74°240 0000 W. Red dots indicate dry days and blue dots
indicate wet days.
dayNum = 9 2
nycP = nycDat [ 1 : dayNum , c ( 3 ,6 )]
dw = c ()
for ( i in 1 : dayNum ){ if ( nycP [i , 2 ] >= 0 . 2 ){ dw [ i ] = 1 }
else { dw [ i ]= 0 } }
dw
n 0 = which ( dw == 0 )
n 1 = which ( dw == 1 )
m = dayNum - 1
par ( mfrow = c ( 1 ,1 ))
par ( mar = c ( 2 . 5 ,4 . 5 ,2 . 5 ,4 ))
plot ( 1 : dayNum , nycP [ , 2 ] , type = " s " ,
main = " Daily Precipitation of New York City :
1 Jun 2 0 1 9 - 3 1 Aug 2 0 1 9 " ,
ylab = " Precipitation [ mm / day ] " ,
xaxt = " n " , xlab = " " ,
cex . lab = 1 . 4 , cex . axis = 1 . 4 )
axis ( 1 , at = c ( 1 , 3 0 , 6 1 , 9 2 ) ,
labels = c ( " 1 June " , " 1 July " , " 1 Aug " , " 3 1 Aug " ) ,
cex . axis = 1 . 4 )
par ( new = TRUE )
plot ( n 0 + 1 . 0 , dw [ n 0 ] - 0 . 2 , cex . lab = 1 . 4 ,
ylim = c ( 0 ,1 5 ) , pch = 1 5 , cex = 0 . 4 , col = " red " ,
axes = FALSE , xlab = " " , ylab = " " )
points ( n 1 + 1 , dw [ n 1 ] , pch = 1 5 , col = " blue " , cex = 0 . 4 )
axis ( 4 , col = " blue " , col . axis = " blue " ,
at = c ( 0 ,1 ) , labels = c ( " D " , " W " ) ,
las = 2 , cex . axis = 1 )
mtext ( " Dry or Wet Days " ,
col = " blue " , side = 4 , line = 1 . 5 , cex = 1 . 4 )
Here, X is a random variable that is a function mapping the weather event outcomes
{DD, DW,W D,WW } to real values {0, 1, 2}, and associates each real value with a probabil-
ity. The RV takes value only when the outcome of the random event is known, such as X = 2
when the outcome of the weather event is WW . The probability of X = 2 is 0.25, denoted
by P(X = 2) = 0.25, if the day-to-day weather is independent, i.e., unrelated. Similarly,
P(X = 0) = 0.25. However, P(X = 1) = 0.5 because X = 1 corresponds to two outcomes
{DW,W D}.
However, the real weather has a serial correlation and the day-to-day weather is not
independent. The probability for each of the weather event outcomes {DD, DW,W D,WW }
is not the same. For this particular dataset, Table 2.1 indicates that the probabilities for the
outcomes
{DD, DW,W D,WW }
are 0.3626, 0.1978, 0.1978, and 0.2418, respectively. They are computed from the number of
transitions in the fifth column of Table 2.1. For example, 33/(33 + 18 + 18 + 22) = 0.3626
is larger than the expected 0.25 under the assumption of independent weather events. This
means that New York has a better chance of another dry day, when today is dry, than of
a wet day. This is because the probability of {DD} is 0.3626, while that of {DW } is only
0.1978.
When the RV takes on only discrete real values, such as the number of wet days, then the
RV is called a discrete RV. Otherwise, it is called a continuous RV, such as the amount of
total precipitation at a station for two consecutive days. The RV Y can take any nonnegative
real value and is a continuous RV. The probability P(Y = 1.5) does not make sense anymore.
Instead, the probability of Y in an interval, such as P(1.0 < Y < 1.5), can be defined. Most
RVs in climate science are continuous, such as temperature, pressure, and atmospheric
moisture.
The data for a random variable consists of the x values of X based on the outcomes
of events that have already occurred, such as the observed temperature or atmospheric
pressure. The forecast or prediction for a random variable means conclusions on X based
on the calculated x values and their associated probabilities for the events to occur, such as
the weather forecast of precipitation for the next day: How much will it rain? What is the
39 2.1 Random Variables
probability of rain? What is the probability of a given rain amount in a given interval? We
will discuss more on these issues in the next chapter.
In a sense, understanding RV can help better understand weather data and weather fore-
casting. You should be cautioned that Table 2.1 only used the data for one time period, not
the entire history of New York weather.
Based on 92 days of New York City daily data, we consider the first 91 days and use the
last day only as an indicator for transition. The number of dry days is 51, and that of wet
days is 40. An estimate of the dry day probability is
51
pD = = 0.5604, (2.1)
91
and the wet day probability is
40
pW = = 0.4396. (2.2)
91
The probabilities of the events of the consecutive two days are shown in Table 2.1, such
as,
PWW = 0.2418. (2.3)
Thus, we denote this as P(X = 2) = 0.2418 when we use the random variable formula to
describe the probability. Similarly, P(X = 0) = 0.3626. However, X = 1 has two cases: DW
and WD. Thus,
Also related to this example is the probability of a wet day today based on the condition
that yesterday was a dry day. This is called conditional probability, denoted by P(W |D), in
this case. It can be estimated by the number of DW transition days divided by the total
number of dry days:
18
P(W |D) = = 0.3529. (2.5)
51
Similarly,
33
P(D|D) = = 0.6471. (2.6)
51
The event DW occurs when the first day is dry and has probability P(D), and the second
day changes to wet from dry: conditional probability P(W |D). Thus,
51 18 18
P(DW ) = P(D)P(W |D) = × = = 0.1978. (2.7)
91 51 91
This agrees with the fourth column of Table 2.1.
40 Elementary Probability and Statistics
This is Bayes’ theorem, sometimes called Bayes’ formula, often written as in conditional
probability:
P(D)P(W |D)
P(D|W ) = . (2.10)
P(W )
Bayes’ theorem holds for any two random events A and B:
P(A)P(B|A)
P(A|B) = . (2.11)
P(B)
This notation for the Bayes’ theorem is most commonly used in books and the Internet.
Let p denote the conditional probability of the second day being dry when the first day is
dry, i.e.,
p = P(D|D). (2.12)
Let q denote the conditional probability of the second day being wet when the first day is
wet, i.e.,
q = P(W |W ). (2.13)
In the case of the New York City example, p = 0.3626 and q = 0.2418.
The second day is either dry or wet, given that the first day is dry. The mathematical
expression for this sentence is
P(D|D) + P(W |D) = 1, (2.14)
or
p + P(W |D) = 1. (2.15)
Similarly,
P(D|W ) + P(W |W ) = 1 (2.16)
means that the second day is either dry or wet, given that the first day is wet.
The above leads to the following equations:
P(W |D) = 1 − P(D|D) = 1 − p, P(D|W ) = 1 − P(W |W ) = 1 − q. (2.17)
1 − CPDN = 1 − (1 − pN ) = pN . (2.21)
The mean length LD of a dry spell is the expected value E[n] when we treat n as an RV:
∞
1
LD = ∑ n × (1 − p)pn−1 = 1 − p , (2.22)
n=1
because
∞
x
∑ nxn = (1 − x)2 . (2.23)
n=1
The R command rnorm(4) generates four random numbers following the standard normal
distribution N(0, 12 ) with mean equal to 0 and standard deviation equal to 1:
0.6341538 0.4210989 1.1772246 1.0102169
To enable the generation of the same four random numbers, use set.seed:
set . seed ( 1 5 3 )
rnorm ( 4 )
42 Elementary Probability and Statistics
Here, 153 is an arbitrary seed number you choose to assign. If you want to obtain the same
four random numbers, it is necessary to apply set.seed(153) each time before running
rnorm(4).
The R command rnorm(10, mean=8, sd=3) generates ten random numbers accord-
ing to the normal distribution N(8, 32 ).
In general, rname is the standard R command for generating random numbers with
distribution name.
We use F(N) to denote the probability of occurrence of a dry spell of length equal to one
day, two days, three days, . . ., or N days, i.e., F(N) = P(n ≤ N). Then, F(N) is equal to the
sum from n = 1 to N of the probabilities in (2.18):
N N
F(N) = ∑ P(Dn ) = ∑ (1 − p)pn−1
n=1 n=1
N
1 − pN
= (1 − p) ∑ pn−1 = (1 − p) × = 1 − pN . (2.25)
n=1 1− p
When N is large, it should cover all the possibilities, and F(N) should approach one, i.e.,
t
Figure 2.2 (a) Probability distribution function (PDF) of the New York City dry spell. (b) Cumulative distribution function (CDF) of
the New York City dry spell.
The function F(N) gives the cumulative probability of P(Dn ) up to N and is called the
cumulative distribution function (CDF), shown in Figure 2.2(b).
Strictly speaking, the PDF for a discrete RV should be called probability mass function
(PMF), or discrete PDF. However, we usually do not make a clear distinction in climate
science, since the problem itself defines if an RV is discrete or continuous.
Figure 2.2 can be plotted by the following computer code.
# R plot Fig . 2 . 2 : PDF and CDF
p=0.6471
n=1:12
pdfn = ( 1 -p )* p ^ n
par ( mar = c ( 4 . 5 ,4 . 5 ,3 ,0 . 5 ))
plot (n , pdfn ,
ylim = c ( 0 ,0 . 2 5 ) , type = " h " , lwd = 1 5 ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 ,
xlab = " Length of Dry Spell : n " , ylab = " Probability " ,
main = " Probability Distribution Function of Dry Spell " )
text ( 1 . 5 ,0 . 2 5 , " ( a ) " , cex = 1 . 5 )
p=0.6471
N=1:12
cdfN = 1 -p ^ N
par ( mar = c ( 4 . 5 ,4 . 5 ,3 ,0 . 5 ))
plot (N , cdfN ,
ylim = c ( 0 ,1 ) , type = " s " , lwd = 3 ,
44 Elementary Probability and Statistics
# create subplots
fig , ax = plt . subplots ( 2 ,1 , figsize =( 1 2 ,1 2 ))
Usually, we use f (x) to denote the PDF, and F(x) for CDF. If X is a discrete RV, then
x
F(x) = ∑ f (t), (2.27)
t=t0
In general, any function f (x) ≥ 0 can be a PDF if the integration or summation over its
domain is 1. For example, if p > 0, q > 0, and p + q = 1, then
n
1 = p + q = (p + q)n = ∑ Ckn pk qn−k , (2.30)
k=0
where Ckn is the combination of choosing k distinct elements out of n distinct elements
n!
Ckn = . (2.31)
k!(n − k)!
Therefore,
f (k) = Ckn pk qn−k (2.32)
is a PDF, and
K
F(K) = ∑ Ckn pk qn−k (2.33)
k=0
is a CDF. This is the binomial distribution, of which the random event has only two possible
outcomes, such as dry day or wet day, success or failure, connected or disconnected, alive
or dead, 0 or 1. If it is dry or wet, then f (k) may be interpreted as the probability of k dry
days among n days. These k dry days can be interwoven with wet days and is different
from a dry spell of k days. The latter is a consecutive k dry days. Figure 2.3 shows the PDF
and CDF of the binomial distribution for n = 20, p = 0.3.
CDF
Cumulative Probability
0.15
0.8
Probability
0.6
0.10
0.4
0.05
PDF
0.2
0.00
0.0
0 5 10 15 20
Number of successes: k
t
Figure 2.3 PDF and CDF of the binomial distribution for n = 20, p = 0.3.
When the dry and wet days are interwoven, there are Ckn ways of selecting k dry days
from n days (probability Ckn pk ), and remaining (n − k) days are wet (probability q(n−k) ).
Thus, the probability of exactly k dry days is
Ckn pk q(n−k) = f (k). (2.34)
46 Elementary Probability and Statistics
This is another way to explain the coefficients of the binomial expansion. The function
f (k) is the probability of k dry days among the total n days with the normalization property
n
∑ f (k) = 1. (2.35)
k=0
This is intuitive since all the possibilities are exhausted when k runs from 0 to n.
We denote the binomial distribution by B(p, n), and f (k) is called the probability mass
function (PMF).
Example 2.2 Tossing a coin is a binomial random variable. Its sample space is Ω =
{heads, tails}. Sample space in probability theory is defined as the set of all possible out-
comes of an experiment. Here, the experiment is tossing a coin. A real-valued function is
defined on the sample space and has only two values: x = 1 if it is a head, and x = 0 if it
is a tail. The probabilities defined on the real values are pX (1) = 0.5 and pX (0) = 0.5. The
PDF of the coin toss RV is
n
1
f (k) = Ckn , (2.36)
2
where n is the total number tosses, i.e., the size of the sample space Ω.
k = np . arange ( 0 , n + 1 )
pdf = stats . binom . pmf
cdf = stats . binom . cdf
fig , ax 1 = plt . subplots ( figsize =( 1 2 ,8 ))
ax 2 = ax 1 . twinx ()
# plot cdf
ax 2 . plot (k , cdf (k , n , p ) , ’bo - ’ );
ax 2 . set _ ylabel ( " Cumulative Probabilty " ,
size = 2 5 , color = ’b ’ , labelpad = 1 5 )
ax 2 . tick _ params ( length = 6 , width = 2 ,
labelsize = 2 1 , color = ’b ’ , labelcolor = ’b ’)
ax 2 . spines [ ’ right ’ ]. set _ color ( ’b ’)
ax 2 . set _ ylim ( 0 ,)
ax 2 . text ( 1 0 , 0 . 2 , " PDF " , color = ’k ’ , size = 2 6 )
plt . savefig ( " Figure 0 2 0 3 . eps " )
In the R code in Example 2.2, dbinom provides data for the PDF, and pbinom yields data
for the CDF. The command dname and pname apply to all the commonly used proba-
bility distributions, such as dnorm and pnorm for normal distributions. For example, if
x=seq(-5,5, len=101), then plot(x, dnorm(x)) plots the PDF of the standard nor-
mal distribution, as shown in Figure 2.4, which can be produced by the following computer
code.
# R plot Fig . 2 . 4 : PDF and CDF of normal distribution
x = seq ( - 5 ,5 , len = 1 0 1 )
y = dnorm ( x )
plot (x , y ,
type = " l " , lty = 1 ,
xlab = " x " , ylab = " Probability Density f ( x ) " ,
main = " Standard Normal Distribution " ,
48 Elementary Probability and Statistics
t
Figure 2.4 PDF and CDF of the standard normal distribution N(0, 1).
ax 2 = ax 1 . twinx ()
plt . show ()
Note that for a continuous RV, the probability at a single point P(x) does not make any
sense, but the probability in x’s small neighborhood of length dx, denoted by f (x)dx, is
meaningful. The probability of a continuous RV in an interval (a, b) is also meaningful and
is defined as the area underneath the PDF:
Z b
Pab = f (x)dx = F(b) − F(a). (2.37)
a
For example, the standard normal distribution has the following PDF formula:
2
1 x
f (x) = √ exp − . (2.38)
2π 2
If a = −1.96 and b = 1.96, then
Z 1.96
Pab = f (x)dx = 0.95. (2.39)
−1.96
These numbers will be used in the description of confidence level and confidence interval
in Chapter 3.
50 Elementary Probability and Statistics
A histogram shows the distribution of data from a single realization, also called a sam-
ple. When it is normalized, a histogram may be considered as an approximation to
the PDF of the random process, for which data are taken or generated. Figure 2.5
shows the histogram computed from 200 random numbers generated by R, and its
PDF fit, where the PDF is subject to an area normalization, which is 200 for this
figure.
t
Figure 2.5 Histogram and its PDF fit of the standard normal distribution N(5, 22 ). The blue short vertical lines are the 200
random numbers x generated by the R code rnorm(200, mean=5, sd=3).
# plot histogram
counts , bins , bars = ax . hist (x , bins = 1 7 ,
range = ( - 5 ,1 5 ) , color = ’ gray ’ , edgecolor = ’k ’ );
# generate 2 0 0 zeros
y = np . zeros ( 2 0 0 )
plt . show ()
Please note that since the data is randomly generated, your histogram might differ from
the one shown here.
2.3.1 Definitions
The expected value, also commonly referred to as mean value, average value, expectation,
or mathematical expectation, of an RV is defined as
Z
E[X] = x f (x) dx ≡ µx , (2.40)
D
where f (x) is the PDF of X , and D is the domain of x, which is defined from the set of
all the event outcomes, i.e., the sample space. The domain D is (−∞, ∞) for the standard
normal distribution.
The expected value of a function of a random variable g(X) is similarly defined:
Z
E[g(X)] = g(x) f (x) dx. (2.41)
D
Var[X] = E[(X − µx )2 ]
Z
= (x − µx )2 f (x) dx. (2.42)
D
This expression is interpreted as the second centered moment of the distribution, while
the expected value defined in Eq. (2.40) is the first moment. In general, the nth centered
moment is defined by
µn = E[(X − µx )n ]. (2.43)
Another commonly used notation for variance is v ar, and many computer languages use
var(x) as the command to calculate the variance of the data string x.
The standard deviation, abbreviated SD, is the square root of the variance:
p
SD[X] = Var[X] (2.45)
It has the same units as x. We also use sd to denote standard deviation and sd(x) is often
a computer command to calculate the standard deviation of the data string x.
Integration cannot be applied to a discrete random variable and has to be replaced by
summation. The expected value of a discrete random variable is defined by a summation
as follows:
E[xn ] = ∑ xn p(xn ), (2.46)
n
where p(xn ) is the probability of xn , i.e., p is the PDF of the discrete random variable.
For example, for a binomial distribution B(n, p), the expected value of k is np and can be
derived from the following formula:
n
E[K] = ∑ kCkn pk (1 − p)n−k = np. (2.47)
k=0
53 2.4 Joint Distributions of X and Y
where a and b are ordinary numbers and X is a random variable. The integral definition of
the expected value implies that
E[Y ] = a + bE[X]. (2.50)
The sum X +Y of two random variables X and Y has the following property:
E[X +Y ] = E[X] + E[Y ]. (2.52)
When two random variables X and Y (also called variates) present, their PDF is a joint
distribution f (x, y), with
ZZ
f (x, y) dxdy = 1, (2.57)
D
is a PDF for X , and is called the marginal distribution function, where Dx is a cross section
of the 2-dimensional domain D for a fixed x.
Figure 2.6 shows a simulated joint distribution: normal distribution N(5, 32 ) in x and
uniform distribution U(0, 5) in y. The locations of the black dots are determined by the
coordinates (x, y). The bar chart on top is the histogram of x data, and the bar chart on the
right is the histogram of y data.
3
y
–5 0 5 10 15
x
t
Figure 2.6 Simulated joint and marginal distributions: Normal distribution in x and uniform distribution in y.
# plot points
ax [ 1 ,0 ]. plot (x ,y , ’ ko ’ , linewidth = 2 );
ax [ 1 ,0 ]. set _ xlabel ( " $ x $ " , size = 2 5 , labelpad = 2 0 );
ax [ 1 ,0 ]. set _ ylabel ( " $ y $ " , size = 2 5 , labelpad = 1 5 );
ax [ 1 ,0 ]. tick _ params ( length = 6 , width = 2 , labelsize = 2 0 );
ax [ 1 ,0 ]. set _ xticks ( np . linspace ( - 5 ,1 5 ,5 ))
We can also define a conditional probability density for x for a given y, denoted f (x|y). If
the marginal distribution f2 (y) is also known, we have
We can state the independence from the joint distribution perspective. The RVs X and Y
are independent (i.e., the occurrence of y does not alter the probability of the occurrence of
x), if and only if
The example shown in Figure 2.6 has independent x and y. Chapters 3, 4, and 7 have more
examples and discussions on independence and serial correlations.
The last part of Eq. (2.62) is called the covariance of X and Y , denoted by
Cov[X,Y ] = E [(X − E[X])(Y − E[Y ])] . (2.63)
Poisson distribution is a discrete probability distribution that shows the probability of the
number of events occurring in a given time interval. For example, what is the probability
of having ten rain storms over New York in June when the average number of June storms
based on the historical data is five? The mean occurrence rate is denoted by λ . The PDF of
the Poisson distribution is
e−λ λ k
f (k) = , k = 0, 1, 2, . . . (2.65)
k!
It can be proved that
E[X] = λ , (2.66)
Var[X] = λ . (2.67)
Figure 2.7 shows the PDF, also known as PMF, for a discrete RV, and CDF of the Poisson
distribution. If the average number of June storms is five, then the probability of having ten
storms in June is f (10) = 0.0181 or approximately 2%, which can be computed by the R
command dpois(10, lambda=5).
Figure 2.7 can be generated by the following computer code.
# R plot Fig . 2 . 7 : Poisson distribution
k=0:20
y = dpois (k , lambda = 5 )
par ( mar = c ( 4 . 5 ,4 . 5 ,2 ,4 . 5 ))
57 2.5 Additional Commonly Used Probabilistic Distributions in Climate Science
1.0
0.15
0.6
0.10
0.4
0.05
0.2
0.00
0.0
0 5 10 15 20
k
t
Figure 2.7 PDF and CDF of a Poisson distribution of mean rate equal to 5.
plot (x , y ,
type = " p " , lty = 1 , xlab = " k " , ylab = " " ,
main = " Poisson Distribution : Mean Rate = 5 " ,
cex . lab = 1 . 4 , cex . axis = 1 . 4 )
mtext ( side = 2 , line = 3 , cex = 1 . 4 , ’ Probability f ( k ) ’)
par ( new = TRUE )
plot (x , ppois (x , lambda = 5 ) , type = " p " , pch = 1 6 ,
lty = 2 , col = " blue " ,
axes = FALSE , ylab = " " , xlab = " " , cex . axis = 1 . 4 )
axis ( 4 , col = " blue " , col . axis = " blue " , cex . axis = 1 . 4 )
mtext ( " Cumulative Probability F ( k ) " , col = " blue " ,
side = 4 , line = 3 , cex = 1 . 4 )
ax 2 = ax 1 . twinx ()
58 Elementary Probability and Statistics
where λ is the mean rate of occurrence, i.e., how many times of occurrence per unit time.
Its CDF is
F(t) = 1 − e−λt . (2.69)
One can also relate this result to the Poisson distribution. According to the Poisson PDF,
the probability of zero precipitation events occurring in the time interval (t,t + ∆t) is
e−λ ∆t (λ ∆t)0
p(0; λ ∆t) = = e−λ ∆t . (2.73)
0!
59 2.5 Additional Commonly Used Probabilistic Distributions in Climate Science
Then, the probability for a nonzero precipitation event to occur in the time interval is
E[X] = µ, (2.77)
2
Var[X] = σ . (2.78)
If X1 ∼ N(µ1 , σ12 ) and X2 ∼ N(µ2 , σ22 ) are independent, then their mean is also a normal
RV with
X1 + X2
X= ∼ N((µ1 + µ2 )/2, (σ12 + σ22 )/22 ). (2.79)
2
Note that the mean has a smaller variance, i.e., is more accurate.
60 Elementary Probability and Statistics
If the RVs X1 , X2 , . . . , Xn are independent and follow the same normal distribution
N(µ, σ 2 ), then their mean is normally distributed N(µ, σA2 ) with
σ2
σA2 = . (2.80)
n
This conclusion is particularly useful in science experiments. Repeating an experiment n
times reduces the experimental error σ dramatically:
σ
σA = √ . (2.81)
n
This is called the standard error of an experiment repeated n times, or n repeated
measurements.
The X1 , X2 , . . . , Xn are independent RVs from a population with the same mean µ and
finite variance σ 2 , but may not be normally distributed. However, their average
∑ni=1 Xi
X= (2.82)
n
is still approximately normally distributed as long as n is sufficiently large with
X ∼ N(µ, σ 2 /n). (2.83)
This formula introduces the classical version of the central limit theorem (CLT), and sup-
√
ports our intuition that we will have a reduced standard error σ / n, if we repeat our
measurements and take their average.
The CLT states that so long as the variances of the terms are finite, the sum will be
normally distributed as n goes to infinity, whether or not the RV of each term is normal.
Nearly all variates in climate science satisfy this criterion.
One distribution that is rare but violates the finite variance assumption of CLT is the
Cauchy distribution C(x0 , γ) whose PDF is
1 1
f (x) = 2 , − ∞ < x < ∞, (2.84)
πγ x−x0
1+ γ
where x0 is called the location parameter and γ is called the scale parameter. The Cauchy
distribution has mean zero, but has infinite variance. Figure 2.8 shows a fat tail of Cauchy
distribution, compared to the normal distribution of the same peak.
Figure 2.8 can be generated by the following computer code.
# R plot Fig . 2 . 8 : Normal and Cauchy distributions
x = seq ( 0 ,2 0 , len = 2 0 1 )
ycauchy = dcauchy (x , location = 1 0 , scale = 1 )
ygauss = dnorm (x , mean = 1 0 , sd = sqrt ( pi / 2 ))
plot (x , ycauchy , type = " l " , lwd = 2 ,
ylab = " Density " ,
main = " Comparison between Cauchy and Gaussian Distributions " ,
cex . lab = 1 . 4 , cex . axis = 1 . 4 )
legend ( - 1 ,0 . 3 5 , legend = " Cauchy distribution " ,
cex = 1 . 2 , bty = " n " , lty = 1 , lwd = 2 )
lines (x , ygauss , lty = 2 )
legend ( - 1 ,0 . 3 2 , legend = " Normal distribution " ,
cex = 1 . 2 , bty = " n " , lty = 2 )
61 2.5 Additional Commonly Used Probabilistic Distributions in Climate Science
0.30
Normal distribution
0.25
0.15 0.20
Density
0.10
0.05
0.00
0 5 10 15 20
x
t
Figure 2.8 Compare the PDF of the normal distribution N(10, π/2) and the Cauchy distribution C(10, 1).
Chi-square deals with the sum of n RV squares; the sum of squares of independent standard
normal RVs:
k
X= ∑ Zn2 , Zn ∼ N(0, 1). (2.85)
n=1
Here, the integer k is called the degrees of freedom (dof). We thus also use χk2 to denote
the χ 2 distribution. If k = 2m is a positive integer, then Γ(m) = (m − 1)! and
1
f (x) = xm−1 e−x/2 . (2.88)
2m (m − 1)!
The expected value and variance of the chi-square χ 2 variable X are given as follows:
E[X] = k, (2.89)
Var[X] = 2k. (2.90)
The following simple R command can plot the probability density function (PDF) of a
χ 2 variable:
x = seq ( 1 ,1 0 , by = 0 . 2 )
plot (x , dchisq (x , df = 4 ))
The histogram of these w data fits the distribution χ62 as shown in Figure 2.9.
The R code for generating Figure 2.9 is as follows.
# R plot Fig . 2 . 9 : Chi - square data and fit
setwd ( " / Users / sshen / climstats " )
da 1 = read . csv ( " EdmontonT . csv " , header = TRUE )
x = da 1 [ , 3 ]
m 1 = mean ( x )
s 1 = sd ( x )
xa = (x - m 1 )/ s 1
y = matrix ( xa [ 1 : 1 6 3 2 ] , ncol = 6 , byrow = TRUE )
63 2.5 Additional Commonly Used Probabilistic Distributions in Climate Science
t
Figure 2.9 Histogram and its chi-square fit. The chi-square data are derived from Edmonton monthly surface air temperature
anomaly data: Jan 1880–Dec 2015.
w = rowSums ( y ^ 2 )
hist (w , breaks = seq ( 0 ,4 0 , by = 1 ) ,
xlim = c ( 0 ,4 0 ) , ylim = c ( 0 ,5 0 ) ,
main = " Histogram and its Chi - square Fit
for Edmonton temperature data " ,
xlab = expression ( " Chi - square data [ " ~ degree ~ C ~ " ] " ^ 2 ) ,
cex . lab = 1 . 4 , cex . axis = 1 . 4 )
par ( new = TRUE )
density = function ( x ) dchisq (x , df = 6 )
x = seq ( 0 ,4 0 , by = 0 . 1 )
plot (x , density ( x ) , type = " l " , lwd = 3 ,
col = " blue " , ylim = c ( 0 ,0 . 1 5 ) ,
axes = FALSE , bty = " n " , xlab = " " , ylab = " " )
axis ( 4 , cex . axis = 1 . 4 , col = ’ blue ’ , col . axis = ’ blue ’)
mtext ( " Chi - square Density " , cex = 1 . 4 ,
side = 4 , line = 3 , col = " blue " )
text ( 2 0 , 0 . 1 , cex = 1 . 4 , col = " blue " ,
" Chi - square distribution : df = 6 " )
# plot histogram
ax . hist (w , bins = 4 1 , edgecolor = ’k ’ ,
color = ’ white ’ , linewidth = 3 )
ax . set _ title ( " Histogram and its Chi - square Fit \ n \
for Emonton temperature data " )
ax . set _ xlabel ( " Chi - square data [$\ degree $ C ]$^ 2 $ " ,
size = 2 2 )
ax . set _ ylabel ( " Frequency " , size = 2 2 , labelpad = 2 0 )
ax . tick _ params ( length = 6 , width = 2 , labelsize = 2 1 )
ax . set _ xticks ([ 0 ,1 0 ,2 0 ,3 0 ,4 0 ])
ax . set _ yticks ([ 0 ,1 0 ,2 0 ,3 0 ,4 0 ,5 0 ])
ax . spines [ ’ top ’ ]. set _ visible ( False )
ax . text ( 1 0 , 3 5 , " Chi - square distribution : df = 6 " ,
size = 2 0 , fontdict = font )
ax 1 = ax . twinx ()
# plot chi - square fit
ax 1 . plot (x , cpdf , color = ’b ’ , linewidth = 3 )
ax 1 . set _ ylabel ( " Chi - square Density " , size = 2 2 ,
color = ’b ’ , labelpad = 2 0 )
ax 1 . set _ ylim ( 0 ,)
ax 1 . tick _ params ( length = 6 , width = 2 , color = ’b ’ ,
labelcolor = ’b ’ , labelsize = 2 1 )
ax 1 . spines [ ’ top ’ ]. set _ visible ( False )
ax 1 . spines [ ’ right ’ ]. set _ color ( ’b ’)
ax 1 . set _ yticks ([ 0 . 0 0 ,0 . 0 5 , 0 . 1 0 , 0 . 1 5 ])
plt . show ()
A popular application of the chi-square distribution in data analysis is to use the chi-
square test to determine goodness of fit to a given distribution, continuous or discrete.
It is sum of the squares of observed counts minus the expected counts from the given
distribution:
n
(Oi − Ei )2
χ2 = ∑ , (2.92)
i=1 Ei
where Oi is the number of observations in the ith category, Ei is the expected number
of observations in the ith category according to the given distribution, and n is the total
number of observations. The perfect fit implies χ 2 = 0, while a large χ 2 value means a not
good fit.
The lognormal distribution is simply the case that the logarithm of the variable is
distributed normally. Its PDF is
1 (ln x−µ)2
f (x) = √ e− 2σ 2 , 0 < x < ∞. (2.93)
xσ 2π
65 2.5 Additional Commonly Used Probabilistic Distributions in Climate Science
The distribution of weekly, 10-day or even monthly total precipitation of some stations
may be approximated by the lognormal distribution. The June precipitation of Madison,
Wisconsin, USA, from 1941 to 2019 fits lognormal distribution well. The R command for
the fitting is
fitdist (x , distr = " lnorm " , method = " mle " )
where x is the Madison precipitation data stream, and mle stands for maximum likelihood
estimation. The fitting results are that µ = 4.52, and σ = 0.64. The theoretical expected
value and variance from the fitted distribution are
2
E[X] = e4.52+0.64 /2 = 112.81, (2.96)
h 2
i 2
Var[X] = e0.64 − 1 e2×4.52+0.64 = 6602.91. (2.97)
This model expected value 112.81 [mm] is approximately equal to the sample mean of
79 years of June precipitation which is 109.79 [mm]. The model standard deviation is
√
6602.91 = 81.25, much larger than the sample standard deviation 63.94 [mm] from the
Madison precipitation data.
The aforementioned statistics results can be computed by the following computer code.
# R lognormal distribution of precipitation
setwd ( " / Users / sshen / climstats " )
Madison = read . csv ( " data / MadisonP . csv " , header = TRUE )
Madison [ 1 : 2 ,] # Take a look at the first two lines of the data
daP = matrix ( Madison [ , 7 ] , ncol = 1 2 , byrow = TRUE )
x = daP [ , 6 ] # The June precipitation data
n = length ( x )
m 1 = mean ( x )
A = log ( m 1 ) - sum ( log ( x ))/ n
c = ( 1 /( 4 * A ))*( 1 + sqrt ( 1 + 4 * A / 3 ))
beta = c / m 1
mu = mean ( log ( x ))
sig = sd ( log ( x ))
round ( c ( mu , sig ) , digits = 2 )
#[1] 4.52 0.65
modelmean = exp ( mu + 0 . 5 * sig ^ 2 )
modelsd = sqrt (( exp ( sig ^ 2 ) - 1 )* exp ( 2 * mu + sig ^ 2 ))
datamean = mean ( x )
datasd = sd ( x )
round ( c ( modelmean , datamean , modelsd , datasd ) , digits = 2 )
#[1] 112.81 109.79 81.26 63.94
If you wish to plot the fitted lognormal distribution, you may use the following simple
R command.
# R plot lognormal distribution
x = seq ( 0 , 6 0 0 , by = 0 . 1 )
plot (x , dlnorm (x , mean = 4 . 5 2 , sd = 0 . 6 4 ))
66 Elementary Probability and Statistics
The monthly precipitation of stations or grid boxes can often be modeled by the Gamma
distribution. For example, the June precipitation of Omaha, Nebraska, USA, from 1948 to
2019 fits the Gamma distribution well with rate β = 0.0189 and shape c = 1.5176. Data
source: Climate Data Online (CDO), NOAA,
www.ncdc.noaa.gov/cdo-web.
The R command is
fitdist (x , distr = " gamma " , method = " mle " )
where x is the June precipitation data stream of 72 years (i.e., 72 numbers) from 1948 to
2019.
The theoretical expected value from the fitted Gamma distribution is thus c/β =
1.5176/0.0189 = 80.30 [mm]. The sample mean is 80.33 [mm] and is approximately
equal to the theoretical expectation 80.30 [mm]. The theoretical standard deviation is
√ √
c/β = 1.5176/0.0189 = 65.18 [mm]. The sample standard deviation 58.83 [mm] and
is in the same order as the theoretical standard deviation, although the approximation is
not very good.
The Gamma distribution has a popular application in meteorological drought monitor-
ing. It is used to define the Standardized Precipitation Index (SPI), which is a widely used
index on a range of timescales from a month to a year.
The parameters of Gamma distribution can be directly estimated from data using the
following formulas:
r !
1 4A
c= 1+ 1+ , (2.101)
4A 3
x̄
b= , (2.102)
c
where
∑ni=1 ln(xi )
A = ln(x̄) − . (2.103)
n
67 2.5 Additional Commonly Used Probabilistic Distributions in Climate Science
same as the standard normal distribution N(0, 1). Even when dof = 6, the t -distribution is
already very close to the standard normal distribution. Thus, t -distribution is meaningfully
different from the standard normal distribution only when the sample size is small, say,
n = 5.
The mathematical formula of the PDF for the t -distribution is quite complicated. The
formula is rarely used and is not presented in this book.
This chapter has described the basics of a random variable, PDF or PMF, CDF, CLT,
and some commonly used probability distributions in climate science. Some important
concepts are as follows.
(i) Definition of RV: When talking about an RV X , we mean three things: (a) A sample
space that contains all the possible outcomes of the random events, (b) a function
that maps the outcomes to real values, and (c) the probability associated with the real
values in a discrete way or over an interval. Therefore, a random variable is differ-
ent from a regular real-valued variable that can take a given value in its domain. A
random variable takes a value only after the outcome of a random event is known,
because RV is itself a function. When the value is taken, we then ask what the
measure is of uncertainty of the value.
(ii) Commonly used probability distributions: We have presented some distributions
commonly used in climate science, including the normal distribution, binomial
distribution, Poisson distribution, exponential distribution, chi-square distribution,
lognormal distribution, and Gamma distribution. We have provided both R and
Python codes to plot the PDFs and CDFs of these distributions. We have also fit-
ted real climate data to appropriate distributions. Following our examples, you may
wish to fit your own data to appropriate distributions.
(iii) The central limit theorem (CLT) deals with the mean of independent RVs of finite
variance. CLT claims that the mean is approximately normal if the number of RVs
are many. The key assumption is the finite variance for each RV.
(iv) Chi-square distribution χk2 deals with the sum of squares of a few independent stand-
ard normal RVs. The chi-square distribution is very useful in climate science, such
as analyzing goodness of fit for a model.
References and Further Reading
[2] R. A. Johnson and G. K. Bhattacharyya, 1996: Statistics: Principles and Methods. 3rd
ed., John Wiley & Sons.
This is a popular textbook for a basic statistics course. It emphasizes clear con-
cepts and includes descriptions of statistical methods and their assumptions.
Many real-world data are analyzed in the book. The 8th edition was published
in 2019.
[3] H. Von Storch and F. W. Zwiers, 2001: Statistical Analysis in Climate Research.
Cambridge University Press.
[4] D. S. Wilks, 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed., Academic
Press.
This excellent textbook is easy to read and contains many simple examples of
analyzing real climate data. It is not only a good reference manual for climate
scientists but also a guide tool that helps scientists in other fields make sense
of the data analysis results in climate literature.
69
70 Elementary Probability and Statistics
Exercises
2.1 Use the data in Table 2.1 to calculate the mean length of wet spell in the summer for
New York City, following the theory of conditional probability.
2.2 Use the data in Table 2.1 to calculate the probability of exactly n successive dry days in
the summer for New York City for n = 1, 2, 3, 4, 5, 6, and 7. Hint: The (n + 1)th day must
be a wet day.
2.3 Use the data in Table 2.1 to calculate the probability of exactly n successive wet days in
the summer for New York City for n = 1, 2, 3, 4, 5, 6, and 7. Hint: The (n + 1)th day must
be a dry day.
2.4 Write down the details of proving that the expected number of occurrence of a binomial
event in B(p, n) is
n
E[K] = ∑ kCkn pk (1 − p)n−k = np, (2.111)
k=0
where Ckn is
n!
Ckn = . (2.112)
k!(n − k)!
2.5 The variance of the number of occurrence of a binomial event in B(p, n) is np(1 − p).
Write down the mathematical details to derive the following:
n
Var[K] = ∑ (k − np)2Ckn pk (1 − p)n−k = np(1 − p). (2.113)
k=0
and
Z ∞
Var[X] = (x − µ)2 f (x) dx = σ 2 . (2.118)
−∞
2.9 Following Figure 2.6, use R or Python to simulate and plot the joint and marginal
distributions under the assumption of normal distribution in both x and y.
71 Exercises
2.10 (a) Use R or Python to fit the January monthly precipitation data of Omaha, Nebraska,
USA, from 1950 to 2019 to the Gamma distribution. You may find the data from
the NOAA Climate Data Online (CDO) or another source.
(b) Plot the fitted distribution and the histogram of data on the same figure. You may
put your the density scale of the fitted distribution on the right using the blue color,
as shown in Figure 2.9.
2.11 Repeat the Gamma distribution problem for the June monthly precipitation data of a
location of your interest. You may find the monthly data from your data source or from
the NOAA Climate Data Online (CDO).
2.12 (a) Use R or Python to fit the January precipitation data of Madison, Wisconsin, USA,
from 1950 to 2019 to the lognormal distribution. You may find the data from the
NOAA Climate Data Online (CDO) or another source.
(b) Plot the fitted distribution and the histogram of data on the same figure. See Figure
2.9 as a reference.
2.13 Repeat the lognormal distribution problem for the June precipitation data of a location of
your interest. You may find the monthly data from your data source or from the NOAA
Climate Data Online (CDO).
2.14 Use R or Python to plot the Poisson distributions with five different rates λ on the same
figure. Explain why the PDF peaks at k = λ .
2.15 Use R or Python to plot the exponential distributions with five different rates λ on the
same figure.
2.16 Use R or Python to plot the lognormal distributions with five different sets of µ and σ
values on the same figure.
2.17 Use R or Python to plot the Gamma distributions with five different sets of a, b, and c
values on the same figure.
2.18 Use R or Python to plot chi-square distributions with different degrees of freedom on
the same figure.
2.19 (a) Identify a GHCN station of your interest that has at least 30 years of high-quality
monthly precipitation records, and download the data.
(b) Use R or Python to plot the May precipitation data against the time [year].
(c) Compute the standard precipitation index (SPI) from the May precipitation data for
the driest year.
2.20 Compute the summer SPI indices based on the June-July-Aug mean precipitation at
Omaha, Nebraska, USA, from 1950 to 2019. You may find the data from the NOAA
Climate Data Online (CDO) or another source.
2.21 Repeat the previous problem for the June precipitation data.
2.22 (a) Compute the standardized anomalies of the NOAAGlobalTemp monthly tempera-
ture data from January 1880 to December 2015 over the 5-degree grid box centered
at (47.5◦ N, 107.5◦ W). This data stream has 1,632 months and can be broken into
272 6-month periods. Then, compute
6
X= ∑ Tn2 (2.119)
n=1
A goal of climate statistics is to make estimates from climate data and use the estimates
to make quantitative decisions with a given probability of success, described by confi-
dence interval. For example, based on the estimate from the NOAAGlobalTemp data and
given 95% probability, what is the interval in which the true value of the global average
decade mean of the 2010s lies? Was the 1980–2009 global temperature significantly dif-
ferent from the 1950 to 1979 temperature, given the 5% probability of being wrong? To
answer questions like these is to make a conclusive decision based on the estimate of a
parameter, such as the global average annual mean temperature, from data. With this in
mind, we introduce the basic methods of parameter estimation from data, and then intro-
duce decision-making using confidence intervals and hypothesis testing. The procedures of
estimation and decision-making are constrained by sample size. Climate data are usually
serially correlated, and the individual data entries may not be independent. The actual sam-
ple size, neff , may be much smaller than the number of data records n. The uncertainty of the
estimates from the climate data is much larger when taking into account of neff . Examples
are provided to explain the correct use of neff . The incorrect use of sample size by climate
scientists can lead to erroneous decisions. We provide a way to test serial correlation and
compute actual sample size.
Fundamentally, three elements are involved: data, estimate, and decision-making. Then
we may ask our first questions: What are climate data? How are they defined? Can we
clearly describe the main differences between the definition and observation of climate
data?
In statistics, a population refers to the set of all members in a specific group about which
information is sought, for example, El Niño events. A sample drawn from a population in
the course of investigation is a subset of observations, such as the El Niño events observed
between 1950 and 2000. The sample data are the quantitative values of the samples, such as
the maximum sea surface temperature anomalies of El Niños over a specific region over the
73
74 Estimation and Decision-Making
Eastern Tropical Pacific. Our goal here is to use the sample data (x1 , x2 , . . . , xn ) to estimate
a statistical parameter for the population, such as the sample mean:
1 n
x̄ = ∑ xi . (3.1)
n i=1
The true population mean µ is never known, but in our imagination it exists. We wish to
use the sample mean x̄ as an estimate for the true mean. We assume that the samples are
random variables. Then x̄ is also a random variable. Some textbooks would use uppercase
letters to denote the random variables to explain statistical concepts and lowercase letters to
denote data for computational convenience. The two notations often get confused in those
books. In this chapter, we use lowercase letters for both random variables and data. The
lowercase letters (x1 , x2 , . . . , xn ) and x̄ can be interpreted as random variables in statistical
interpretation and also treated as data in computing.
The central limit theorem (CLT) states that the sample mean x̄ is normally distributed
when the sample size n is sufficiently large. Some textbooks consider 30 to be “large” and
others 50. Climate scientists often use 30. The CLT needs an assumption that each sample
xi has finite variance. Climate data usually satisfy this assumption.
We do not expect that the true mean µ is equal to the sample mean, rather we ask a
question: what is the probability that the true mean lies in an interval around the sample
mean x̄? Apparently, if the interval is large, it has a better chance for the true mean to be
inside the interval. However, a large interval is less accurate and less useful, because in this
case the true mean µ may be far away from the sample mean x̄.
If the population has small variability, then it is easier to obtain accurate observations.
The sample data will have a smaller standard deviation. Hence even a small interval may
have a big chance to contain the true mean.
Therefore, the interval depends on two factors: variability of the population, measured by
the standard deviation σ , and number of the observations, given by the sample size n. Our
intuition is that we would get a more accurate estimate for the population mean when we
have a large sample size n. These two factors lead us to define the sample standard deviation
and the standard deviation of the sample mean. The unbiased estimate of the sample standard
deviation is defined by
s
1 n
s= ∑ (xi − x̄)2 .
n − 1 i=1
(3.2)
Further, when the random sample data (x1 , x2 , . . . , xn ) are independent from each other and
are from the same population with variance equal to σ 2 , the standard deviation of the
sample mean x̄ is
s
SD[x̄] = √ . (3.3)
n
This is also called the standard error of the sample mean, denoted by SE[x̄] or SE. The
standard error quantifies the size of the “error bar” x̄ ± SE when approximating the true
mean using the sample mean.
75 3.1 From Data to Estimate
(SD[x̄])2 = Var[x̄]
x1 + x2 + · · · + xn
= Var
n
hx i hx i hx i
1 2 n
= Var + Var + · · · + Var
n n n
Var[x1 ] Var[x2 ] Var[xn ]
= + +···+
n2 n2 n2
σ 2 σ 2 σ 2
= 2 + 2 +···+ 2
n n n
σ2
= . (3.4)
n
This description and derivation imply that sample standard deviation s estimates the
spread of the population, while standard deviation of the mean SD(x̄) is an indicator of the
accuracy of using the sample mean x̄ to approximate the population mean µ . This is why
SD[x̄] is also called the standard error. To further understand this concept, we use numerical
simulation results shown in Figure 3.1. We generate 1,000,000 normally distributed and
independent data according to N(0, 102 ). So the population mean is µ = 0 and standard
deviation is σ = 10. These are given for the simulation. Figure 3.1(a) shows the histogram
of the 1,000,000 data. This histogram has a standard deviation equal to 10. Then, we take
sample mean of size n = 100. The 1,000,000 data allow us to take the sample mean 10,000
times, which leads to 10,000 sample means x̄. Figure 3.1(b) shows the histogram of these
√ √
10,000 sample means x̄, whose standard deviation is SD(x̄) = σ / n = 10/ 100 = 1.
800
(a)
(b)
600
Frequency
Frequency
5,000 10,000
400
200
0
t
Figure 3.1 (a) Histogram of a population x, and (b) histogram of sample mean x̄ generated from the numerical simulation
according to the normal distribution N(0, 102 ).
However, most climate data x1 , x2 , . . . , xn are not independent, and are correlated with
cor(xi , x j ) 6= 0. Then, the n sample data have some redundancy. The extreme case is that
all x1 , x2 , . . . , xn are equal. Thus, the n sample data really mean only one datum. Thus,
x1 , x2 , . . . , xn do not serve as n independent pieces of information. The mean x̄ will not
√
achieve the purpose of shrinking the standard error s/ n. Then, how many independ-
ent pieces of information are in the data x1 , x2 , . . . , xn ? We use neff to denote the number,
called the effective sample size, which is less than n. Thus, a more accurate estimate of the
standard error is
s
SE = √ . (3.5)
neff
More details about the effective sample size can be found in Section 3.3.
In this expression, we treat both x̄ and EM as random variables, and the unknown true mean
µ as a fixed value. When the CI (x̄ − EM, x̄ + EM) is estimated with numerical values, we
have then made a single realization of the CI. A particular realization may not cover the
true mean, although we are 95% confident that it does because about 95% of realizations
do.
One can use computers to simulate this statement by repeatedly generating normally
distributed numbers (x1 , x2 , . . . , xn ) m times with given mean and standard deviation. We
then compute the intervals (x̄ − EM, x̄ + EM) using EM = 1.96 √sn and verify that among
the m number of experiments, 95% of them have the true mean inside the interval (x̄ −
EM, x̄ + EM). The computer code is below.
# Verification of 9 5 % CI by numerical simulation
m = 1 0 0 0 0 # 1 0 ,0 0 0 experiments
x = 1:m
n = 3 0 # sample size n
truemean = 8
da = matrix ( rnorm ( n *m , mean = truemean , sd = 1 ) , nrow = m )
esmean = rowMeans ( da ) # sample mean
78 Estimation and Decision-Making
library ( matrixStats )
essd = rowSds ( da ) # sample SD
upperci = esmean + 1 . 9 6 * essd / sqrt ( n ) # interval upper limit
lowerci = esmean - 1 . 9 6 * essd / sqrt ( n ) # interval lower limit
l=0
for ( k in 1 : m ){
if ( upperci [ k ] >= truemean & lowerci [ k ] <= truemean )
l=l+1
} # Determine if the true mean is inside the interval
l / m # Percentage of truth
# [ 1 ] 0 . 9 4 2 5 # which is approximately 0 . 9 5
You can also plot and visualize these simulations. See the exercise problems of this
chapter.
Similarly, the probability for the true mean to be inside the error bar x̄ ± SE is 0.68. See
Figure 3.2 for the confidence intervals at 95% and 68% confidence levels.
When the sample size n goes to infinity, the error margin (EM) goes to zero. Accordingly,
the sample mean is equal to the true mean. This conclusion is correct with 95% probability,
and wrong with 5% probability.
One can also understand the sample confidence interval for a new variable:
x̄ − µ
z= √ , (3.9)
s/ n
which is a normally distributed variable with mean equal to zero and standard deviation
equal to one, i.e., it is a standard normal distribution. The variable y = −z also satisfies the
standard normal distribution. So, the probability of −1 < z < 1 is 0.68, and that of −1.96 <
√ √
z < 1.96 is 0.95. The set −1.96 < z < 1.96 is equivalent to x̄ − 1.96s/ n < µ < x̄ + 1.96s/ n.
Thus, the probability of the true mean being in the confidence interval of the sample mean
√ √
x̄ − 1.96s/ n < µ < x̄ + 1.96s/ n is 0.95. This is visually displayed in Figure 3.2.
√
In addition, the formulation x̄ = µ + zs/ n corresponds to a standard statistics problem
for an instrument with observational errors:
y = x ± ε, (3.10)
79 3.1 From Data to Estimate
0.4
0.3
Probability density
Probability
0.2
= 0.68
0.1
Probability
= 0.025
SE SE x SE SE
0.0
t
Figure 3.2 Schematic illustration of confidence intervals and confidence levels of a sample mean for a large sample size. The
confidence interval at the 95% confidence level is between the two red points, and that at 68% is between the two
blue points. SE stands for the standard error, and 1.96 SE is approximately regarded as 2 SE in this figure.
where ε stands for error, x is the true but never-known value to be observed, and y is the
observational data. Thus, data are equal to the truth plus random errors. The expected value
√
of the error is zero and the standard deviation of the error is s/ n, also called the standard
error.
The 95% confidence level comes into the equation when we require that the observed
value must lie in the interval (µ − EM, µ + EM) with a probability equal to 0.95. This
corresponds to the requirement that the standard normal random variable z is found in a
symmetric interval (z− , z+ ) with a probability equal to 0.95, which implies that z− = −1.96
and z+ = 1.96. Thus, the confidence interval of the sample mean at the 95% confidence
level is
√ √
(x̄ − 1.96s/ n, x̄ + 1.96s/ n), (3.11)
or
√ √
(x̄ − zα/2 s/ n, x̄ + zα/2 s/ n), (3.12)
where zα/2 = z0.05/2 = 1.96. So, 1 − α = 0.95 is used to represent the probability of the
variate lying inside the confidence interval, while α = 0.05 is the “tail probability” outside
the confidence interval. Outside the confidence interval means occurring on either the left
side or the right side of the distribution indicated by the red area of Figure 3.2. Each side
represents α/2 = 0.025 tail probability.
Figure 3.2 can be plotted by the following computer code.
80 Estimation and Decision-Making
x = np . linspace ( - 3 , 3 )
y = intg ( x )
# Set up figure
fig , ax = plt . subplots ( figsize =( 1 2 ,1 0 ))
ax . plot (x , y , ’ black ’ , linewidth = 1 )
ax . set _ ylim ( bottom = 0 )
ix = np . linspace (a , b )
ix 1 = np . linspace ( a 1 , b 1 )
# Adding polygon
ax . add _ patch ( poly _)
ax . add _ patch ( poly )
ax . add _ patch ( poly 1 )
ax . set _ xticks (( a , a 1 ,0 ,b 1 ,b ))
ax . set _ xticklabels (( ’ ’ , ’ ’ , ’ ’ , ’ ’ , ’ ’ ))
plt . show ()
In practice, we often regard 1.96 as 2.0, and the 2σ -error bar as the 95% confidence
interval.
82 Estimation and Decision-Making
Example 3.1 Figure 3.3 shows the January surface air temperature anomalies of
Edmonton, Canada, based on the 1971–2000 monthly climatology. We compute the mean
and its confidence interval for the first 50 years from 1880 to 1929 at the 95% confidence
level. The result is as follows:
(i) Sample mean: x̄ = −2.47◦ C (with sample size n = 50).
(ii) Sample standard deviation: s = 4.95◦ C.
(iii) Error margin at the 95% confidence level without the consideration of serial
√
correlation: EM = 1.96 × s/ n = 1.37◦ C.
(iv) Confidence interval of the mean at the 95% confidence level: x̄ ± EM , i.e.,
[−3.84, −1.10]◦ C.
1880−1929 1997−2016
mean = −2.47 mean = 1.53
−20
t
Figure 3.3 The January mean surface air temperature anomalies from 1850 to 2016 based on the 1971–2000 monthly
climatology for the 5◦ × 5◦ grid box centered around (52.5◦ N, 112.5◦ W), that covers Edmonton, Canada (data
source: NOAAGlobalTemp Version 3.0).
See Figure 3.3 for the mean x̄ = −2.47◦ C and the confidence interval [−3.84, −1.10]◦ C.
These statistics can be computed using the following computer code.
# R code for Edmonton data statistics NOAAGlobalTemp
setwd ( ’/ Users / sshen / climstats ’)
da 1 = read . csv ( " data / Lat 5 2 . 5 _ Lon - 1 1 2 . 5 . csv " , header = TRUE )
dim ( da 1 )
#[1] 1642 3 # 1 6 4 2 months : Jan 1 8 8 0 - Oct 2 0 1 6
da 1 [ 1 : 2 ,]
# Date Value
# 1 1 8 8 0 -0 1 -0 1 -7 . 9 6 0 9
# 2 1 8 8 0 -0 2 -0 1 -4 . 2 5 1 0
83 3.1 From Data to Estimate
jan = seq ( 1 , 1 6 4 2 , by = 1 2 )
Tjan = da 1 [ jan ,]
TJ 5 0 = Tjan [ 1 : 5 0 , 3 ]
xbar = mean ( TJ 5 0 )
sdEdm = sd ( TJ 5 0 )
EM = 1 . 9 6 * s / sqrt ( 5 0 )
CIupper = xbar + EM
CIlower = xbar - EM
round ( c ( xbar , sdEdm , EM , CIlower , CIupper ) , digits = 2 )
# [ 1 ] -2 . 4 7 4 . 9 5 1 . 3 7 -3 . 8 4 -1 . 1 0
When the serial correlation is considered, the error margin will be larger. We will discuss
this later in this chapter.
EM = 1 . 9 6 * stdev ( TJ 5 0 )/ np . sqrt ( 5 0 )
plt . plot ( t [ 1 1 7 : 1 3 7 ] , np . repeat ( m 1 9 9 7 2 0 0 6 , 2 0 ) ,
color = ’ blue ’)
plt . plot ( t [ 0 : 5 0 ] , np . repeat ( m 1 8 8 0 1 9 2 9 , 5 0 ) ,
color = ’ darkgreen ’)
plt . text ( 1 9 8 5 , -1 5 ,
’1 9 9 7 -2 0 1 6 ’ ’\ n ’ ’ mean = 1 . 5 3 $\ degree $ C ’ ,
color = ’ blue ’ , fontsize = 2 0 )
plt . text ( 1 8 8 0 , -1 5 ,
’1 8 8 0 -1 9 2 9 ’ ’\ n ’ ’ mean = -2 . 4 7 $\ degree $ C ’ ,
color = ’ blue ’ , fontsize = 2 0 )
plt . show ()
One type of decision-making process is based on deterministic logic. If it rains, you cover
yourself when walking outside, perhaps with an umbrella or a rain coat. If the temperature
is hot in the room, you turn on the air conditioning.
Another type of decision-making involves incomplete information and uncertainties.
This process is not deterministic but probabilistic. For example, Figure 3.3 seems to sug-
gest that in the last 20 years (1997–2016), Edmonton January temperature anomalies are
greater than zero. However, the temperature goes up and down. It is unclear whether the
suggested “greater than zero” claim is true. We need data and procedures to make a deci-
sion: accept or reject the claim. This decision can be correct or wrong with four possible
outcomes, as shown in Table 3.1, the decision contingency table.
DECISION
Accept H0 Accept H1
Type I Error
Correct Decision
H0 True
False alarm
Significance level: α
Error probability: α
Confidence level: 1 − α
Risk of false cases
ACTUAL
Type II Error
H1 True
The true and false determination of an event, such as the positive average of anomalies,
and the correct and wrong decisions on the event, form four cases: TT (true event cor-
rectly detected as true), FF (false event correctly detected as false), TF (true event, but
wrongly detected as false), and FT (false event, but wrongly detected as true). TF is a
missed detection, and FT is a false alarm. These four cases form a 2 × 2 table, called the
contingency table for decision-making. The correct detections (TT and FF) are diagonal ele-
ments, and the wrong detections (TF and FT) are off diagonal elements. Table 3.1 describes
the contingency table in formal statistical terms.
We wish to make a decision based on the data: can we statistically justify that a sugges-
tion from data makes sense, with the significance level 0.05, or 5%? Figure 3.4 shows the
significance level as the tail probability on the right of the critical value xc .
Making a decision with probability considerations goes through the following five steps.
The process is called hypothesis testing, which has to consider uncertainties.
The first step is to assume that the situation under consideration is the same as what
was before: business as usual. This is called the null hypothesis, denoted by H0 . Here,
the word “null” may be understood as a synonym for nullified, abandoned, nonexistent,
or unchanged. However, the business-as-usual is not deterministic and has uncertainty.
Thus, H0 has a probabilistic distribution, illustrated by the solid bell-shape curve shown in
Figure 3.4.
The second step is to state the hypothesis of a changed scenario. In general, this occurs
with a small probability, such as a fire accident in a building. This hypothesis is called the
alternative hypothesis, denoted by H1 , whose distribution is shifted to the right in Figure 3.4
by the “Difference” marked in the figure. When the difference is large, the two distributions
have little overlap. Because H1 occurs with a small probability, accepting H1 requires strong
evidence for H1 to be established. For example, strong evidence must be present in the
diagnostic process in order to conclude that a patient has cancer; a minor symptom is
insufficient. However, the required strength of evidence depends on the specific problem.
The fire alarm does not need strong evidence, while the cancer conclusion requires very
strong evidence. We see many false fire alarms, but very few false cancer diagnoses. Global
warming is in the nature of a fire alarm because the stakes are too high.
The third step is directly related to the aforementioned decision: what is the small-tail
probability α , indicated by the pink area in Figure 3.4, that the alternative scenario is true?
When the difference is large, we can be more certain that the alternative hypothesis should
be true. The chance for being wrong to accept H1 is α , because the incorrect acceptance of
H1 implies that the reality is H0 , but we have incorrectly rejected H0 . We thus should use
the tail distribution of H0 , i.e., the pink region, which is determined by the H0 distribution
87 3.2 Decision-Making by Statistical Inference
Accept H0 Accept H1
1 − β: Power
µ0 Statistic xs
Critical value xc
t
Figure 3.4 The distributions of null and alternative hypotheses: Critical value, p-value, and significance level α.
and the critical value xc . Hence, when α is chosen and H0 distribution is predetermined
by the nature of the problem, then xc can be calculated. For example, if H0 is normally
distributed N(0, 1), and α = 0.025, then xc = 1.96.
The fourth step is to calculate an index from the data, such as mean and variance. The
index is called the test statistic. The distributions of H0 and H1 depend on the definition of
the test statistic. For example, the distributions are normal if the test statistic xs is the mean
with a large sample size; or the distributions are student-t if the sample size is small and the
standard deviation of the population is unknown and is estimated as the sample standard
deviation.
The fifth step is to use the xs value to make a decision based on the distributions of H0 and
H1 . If xs is in the small-tail probability region of the H0 distribution (i.e., xs > xc in Figure
3.4), then the alternative hypothesis H1 is accepted. The probability of this decision being
wrong is equal to α . This is the chance of a false alarm: claiming a significant difference
but the reality is business as usual. Namely, a fire alarm is triggered when there is no fire.
This is called the Type I error, or α -error, in the statistical decision process. See Table 3.1
and next section for further explanation of the decision errors.
Figure 3.4 may be plotted by the following computer code.
# R plot Fig . 3 . 4 : Null and alternative hypotheses
setwd ( " / Users / sshen / climstats " )
setEPS () # save the . eps figure file to the working directory
postscript ( " fig 0 3 0 4 . eps " , height = 7 , width = 1 0 )
x = seq ( - 3 ,6 , len = 1 0 0 0 )
par ( mar = c ( 0 . 5 ,0 ,2 . 5 ,0 . 0 ))
plot (x , dnorm (x , 0 ,1 ) , type = " l " ,
lwd = 2 , bty = ’n ’ ,
88 Estimation and Decision-Making
When the distributions of H0 and H1 are determined, the values of α and β are determined
by the critical value xc . If xc is large, it requires a larger difference in order to reject H0 and
accept H1 . Accepting H1 when H0 is true, is a false alarm, i.e., the fire siren is triggered, but
there is no fire. This is called a Type I error of decision. The probability of Type I error is
equal to α : the tail probability of H0 because H0 is true. A smaller α means a less sensitive
detector, and less chance of a false alarm, because a smaller α means a larger xc , and hence
a larger difference is needed for accepting H1 , i.e., much stronger evidence is required to
claim a fire event.
Of course, a smaller α means a large β : the tail probability of H1 . A small α for an
insensitive detector means a larger chance of missing detection: a real fire not detected,
a patient’s real cancer not diagnosed, or a real climate change not identified. This kind of
missed detection is called a Type II error: H1 is true but is wrongly rejected. The probability
of the wrong rejection of H1 is β . To make β smaller, we have to make α larger, to be a
more sensitive detector.
Then how can we minimize our probability of wrong decision? Figure 3.4 suggests two
scenarios. The first is wide separation of the two distributions H0 and H1 . If the difference
is large, then the H1 is evident, and accepting H1 has a very small chance of being wrong.
The second is that two distributions are slim and tall, namely, having a very small standard
deviation s, which means accurate sensors, or a large sample size n. This is the common
sense of seeing different doctors to seek separate opinions, if a patient feels seriously sick
but has not been diagnosed with causes.
If the instruments and sample size are determined, then a more sensitive detector (i.e.,
a larger α ) implies a smaller probability of Type II error (i.e., a smaller β ). One cannot
require both α and β to be small at the same time. Thus, you may choose a predetermined
91 3.2 Decision-Making by Statistical Inference
α appropriate to your problem. If, for example, you design a fall detection sensor to go with
smart rollator walkers for seniors, this could be a matter of life or death, and thus must be
very sensitive so a larger α , say 10% or higher, should be used. Our Earth’s climate change
detection must also be very sensitive, since we cannot afford to allow a missed detection
to ruin our only Earth. When testing the quality of shoes, on the other hand, the detection
does not need to be that sensitive. A smaller α , say 5%, would be appropriate.
The significance of difference to be checked in hypothesis testing is measured by the
significance level α and the power of statistical inference
P = 1−β (3.13)
in Figure 3.4. The power is between α and 1. Two extreme cases corresponding to P = 1
and P = α are as follows:
All the other cases are in between the these two extremes. The tail probability on the right
of the test statistic is called the p-value. As discussed earlier, if the p-value is less than α ,
then we can conclude that the alternative hypothesis is accepted at the α significance level,
and that the statistical inference power is 1 − β . The p-value is determined by xs and the H0
distribution can be interpreted as the probability due to sampling data for the extremes. For
example, a medical study produced a p-value 0.01. This means that if the medication has
no effect (the H0 scenario) as a whole, one would conclude a valid medical effect in only
1% of studies due to random data error. A large p-value thus means that there is a larger
chance for the sample data to support the null hypothesis. Thus, the p-value measures the
compatibility of the sample data and null hypothesis.
The sample size is n = 20. We will show later that the samples are independent and have
no serial correlations. Since n < 30, the sample size is treated as small. The mean of the
samples should follow a t -distribution. The null and alternative hypotheses are H0 : x̄ = 0◦ C
and H1 : x̄ > 0◦ C, respectively.
The numerical results are as follows:
The test statistic ts > tc is in the rejection domain of H0 . Thus, H1 is accepted. The mean
temperature anomalies of Edmonton from 1977 to 2016 are significantly greater than zero
at the 5% significance level. The p-value at ts = 2.33◦ C is 0.0154, which is less than 2%.
This substantiates the acceptance of H1 .
The computer code for calculating this result is as follows.
Randomly flipping coins, tossing dice, or drawing playing cards can generate independent
samples. A sample is independent from the other samples. As derived earlier, if n samples
x1 , x2 , . . . , xn are independent from the same population, and each sample has the same
standard deviation σ estimated by
s
∑ni=1 (xi − x̄)2
s= (3.14)
n−1
where
∑ni=1 xi
x̄ = (3.15)
n
is the sample mean, then the standard deviation of the mean x̄, also known as standard error
of the mean, is
s
SE = √ . (3.16)
n
However, many (if not most) climate datasets are recorded from a sequence of times
and violate this independence assumption. For example, a string of n monthly temperature
anomaly records at a location may be serially correlated, i.e., not independent. Then, the
√
standard error of the mean of the n records is larger than s/ n,
There is often a serial correlation and this could reduce the sample size n to a smaller
value, the effective sample size neff . This is important in estimating standard deviation σ̂ of a
√
mean x̄, because neff < n and hence s/ n leads to an erroneous underestimate of the spread
of expected values in the mean estimate x̄. The correct estimate of the standard error for
the mean should be
s
SE = √ , (3.17)
neff
Thus, if we overcount the number of independent pieces of information, we will be led into
the trap of claiming more confidence in a parameter’s value than we are entitled to.
Use ρ to denote the lag-1 autocorrelation coefficient. Figure 3.5 shows the AR(1) time
series1 and their lagged autocorrelations for two different values of ρ : 0.9 and 0.6. The
figure shows that AR(1) processes have an initial memory depending on ρ and become
random after the memory. With the ρ value, the neff value can be computed from
1−ρ
neff ≈ n. (3.19)
1+ρ
4
(a) (b)
3
2
2
x
x
1
1
0
0
−1
−1
0 50 100 150 200 0 50 100 150 200
Time Time
Lagged autocorrelation: ρ=0.9 Lagged autocorrelation: ρ=0.6
(c) (d)
0.8
0.8
Autocorrelation
Autocorrelation
0.4
0.4
0.0
0.0
2 4 6 8 10 2 4 6 8 10
Time lag Time lag
t
Figure 3.5 Realizations of AR(1) time series (a) and (b) for n = 1, 000, but only the first 200 are shown, and their
autocorrelation (c) and (d). The blue dots in Panels (c) and (d) are the exact correlations, and the black circles indicate
the correlations computed from the simulated data.
The parameter ρ is the lag-1 correlation coefficient and can be estimated from the data:
n
∑t=2 (xt − x̄+ )(xt−1 − x̄− )
ρ̂ = p n
p n , (3.20)
∑t=2 (xt − x̄+ )2 ∑t=2 (xt−1 − x̄− )2
where
n
∑t=2 xt
x̄+ = , (3.21)
n−1
n
∑ xt−1
x̄− = t=2 . (3.22)
n−1
If there is no serial correlation, then ρ̂ = 0 and neff = n. White noise is such an example.
If there is a perfect correlation, then ρ̂ = 1 and neff = 0, for example if the data are all the
same, which is a trivial case and does not need further analysis.
The approximation formula for effective degrees of freedom can have a large error for a
large ρ close to 1.0, and a short time series (i.e., a small n value). The estimation formula
(3.20) for ρ also has a large error for a short time series data string. In practice, you may use
your own experience and the nature of the problem to determine when the approximation
formulas (3.19) and (3.20) are acceptable. More discussions on the simulations and appli-
cations about independence and serial correlations of time series are included in Chapters
6 and 7, where sampling errors and forecasting are described.
95 3.3 Effective Sample Size
size = 2 5 )
# theoretical
rho _ theory = np . zeros ( M )
for m in range ( M ):
rho _ theory [ m ] = lamb ** m
axs [ 1 ,0 ]. scatter ( np . linspace ( 1 ,1 0 ,1 0 ) , rho _ theory ,
color = ’b ’ , marker = ’+ ’)
axs [ 1 ,0 ]. tick _ params ( axis = ’ both ’ , which = ’ major ’ ,
labelsize = 1 5 )
# For lambda = 0 . 6
lamb = 0 . 6 ; alpha = 0 . 2 5 ; y 0 = 4
n = 1 0 0 0 # length of array
x = np . zeros ( n )
w = np . random . normal ( size = 1 0 0 0 )
x[0] = y0
for t in range ( 1 ,n ):
x [ t ] = x [t - 1 ]* lamb + alpha * w [ t ]
# time values from 2 0 0 -4 0 0
t 2 0 0 _ 4 0 0 = np . linspace ( 2 0 1 , 4 0 0 , 2 0 0 )
axs [ 0 ,1 ]. plot ( t 2 0 0 _ 4 0 0 , x [ 2 0 0 : 4 0 0 ] , color = ’k ’)
axs [ 0 ,1 ]. set _ title ( r " AR ( 1 ) Time Series : $\ rho = 0 . 6 $ " ,
size = 2 5 )
axs [ 0 ,1 ]. set _ xlabel ( ’ Time ’ , size = 2 0 );
axs [ 0 ,1 ]. set _ ylabel ( ’x ’ , size = 2 0 )
axs [ 0 ,1 ]. tick _ params ( axis = ’ both ’ , which = ’ major ’ ,
labelsize = 1 5 )
# Python plot Fig . 3 . 5 continued
# correlation
M = 10
rho = np . zeros ( M )
for m in range ( M ):
rho [ m ] = np . corrcoef ( x [ 0 :n - m ] , x [ m : n + 1 ])[ 0 ,1 ]
axs [ 1 ,1 ]. scatter ( np . linspace ( 1 ,1 0 ,1 0 ) , rho ,
color = ’ none ’ , edgecolor = ’k ’)
axs [ 1 ,1 ]. set _ ylim ( 0 ,1 . 2 )
axs [ 1 ,1 ]. set _ xlabel ( ’ Time lag ’ , size = 2 0 );
axs [ 1 ,1 ]. set _ ylabel ( ’ Auto - correlation ’ , size = 2 0 )
axs [ 1 ,1 ]. set _ title ( r ’ Lagged auto - correlation : $\ rho = 0 . 6 $ ’ ,
size = 2 5 )
# theoretical
rho _ theory = np . zeros ( M )
for m in range ( M ):
rho _ theory [ m ] = lamb ** m
axs [ 1 ,1 ]. scatter ( np . linspace ( 1 ,1 0 ,1 0 ) ,
rho _ theory , color = ’b ’ , marker = ’+ ’)
axs [ 1 ,1 ]. tick _ params ( axis = ’ both ’ , which = ’ major ’ ,
labelsize = 1 5 )
Example 3.2 Figure 3.6 shows the mean NOAAGlobalTemp data from 1880 to 2019,
indicated by the left most thick solid green horizontal line. We calculate the confidence
interval of temperature data in the period of 1920–1949. The results are as follows: x̄ =
−0.38◦ C, s = 0.16◦ C, ρ̂ = 0.78, n = 30, neff = 4,tc = t0.975,4 = 2.78.
98 Estimation and Decision-Making
0.5
Linear trend 0.74°C per century
t
Figure 3.6 NOAA global average annual mean temperature from 1880 to 2019 (data source: NOAAGlobalTemp V5, 2020).
The confidence interval is computed by substituting the above into the following formula:
s
x̄ ± tc × √ . (3.23)
neff
You may use the following computer code to compute the statistics of this example.
# R code for the 1 9 2 0 -1 9 4 9 NOAAGlobalTemp statistics
setwd ( " / Users / sshen / climstats " )
da 1 = read . table ( " data / NOAAGlobalTempAnn 2 0 1 9 . txt " ,
header = FALSE ) # read data
Ta = da 1 [ 4 1 : 7 0 ,2 ]
n = 30
xbar = mean ( Ta )
s = sd ( Ta )
r 1 = cor ( Ta [ 1 : 2 9 ] , Ta [ 2 : 3 0 ])
neff = n *( 1 - r 1 )/( 1 + r 1 )
neff = 4 # [ 1 ] 3 . 6 7 7 7 4 6 approximately equal to 4
tc 0 = qt ( 0 . 9 7 5 , 2 9 , lower . tail = TRUE )
tc = qt ( 0 . 9 7 5 , 3 , lower . tail = TRUE )
CI 1 = xbar - tc 0 * s / sqrt ( n ); CI 2 = xbar + tc 0 * s / sqrt ( n )
CI 3 = xbar - tc * s / sqrt ( neff ); CI 4 = xbar + tc * s / sqrt ( neff )
print ( paste ( ’ rho = ’ , round ( r 1 , digits = 2 ) ,
’ neff = ’ , round ( neff , digits = 2 )))
# [ 1 ] " rho = 0 . 7 8 neff = 4 "
round ( c ( xbar , s , r 1 , CI 1 , CI 2 , CI 3 , CI 4 , tc ) , digits = 2 )
# [ 1 ] -0 . 3 8 0 . 1 6 0 . 7 8 -0 . 4 4 -0 . 3 2 -0 . 6 3 -0 . 1 3 3 . 1 8
neff = n *( 1 - r 1 )/( 1 + r 1 )
print ( ’ rho 1 = ’ , np . round ( r 1 ,2 ) , ’ neff = ’ , np . round ( neff , 2 ))
neff = 4 # rho 1 = 0 . 7 8 neff = 4 # 3 . 6 8 rounded to 4
tc 0 = scipy . stats . t . ppf ( q = 0 . 9 7 5 , df = 2 9 )
tc = scipy . stats . t . ppf ( q = 0 . 9 7 5 , df = 4 )
CI 1 = xbar - tc 0 * s / np . sqrt ( n ); CI 2 = xbar + tc 0 * s / np . sqrt ( n )
CI 3 = xbar - tc * s / np . sqrt ( neff ); CI 4 = xbar + tc * s / np . sqrt ( neff )
result = np . array ([ xbar , s , r 1 , CI 1 , CI 2 , CI 3 , CI 4 , tc ])
print ( np . round ( result , 2 ))
# [ - 0 . 3 8 0 . 1 6 0 . 7 8 -0 . 4 4 -0 . 3 2 -0 . 6 3 -0 . 1 3 2.78]
Example 3.3 Is there a significant difference between the 1950–1979 and 1980–2009
means, indicated by the two thick horizontal green lines on the right in Figure 3.6, of the
global average temperature anomalies based on the NOAAGlobalTemp data V5?
The relevant statistics are as follows:
1950–1979: x̄1 = −0.29◦ C, s1 = 0.11◦ C, n = 30, ρ1 = 0.22, n1eff = 19
1980–2009: x̄2 = 0.12◦ C, s2 = 0.16◦ C, n = 30, ρ1 = 0.75, n2eff = 4
The t -statistic is
x̄1 − x̄2
ts = q = −4.89. (3.24)
s1 /n1eff + s22 /n2eff
2
The degrees of freedom is n1eff + n2eff − 2 = 21. The critical t value for x̄1 6= x̄2 at the two-
sided tail 5% significance level is −2.08 or 2.08. Thus, ts = −4.89 is in the H0 rejection
region and we conclude that there is a significant difference between the 1950–1979 and
the 1980–2009 temperature anomalies.
If the serial correlation is not considered, then ts = −9.53, and H0 is also rejected. This
is a stronger conclusion of rejection and has a smaller p-value.
These statistics may be computed using the following computer code.
Ta = da 1 [ 1 0 1 : 1 3 0 ,2 ]
n = 30
xbar = mean ( Ta )
s = sd ( Ta )
r 1 = cor ( Ta [ 1 : 2 9 ] , Ta [ 2 : 3 0 ])
neff = n *( 1 - r 1 )/( 1 + r 1 )
neff # [ 1 ] 4 . 3 2 2 4 1 8 approximately equal to 4
neff = 4
round ( c ( xbar , s , r 1 , neff ) , digits = 2 )
#[1] 0.12 0.16 0.75 4.00
Ta = da 1 [ 1 0 0 : 1 3 0 ] # 1 9 8 0 -2 0 0 9
n = 30
xbar = np . mean ( Ta )
s = np . std ( Ta )
r 1 , _ = pearsonr ( Ta [ 0 : 2 9 ] , Ta [ 1 : 3 0 ])
neff = n *( 1 - r 1 )/( 1 + r 1 )
print ( neff ) # 4 . 3 2 2 4 approximately equal to 4
neff = 4
res 2 = np . array ([ xbar , s , r 1 , neff ])
print ( np . round ( res 2 ,2 ))
#0.12 0.16 0.75 4.00
This section uses examples to answer two types of questions using the chi-square distribu-
tion. First, given the weather distribution of a location based on the long historical record, is
the observed weather of this month significantly different from the given distribution? Sec-
ond, after fitting climate data to a certain distribution, how good is the fit? The chi-square
statistic measures the difference between the histogram and the proposed PDF, discrete or
continuous. A small chi-square statistic means a good fit.
Based on past observations, the mean number of clear, partly cloudy, cloudy days of Dodge
City, Kansas, USA, in October are 15, 7, and 9, respectively (data source: National Weather
Service, NOAA, www.weather.gov/ddc/cloudydays).
If a future October will observe 9 clear days, 6 partly cloudy days, and 16 cloudy days,
is this October weather significantly different from the historical record?
The expected data are E1 = 15, E2 = 7, E3 = 9, respectively. The observed data are O1 =
9, O2 = 6, O3 = 16. We use the chi-square statistic to measure the differences between the
observed and expected data as follows:
3
(Oi − Ei )2
χ2 = ∑ = 7.9873. (3.25)
i=1 Ei
If this χ 2 -statistic is small, then there is no significant difference. Then, how small should
the χ 2 -statistic be for a given significance level? This decision question can be answered
by the following χ 2 calculations.
This chi-square distribution has 3 degrees of freedom, equal to the number of categories
minus the number of constraints. We have three categories, and one constraint. The con-
straint is that the total number of days of the three categories has to be equal to n = 31
for October. The tail probability of the χ 2 (2) distribution in the region x > xs = 7.9873
is 0.0184, which is the p-value and is less than 5%. Thus, this future observed October
weather is significantly different from the past record at the 5% significance level.
Another argument is that the statistic xs = 7.9873 > xc = 5.99 hence is in the rejection
region. Thus, the null hypothesis of no significant difference should be rejected.
The computer code for computing the above statistics is as follows.
# Python code : Chi - square test for Dodge City , Kansas , USA
from scipy . stats import chi 2
(( 9 -1 5 )** 2 )/ 1 5 + (( 6 -7 )** 2 )/ 7 + (( 1 6 -9 )** 2 )/ 9
# 7 . 9 8 7 3 0 2 # This is the chi - statistic
1 - chi 2 . cdf ( 7 . 9 8 7 3 0 2 , df = 2 ) # Compute the tail probability
# [ 1 ] 0 . 0 1 8 4 3 2 2 9 #p - value
1 - chi 2 . cdf ( 5 . 9 9 , df = 2 ) # Compute the tail probability
# [ 1 ] 0 . 0 5 0 0 3 6 6 3 # Thus , xc = 5 . 9 9
In Chapter 2, the Omaha, Nebraska, June precipitation from 1948 to 2019 was fitted to a
Gamma distribution with shape equal to 1.5176, and rate equal to 0.0189. See Figure 3.7
for the histogram and its fit to the Gamma distribution.
Gamma density
Frequency
10
5
0
t
Figure 3.7 Histogram and its fit to a Gamma distribution for the June precipitation data from 1948 to 2019 for Omaha, Nebraska,
USA (data source: Climate Data Online (CDO), NOAA, www.ncdc.noaa.gov/cdo-web).
library ( fitdistrplus )
fitdist (y , distr = " gamma " , method = " mle " )
# estimate Std . Error
# shape 1 . 5 1 7 6 0 9 2 1 0 . 2 2 9 3 8 2 8 1 9
# rate 0 . 0 1 8 8 9 4 2 8 0 . 0 0 3 3 6 5 7 5 7
Next we examine goodness of fit based on the chi-square test. The chi-square statistic
can be computed by the following formula:
12
(Oi − Ei )2
χ2 = ∑ = 6.3508. (3.26)
i=1 Ei
# R code : Chi - square test for the goodness of fit : Omaha prcp
setwd ( " / Users / sshen / climstats " )
Omaha = read . csv ( " data / OmahaP . csv " , header = TRUE )
dim ( Omaha )
#[1] 864 7 : Jan 1 9 4 8 - Dec 2 0 1 9 : 8 6 4 months , 7 2 years * 1 2
daP = matrix ( Omaha [ , 7 ] , ncol = 1 2 , byrow = TRUE )
y = daP [ , 6 ] # Omaha June precipitation data
n = 7 2 # Total number of observations
m = 1 2 # 1 2 bins for the histogram in [ 0 , 3 0 0 ] mm
p 1 = pgamma ( seq ( 0 ,3 0 0 , by = 3 0 0 / m ) ,
shape = 1 . 5 1 7 6 , rate = 0 . 0 1 8 9 )
p1[m+1] = 1
p 2 = p 1 [ 2 :( m + 1 )] - p 1 [ 1 : m ]
y # The 7 2 years of Omaha June precipitation
cuts = cut (y , breaks = seq ( 0 ,3 0 0 , by = 3 0 0 / m ))
# The cut function in R assigns values into bins
Oi <- c ( t ( table ( cuts ))) # Extract the cut results
Ei = round ( p 2 *n , digits = 1 ) # Theoretical results
rbind ( Oi , Ei )
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
# Oi 9 1 7 . 0 1 6 . 0 7 . 0 8 . 0 5 . 0 5 . 0 1 . 0 2 . 0 1.0 1.0 0.0
# Ei 1 3 1 5 . 6 1 2 . 8 9 . 5 6 . 8 4 . 7 3 . 2 2 . 1 1 . 4 0.9 0.6 1.2
sum ((( Oi - Ei )^ 2 )/ Ei )
# [ 1 ] 6 . 3 5 0 8 3 2 # Chi - square statistic
1 - pchisq ( 1 9 . 6 8 , df = 1 1 ) # Compute the tail probability
# [ 1 ] 0 . 0 4 9 9 2 7 1 8 # Thus , xc = 1 9 . 6 8
1 - pchisq ( 6 . 3 5 0 8 , df = 1 1 ) # Compute the tail probability
# [ 1 ] 0 . 8 4 8 9 6 2 9 #p - value
107 3.5 Kolmogorov–Smirnov Test Using Cumulative Distributions
The critical value for the 5% significance level of the χ 2 (11) is xc = 19.68, since
P(χ 2 (11) > 19.68) = 0.04993. Our statistic xs = 6.3508 is much smaller than xc , which
implies that there is no significant difference between the histogram and the fitted Gamma
distribution. Therefore, we conclude that the Gamma distribution is a very good fit to the
data.
Of course, you can use the computed p-value p = 0.8490 > 0.05 to conclude the good
fit.
The chi-square test for goodness of fit examines the sum of the squared differences between
the histogram of the observed data and the expected PDF. The Kolmogorov–Smirnov (K-S)
test for goodness of fit examines the maximum difference between the cumulative percent-
age of the observed data and the expected CDF. The K-S test can be applied to check
differences between two cumulative distributions, such as if a dataset satisfies a given dis-
tribution, or if two datasets are from the same distribution. However, the K-S test assumes
that the parameters of the expected CDF are independent of the sample data (i.e., the
observed data), while the chi-square test does not have such an assumption.
The test statistic is
Dn = max |Fobs (x) − Fexp (x)|, (3.27)
x
108 Estimation and Decision-Making
where Fexp (x) is the expected CDF, and Fobs (x) is the cumulative distribution from n inde-
pendent and identically distributed samples and can be computed by ranking the observed
data. The R command for computing Fobs (x) is ecdf(obs.data).
√
If n is sufficiently large, nDn is approximately the Kolmogorov distribution, which can
help determine the critical value for Dn . The Kolmogorov distribution is quite complex and
is not discussed in this book. In applications, computer software packages for the K-S test
give the p-value, from which the decision of accepting or rejecting the null hypothesis can
be made: accept the null hypothesis at the 5% significance level if the p-value is greater
than 0.05.
We use R command x = rnorm(60) to generate data x. Then check if x is from the
standard normal population N(0, 1) by using the K-S test. The answer is of course, yes.
The computer code for the K-S test is as follows.
The large p-value 0.4997 implies that the null hypothesis is accepted and the data are
indeed from the standard normal population.
If data u is from a uniform distribution, then the K-S test will give a very small p-value,
approximately zero p -value = 1.366e-14, as shown in the K-S test computer code.
Thus, there is a significant difference between the data distribution and N(0, 12 ). Of course,
this is expected, since these sample data u are from a uniform distribution and are not from
the N(0, 12 ) population.
The K-S test is applicable to any distribution, and is often used in climate science.
However, the K-S test should not be used when the parameters of the expected CDF are
109 3.5 Kolmogorov–Smirnov Test Using Cumulative Distributions
estimated from the sample data. If the sample data are used to estimate the model parame-
ter and to compute the K-S statistic Dn , then the Dn value may be erroneously reduced. The
consequence is the exaggeration of Type II error: the null hypothesis should be rejected,
but is not. Thus, it may lead to a false negative and miss detecting a signal.
The following computer code tests that the standard anomalies of the monthly Edmon-
ton temperature data from January 1880 to December 2015 are from the standard normal
population. The almost-zero p-value implies the significant difference between the Edmon-
ton data and the standard normal distribution. Figure 3.8 shows the difference between the
cumulative distribution from data and that from N(0, 1).
1.0
0.8
0.8
Fobs: Percentile/100
0.6
0.4
0.4
Dn= max|F obs − Fexp|
0.2
0.2
K−S test: Dn= 0.0742, p−value ≈ 0
0.0
0.0
−4 −2 0 2 4
Temperature anomalies [°C]
t
Figure 3.8 Cumulative distributions of the observed data and a model. The observed data are the standardized monthly surface
air temperature anomalies of Edmonton from January 1880 to December 2015. The model is N(0, 12 ), where Dn is
equal to the maximum difference between the two CDF curves.
You can use the K-S test to check if the Omaha June precipitation data from 1948 to 2019
follows a Gamma distribution. The p-value of the K-S test is 0.8893. The null hypothesis
is retained, and the Gamma distribution is a good fit to the Omaha June precipitation data.
The K-S test leads to the same conclusion as the χ 2 -test. The difference between the tests
is that the K-S test is based on a continuous distribution, while the χ 2 -test is applied to a
discrete histogram in this Omaha June precipitation example. You may use the following
computer code to compute the K-S statistic and its corresponding p-value.
# R : K - S test for Omaha June precip vs Gamma
Omaha = read . csv ( " data / OmahaP . csv " , header = TRUE )
daP = matrix ( Omaha [ , 7 ] , ncol = 1 2 , byrow = TRUE )
y = daP [ , 6 ] # June precipitation 1 9 4 8 -2 0 1 9
# install . packages ( ’ fitdistrplus ’)
library ( fitdistrplus )
omahaPfit = fitdist (y , distr = " gamma " , method = " mle " )
ks . test (y , " pgamma " , shape = 1 . 5 1 7 6 0 9 2 1 , rate = 0 . 0 1 8 8 9 4 2 8 )
# D = 0 . 0 6 6 2 4 9 , p - value = 0 . 8 8 9 3 # y is the Omaha data string
# You may verify the K - S statistic using another command
# install . packages ( ’ fitdistrplus ’)
library ( fitdistrplus )
gofstat ( omahaPfit )
# Kolmogorov - Smirnov statistic 0 . 0 6 6 2 4 9 4 6
The K-S test can also check if two sample datasets x and y are from the same population.
The R command is ks.test(x, y). The Python command is
stats.ks_2samp(x, y)
Y = aX + b. (3.28)
The strength of this relationship is determined by the Pearson correlation coefficient, com-
puted from the sample data x and y. If X and Y are normally distributed, then the nonzero
correlation coefficient r determines the existence of a linear relationship:
∑ni=1 (xi − x̄)(yi − ȳ)
r= p p , (3.29)
∑ni=1 (xi − x̄)2 ∑ni=1 (yi − ȳ)2
where x̄ is the mean of data x = (x1 , x2 , . . . , xn ), and ȳ is the mean of data y = (y1 , y2 , . . . , yn ).
The null hypothesis is r = 0 for the nonexistence of the linear relationship. The
alternative hypothesis is, of course, r 6= 0.
For given sample data (x, y), the t -statistic for the hypothesis testing is
r
n−2
t =r , (3.30)
1 − r2
which satisfies a t -distribution with degrees of freedom equal to n − 2.
To avoid erroneous conclusions, we should use caution when applying this test. First,
this test is based on a linear relationship. A very good nonlinear relationship between x
and y data may still lead to mistaken acceptance of the no-relationship null hypothesis. For
example, if 100 pairs of data (x, y) lie uniformly on a unit circle centered around zero, then
the nonlinear relationship should exist. However, the correlation coefficient is zero, and the
existence of a linear relationship is rejected. Therefore, rejection of a linear relationship by
the correlation coefficient does not exclude the existence of a nonlinear relationship. It is
important that statistical analysis should incorporate physics, and statistical results should
have physical interpretations.
113 3.6 Determine the Existence of a Significant Relationship
Second, the t -statistic assumes that the sample data are from normal distributions.
When the sample size is large, say greater than 50, this assumption can be regarded as
approximately true even if x and y are from nonnormal distributions.
Third, the correlation coefficient is sensitive to the data and can be greatly altered by
some outliers. If the outliers are erroneous, then the test conclusion can be wrong too. For
example, Figure 3.9 shows such a case. The random data points around the origin appar-
ently show the nonexistence of any relationship. Of course, these random points do not
imply a linear relationship either. The correlation coefficient is only 0.19. However, when
a pair of outliers (1,1) is added to the random data, the correlation coefficient becomes 0.83,
which supports a strong linear relationship. It is clear if the data pair (1,1) is an erroneous
outlier, then the rejection of the null hypothesis is a wrong decision.
t
Figure 3.9 Correlation coefficient is sensitive to outliers.
Figure 3.9 and its related statistics can be calculated by the following computer code.
# R plot Fig . 3 . 9 : Correlation coefficient sensitive to outliers
setEPS () # save the . eps figure
postscript ( " fig 0 3 0 9 . eps " , height = 5 . 6 , width = 8 )
par ( mar = c ( 4 . 2 ,4 . 5 ,2 . 5 ,4 . 5 ))
setwd ( " / Users / sshen / climstats " )
par ( mar = c ( 4 ,4 ,0 . 5 ,0 . 5 ))
x = c ( 0 . 2 * runif ( 5 0 ) , 1 )
y = c ( 0 . 2 * runif ( 5 0 ) , 1 )
plot (x ,y , pch = 1 9 , cex = 0 . 5 ,
cex . axis = 1 . 4 , cex . lab = 1 . 4 )
dev . off ()
#t - statistic
n=51
r = cor (x , y )
t = r *( sqrt (n - 2 ))/ sqrt ( 1 -r ^ 2 )
t
#[1] 10.19999
114 Estimation and Decision-Making
qt ( 0 . 9 7 5 , df = 4 9 )
# [ 1 ] 2 . 0 0 9 5 7 5 # This is the critical t value
1 - pt ( 1 0 . 1 9 9 9 9 , df = 4 9 )
# [ 1 ] 5 . 1 9 5 8 4 4e - 1 4 # a very small p - value
The critical t value at the two-sides 5% significance level is approximately 2.01 in the
R code calculation, and the p-value is almost zero. Thus, the null hypothesis is rejected at
the 5% significance level. However, this rejection is all caused by the pair of outlier data
(1,1), and is most likely wrong. To avoid this kind of error, one may try a Kendall tau test
to make a decision.
For the same data, the Kendall tau test accepts the null hypothesis, because the p-value is
0.2422 and is larger than 0.025. Thus, we conclude that the linear relationship does not
exist. The computer code for the Kendall tau test is as follows.
# R code for a Kendall tau test
# install . packages (" Kendall ")
library ( Kendall )
x = c ( 0 . 2 * runif ( 5 0 ) , 1 )
y = c ( 0 . 2 * runif ( 5 0 ) , 1 )
Kendall (x , y )
# tau = 0 . 1 1 4 , 2 - sided pvalue = 0 . 2 4 2 1 6
The Kendall tau score here is defined according to the positive or negative slope of the
line segment determined by two pairs of data (xi , yi ), (x j , y j ):
y j − yi
mi j = . (3.31)
x j − xi
If mi j > 0, the two pairs are said to be concordant, else discordant. If mi j = 0, then the two
pairs are called a tie. When x j = xi , the data pair are excluded from the Kendall tau score
calculation. Finally the Kendall tau score is defined as
Nc − Nd
τ= , (3.32)
Nc + Nd
where Nc is the number of concordant pairs and Nd is the number of discordant pairs. A tie
adds 1/2 to both Nc and Nd . In this way, the tau score is insensitive to outliers because the
tau score depends on the ratios of data (i.e., the slope), not the data themselves.
The Kendall tau test has the advantage of being insensitive to outliers. It also does not
assume the distributions of data. However, it still requires data independence. When the
data are serially correlated, the Kendall tau score needs to be revised.
Although our Kendall tau test result has successfully accepted the null hypothesis, once
in a while the randomly generated data (x, y) may still have a small p-value in the Ken-
dall tau test, and hence reject the null hypothesis and support the existence of a linear
relationship. This implies the importance of interpreting the statistical results in terms of
climate science. There is no definite way to statistically determine whether the data (1, 1)
are erroneous and hence should be excluded, but there might be a physical way to deter-
mine whether the data (1, 1) are unlikely to occur and to justify whether a linear relationship
is reasonable. Statistical methods help identify signals in climate data, and climate science
helps interpret the statistical results. This science-data-statistics-science cycle can go on
for multiple rounds, in order to obtain correct and useful conclusions.
In the Kendall tau case, if x is monotonic time, then the slope signs depend on y data only.
We can define the so called Mann–Kendall score for the y data:
∑n−1 n
j=1 ∑i= j+1 sgn(y j − yi )
S= , (3.33)
2
where the sgn function is defined as
1
x>0
sgn(x) = 0 x=0 . (3.34)
−1 x<0
116 Estimation and Decision-Making
The R command for the Mann–Kendall test is MannKendall(data). For the monthly
Edmonton standardized temperature data from January 1880 to December 2015, the Mann–
Kendall trend test result is S = 206254 and the p-value approximately zero. Thus, there is
a linear trend in the Edmonton temperature data.
The computer code is as follows.
# R code for Mann - Kendall test : Edmonton data
setwd ( " / Users / sshen / climstats " )
# Read Edmonton data from the gridded NOAAGlobalTemp
da 1 = read . csv ( " data / EdmontonT . csv " , header = TRUE )
x = da 1 [ , 3 ]
m 1 = mean ( x )
s 1 = sd ( x )
xa = (x - m 1 )/ s 1 # standardized anomalies
# install . packages ( ’ Kendall ’)
library ( Kendall )
summary ( MannKendall ( xa ))
# Score = 2 0 6 2 5 4 , Var ( Score ) = 4 9 2 3 4 9 0 5 6
# tau = 0 . 1 5 3 , 2 - sided pvalue = < 2 . 2 2e - 1 6
This chapter has provided the basics of estimation and decision-making based on data.
These methods are commonly used in simple statistical analyses of scientific data, and
are helpful in understanding the statistical results in most literature. However, when data
involve both time and space, they often require spatial statistics methods, such as empirical
orthogonal functions for spatial patterns and principal components for temporal patterns.
The space-time data involve considerable matrix theory and will be discussed in Chapters
5 and 6.
This chapter began by discussing the fundamental concept of estimating the standard
deviation for a population and estimating the standard error of the mean from the point
117 3.7 Chapter Summary
of view of the accuracy assessment. A numerical simulation was provided to illustrate the
concept of standard error (see Fig. 3.1). From this concept, many statistics questions arise.
Among them, we have discussed the following topics:
(i) Search for the true mean by a confidence interval;
(ii) Contingency table for decision-making with probabilities of false alarm and missed
detection;
(iii) Steps of hypothesis testing as the standard procedure of decision-making based on
data;
(iv) Serial correlation of data which lead to a reduced effective sample size;
(v) A suite of typical climate statistical decision examples: Determine the goodness of fit
to a model by χ 2 -distribution, determine change of a probability distribution by the
Kolmogorov–Smirnov test, determine a significant correlation by a t -test or a non-
parametric Kendall tau test, and determine a significant trend by the nonparametric
Mann–Kendall test.
Real climate data examples were provided to illustrate the applications of the statistical
methods. You can directly apply these methods to the climate data of your choice. You
can also use these methods to verify the statistical conclusions in your climate science
literature.
References and Further Reading
[1] L. Chihara and T. Hesterberg, 2011: Mathematical Statistics with Resampling and R.
Wiley.
Chapters 1–3 and 6–8 of the book are a good reference to the materials
presented in our chapter and the book has many R codes and numerical
examples.
[2] R. A. Johnson and C. K. Bhattacharyya, 1996: Statistics: Principles and Methods. 3rd
ed., John Wiley & Sons.
This book uses real climate data from both observations and numerical models,
and has computer codes of R and Python to read the data and reproduce the
figures and statistics in the book.
[4] H. Von Storch and F. W. Zwiers, 2001: Statistical Analysis in Climate Research.
Cambridge University Press.
[5] D. S. Wilks, 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed., Academic
Press.
118
119 Exercises
This excellent textbook is easy to read and contains many simple examples of
analyzing real climate data. It is not only a good reference manual for climate
scientists, but also a guide tool that helps scientists in other fields make sense
of the data analysis results in climate literature.
Exercises
3.1 Visualize a confidence interval.
(a) Use computer to generate 30 normally distributed numbers x1 , x2 , . . . , xn (n = 30)
with mean equal to 8 and standard deviation equal to 2.
(b) Compute the mean x̄ and standard deviation s from the 30 numbers.
(c) On the x-axis, plot the dots of x1 , x2 , . . . , xn (n = 30).
(d) On same the x-axis, plot the confidence interval based on the formula
s s
(x̄ − 1.96 √ , x̄ + 1.96 √ ). (3.35)
n n
(e) Plot the point for the true mean 8 and check if the true mean is inside the confidence
interval.
3.2 Visualize 20 confidence intervals on a single figure by repeating the numerical experi-
ment of the previous problem 20 times. Hint: One way to display the result is to use the
horizontal axis as the order of experiment from 1 to 20 and the vertical axis for the CI, and to
plot a blue horizontal line for the true mean. Then visually check whether the blue horizontal
line intersects with all the CIs.
3.3 (a) Repeat the numerical simulation of histograms of Figure 3.1 with a total of
2,000,000 normally distributed data according to N(6, 92 ), and n = 100.
(b) Discuss the two histograms and explain the meaning of standard error.
3.4 Find the confidence interval of the NOAAGlobalTemp data of the global average annual
mean for the period of 1961–1990 when the serial correlation is taken into account.
What would be the confidence interval if the serial correlation is ignored?
3.5 Use the NOAAGlobalTemp data of the global average annual mean and go through the
t -test procedures to find out whether there is a significant difference between the 1920–
1949 and 1950–1979 temperature anomalies, when the serial correlation is taken into
account.
3.6 Use the t -test to determine if there a significant nonzero trend in the NOAAGlobal-
Temp data of the global average annual mean during the period of 1971–2010 at the 5%
significance level. What is the p-value?
3.7 Repeat the previous problem for the monthly data of Januaries.
3.8 Repeat the previous problem for the monthly data of Julies.
3.9 Use the chi-square test to check if the January monthly precipitation data of Omaha,
Nebraska, USA, from 1948 to 2019 fits well to a Gamma distribution. What are the
120 Estimation and Decision-Making
shape and rate of the fitting? You can find the data on the NOAA Climate Data Online
website www.ncdc.noaa.gov/cdo-web.
3.10 Do the same for the July monthly precipitation data of Omaha from 1948 to 2019.
3.11 Do the same for the annual mean precipitation data of Omaha from 1948 to 2019.
3.12 Use the chi-square test to check if the June monthly precipitation data of Omaha,
Nebraska, USA, from 1948 to 2019 fits well to a lognormal distribution. Find µ and
σ in the lognormal distribution.
3.13 Fit a long-term monthly precipitation data of a station of your choice to the Gamma
distribution. Choose a station and a month, such as Paris and June. The record should be
at least 50 years long. Use the chi-square test to check if the monthly precipitation data
fits well to a Gamma distribution. What are the shape and rate of the fitting?
3.14 Examine the long-term monthly surface air temperature data of a station of your choice.
Explore if the standardized anomaly data satisfy the standard normal distribution using
the K-S test.
3.15 Examine the long-term monthly precipitation data of a station of your choice. Explore
if the standardized anomaly data satisfy the standard normal distribution using the K-S
test.
3.16 For the standardized January monthly mean precipitation and temperature data of Mad-
ison, Wisconsin, USA, from 1961 to 2010, use a t -test to examine whether there exists a
significant correlation between temperature and precipitation. You can find the data on
the NOAA Climate Data Online website www.ncdc.noaa.gov/cdo-web.
3.17 Examine the same correlation using the Kendall tau test.
3.18 (a) Use the Mann–Kendall trend test to examine whether there is a significant positive
temporal trend in the Madison precipitation data at the 5% significance level. What
is the p-value?
(b) Do the same for the temperature data.
3.19 Use the Mann–Kendall trend test to examine whether there is a significant positive tem-
poral trend in the NOAAGlobalTemp annual mean data from 1880 to 2019 at the 5%
significance level. What is the p-value? See Figure 3.6 for the linear trend line.
3.20 Use the Mann–Kendall trend test to examine whether there is a significant positive tem-
poral trend in the NOAAGlobalTemp September monthly mean data from 1880 to 2019
at the 2.5% significance level. What is the p-value?
3.21 Use the Mann–Kendall trend test to examine whether there is a temporal trend (posi-
tive or negative) in the global annual mean precipitation from 1979 to 2018 at the 5%
significance level. What is the p-value? You may use the NASA Global Precipitation
Climatology Project (GPCP data).
4 Regression Models and Methods
The word “regression” means “a return to a previous and less advanced or worse form,
state, condition, or way of behaving,” according to the Cambridge dictionary. The first part
of the word – “regress” – originates from the Latin regressus, past participle of regredi (“to
go back”), from re- (“back”) + gradi (“to go”). Thus, “regress” means “return, to go back”
and is in contrast to the commonly used word “progress.” The regression in statistical data
analysis refers to a process of returning from irregular and complex data to a simpler and
less perfect state, which is called a model and can be expressed as a curve, a surface, or
a function. The function or curve, less complex or less advanced than the irregular data
pattern, describes a way of behaving or a relationship. This chapter covers linear models in
both uni- and multivariate regressions, least square estimations of parameters, confidence
intervals and inference of the parameters, and fittings of polynomials and other nonlinear
curves. By running diagnostic studies on residuals, we explain the assumptions of a linear
regression model: linearity, homogeneity, independence, and normality. As usual, we use
examples of real climate data and provide both R and Python codes.
The simplest model of regression is a straight line, i.e., a linear model. This involves only
two variables x and y. Therefore, a simple linear regression means going from the data pairs
(xi , yi )(i = 1, 2, . . . , n) on the xy-plane back to a simple straight line model:
y = a + bx. (4.1)
The data will determine the values of a and b by the criterion of the best fit. The model can
be used for prediction: predict y when a value of x is given.
26
Temperature lapse rate: 7.0°C/1.0km
y = 33.48 − 0.0070 x
24
R-squared = 0.96
22
Temperature [°C]
20
18
16
14
t
Figure 4.1 The dots are the scatter plot of the data of elevation and 1981–2010 average July surface air temperature Tmean
taken from the 24 USHCN stations. The thick straight line is the linear regression model computed from the data.
are the scatter plot of the data of elevation and 1981–2010 average July surface air tem-
perature Tmean of the 24 Colorado stations in the United States Historical Climatological
Network (USHCN) (Menne et al. 2009). The straight line of the figure corresponds to the
linear model:
y = 33.48 − 0.0070x. (4.2)
The vertical axis is for temperature, which is the 30-year average of the July daily mean
temperature (Tmean) from 1981 to 2010, computed from the USHCN monthly data, which
have been adjusted for the time of observation bias (TOB). Some stations had missing data
denoted by −9,999. When computing the 30-year average, the entries of −9,999 were
omitted. Thus, some averages were computed from fewer than 30 years. The resulting
average temperature data in the unit of ◦ C for the 24 stations are as follows:
123 4.1 Simple Linear Regression
The Colorado July temperature lapse rate (TLR) 7.0◦ C/1.0 km is consistent with ear-
lier studies using 104 stations on the west slope of Colorado for an elevation range of
1,500–3,600 meters: 6.9◦ C/1.0 km (Fall 1997). For the annual mean temperature, the LTR
is 6.0◦ C/1.0 km. These two results are comparable to an LTR study for the Southern Ecua-
dorian Andes in the elevation range of 2,610–4,200 meters (Cordova et al. 2016). The
annual mean temperature TLR is 6.88◦ C/1.0 km.
The TLR may be applied to approximately predict the temperature of a mountain region
at a given location. This is useful in cases when it is hard to maintain a weather station
due to high elevation or complex terrain, while it is relatively easy to obtain elevation data
based on the Geographical Information System (GIS) or a digital elevation model (DEM)
dataset.
This subsection describes the statistical concepts and mathematical theory of the afore-
mentioned linear regression procedure. We pay special attention to the assumptions about
the linear regression model.
Y = a + bx + ε. (4.3)
Here,
125 4.1 Simple Linear Regression
Yi = a + bxi + εi , i = 1, 2, . . . , n, (4.5)
The last equation means that the error terms εi at the observational points xi are
uncorrelated with each other.
The observations at the given points xi are yi . Both are fixed values, such as temperature
y1 = 22.064◦ C at the given elevation x1 = 1671.5 meters in the Colorado July temperature
example. Here, y1 is a sample value for the random variable Y1 . The corresponding error
datum is e1 , which is called a residual and is regarded to be a sample value of ε1 with
respect to the linear model.
ya
Lon
e
ger
ger
e
Lon
e
Shortest e
^
b xa
t
Figure 4.2 The condition of the least sum of the residual squares SSE is equivalent to the orthogonality condition e · xa = 0.
The b̂ estimate formula (4.21) can be derived from the best fit condition which minimizes
the mean square errors (MSE)
|e|2 ∑n e2 ∑n (yi − ŷi )2
MSE = = i=1 i = i=1 , (4.25)
n n n
equivalently minimizing the sum of the square errors (SSE)
n n
SSE = |e|2 = ∑ e2i = ∑ (yi − ŷi )2 = n × MSE, (4.26)
i=1 i=1
where |e| denotes the Euclidean length of vector e. Thus, this method is also called the least
square estimate. The least square condition is equivalent to the following orthogonality
condition (see Fig. 4.2)1 :
e · xa = 0. (4.27)
Inserting Eq. (4.20) into the linear model with data (4.12) leads to
or
ya,i = b̂xa,i + ei , i = 1, 2, . . . , n, (4.29)
1 SSE can be written as
SSE = |e|2 = |ya − b̂xa |2 .
When b̂ is optimized to minimize SSE, the derivative of this equation with respect to b̂ is zero.
d
SSE = (ya − b̂xa ) · xa = e · xa = 0.
db̂
The orthogonality condition, e · xa = 0, can also be derived from a geometric point of view. The three vec-
tors b̂xa , e and ya in ya = b̂xa + e form a triangle. The side e, depending on b̂, is the shortest when b̂xa is
perpendicular to e by adjusting parameter b̂.
128 Regression Models and Methods
or
ya = b̂xa + e. (4.30)
The dot product of both sides of Eq. (4.29) with xa yields
ya · xa − e · xa
b̂ = . (4.31)
xa · xa
With the orthogonality condition Eq. (4.27) e · xa = 0, Eq. (4.31) is reduced to the least
square estimate of b: Eq. (4.21).
In summary, the estimates â and b̂ are derived by requiring (i) zero mean of the residuals,
and (ii) minimum sum of the residual squares.
As shown in the computer code for Figure 4.1, the R command for the simple lin-
ear regression is lm(y ∼ x). The TLR dataset yields the estimate of the intercept â =
33.476216 and slope b̂ = −0.006982:
lm ( y ~ x )
# ( Intercept ) x
#33.476216 -0 . 0 0 6 9 8 2
reg = lm ( y ~ x )
round ( reg $ residuals , digits = 5 )
mean ( reg $ residuals )
# [ 1 ] 1 . 6 2 0 4 3e - 1 7
xa = x - mean ( x )
sum ( xa * reg $ residuals )
# [ 1 ] -2 . 8 3 7 7 3e - 1 3
xa = np . array ( x ) - np . mean ( x )
np . sum ( xa * regresiduals )
print ( np . dot ( xa , regresiduals )) # orthogonality
# -1 . 4 8 9 2 9 7 5 7 4 1 5 3 1 7e - 1 1 # should be almost zero
Once again, we emphasize that two assumptions are used in the derivation of the formu-
las to estimate â and b̂:
130 Regression Models and Methods
(i) The unbiased model assumption: The mean residual is zero ē = 0, and
(ii) The optimization assumption: The residual vector e is perpendicular to the x-anomaly
vector xa = x − x̄.
Under these two assumptions, the estimators â and b̂ are highly sensitive to the Y outliers,
particularly the outlier data corresponding to the smallest and largest x values. One outlier
can completely change the â and b̂ values, in agreement with our intuition. This is an
endpoint problem often encountered in data analysis. To suppress the sensitivity, many
robust regression methods have been developed and R packages are available, such as the
Robust Regression, by the UCLA Institute for Digital Research & Education:
https :// stats . idre . ucla . edu / r / dae / robust - regression /
Figure 4.2 can be generated by the following computer code.
# R plot Fig . 4 . 2 : Geometric derivation of the least squares
par ( mar = c ( 0 . 0 ,0 . 5 ,0 . 0 ,0 . 5 ))
plot ( 0 ,0 , xlim = c ( 0 ,5 . 2 ) , ylim = c ( 0 ,2 . 2 ) ,
axes = FALSE , xlab = " " , ylab = " " )
arrows ( 0 ,0 ,4 ,0 , angle = 5 , code = 2 , lwd = 3 , length = 0 . 5 )
arrows ( 4 ,0 ,4 ,2 , angle = 5 , code = 2 , lwd = 3 , length = 0 . 5 )
arrows ( 0 ,0 ,4 ,2 , angle = 5 , code = 2 , lwd = 3 , length = 0 . 5 )
arrows ( 5 ,0 ,4 ,2 , angle = 7 , code = 2 , lwd = 2 , lty = 3 , length = 0 . 5 )
arrows ( 0 ,0 ,5 ,0 , angle = 7 , code = 2 , lwd = 2 , lty = 3 , length = 0 . 5 )
arrows ( 3 ,0 ,4 ,2 , angle = 7 , code = 2 , lwd = 2 , lty = 3 , length = 0 . 5 )
arrows ( 0 ,0 ,3 ,0 , angle = 7 , code = 2 , lwd = 2 , lty = 3 , length = 0 . 5 )
segments ( 3 . 9 ,0 , 3 . 9 , 0 . 1 )
segments ( 3 . 9 , 0 . 1 , 4 . 0 , 0 . 1 )
text ( 2 ,0 . 2 , expression ( hat ( b )~ bold ( x )[ a ]) , cex = 2 )
text ( 2 ,1 . 2 , expression ( bold ( y )[ a ]) , cex = 2 )
text ( 4 . 1 ,1 , expression ( bold ( e )) , cex = 2 )
text ( 3 . 8 ,0 . 6 , expression ( paste ( " Shortest " , bold ( e ))) ,
cex = 1 . 5 , srt = 9 0 )
text ( 3 . 4 ,1 . 1 , expression ( paste ( " Longer " , bold ( e ))) ,
cex = 1 . 5 , srt = 7 1 )
text ( 4 . 6 ,1 . 1 , expression ( paste ( " Longer " , bold ( e ))) ,
cex = 1 . 5 , srt = - 7 1 )
Case (b): Perfect noise with zero correlation, rxy = 0, which implies
b̂ = 0, (4.37)
Since xa /|xa | is a unit vector in the direction of xa , the slope is the projection of the
y anomaly data vector on the x anomaly data vector and then normalized by the x
anomaly data vector.
132 Regression Models and Methods
The variances MV, YV, and R2 can be computed by the following computer code in
multiple ways.
# R code for computing MV
var ( reg $ fitted . values )
#[1] 15.22721
# Or another way
yhat = reg $ fitted . values
var ( yhat )
#[1] 15.22721
# Or still another way
n = 24
sum (( yhat - mean ( yhat ))^ 2 )/( n - 1 )
#[1] 15.22721
cor (x , y )
# [ 1 ] -0 . 9 8 1 3 8 5 8
( cor (x , y ))^ 2
# [ 1 ] 0 . 9 6 3 1 1 8 1 # This is the R - squared value
# computing YV
print ( ’ YV = ’ , np . sum (( y - np . mean ( y ))** 2 )/( n - 1 ))
# [1] 15.81033
# Or another way
print ( ’ YV = ’ , variance ( y ))
# YV = 1 5 . 8 1 0 3 2 6 4 1 8 4 7 8 2 6
n n
SSE = ∑ e2i = ∑ [yi − (â + b̂xi )]2 . (4.47)
i=1 i=1
This approach can be found in most statistics books and the Internet. The terminology of
“least square” regression comes from this minimization principle.
This minimization is with respect to â and b̂. The minimization condition leads to two
linear equations that determine â and b̂. The “least square” minimization condition for b̂ is
equivalent to the orthogonality condition (4.27). Geometrically, the minimum distance of
a point to a line is defined as the length of the line segment that is orthogonal to the line
and connects the point to the line, i.e., it is the minimum distance between the point and
the line (see Fig. 4.2). Thus, the orthogonality condition and minimization condition are
equivalent.
Another commonly used method to estimate the parameters is to maximize a likelihood
function defined as
n
1
L(a, b, σ ) = ∏ √ exp(−(yi − (a + bxi ))2 /(2σ 2 )). (4.48)
i=1 2πσ
The solution for the maximum L(a, b, σ ) yields the estimate of â, b̂, and σ̂ 2 . The result
is exactly the same as the ones obtained by the method of least squares or perpendicular
projection.
However, the maximum likelihood approach shown here explicitly assumes the nor-
mal distribution of the error term εi and the response variable Yi . For the least squares
approach, the assumption of normal distribution is needed only when for the inference of
the parameters â, b̂, and σ̂ 2 , but is not needed to estimate them.
136 Regression Models and Methods
because
n
∑ (xi − x̄) = 0. (4.50)
i=1
The variance of B is
σ2
Var(B) = . (4.52)
||x||2a
137 4.1 Simple Linear Regression
Hence,
σ2
Var(B) = . (4.55)
Sxx
When n goes to infinity, Sxx also goes to infinity. Thus,
This means that more data points yield a better estimate for the slope, supporting our
intuition.
A = Ȳ − Bx̄. (4.57)
may be considered the estimated variance of the xi data. This result can be derived as
follows:
Var[A] = Var[Ȳ ] + Var[B]x̄2
σ2 σ2 2
= + x̄
n Sxx
σ2 x̄2
= 1+ . (4.61)
n Sxx /n
This expression implies that as n goes to infinity, the variance Var[A] also goes to zero.
Thus, the intercept estimate is better when there are more data points.
for a given σ 2 .
When σ 2 is unknown and is to be estimated by s2 , then A and B follow a t-distribution.
To be exact, the t-statistics with n − 2 dof can be defined as
â − a
Ta = q , (4.72)
2
s 1n + Sx̄xx
b̂ − b
Tb = √ , (4.73)
s/ Sxx
where a and b are given, and â and b̂ are estimated from data. The confidence interval
(−tα/2,n−2 ,tα/2,n−2 ) for Ta and Tb leads to the confidence intervals for â and b̂. Thus, these
two statistics Ta and Tb can be used for a hypothesis test.
H0 : b = β , (4.74)
H1 : b 6= β . (4.75)
We define a t-statistic
b̂ − β
Tb = √ . (4.76)
s/ Sxx
At the α × 100% significance level, we reject H0 if
For the Colorado July TLR example, if we want to test whether the TLR is equal to
7.3◦ C/km, we calculate
−0.0069818 − (−0.0073)
Tb = = 1.092345. (4.78)
0.0002913
The quantile
t0.975,22 = 2.073873 (4.79)
Thus, |Tb | < t0.975,22 , and hence the null hypothesis is not rejected at the 5% significance
level. In other words, the obtained TLR 7.0◦ C/km is not significantly different from the
given value 7.3◦ C/km.
30
y = 33.48 − 0.0070 x
R-squared = 0.96
Temperature lapse rate: 7.0°C/km
25
Temperature [°C]
20
t
Figure 4.3 The confidence interval of the fitted model based on the Colorado July mean temperature data and their
corresponding station elevations (red lines), and confidence interval of the response variable Tmean (blue lines).
The confidence interval is shown by the blue lines in Figure 4.3. It means that, with 95%
chance, the Y values are between the blue lines. As expected, almost all the Colorado TLR
observed data lie between the blue lines. There is only 5% chance for a data point to lie
outside the blue lines. According to Eq. (4.85), the blue lines also diverge, but very slowly
because
(x∗ − x̄)2
< 0.1310
Sxx
is much smaller than 1 + 1n = 1.0417.
Figure 4.3 can be produced by the following computer code.
# R plot Fig . 4 . 3 : Confidence intervals of a regression model
setwd ( " / Users / sshen / climstats " )
# Confidence interval of the linear model
x 1 = seq ( max ( x ) , min ( x ) , len = 1 0 0 )
n = 24
xbar = mean ( x )
reg = lm ( y ~ x )
SSE = sum (( reg $ residuals )^ 2 )
s _ squared = SSE /( length ( y ) - 2 )
s = sqrt ( s _ squared )
modTLR = 3 3 . 4 7 6 2 1 6 + -0 . 0 0 6 9 8 2 * x 1
xbar = mean ( x )
Sxx = sum (( x - xbar )^ 2 )
142 Regression Models and Methods
CIupperModel = modTLR +
qt (. 9 7 5 , df =n - 2 )* s * sqrt (( 1 / n )+( x 1 - xbar )^ 2 / Sxx )
CIlowerModel = modTLR -
qt (. 9 7 5 , df =n - 2 )* s * sqrt (( 1 / n )+( x 1 - xbar )^ 2 / Sxx )
CIupperResponse = modTLR +
qt (. 9 7 5 , df =n - 2 )* s * sqrt ( 1 +( 1 / n )+( x 1 - xbar )^ 2 / Sxx )
CIlowerResponse = modTLR -
qt (. 9 7 5 , df =n - 2 )* s * sqrt ( 1 +( 1 / n )+( x 1 - xbar )^ 2 / Sxx )
2
Residual temp [°C]
1
0
−1
−2
1,000 1,500 2,000 2,500
Elevation [m]
t
Figure 4.4 Scatter plot of the residuals of the linear regression for the Colorado July mean temperature data against the
elevations. The data are from the 24 USHCN stations in Colorado, and the July Tmean is the 1981–2010 average
temperature.
linear regression assumptions. If some assumptions are violated for a given dataset, we can
seek other methods to remediate the problems. One way is to make a data transformation
or to use a new regression function so that the assumptions are satisfied for the trans-
formed data or the new functional model. Another way is to use a nonparametric method,
which does not assume any distribution. The comprehensive residual analysis methods and
the corresponding R and Python codes can be found online: this book is limited to the
discussion of a few examples.
Violation of independence assumption (iv) often occurs when y is a time series, in which
case there could be correlations from one time to later times. This serial correlation effect
leads to fewer independent time intervals, i.e., the dof can be reduced, and the statistical
inference needs special attention.
# R plot Fig . 4 . 4 : Regression residuals
reg = lm ( y ~ x )
setEPS () # Plot the figure and save the file
postscript ( " fig 0 4 0 4 . eps " , height = 4 . 5 , width = 8 )
par ( mar = c ( 4 . 5 ,4 . 5 ,2 . 0 ,0 . 5 ))
plot (x , reg $ residuals , pch = 5 ,
ylim = c ( - 2 ,2 ) , xlim = c ( 1 0 0 0 ,2 8 0 0 ) ,
xlab = " Elevation [ m ] " ,
ylab = bquote ( " Residual Temp [ " ~ degree ~ " C ] " ) ,
main = " Residuals of the Colorado July Tmean vs . Elevation " ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 , cex . main = 1 . 2 )
dev . off ()
r = y - abline
fig , ax = plt . subplots ( figsize =( 1 2 ,8 ))
ax . plot (x , r , ’ kd ’)
ax . set _ title ( " Residuals of the Colorado 1 9 8 1 - 2 0 1 0 July \ n \
Tmean vs . Elevation " , fontweight = ’ bold ’ , size = 2 5 , pad = 2 0 )
ax . set _ xlabel ( " Elevation $[ m ]$ " , size = 2 5 , labelpad = 2 0 )
ax . set _ ylabel ( r " Residual Temp [$\ degree $ C ] " ,
size = 2 5 , labelpad = 2 0 );
ax . tick _ params ( length = 6 , width = 2 , labelsize = 2 0 );
ax . set _ xticks ( np . linspace ( 1 0 0 0 , 3 0 0 0 , 5 ))
ax . set _ yticks ( np . linspace ( - 2 ,2 ,5 ))
plt . show ()
−2 −1 0 1 2
Theoretical quantiles
t
Figure 4.5 Q-Q plot of the linear regression residuals for Colorado July mean temperature data at the 24 USHCN stations. The
diagonal straight line is the theoretical line for the standard normal distribution.
However, the Q-Q plot is only a subjective visual test. A quantitative test may be used,
such as the Kolmogorov–Smirnov (KS) test or Shapiro–Wilk (SW) test. The KS-test statis-
tic and its p-value for the Colorado TLR data are D = 0.20833 and p-value = 0.686. The
null hypothesis is not rejected, i.e., the temperature residuals from the linear regression are
not significantly different from the normal distribution. The normality assumption is valid.
Figure 4.5 can generated by the following computer code.
146 Regression Models and Methods
The DW test is meant for a time series. We thus sort the Tmean data according to the
ascending order of elevation, and then compute DW.
When cor(ei , e j ) = σ 2 δi j , we have DW ≈ 2. For the Colorado July TLR data, DW =
2.3072 and p − value = 0.7062, which can be computed by the following code.
The large p-value 0.7062 implies that the null hypothesis of no serial correlation (i.e.,
independence) is not rejected, because DW 2.3072 is not too far away from 2, which indi-
cates independence of the error terms. Some literatures use 1.5 < DW < 2.5 to conclude
independence without checking the p-value.
When the independence assumption is violated, the data have serial correlation, which
reduces the sample size and hence enlarges the confidence interval for the regression results
and makes the results not as reliable.
The large p-value 0.6374 of the MK test for residuals implies that the residuals have
no trend, linear or nonlinear. The small p-value 2.261e-09 of the MK test for the sorted
temperature implies that the sorted July Tmean according to elevation has a significant
trend.
The linear trend is frequently used to explain the global average temperature. The tem-
perature data with respect to time is clearly nonlinear. When the scatter plot shows a clear
pattern, then one or more of the four assumptions are usually violated. When that happens,
there are different ways to remediate the situation. For example, when linearity is violated,
the problem may be intrinsically nonlinear and a polynomial model may be a better fit than
a linear model. When the constant variance is violated, known as heteroscedastic data in
statistics literature, then the data standardization (i.e., the anomaly data being divided by
the data standard deviation) or logarithmic transform for the positive-valued data, may be
used to transform the data and to make the data homoskedastic, i.e., constant variance.
This section uses the method of matrices, which will be described in Chapter 5. If you are
not familiar with matrices, you may come back to this section after reading the first two
sections of Chapter 5.
In the previous simple linear regression, the July temperature was assumed to linearly
depend only on the vertical coordinate of the station: elevation. However, the temperature
may also depend on the horizontal coordinates of the station: latitude and longitude. This
subsection deals with the problem of more than one explanatory variable. A multivariate
linear regression model (or multiple linear regression model) can be expressed as follows:
149 4.2 Multiple Linear Regression
Y = b0 + b1 x1 + b2 x2 + · · · + bm xm + ε, (4.87)
lat=c(
39.9919, 38.4600, 39.2203, 38.8236, 39.2425, 37.6742,
39.6261, 38.4775, 40.6147, 40.2600, 39.1653, 38.5258,
37.7717, 38.0494, 38.0936, 38.0636, 37.1742, 38.4858,
8.0392, 38.0858, 40.4883, 37.9492, 37.1786, 40.0583
)
lon=c(
-105.2667, -105.2256, -105.2783, -102.3486, -107.9631, -106.3247,
-106.0353, -102.7808, -105.1314, -103.8156, -108.7331, -106.9675,
-107.1097, -102.1236, -102.6306, -103.2153, -105.9392, -107.8792,
-103.6933, -106.1444, -106.8233, -107.8733, -104.4869, -102.2189
)
The elevation and temperature data are the same as those at the beginning of this chapter.
The computer code for the multiple linear regression of three variables is as follows.
# R code for the TLR multivariate linear regression
elev = x ; temp = y # The x and y data were entered earlier
dat = cbind ( lat , lon , elev , temp )
datdf = data . frame ( dat )
datdf [ 1 : 2 ,] # Show the data of the first two stations
# lat lon elev temp
# 3 9 . 9 9 1 9 -1 0 5 . 2 6 6 7 1 6 7 1 . 5 2 2 . 0 6 4
# 3 8 . 4 6 0 0 -1 0 5 . 2 2 5 6 1 6 3 5 . 6 2 3 . 5 9 1
According to this three-variable linear regression, the TLR is the regression coefficient
for elevation, i.e., −0.0075694. This means 7.6◦ C/km, larger than the TLR estimated
earlier using only one variable: elevation.
The quantile of t0.975,20 = 2.085963. The dof is 20 now because four parameters have
been estimated. The confidence interval of TLR can be computed as follows:
(−0.0075694 − 2.085963 × 0.0003298, −0.0075694 + 2.085963 × 0.0003298)
= (−0.008257351, −0.006881449). (4.89)
The confidence interval for TLR at the 95% confidence level is (6.9, 8.3)◦ C/km.
The R2 value is very large: 0.979, which means that the linear model data can explain
98% of the variance of observed temperature data.
The data and their corresponding regression coefficients are written in matrix form as
follows:
y1
y2
y= . (4.92)
..
yn
· · · xm1
1 x11 x21
1 x12 x22 · · · xm2
X= (4.93)
.. .. .. .. ..
. . . . .
1 x1n x2n · · · xmn
b̂0
b̂1
b̂ =
b̂2 .
(4.94)
..
.
b̂m
The data, linear model prediction, and the corresponding residuals (i.e., the prediction
errors of the linear model) can be written as follows:
yn×1 = Xn×(m+1) b̂(m+1)×1 + en×1 (4.95)
where the residual vector is
e1
e2
en×1 = . (4.96)
..
.
en
Multiplying Eq. (4.95) by Xt from the left and enforcing
(Xt )(m+1)×n en×1 = 0, (4.97)
you can obtain the estimate of the regression coefficients
−1 t
b̂(m+1)×1 = (Xt )(m+1)×n Xn×(m+1)
(X )(m+1)×n yn×1 . (4.98)
Condition (4.97) means each column vector of the Xn×(m+1) matrix is perpendicular to
the residual vector en×1 . For the first column, the condition corresponds to the assumption
of zero mean of the model error:
n
∑ ei = 0. (4.99)
i=1
For the remaining columns, the condition means that the residual vector is perpendicular to
the data vector for each explanatory variable. This implies that the residual vectors’ Euclid-
ean distances to the data vectors are minimized, i.e., the minimizing sum of squared errors
(SSE). Therefore, Eq. (4.98) is the least square estimate of the regression coefficients.
152 Regression Models and Methods
When the data have strong nonlinearity, the scatter plot of residuals will show an obvious
pattern, as shown in Figure 4.6 for the linear regression of the global average annual mean
153 4.3 Nonlinear Fittings Using the Multiple Linear Regression
1
(a)
Temperature [°C]
0
−1
(b)
0.3
Residuals [°C]
0.0
−0.3
t
Figure 4.6 (a) Linear regression of the global average annual mean land and ocean surface air temperature anomalies with
respect to the 1971–2000 climatology based on the NOAAGlobalTemp dataset (Zhang et al. 2019); (b) Scatter plot of
the linear regression residuals.
surface air temperature anomalies from 1880 to 2018 with respect to the 1971–2000 cli-
matology. The linearity assumption of the simple linear regression is clearly violated. The
independence assumption is also violated.
The following computer code can plot Figure 4.6 and provide diagnostics of the linear
regression.
# R plot Fig . 4 . 6 : Regression diagnostics
setwd ( " / Users / sshen / climstats " )
dtmean < - read . table (
" data / aravg . ann . land _ ocean . 9 0 S . 9 0 N . v 5 . 0 . 0 . 2 0 1 9 0 9 . txt " ,
header = F )
dim ( dtmean )
#[1] 140 6
x = dtmean [ 1 : 1 3 9 ,1 ]
y = dtmean [ 1 : 1 3 9 ,2 ]
reg = lm ( y ~ x ) # linear regression
reg
# ( Intercept ) yrtime
# -1 4 . 5 7 4 8 4 1 0.007348
CIupperModelr = modT +
qt (. 9 7 5 , df = 5 )* s * sqrt (( 1 / n )+( x - xbar )^ 2 / Sxx )
CIlowerModelr = modT -
qt (. 9 7 5 , df = 5 )* s * sqrt (( 1 / n )+( x - xbar )^ 2 / Sxx )
CIupperResponser = modT +
qt (. 9 7 5 , df = 5 )* s * sqrt ( 1 +( 1 / n )+( x - xbar )^ 2 / Sxx )
CIlowerResponser = modT -
qt (. 9 7 5 , df = 5 )* s * sqrt ( 1 +( 1 / n )+( x - xbar )^ 2 / Sxx )
text ( 1 9 4 0 , 0 . 5 ,
bquote ( " Linear trend : 0 . 7 3 4 8 " ~ degree ~ " C per century " ) ,
col = " black " , cex = 1 . 4 )
text ( 1 8 8 0 , 0 . 9 , " ( a ) " , cex = 1 . 4 )
par ( mar = c ( 4 . 5 ,4 . 5 ,0 ,0 . 7 ))
plot (x , reg $ residuals , ylim = c ( - 0 . 6 ,0 . 6 ) ,
pch = 5 , cex . lab = 1 . 4 , cex . axis = 1 . 4 ,
yaxt = ’n ’ , xlab = " Year " ,
ylab = bquote ( " Residuals [ " ~ degree ~ " C ] " ))
155 4.3 Nonlinear Fittings Using the Multiple Linear Regression
Based on the visual check of Figure 4.6(b), the constant variance assumption seems
satisfied. The KS test shows that the normality assumption is also satisfied.
# Kolmogorov - Smirnov ( KS ) test for normality
library ( fitdistrplus )
156 Regression Models and Methods
The DW test shows that the independence assumption is violated. This implies the exist-
ence of serial correlation, which in turn implies a smaller dof, and hence larger t1−α,dof .
Thus, the confidence intervals for model Ŷ are wider than the red lines in Figure 4.6(a), and
those for the Y values are wider than the blue lines in the same figure. Therefore, the regres-
sion results are less reliable. In this case, you need to compute the effective dof (edof),
which is less than n − 2. For the NOAAGlobalTemp time series, the one-year time lag
serial correlation is ρ1 = 0.9271, and the edof under the assumption of an autoregression
one process AR(1) is approximately equal to
1 − ρ1
edof = × dof. (4.105)
1 + ρ1
Consequently, the edof is reduced to 5, compared to the nonserial correlation case dof 137.
R can compute qt(.975, df=137) and yield 1.977431, and compute qt(.975, df=5)
and yield 2.570582. Thus, the actual confidence interval at the 95% confidence level for
the linear regression is about 25% wider. Namely, the dotted color lines are wider than the
solid color lines.
The W pattern of the residuals shown in Figure 4.6(b) implies that the linearity assump-
tion is violated. To remediate the nonlinearity, we fit the data to a nonlinear model and
hope the residuals do not show clear patterns. We will use the third-order polynomial to
illustrate the procedure.
Figure 4.7(a) shows the third-order polynomial fitting to the NOAAGlobalTemp dataset:
y = b0 + b1 x + b2 x2 + b3 x3 + ε. (4.106)
This fitting can be estimated by the method of multiple linear regression with three
variables:
x1 = x, x2 = x2 , x3 = x3 . (4.107)
Then, a computer code for the multiple linear regression can be used to estimate the
coefficients of the polynomial and plot Figure 4.7.
# R plot Fig . 4 . 7 : Polynomial fitting
x1=x
x2=x1^2
x3=x1^3
dat 3 = data . frame ( cbind ( x 1 ,x 2 ,x 3 ,y ))
reg 3 = lm ( y ~ x 1 + x 2 + x 3 , data = dat 3 )
# simply use
# reg 3 = lm ( y ~ x + I ( x ^ 2 ) + I ( x ^ 3 ))
setEPS () # Plot the figure and save the file
postscript ( " fig 0 4 0 7 . eps " , height = 8 , width = 8 )
par ( mfrow = c ( 2 ,1 ))
par ( mar = c ( 0 ,4 . 5 ,2 . 5 ,0 . 7 ))
plot (x , y , type = " o " , xaxt = " n " ,
cex . lab = 1 . 4 , cex . axis = 1 . 4 , xlab = " Year " ,
ylab = bquote ( " Temperature [ " ~ degree ~ " C ] " ) ,
main = " Global Annual Mean Surface Temperature Anomalies " ,
158 Regression Models and Methods
(a)
0.5
The third-order polynomial fit
Temperature [°C]
0.0
−0.5
0.4
0.2 (b)
Residuals [°C]
0.0
−0.2
t
Figure 4.7 (a) Fit a third-order polynomial to the global average annual mean land and ocean surface air temperature anomalies
with respect to the 1971–2000 climatology based on the NOAAGlobalTemp dataset (Zhang et al. 2019); (b) Scatter
plot of the corresponding residuals.
where, raw=TRUE means using the raw polynomial form written like formula (4.106).
Another option is raw=FALSE, which means that the data are fitted to an orthogonal
polynomial.
Comparison of Figures 4.6(b) and 4.7(b) shows that the W-shape pattern of the residuals
in the third-order polynomial fitting is weaker than that for the linear regression. None-
theless, the W-shape pattern is still clear. Figure 4.7(b) visually suggests that the constant
variance assumption is satisfied. The KS test shows that the normality assumption is also
satisfied. However, the DW test shows the existence of serial correlation. The computing
of these tests is left as exercise problems.
You can try to fit a ninth-order orthogonal polynomial. This can eliminate the W-
shape nonlinear pattern of the residuals. You can also try to fit other polynomials or
functions.
160 Regression Models and Methods
This chapter has introduced three regression methods commonly used in climate science:
(a) simple linear regression of a single variate, (b) multiple linear regression, and (c) nonlin-
ear fitting. For (a) and (b) we used the data of surface air temperature and elevation of the 24
USHCN stations in the state of Colorado, USA. For (c) we used the NOAAGlobalTemp’s
global average annual mean temperature data from 1880 to 2018.
A linear regression model has four fundamental assumptions:
(i) Approximate linearity between x and Y ,
(ii) Normal distribution of ε and Y ,
(iii) Constant variance of ε and Y , and
(iv) Independence of error terms: Cor[εi , ε j ] = σ 2 δi j .
Assumptions (i) and (iii) can be verified by visually examining the scatter plot of residuals.
Assumption (ii) may be verified by the KS test. Assumption (iv) can be verified by the
DW test. When one or more assumptions are violated, you may improve your model, for
example, by using a nonlinear model or transforming the data.
References and Further Reading
[1] M. Cordova, R. Celleri, C.J. Shellito et al., 2016: Near-surface air temperature
lapse rate over complex terrain in the Southern Ecuadorian Andes: implications for
temperature mapping. Arctic, Antarctic, and Alpine Research, 48, 673–684.
Figure 3 of this paper shows the linear regression with R2 = 0.98 for the tem-
perature data of nine stations whose elevations are in the range from 2,610 to
4,200 meters. The regression implies a lapse rate of 6.88◦ C/km for the annual
mean temperature.
[2] P. L. Fall, 1997: Timberline fluctuations and late Quaternary paleoclimates in the
Southern Rocky Mountains, Colorado. Geological Society of America Bulletin, 109,
1306–1320.
Figure 2a of this paper shows a lapse rate of 6.9◦ C/km for the mean July
temperature based on the data of 104 stations and a linear regression with
R2 = 0.86. Figure 2b shows the annual mean temperature TLR equal to
6.0◦ C/km with R2 = 0.80.
[3] F. A. Graybill and H. K. Iyer, 1994: Regression Analysis: Concepts and Applications.
Duxbury Press.
[4] M. J. Menne, C. N. Williams, and R. S. Vose, 2009: The United States Historical
Climatology Network monthly temperature data Version 2. Bulletin of the American
Meteorological Society, 90, 993–1007.
This is one of the series of papers on the USHCN datasets prepared at the
NOAA National Centers for Environmental Information and widely used
since the early 1990s. The network includes 1,218 stations and has both
monthly and daily data of temperature and precipitation publicly available.
161
162 Regression Models and Methods
www.ncdc.noaa.gov/ushcn
[5] H.-M., Zhang, B. Huang, J. Lawrimore, M. Menne, and T. M. Smith, 2019: NOAA
Global Surface Temperature Dataset (NOAAGlobalTemp), Version 5.0 (Time Series).
NOAA National Centers for Environmental Information. doi:10.7289/V5FN144H.
Last accessed March 2021.
Exercises
4.1 (a) Compute the temperature lapse rate for August using the TOB mean temperature
data from the 24 USHCN stations in Colorado during the period of 1981–2010.
(b) Plot the figure similar to Figure 4.1 for the data and results of (a).
4.2 Repeat the previous problem but for January. Compare your January results with those
of July in Figure 4.1 and its related text.
4.3 (a) Compute the temperature lapse rate for the annual mean temperature based on the
24 USHCN stations in Colorado during the period of 1981–2010.
(b) Plot the figure similar to Figure 4.1 for the data and results of (a).
4.4 (a) Compute the annual mean temperature lapse rate for a high mountainous region
and for a period of your choice.
(b) Plot the figure similar to Figure 4.1 for the data and results of (a).
4.5 Show that the orthogonality condition Eq. (4.27)
e · xa = 0 (4.108)
is equivalent to the condition of minimizing SSE given condition (4.19)
n
∑ ei = 0. (4.109)
i=1
4.6 Show that when m = 1, the confidence interval formula for the multiple linear regression
(4.102) is reduced to the confidence interval formula for the simple linear regression
(4.82).
4.7 Examine the global average December temperature anomalies from 1880 to 2018 in the
dataset of the NOAAGlobalTemp.
(a) Make a linear regression of the temperature anomalies against time.
(b) Compute the confidence intervals of the fitted model at the 95% confidence level.
163 Exercises
(c) Compute the confidence intervals of the anomaly data at the 95% confidence level.
(d) On the same figure similar to Figure 4.6(a), plot the scatter plot of the anomaly
data against time, and plot the confidence intervals computed in (b) and (c).
4.8 Make a diagnostic analysis for the regression results of the previous problem.
(a) Produce a scatter plot of the residuals against time.
(b) Visually check whether the assumptions of linearity and constant variance are
satisfied.
(c) Use the KS test to check the normality assumption on residuals.
(d) Use the DW test to check the independence assumption on residuals.
(e) When serial correlation is considered, find the effective degrees of freedom (edof).
(f) Compute the confidence intervals in Steps (b) and (c) in Problem 4.7 using edof.
(g) Produce a scatter plot of the anomaly data against time, and plot the confidence
intervals on the same figure using the results of Step (f), similar to Figure 4.6(a).
4.9 (a) Fit the NOAAGlobalTemp’s global average annual mean data from 1880 to 2018
to a ninth-order orthogonal polynomial.
(b) Plot the data and the fitted polynomial function on the same figure.
(c) Produce a scatter plot of the residuals of the fitting against time as a different figure
from (b).
4.10 Make a diagnostic analysis for the above regression and examine the regression assump-
tions. In particular, use the KS test to verify the normality assumption, and the DW test
to verify the independence assumption.
4.11 (a) Use multiple linear regression to compute the 12th-order polynomial fitting of the
NOAAGlobalTemp’s global average annual mean data from 1880 to 2018:
T = b0 + b1t + b2t 2 + · · · + b12t 12 + ε. (4.110)
(b) Plot the data and the fitted polynomial function on the same figure.
(c) Produce a scatter plot of the residuals of the fitting against time as a different figure
from (b).
4.12 Make a diagnostic analysis for the previous regression and examine the regression
assumptions. In particular, use the KS test to verify the normality assumption, and the
DW test to verify the independence assumption.
4.13 (a) Fit the global average January monthly mean temperature anomaly data from
1880 to 2018 in the NOAAGlobalTemp dataset to a third-order orthogonal poly-
nomial. The global average monthly mean NOAAGlobalTemp time series data are
included in the book’s master dataset named data.zip. You can also download
the updated data from the Internet.
(b) Plot the data and the fitted polynomial function on the same figure.
(c) Produce a scatter plot of the residuals of the fitting against time in another figure.
4.14 Make a diagnostic analysis for the previous January data fit following the procedures
in this chapter. In particular, use the KS test to verify the normality assumption, and the
DW test to verify the independence assumption.
4.15 (a) Fit the global average July monthly mean temperature anomaly data from 1880 to
2018 in the NOAAGlobalTemp dataset to a third-order orthogonal polynomial.
164 Regression Models and Methods
(b) Plot the data and the fitted polynomial function on the same figure.
(c) Produce a scatter plot of the residuals of the fitting against time in another figure.
4.16 Make a diagnostic analysis for the previous July data fit following the procedures in this
chapter. In particular, use the KS test to verify the normality assumption, and the DW
test to verify the independence assumption.
4.17 Use the gridded monthly NOAAGlobalTemp dataset and make a third-order polynomial
fit to the January monthly mean temperature anomaly data from 1880 to 2018 in a
grid box that covers Tokyo, Japan. The gridded monthly NOAAGlobalTemp dataset is
included in the book’s master data file data.zip. You can also download the updated
data from the Internet.
4.18 Use the gridded monthly NOAAGlobalTemp dataset and make a third-order polynomial
fit to the January monthly mean temperature anomaly data from 1880 to 2018 in a grid
box that covers Bonn, Germany.
4.19 Use the gridded monthly NOAAGlobalTemp dataset and make a fifth-order polynomial
fit to the monthly mean temperature anomaly data for a grid box, a month, and a period
of time of your choice. For example, you may choose your hometown grid box, the
month you were born, and the period of 1900–1999.
4.20 Make a diagnostic analysis for the previous data fit following the procedures in this
chapter.
4.21 Use the January time series data of an USHCN station of your choice and fit a third-
order polynomial. Make a diagnostic analysis for the fit. The updated monthly USHCN
station data may be downloaded from the Internet.
5 Matrices for Climate Data
Matrices appear everywhere in climate science. For examples, climate data may be
written as a matrix – a 2-dimensional rectangular array of numbers or symbols – and
most data analyses and multivariate statistical studies require the use of matrices. The
study of matrices is often included in a course known as linear algebra. This chap-
ter is limited to (i) describing the basic matrix methods needed for this book, such as
the inverse of a matrix and the eigenvector decomposition of a matrix, and (ii) pre-
senting matrix application examples of real climate data, such as the sea level pressure
data of Darwin and Tahiti. From climate data matrices, we wish to extract helpful
information, such as the spatial patterns of climate dynamics (e.g., El Niño Southern
Oscillation), and temporal occurrence of the patterns. These are related to eigenvec-
tors and eigenvalues of matrices. This chapter features the space-time data arrangement,
which uses rows of a matrix for spatial locations, and columns for temporal steps. The
singular value decomposition (SVD) helps reveal the spatial and temporal features of
climate dynamics as singular vectors and the strength of their variability as singular
values.
To better focus on matrix theory, some application examples of linear algebra, such
as the balance of chemical reaction equations, are not included in the main text, but are
arranged as exercise problems. We have also designed exercise problems for the matrix
analysis of real climate data from both observations and models.
Matrices have appeared earlier in this book, e.g., the matrix formulation of a multi-
linear regression in Chapter 4. A matrix is a rectangular array of numbers (or even
expressions), often denoted by an uppercase letter in either boldface or plain, A
or A:
a11 a12 · · · a1p
21 a22 · · · a2p
a
··· ··· ··· ···
A= (5.1)
··· ··· ··· ···
t
Figure 5.1 A subset of the monthly surface air temperature anomalies from the NOAAGlobalTemp Version 4.0 dataset.
This is an n × p matrix, in which ai j are called elements or entries of the matrix A, i is the
row index from 1 to n, and j is the column index from 1 to p. The dimensions of a matrix
are denoted by subscript, e.g., An×p indicating that the matrix A has n rows and p columns.
The matrix may also be indicated by square brackets around a typical element A = [ai j ] or
sometimes {A}i j , maybe even Ai j . If n = p, then the array is a square matrix.
Figure 5.1 is an example of a space-time climate data matrix. It is a subset of the 5◦ × 5◦
gridded monthly surface air temperature anomalies from the NOAA Merged Land Ocean
Global Surface Temperature Analysis (NOAAGlobalTemp) (Version 4.0). The rows are
indexed according to the spatial locations prescribed by the latitude and longitude of the
centroid of a 5◦ × 5◦ grid box (see the entries of the first two columns in boldface). The
columns are indexed according to time (see the first-row entries in boldface). The other
entries are the temperature anomalies with respect to the 1971–2000 monthly climatology.
The anomalies are arranged according to the locations by rows and the time by columns.
The units for the anomaly data are ◦ C.
For a given month, the spatial temperature data on the Earth’s surface is itself a 2-
dimensional array. To make a space-time matrix, we assign each grid box a unique index
s from 1 to n if the spatial region has n grid boxes. The index assignment is subjective,
depending on the application needs. The commonly used way is to fix a latitude and
increase the index number as the longitude increases, as indicated by the first two columns
of the data matrix shown in Figure 5.1. When the longitude is finished at this latitude
band, go to the next latitude band until the completion of the latitude range. This can go
from south to north, or from north to south. Of course, one can fix the longitude first, and
increase the index according to the ascending or descending order of latitudes.
167 5.1 Matrix Definitions
Following this spatial index as the row number, the climate data for a given month is a
column vector. If the dataset has data for p months, then the space-time data matrix has
p columns. If the dataset has n grid boxes, then the data forms an n × p space-time data
matrix. You can conveniently use row or column operations of a computer language to cal-
culate statistics of the dataset, such as spatial average, temporal mean, temporal variance,
etc.
For more explicit indication of space and time, you may use s for the row index and t
for the column index in a space-time data matrix. Thus, [Ast ] indicates a space-time data
matrix (s for space and t for time).
This space-time indexing can be extended to the data in 3D space and 1D time,
as long as we can assign a unique ID s from 1 to n for a 3D grid box and a
unique ID t for time. The presently popular netCDF (Network Common Data Form)
data format in climate science, denoted by .nc, uses this index procedure for a 4D
dataset. For example, to express the output of a 3D climate model, you can start
your index longitude first for a given latitude and altitude. When longitude exhausts,
count the next latitude. When the latitude exhausts, count the next altitude until the
last layer of the atmosphere or ocean. Eventually, a space-time data matrix is formed
[Ast ].
To visualize the row data of a space-time data matrix [Ast ], just plot a line graph of
the row data against time. To visualize the column data of a space-time data matrix [Ast ],
you need to convert the column vector into a 2D pixel format for a 2D domain (e.g., the
contiguous United States (CONUS) region), or a 3D data array format for a 3D domain
(e.g., the CONUS atmosphere domain from the 1,000 mb surface level to the 10 mb height
level). This means that the climate data are represented in another matrix format, such as
the surface air temperature anomaly data on a 5-degree latitude–longitude grid for the entire
world for December 2015 visualized by Figure 1.8. The data behind the figure is a 36 × 72
data matrix on the grid whose rows are for latitude and columns for longitude. This data
matrix is in space-space pixel format like the data for a photo. Each time corresponds to a
new space-space pixel data matrix. Thus, the latitude–longitude-time forms a 3D data array.
With elevation, then latitude–longitude-altitude-time forms a 4D data array, which is often
written in the netCDF file in climate science. You can use the 4DVD data visualization
tool www.4dvd.org, described in Chapter 1, to visualize the 4D Reanalysis data array as
an example to understand the space-time data plot, and netCDF data structure.
The coordinates (32.5, 262.5) in the sixth row of Figure 5.1 indicate a 5◦ × 5◦
grid box centered at (32.5◦ N, 97.5◦ W). This box covers part of Texas, USA. The
large temperature anomalies for the summer of 1934 (2.64◦ C for June, 2.12◦ C for
July, and 2.36◦ C for August) were in the 1930s Dust Bowl period. The hot summer
of 1934 was a wave of severe drought. The disastrous dust storms in the 1930s over
the American and Canadian prairies destroyed many farms and greatly damaged the
ecology.
168 Matrices for Climate Data
This section provides a concise list summarizing fundamental properties and commonly
used operations of matrices. We limit our material to the basics that are sufficient for this
book.
(i) Zero matrix: A zero matrix has every entry equal to zero: 0 = [0], or 0 = [0], or
explicitly
0 0 ··· 0
0
0 ··· 0
· · · · · · · · · · · ·
0= (5.2)
· · · · · · · · · · · ·
0 0 ··· 0
(ii) Identity matrix: An identity matrix is a square matrix whose diagonal entries are all
equal to one and whose off-diagonal entries are all equal to zero, and is denoted by I
or I. See an expression of an identity matrix below:
1 0 ··· 0
0
1 ··· 0
· · · · · · · · · · · ·
I= (5.3)
· · · · · · · · · · · ·
0 0 ··· 1
People also use the following notation
I = [δi j ], (5.4)
where
(
1 when i = j
δi j = (5.5)
0 otherwise
D = [di δi j ], (5.6)
(At )i j = A ji (5.7)
For example, if
1 2 3
A= (5.8)
4 5 6
then
1 4
At = 2 5 (5.9)
3 6
Here, : in the first position of the double-index subscript means the inclusion of all
the rows.
(ix) Dot product of two vectors: Two vectors of the same dimension can form a dot
product that is equal to the sum of the products of the corresponding entries:
u · v = u1 v1 + u2 v2 + · · · + un vn . (5.14)
The dot product is also called an inner product.
For example, if
u= 1 2 3 , v= 4 5 6 , (5.15)
then
u · v = 1 × 4 + 2 × 5 + 3 × 6 = 32. (5.16)
The amplitude of vector u of dimension n is defined as
q
|u| = u21 + u22 + · · · + u2n . (5.17)
Sometimes, the amplitude is also called length, or Euclidean length, or magnitude.
Please do not mix the concept of Euclidean length of a vector with the dimensional
length of a vector. The latter means the number of entries of a vector, i.e., n.
If the Euclidean length of u is equal to one, we say that the u is a unit vector. If
every element of u is zero, then we say that u is a zero vector.
By the definition of dot product, we have
|u|2 = u · u. (5.18)
If u · v = 0, we say that u and v are orthogonal. Further, if u · v = 0 and |u| = |v| = 1,
then we say that u and v are orthonormal.
(x) Matrix multiplication: The product of matrix An×p and matrix B p×m is an n × m
matrix Cn×m whose element ci j is the dot product of the ith row vector of A and jth
column vector of B:
ci j = ai: · b: j . (5.19)
We denote
Cn×m = An×p B p×m , (5.20)
or simply
C = AB. (5.21)
Note that the number of columns of A and the number of rows of B must be the same
before the multiplication AB can be made, because the dot product ai: · b: j requires
this condition. This is referred to as the dimension-matching condition for matrix
multiplication. If this condition is violated, the two matrices cannot be multiplied.
For example, for the following two matrices
1 0
0 −1
A3×2 = 0 4 B2×2 = , (5.22)
1 2
3 2
171 5.2 Fundamental Properties and Basic Operations of Matrices
we can compute
0 −1
A3×2 B2×2 = 4 8 . (5.23)
2 1
However, the expression
B2×2 A3×2 (5.24)
is not defined, because the dimensions do not match. Thus, for matrix multiplication
of two matrices, their order is important. The product BA may not be equal to AB
even when both are defined. That is, the commutative law does not hold for matrix
multiplication.
The dot product of two vectors can be written as the product of two matrices. If
both u and v are n-dimensional column vectors, then
u · v = ut v. (5.25)
The right-hand side is a 1 × n matrix times an n × 1 matrix, and the product is a 1 × 1
matrix, whose element is the result of the dot product. Computer programs usually
calculate a dot product using this process of matrix multiplication.
A scaler can always multiply a matrix, which is defined as follows. Given a scalar
c and a matrix A, their product is
cA = [cai j ] = Ac. (5.26)
The scaler multiplication can be extended to multiple vectors
(u1 , u2 , · · · , u p )
or matrices to form a linear combination:
u = c1 u1 + c2 u2 + · · · + c p u p , (5.27)
where c1 , c2 , . . . , c p are coefficients of the linear combination and at least one of the
coefficients is nonzero. Multivariate linear regression discussed at the end of previ-
ous chapter is a linear combination. This is a very useful mathematical expression in
data science.
(xi) Matrix inversion: For a given square matrix A, if there is a matrix B such that
BA = AB = I, (5.28)
then B is called the inverse matrix of A, denoted by A−1 , i.e.,
A−1 A = AA−1 = I. (5.29)
Not all the matrices have an inverse. If a matrix has an inverse, then the matrix is
said to be invertible. Equivalently, A−1 exists.
As an example of the matrix inversion, given
1 −1
A= , (5.30)
1 2
172 Matrices for Climate Data
we have
2/3 1/3
A−1 = . (5.31)
−1/3 1/3
Hand calculation for the inverse of a small matrix is already very difficult, and
that for the inverse of a large matrix is almost impossible. Computers can do the
calculations for us, as will be shown in examples later.
According to the definition of inverse, we have the following formula for the
inverse of the product of two matrices:
(AB)−1 = B−1 A−1 , (5.32)
if both A and B are invertible matrices. Please note the order switch of the matrices.
With the definition of an inverse matrix, we can define the matrix division by
A/B = AB−1 (5.33)
when B−1 exists. In matrix operations, we usually do not use the concept of matrix
division, but always use the matrix inverse and matrix multiplication.
(xii) More properties of the matrix transpose:
(At )t = A (5.34)
t t t
(A + B) = A + B (5.35)
t t t
(AB) = B A (5.36)
−1 t t −1
(A ) = (A ) . (5.37)
(xiii) Orthogonal matrices: An orthogonal matrix1 is one whose row vectors are orthonor-
mal. In this case, the inverse matrix can be easily found: it is its transpose. That is, if
A is an orthogonal matrix, then
A−1 = At . (5.38)
The proof of this claim is very simple. The orthonormal property of the row
vectors of A implies that
AAt = I. (5.39)
t
By the definition of matrix inverse, A is the inverse matrix of A.
If A is an orthogonal matrix, its row vectors are also orthonormal. This can be
proved by multiplying both sides of the above by At from the left:
At AAt = At I. (5.40)
Then multiply both sides of this equation by (At )−1 from the right:
At A(At (At )−1 ) = At I(At )−1 , (5.41)
which yields
At A = I. (5.42)
1 Although orthogonal matrix is a standard mathematical terminology, it is acceptable if you call it orthonormal
matrix.
173 5.3 Some Basic Concepts and Theories of Linear Algebra
is an orthogonal matrix for any given real number θ . You can easily verify this using
the trigonometrical identity sin2 θ + cos2 θ = 1. We thus have
−1 cos θ sin θ
T = , (5.44)
− sin θ cos θ
You can easily verify that T−1 T = I by hand calculation of the product of the two
matrices in this equation.
A meteorologist needs to make a decision on what instruments to order under the following
constraint. She is given a budget of $1,000 to purchase 30 instruments for her observational
sites. Her supplier has two products for the instrument: the first is $30 per set, and the sec-
ond $40 per set. She would like to buy as many of the second type of instrument as possible
under the budget constraint. Then, the question is how many instruments of the second kind
she can buy? This problem leads to the following linear system of two equations:
Ax = b, (5.47)
174 Matrices for Climate Data
where
30 40
A= (5.48)
1 1
x
x= 1 (5.49)
x2
1000
b= (5.50)
30
(5.51)
Then, the solution of this system may be tightly expressed in the following way:
x = A−1 b. (5.52)
This expression is convenient for mathematical proofs, but is rarely used for solving a
linear system, because finding an inverse matrix is computationally costly. A way to solve
a linear system is to use Gauss elimination. The corresponding computing procedure is
called the row operation on a matrix. There are numerous ways of solving a linear system.
Some are particularly efficient for a certain system, such as a sparse matrix or a matrix of
a diagonal band of width equal 3 or 5. Efficient algorithms for a linear system, particularly
an extremely large system, are forever a research topic. In this book, we use a computer to
solve a linear system without studying the algorithm details. The R and Python commands
are as follows.
solve (A , b ) # This is the R code for finding x
numpy . linalg . solve (A , b ) # This is the Python code
A linear transformation is to convert vector xn×1 into ym×1 using the multiplication of a
matrix Tm×n :
ym×1 = Tm×n xn×1 . (5.53)
Usually, the linear transformation Tx changes both direction and magnitude of x. How-
ever, if T is an orthogonal matrix, then Tx does not change the magnitude of x, and changes
only the direction. This claim can be simply proved by the following formula:
x1 = d2 x2 + · · · + d p xp , (5.58)
where at least one of the coefficients d2 , d3 , · · · , d p is non-zero. Thus, the linear system of
equations for c1 , c2 , c3 , · · · , c p
c1 x1 + c2 x2 + · · · + c p xp = 0 (5.59)
Xc = 0, (5.60)
and 0 is the p-dimensional zero column vector. The solution of this matrix equation is
c = X−1 0 = 0. (5.63)
However, c must not be zero. This contradiction implies that X−1 does not exist if the col-
umn vectors are linearly dependent. In other words, if X−1 exists, then its column vectors
are linearly independent.
Consider vectors in a 3-dimensional space. Any two column vectors x2 and x3 define a
plane. If x1 can be written as a linear combination of x2 and x3 , then it must lie in the same
plane. So, x1 , x2 and x3 are linearly dependent. The matrix [x1 x2 x3 ]3×3 is not invertible.
176 Matrices for Climate Data
5.3.4 Determinants
For a square matrix A, a convenient notation and concept is its determinant. It is a scaler
and is denoted by det[A] or |A|. For a 2 × 2 matrix
a b
A= , (5.64)
c d
its determinant is
det[A] = ad − cb. (5.65)
For a high-dimensional matrix, the determinant computation is quite complex and is com-
putationally expensive. We usually do not need to calculate the determinant of a large
matrix, say A172×172 . The computer command for computing the determinant of a small
square matrix is as follows:
det ( A ) # R command for determinant
np . linalg . det ( a ) # Python command for determinant
Two 2-dimensional column vectors x1 and x2 can span a parallelogram, whose area S is
equal to the absolute value of the determinant of the matrix consisting of the two vectors
A = [x1 x2 ]:
S = det[x1 x2 ] . (5.66)
(a) The determinant of a diagonal matrix is the product of its diagonal elements.
(b) If a determinant has a zero row or column, the determinant is zero.
(c) The determinant does not change after a matrix transpose, i.e., det[At ] = det[A].
(d) The determinant of the product of two matrices: det[AB] = det[A]det[B].
(e) The determinant of the product of a matrix with a scaler: det[cB] = cn det[A], if A is an
n × n matrix.
(f) The determinant of an orthogonal matrix is equal to 1 or −1.
The rank of A is the greatest number of columns of the matrix that are linearly independent,
and is denoted by r[A].
For a 3 × 3 matrix A, we treat each column as a 3-dimensional vector. If all three lie
along a line (i.e., colinear), the rank of A is one. If all three of the vectors lie in a plane,
but are not collinear, the rank is two. If none of the three are collinear or lie in a plane, the
rank is three.
177 5.3 Some Basic Concepts and Theories of Linear Algebra
If the rank of a square matrix is less than its dimension, then at least one column can be
a linear combination of other columns, which implies that the determinant vanishes. If A
has rank r, it is possible to find r linearly independent columns, and all the other columns
are linear combinations of these r independent columns.
Some properties about the matrix rank are listed below:
(a) If det[An×n ] 6= 0, then r[A] = n, and the matrix An×n is invertible and is said to be
nonsingular.
(b) If det[A] = 0, then the rank of A is less than n, and A is not invertible and is said to be
singular.
(c) If B is multiplied by a nonsingular matrix A, the product has the same rank as B.
(d) 0 ≤ r[An×p ] ≤ min(n, p).
(e) r[A] = r[At ].
(f) r[A B] ≤ min(r[A], r[B]).
(g) r[A At ] = r[At A] = r[A].
(h) r[A + B] ≤ r[A] + r[B].
Computers can easily demonstrate the matrix computations following the theories
presented in this chapter so far. The computer code is below.
# R code : Computational examples of matrices
A = matrix ( c ( 1 ,0 ,0 ,4 ,3 , 2 ) , nrow = 3 , byrow = TRUE )
B = matrix ( c ( 0 ,1 ,-1 ,2 ) , nrow = 2 ) # form a matrix by columns
C = A %*% B # matrix multiplication
C
# [ 1 ,] 0 -1
# [ 2 ,] 4 8
# [ 3 ,] 2 1
t ( C ) # transpose matrix of C
# [ 1 ,] 0 4 2
# [ 2 ,] -1 8 1
library ( Matrix )
rankMatrix ( A ) # Find the rank of a matrix
# [ 1 ] 2 # rank ( A ) = 2
178 Matrices for Climate Data
# Orthogonal matrices
p = sqrt ( 2 )/ 2
Q = matrix ( c (p , -p ,p , p ) , nrow = 2 )
Q # is an orthogonal matrix
# [,1] [,2]
# [ 1 ,] 0 . 7 0 7 1 0 6 8 0 . 7 0 7 1 0 6 8
# [ 2 ,] -0 . 7 0 7 1 0 6 8 0 . 7 0 7 1 0 6 8
Q %*% t ( Q ) # verify O as an orthogonal matrix
# [,1] [,2]
# [ 1 ,] 1 0
# [ 2 ,] 0 1
det ( Q ) # The determinant of an orthogonal matrix is 1 or -1
#[1] 1
# Matrix multiplication
A = [[ 1 , 0 ] , [ 0 , 4 ] ,[ 3 , 2 ]]
B = [[ 0 ,-1 ] ,[ 1 ,2 ]]
C = np . matmul (A , B ) # Or C = np . dot (A , B )
print ( ’C = ’ , C )
# C = [[ 0 -1 ]
# [ 4 8]
# [ 2 1 ]]
print ( ’ Transpose matrix of C = ’ , C . transpose ())
# Transpose matrix of C = [[ 0 4 2 ]
# [ - 1 8 1 ]]
# matrix inversion
A = [[ 1 ,-1 ] ,[ 1 ,2 ]]
np . linalg . inv ( A ) # compute the inverse of A
# array ([[ 0 . 6 6 6 6 6 6 6 7 , 0 . 3 3 3 3 3 3 3 3 ] ,
# [ - 0 . 3 3 3 3 3 3 3 3 , 0 . 3 3 3 3 3 3 3 3 ]])
# An orthogonal matrix
p = np . sqrt ( 2 )/ 2
Q = [[ p , p ] ,[ -p , p ]]
print ( ’ Orthogonal matrix Q = ’ , np . round (Q , 2 ))
T = np . transpose ( Q )
print ( ’Q times transpose of Q = ’ , np . matmul (Q , T ))
print ( ’ Determinant of Q = ’ , np . linalg . det ( Q ))
# Orthogonal matrix Q = [[ 0 . 7 1 0 . 7 1 ]
# [ - 0 . 7 1 0 . 7 1 ]]
179 5.4 Eigenvectors and Eigenvalues
# Q times transpose of Q = [[ 1 . 0 .]
# [ 0 . 1 .]]
# Determinant of Q = 1 . 0
An Eigenvector vs a Non−eigenvector
3.0
Av
2.5
Au
2.0
1.5
x2
1.0
Eigenvector v
0.5
Non−eigenvector u
0.0
t
Figure 5.2 An eigenvector v, a non-eigenvector u, and their linear transforms by matrix A: Av and Au. Here, Av and v are
parallel, and Au and u are not parallel.
However, there exist some special vectors v such that Av is parallel to v. For example,
v = (1, 1) is such a vector, since Av = (3, 3) is in the same direction as v = (1, 1). See the red
vectors in Figure 5.2. If two vectors are parallel, then one vector is a scalar multiplication
of the other, e.g., (3, 3) = 3(1, 1) . We denote this scalar by λ . Thus,
Av = λ v. (5.69)
180 Matrices for Climate Data
These vectors v are special to A, maintain their own orientation when multiplied by A, and
are called eigenvectors. Here, “eigen” is from German, meaning “self,” “own,” “particular,”
or “special.”2 The corresponding scalars λ are called eigenvalues. The formula (5.69) is a
mathematical definition of the eigenvalue problem for matrix A.
If v is an eigenvector, then its multiplication to a scalar c is also an eigenvector, since
A(cv) = cAv = cλ v = λ (cv).
Namely, all the vectors in the same direction v are also eigenvectors. The eigenvectors of
length 1 are called unit eigenvectors, or unitary eigenvectors, and are unique up to a posi-
tive or negative sign. Most computer programs output unit eigenvectors. Thus, eigenvector
describes a direction or an orientation. Each square matrix has its own special orientations.
The aforementioned vector
1
v= (5.70)
1
is an eigenvector that maintains its own direction after being multiplied by A:
1 2 1 3 1
× = =3 . (5.71)
2 1 1 3 1
Thus,
1
v=
1
is an eigenvector of A and λ = 3 is an eigenvalue of A. The corresponding unit eigenvector
is
√
1/√2
e = v/|v| = .
1/ 2
Another eigenvector for the above matrix A is v2 = (1, −1):
1 2 1 1
× = −1 × . (5.72)
2 1 −1 −1
The second eigenvalue is λ2 = −1.
The computer code for the eigenvalues and eigenvectors of the above matrix A is as
follows.
# R code for eigenvectors and eigenvalues
A = matrix ( c ( 1 , 2 , 2 , 1 ) , nrow = 2 )
eigen ( A )
# $ values
# [ 1 ] 3 -1
# $ vectors
#[,1] [,2]
# [ 1 ,] 0 . 7 0 7 1 0 6 8 -0 . 7 0 7 1 0 6 8
# [ 2 ,] 0 . 7 0 7 1 0 6 8 0 . 7 0 7 1 0 6 8
2 The “eigen” part in the word “eigenvector” is from German or Dutch and means “self” or “own,” as in “one’s
own.” Thus, an eigenvector v is A’s “own” vector. In English books, the word “eigenvector” is the standard
translation of the German word “eigenvektor.” The word “eigenvalue” is translated from the German word
“eigenwert,” as “wert” means “value.” Instead of eigenvector, some English publications use “characteristic
vector,” which indicates “characteristics of a matrix,” or “its own property of a matrix.” German mathematician
David Hilbert (1862–1943) was the first to use “eigenvektor,” and “eigenwert” in his 1904 article about a
general theory of linear integral equations.
181 5.4 Eigenvectors and Eigenvalues
(A − λ I)v = 0. (5.73)
det(A − λ I) = 0. (5.74)
Expanding the determinant out leads to an Nth degree polynomial in λ , which has exactly
N roots. Some roots may be repeated and hence counted multiple times toward N. Some
roots may be complex numbers, which are not discussed in this book. Each root is an
eigenvalue and corresponds to an eigenvector.
This concise determinant expression is tidy and useful for mathematical proofs, but is
not used as a computer algorithm for calculating eigenvalues or eigenvectors because it
is computationally costly or even impossible for a large matrix. Nonetheless, some old
textbooks of linear algebra defined eigenvalues using Eq. (5.74).
Figure 5.2 may be generated by the following computer code.
# R plot Fig . 5 . 2 : An eigenvector v vs a non - eigenvector u
setwd ( ’/ Users / sshen / climstats ’)
setEPS () # Plot the figure and save the file
postscript ( " fig 0 5 0 2 . eps " , width = 6 )
par ( mar = c ( 4 . 5 ,4 . 5 ,2 . 0 ,0 . 5 ))
plot ( 9 ,9 ,
main = ’ An eigenvector vs a non - eigenvector ’ ,
cex . axis = 1 . 4 , cex . lab = 1 . 4 ,
xlim = c ( 0 ,3 ) , ylim = c ( 0 ,3 ) ,
xlab = bquote ( x [ 1 ]) , ylab = bquote ( x [ 2 ]))
arrows ( 0 ,0 , 1 ,0 , length = 0 . 2 5 ,
angle = 8 , lwd = 5 , col = ’ blue ’)
arrows ( 0 ,0 , 1 ,2 , length = 0 . 3 ,
angle = 8 , lwd = 2 , col = ’ blue ’ , lty = 3 )
arrows ( 0 ,0 , 1 ,1 , length = 0 . 2 5 ,
angle = 8 , lwd = 5 , col = ’ red ’)
arrows ( 0 ,0 , 3 ,3 , length = 0 . 3 ,
angle = 8 , lwd = 2 , col = ’ red ’ , lty = 3 )
text ( 1 . 4 ,0 . 1 , ’Non - eigenvector u ’ , cex = 1 . 4 , col = ’ blue ’)
text ( 1 . 0 ,2 . 1 , ’ Au ’ , cex = 1 . 4 , col = ’ blue ’)
text ( 1 . 5 ,0 . 9 , ’ Eigenvector v ’ , cex = 1 . 4 , col = ’ red ’)
text ( 2 . 8 , 2 . 9 5 , ’ Av ’ , cex = 1 . 4 , col = ’ red ’)
dev . off ()
182 Matrices for Climate Data
a 1 = patches . FancyArrowPatch (( 0 ,0 ) , ( 1 ,0 ) , ** kw 1 ,
linewidth = 5 )
a 2 = patches . FancyArrowPatch (( 0 , 0 ) , ( 1 ,2 ) ,** kw 1 )
a 3 = patches . FancyArrowPatch (( 0 , 0 ) , ( 1 ,1 ) , ** kw 2 ,
linewidth = 5 )
a 4 = patches . FancyArrowPatch (( 0 , 0 ) , ( 3 ,3 ) ,** kw 2 )
for a in [ a 1 , a 2 , a 3 , a 4 ]:
plt . gca (). add _ patch ( a )
plt . title ( ’ An eigenvector v vs a non - eigenvector u ’)
plt . xlabel ( r ’$ x _ 1 $ ’ , fontsize = 2 5 )
plt . ylabel ( r ’$ x _ 2 $ ’ , fontsize = 2 5 )
plt . text ( 0 . 6 , 0 . 1 , ’Non - eigenvector u ’ ,
color = ’ blue ’ , fontsize = 2 5 )
plt . text ( 0 . 8 , 2 . 0 , ’ Au ’ ,
color = ’ blue ’ , fontsize = 2 5 )
plt . text ( 1 . 0 3 ,0 . 8 5 , ’ Eigenvector v ’ ,
color = ’ red ’ , fontsize = 2 5 )
plt . text ( 2 . 7 , 2 . 9 , ’ Av ’ ,
color = ’ red ’ , fontsize = 2 5 )
plt . show ()
(c) If all the eigenvalues of An×n are positive, then the quadratic form
n
Q(x) = xt Ax = ∑ ai j xi x j (5.76)
i, j=1
is also a positive scaler for any nonzero vector x. A quadratic form may be used to
express kinetic energy or total variance in climate science. The kinetic energy in cli-
mate science is always positive. For all unit vectors x, the maximum quadratic form
is equal to the largest eigenvalue of A. The maximum is achieved when x is the
corresponding eigenvector of A.
(d) The rank of matrix A is equal to the number of nonzero eigenvalues, where the
multiplicity of repeated eigenvalues is counted.
(e) Eigenvalues of a diagonal matrix are equal to the diagonal elements.
(f) For a symmetric matrix An×n , its n unit column eigenvectors form an orthogonal matrix
Qn×n such that An×n can be diagonalized by Qn×n in the following way:
λ1 = 3, λ2 = −1, (5.79)
√ √
2/2 − 2/2
v1 = √ , v2 = √ . (5.80)
2/2 2/2
Thus, the orthogonal matrix Q and the diagonal matrix D are as follows:
√ √
2/2 √2/2 , 3 0
Q= √ D= . (5.81)
2/2 − 2/2 0 −1
With Q, A, and D, you can easily verify Eq. (5.77) by hand calculation or by computer
coding.
Equation (5.77) can be written in the following way:
or
n
An×n = ∑ λk q(k) (q(k) )t . (5.83)
k=1 n×1 1×n
184 Matrices for Climate Data
C = QDc Qt , (5.87)
m
0
...
... = ...
n
n
Dm×m (Vt)m×m
An×m Un×m
m n n m
0
... = ...
n
n
0
...
An×m Un×n Dn×n (Vt)n×m
t
Figure 5.3 Schematic diagrams of SVD: A = UDV t . Case 1: n > m (top panel). Case 2: n < m (bottom panel).
B = np . array ( A )
C = np . matmul (B , B . T ) # B times B transpose
valC , vecC = np . linalg . eig ( C )
np . sqrt ( valC )
# array ([ 1 . 0 8 5 5 1 4 4 , 4 . 2 2 1 5 7 0 6 2 ])
The computer code shows the following SVD results expressed in following mathemat-
ical formulas:
(a) The vector form:
−1 0 −2
A=
1 2 3
−0.48
= 4.22 × 0.32 0.42 0.85 +
0.88
0.88
1.09 × −0.37 0.88 − 0.29 (5.91)
−0.48
and
(b) The matrix form:
−1 0 −2 −0.48 0.88 4.22 0 0.32 0.42 0.85
A= = (5.92)
1 2 3 0.88 0.48 0 1.09 −0.37 0.88 −0.29
If we use only the first singular vectors to approximate A from the two triplets of singular
vectors and singular values, the result is as follows.
# Data reconstruction by singular vectors : R code
round ( svdA $ d [ 1 ]* svdA $ u [ , 1 ]%*% t ( svdA $ v [ , 1 ]) ,
digits = 1 )
# [,1] [,2] [,3]
# [ 1 ,] -0 . 7 -0 . 8 -1 . 7
# [ 2 ,] 1 . 2 1 . 5 3 . 2
It is quite close to A.
188 Matrices for Climate Data
If we use both singular vectors to reconstruct A, then the reconstruction is exact without
errors, as expected.
round ( svdA $ d [ 1 ]* svdA $ u [ , 1 ]%*% t ( svdA $ v [ , 1 ]) +
svdA $ d [ 2 ]* svdA $ u [ , 2 ]%*% t ( svdA $ v [ , 2 ]) ,
digits = 2 )
# [,1] [,2] [,3]
# [ 1 ,] -1 0 -2
# [ 2 ,] 1 2 3
Figure 5.3 for the schematic diagram of SVD may be plotted by the following computer
code.
# Spatial matrix U
segments ( x 0 = c ( 7 ,7 ,1 0 ,1 0 ) ,
y 0 = c ( 6 ,1 2 ,1 2 ,6 )+ 1 ,
x 1 = c ( 7 ,1 0 ,1 0 ,7 ) ,
y 1 = c ( 1 2 ,1 2 ,6 ,6 )+ 1 ,
col = c ( ’ blue ’ , ’ blue ’ , ’ blue ’ , ’ blue ’) , lwd = 3 )
segments ( x 0 = c ( 7 . 5 ,8 ) ,
y 0 = c ( 6 ,6 )+ 1 ,
x 1 = c ( 7 . 5 ,8 ) ,
y 1 = c ( 1 2 ,1 2 )+ 1 ,
lwd = 1 . 3 , lty = 3 , col = ’ blue ’)
text ( 6 . 2 , 9 + 1 , ’n ’ , srt = 9 0 , col = ’ blue ’ , cex = 1 . 4 )
text ( 8 . 5 , 1 2 . 8 + 1 , ’m ’ , col = ’ red ’ , cex = 1 . 4 )
text ( 9 , 9 + 1 , ’ ... ’ , cex = 1 . 4 , col = ’ blue ’)
text ( 8 . 7 , 5 . 0 + 1 , bquote ( U [ n %*% m ]) , cex = 2 . 5 , col = ’ blue ’)
# Singular value diagonal matrix D
segments ( x 0 = c ( 1 2 ,1 2 ,1 5 ,1 5 ) ,
y 0 = c ( 9 ,1 2 ,1 2 ,9 )+ 1 ,
x 1 = c ( 1 2 ,1 5 ,1 5 ,1 2 ) ,
y 1 = c ( 1 2 ,1 2 ,9 ,9 )+ 1 ,
col = c ( ’ brown ’ , ’ brown ’ , ’ brown ’ , ’ brown ’) , lwd = 3 )
segments ( x 0 = 1 2 , y 0 = 1 2 + 1 , x 1 = 1 5 , y 1 = 9 + 1 , lty = 3 ,
col = c ( ’ brown ’) , lwd = 1 . 3 ) # diagonal line
text ( 1 1 . 2 , 1 0 . 5 + 1 , ’m ’ , srt = 9 0 , col = ’ red ’ , cex = 1 . 4 )
text ( 1 3 . 5 , 1 2 . 8 + 1 , ’m ’ , col = ’ red ’ , cex = 1 . 4 )
text ( 1 4 . 1 , 1 1 . 3 + 1 , ’0 ’ , col = ’ brown ’ , cex = 1 . 4 )
text ( 1 2 . 9 , 1 0 . 0 + 1 , ’0 ’ , col = ’ brown ’ , cex = 1 . 4 )
text ( 1 3 . 9 , 8 . 0 + 1 , bquote ( D [ m %*% m ]) , cex = 2 . 5 , col = ’ brown ’)
# Temporal matrix V
segments ( x 0 = c ( 1 7 ,1 7 ,2 0 ,2 0 ) ,
y 0 = c ( 9 ,1 2 ,1 2 ,9 )+ 1 ,
x 1 = c ( 1 7 ,2 0 ,2 0 ,1 7 ) ,
y 1 = c ( 1 2 ,1 2 ,9 ,9 )+ 1 ,
col = c ( ’ red ’ , ’ red ’ , ’ red ’ , ’ red ’) , lwd = 3 )
segments ( x 0 = c ( 1 7 ,1 7 ) ,
y 0 = c ( 1 1 . 5 ,1 0 . 8 )+ 1 ,
x 1 = c ( 2 0 ,2 0 ) ,
y 1 = c ( 1 1 . 5 ,1 0 . 8 )+ 1 ,
col = c ( ’ red ’ , ’ red ’) , lty = 3 , lwd = 1 . 3 )
text ( 1 6 . 2 , 1 0 . 5 + 1 , ’m ’ , srt = 9 0 , col = ’ red ’ , cex = 1 . 4 )
text ( 1 8 . 5 , 1 2 . 5 + 1 , ’m ’ , col = ’ red ’ , cex = 1 . 4 )
text ( 1 9 . 5 , 8 + 1 , bquote (( V ^ t )[ m %*% m ]) , cex = 2 . 5 , col = ’ red ’)
text ( 1 8 . 5 , 1 0 + 1 , ’ ... ’ , col = ’ red ’ , srt = 9 0 , cex = 1 . 4 )
# Space - time data matrix B when n < m
segments ( x 0 = c ( 0 ,0 ,6 ,6 ) ,
y 0 = c ( 0 ,3 ,3 ,0 ) ,
x 1 = c ( 0 ,6 ,6 ,0 ) ,
y 1 = c ( 3 ,3 ,0 ,0 ) ,
col = c ( ’ blue ’ , ’ red ’ , ’ blue ’ , ’ red ’) , lwd = 3 )
segments ( x 0 = c ( 1 ,2 ,5 ) ,
y 0 = c ( 0 ,0 ,0 ) ,
x 1 = c ( 1 ,2 ,5 ) ,
y 1 = c ( 3 ,3 ,3 ) ,
lwd = 1 . 3 , lty = 3 )
text ( - 0 . 8 , 1 . 5 , ’n ’ , srt = 9 0 , col = ’ blue ’ , cex = 1 . 4 )
text ( 3 , 3 . 8 , ’m ’ , col = ’ red ’ , cex = 1 . 4 )
190 Matrices for Climate Data
This Python code generates the top-left rectangular box in Figure 5.3. The remaining
code for other boxes is highly repetitive and can be found from the book website.
5.6 SVD for the Standardized Sea Level Pressure Data of Tahiti and
Darwin
The Southern Oscillation Index (SOI) is an indicator for El Niño or La Niña. It is computed
as the difference of sea level pressure (SLP) of Tahiti (17.75◦ S, 149.42◦ W) minus that
of Darwin (12.46◦ S, 130.84◦ E). An SVD analysis of the SLP data can substantiate this
calculation formula.
The following shows the data matrix of the standardized SLP anomalies of Tahiti and
Darwin from 2009 to 2015, and its SVD.
# R SVD analysis for the weighted SOI from SLP data
setwd ( " / Users / sshen / climmath " )
Pda < - read . table ( " data / PSTANDdarwin . txt " , header = F )
dim ( Pda )
# [ 1 ] 6 5 1 3 # Monthly Darwin data from 1 9 5 1 -2 0 1 5
pdaDec < - Pda [ , 1 3 ] # Darwin Dec standardized SLP anomalies data
Pta < - read . table ( " data / PSTANDtahiti . txt " , header = F )
ptaDec = Pta [ , 1 3 ] # Tahiti Dec standardized SLP anomalies
192 Matrices for Climate Data
UDVt (5.93)
The first column vector (−0.66, 0.75) of the spatial pattern matrix U may be interpreted
to be associated with the SOI, which puts a negative weight −0.66 on Darwin, and a pos-
itive weight 0.75 on Tahiti. The weighted sum is approximately equal to the difference of
Tahiti’s SLP minus that of Darwin, which is the definition of SOI. The index measures
large-scale ENSO dynamics of the tropical Pacific (Trenberth 2020). The magnitude of the
vector (−0.66, 0.75) is approximately 1, because U is a unitary matrix. The corresponding
first temporal singular vector has a distinctly large value 0.72 in December 2010, which
was a strong La Niña month. In this month, the Darwin had a strong negative SLP anomaly,
while Tahiti had a strong positive SLP anomaly. This situation enhanced the easterly trade
winds in the tropical Pacific and caused abnormally high precipitation in Australia in the
2010–2011 La Niña period.
The second column vector (0.75, 0.66) of the spatial pattern matrix U also has climate
implications. The weighted sum with two positive equal weights measures the small scales
tropical Pacific dynamics (Trenberth 2020).
The column vectors of the spatial matrix U are spatial singular vectors, also called EOFs,
while those of the temporal matrix V are temporal singular vectors, also called PCs. The
first few EOFs often have climate dynamic interpretations, such as El Niño Southern Oscil-
lation (ENSO). If EOF1 corresponds to El Niño and shows some typical ENSO properties,
such as the opposite signs of SLP anomalies of Darwin and Tahiti, then PC1 shows a tem-
poral pattern, e.g., the extreme values of PC1 indicating both the occurrence time and the
strength of El Niño. The diagonal elements of matrix D are singular values, also known as
eigenvalues.
An eigenvector v of a square matrix C is a special vector such that C’s action on v does
not change its orientation, i.e., Cv is parallel to v. This statement implies the existence of a
scaler λ , called eigenvalue (also known as singular value or characteristic value), such that
Cv = λ v. (5.95)
We have also discussed the matrix method of solving a system of linear equations
Ax = b, (5.96)
194 Matrices for Climate Data
linear independence of vectors, linear transform, and other basic matrix methods. These
methods are useful for the chapters on covariance, EOFs, spectral analysis, regression anal-
ysis, and machine learning. You may focus on the computing methods and computer code
of the relevant methods. The mathematical proofs of this chapter, although helpful for
exploring new mathematical methods, are not necessarily needed to read the other chap-
ters of this book. If you are interested in an in-depth mathematical exploration of matrix
theory, you may wish read the books by Horn and Johnson (1985) and Strang (2016).
References and Further Reading
[1] G. H. Golub and C. Reinsch, 1970: Singular value decomposition and least squares
solutions. Numerische Mathematik, 14, 403–420.
[2] R. A. Horn and C. R. Johnson, 1985: Matrix Analysis. Cambridge University Press.
[3] G. Strang, 2016: Introduction to Linear Algebra. 5th ed., Wellesley-Cambridge Press.
[4] G. W. Stewart, 1993: On the early history of the singular value decomposition. SIAM
Review, 35, 551–566.
This paper describes the contributions from five mathematicians in the period
of 1814–1955 to the development of the basic SVD theory.
[5] K. Trenberth, and National Center for Atmospheric Research Staff (eds.), 2020: The
Climate Data Guide: Southern Oscillation Indices: Signal, Noise and Tahiti/Darwin
SLP (SOI).
195
196 Matrices for Climate Data
https://climatedataguide.ucar.edu/climate-data/southern-
oscillation-indices-signal-noise-and-tahitidarwin-slp-soi
This site describes the optimal indices for large- and small-scale dynamics.
[6] A. Tucker, 1993: The growing importance of linear algebra in undergraduate mathe-
matics. College Mathematics Journal, 24, 3–9.
This paper describes the historical development of linear algebra, such as the
term “matrix” being coined by J. J. Sylvester in 1848, and pointed out that
“tools of linear algebra find use in almost all academic fields and throughout
modern society.” The use of linear algebra in the big data era is now even more
popular.
Exercises
5.1 Write a computer code to
(a) Read the NOAAGlobalTemp data file, and
(b) Generate a 4 × 8 space-time data matrix for the December mean surface air tem-
perature anomaly data of four grid boxes and eight years. Hint: You may find the
NOAA Global Surface Temperature (NOAAGlobalTemp) dataset online. You can
use either netCDF format or CSV format.
5.2 Write a computer code to find the inverse of the following matrix.
# [,1] [,2] [,3]
# [ 1 ,] 1 . 7 -0 . 7 1 . 3
# [ 2 ,] -1 . 6 -1 . 4 0 . 4
# [ 3 ,] -1 . 5 -0 . 3 0 . 6
5.3 Write a computer code to solve the following linear system of equations:
Ax = b, (5.97)
where
1 2 3 x1 1
A= 4
5 6 x = x2
b = −1 . (5.98)
7 8 0 x3 0
5.4 The following equation
1 2 3 x1 0
4 5 6 x2
= 0 (5.99)
7 8 9 x3 0
has infinitely many solutions, and cannot be directly solved by a simple computer
command, such as solve(A, b).
197 Exercises
(a) Show that the three row vectors of the coefficient matrix are not linearly independ-
ent.
(b) Because of the dependence, the linear system has only two independent equations.
Thus, reduce the linear system to two equations by treating x3 as an arbitrary value
while treating x1 and x2 as variables.
(c) Solve the two equations for x1 and x2 and express them in terms of x3 . The infinite
possibilities of x3 imply infinitely many solutions of the original system.
5.5 Ethane is a gas similar to the greenhouse gas methane and can burn with oxygen to form
carbon dioxide and water:
C2 H6 + O2 −→ CO2 + H2 O. (5.100)
Given two ethane molecules, how many molecules of oxygen, carbon dioxide and water
are required for this chemical reaction equation to be balanced? Hint: Assume x, y, and z
molecules of oxygen, carbon dioxide, and water, respectively, and use the balance of the
number of atoms of carbon, hydrogen, and oxygen to form linear equations. Solve the
system of linear equations.
5.6 Carry out the same procedure as the previous problem but for the burning of methane
CH4 .
5.7 (a) Use matrix multiplication to show that the vector
1
u= (5.101)
1
is not an eigenvector of the following matrix:
0 4
A= (5.102)
−2 −7
(b) Find all the unit eigenvectors of matrix A in (a).
5.8 Use hand calculation to compute the matrix multiplication of UDVt where the data of
relevant matrices are given by the following R output.
A = matrix ( c ( 1 ,-1 ,1 ,1 ) , nrow = 2 )
A
# [ 1 ,] 1 1
# [ 2 ,] -1 1
svd ( A )
#$d
#[1] 1.414214 1.414214
#$u
# [ 1 ,] -0 . 7 0 7 1 0 6 8 0 . 7 0 7 1 0 6 8
# [ 2 ,] 0 . 7 0 7 1 0 6 8 0 . 7 0 7 1 0 6 8
#$v
# [ 1 ,] -1 0
# [ 2 ,] 0 1
5.9 Use a computer to find the matrices U, D and V of the SVD for the following data matrix:
1 2 3 4 5
A = 6 7 8 9 10 . (5.103)
11 12 13 14 15
198 Matrices for Climate Data
5.10 Use the first singular vectors from the result of Exercise 5.9 to approximately reconstruct
the data matrix A using
B = d1 u1 vt1 . (5.104)
Describe the goodness of the approximation using text, limited to 20 to 100 words.
5.11 For the following data matrix
1.2 −0.5 0.9 −0.6
A = 1.0 −0.7 −0.4 0.9 , (5.105)
−0.2 1.1 1.6 −0.4
(a) Use a computer to find the eigenvectors and eigenvalues of matrix AAt .
(b) Use a computer to find the eigenvectors and eigenvalues of matrix At A.
(c) Use a computer to calculate SVD of A.
(d) Compare the singular vectors and singular values in Step (c) with the eigenval-
ues and eigenvectors computed in Steps (a) and (b). Use text to discuss your
comparison.
(e) Use a computer to verify that the column vectors of U and V in the SVD of Step
(c) are orthonormal to each other.
5.12 Conduct an SVD analysis of the December standardized anomalies data of sea level
pressure (SLP) at Darwin and Tahiti from 1961 to 2010. You can find the data from the
Internet or from the website of this book.
(a) Write a computer code to organize the data into a 2 × 50 space-time data matrix.
(b) Make the SVD calculation for this space-time matrix.
(c) Plot the first singular vector in V against time from 1961 to 2010 as a time series
curve, which is called the first principal component, denoted by PC1.
(d) Interpret the first singular vector in U, which is called the first empirical orthogonal
function (EOF1), as weights of Darwin and Tahiti stations.
(e) Check the historical El Niño events between 1961 and 2010 from the Internet, and
interpret the extreme values of PC1.
5.13 Plot PC2 against time from the SVD analysis in the previous problem. Discuss the
singular values λ1 and λ2 . Interpret PC2 in reference to Trenberth (2020).
5.14 Conduct an SVD analysis similar to the previous two problems for the January
standardized SLP anomalies data at Darwin and Tahiti from 1961 to 2010.
5.15 Conduct an SVD analysis similar to the previous problem for the monthly standardized
SLP anomalies data at Darwin and Tahiti from 1961 to 2010. This problem includes
anomalies for every month. The space-time data is a 2 × 600 matrix.
5.16 For the observed data of the monthly surface air temperature at five stations of your
interest from January 1961 to December 2010, form a space-time data matrix, compute
the SVD of this matrix, and interpret your results from the perspective of climate science.
Plot PC1, PC2, and PC3 against time. You may find your data from the internet, e.g., the
NOAA Climate Data Online website www.ncdc.noaa.gov/cdo-web
5.17 Do the same analysis as the previous problem, but for the monthly precipitation data at
the same stations in the same time period.
5.18 For a Reanalysis dataset, conduct an SVD analysis similar to the previous problem for
the monthly surface temperature data over 10 grid boxes of your choice in the time
199 Exercises
period of 1961–2010. The space-time data is a 10 × 600 matrix. You can use your pre-
ferred Reanalysis dataset and download the data from the Internet, e.g., NCEP/NCAR
Reanalysis, and ECMWF ERA.
5.19 Let C = AAt and A is any real-valued rectangular matrix. Show that
(a) the eigenvalues of C are nonnegative;
(b) if v = At u, then v is a eigenvector of Ct = At A.
5.20 Given that
2 −1 x1
A= , x= , (5.106)
−1 3 x2
write down the second-order polynomial corresponding to the following matrix expres-
sion:
P(x1 , x2 ) = xt AAt x. (5.107)
This is known as the quadratic form of the matrix AAt .
5.21 If x is a unit vector, use calculus to find the maximum value of P(x1 , x2 ) in the previous
problem. How is your solution related to the eigenvalue of AAt ?
6 Covariance Matrices, EOFs, and PCs
Covariance of climate data at two spatial locations is a scaler. The covariances of climate
data at many locations form a matrix. The eigenvectors of a covariance matrix are defined
as empirical orthogonal functions (EOFs), which vary in space. The orthonormal projec-
tions of the climate data on EOFs yield principal components (PCs), which vary in time.
These EOFs and PCs are commonly used in climate data analysis. Physically they may
be interpreted as the spatial and temporal patterns or dynamics of a climate process. The
eigenvalues of the covariance matrix represent the variances of the climate field for differ-
ent EOF patterns. The EOFs defined by a covariance matrix are mathematically equivalent
to the SVD definition of EOFs, and the SVD definition is computationally more conven-
ient when the space-time data matrix is not too big. The covariance definition of EOFs
provides ways of interpreting climate dynamics, such as how variance is distributed across
the different EOF components.
This chapter describes covariance and EOFs for both climate data and stochastic cli-
mate fields. The EOF patterns can be thought of as statistical modes. The chapter not only
includes the rigorous theory of EOFs and their analytic representations, but also discusses
commonly encountered problems in EOF calculations and applications, such as area-factor,
time-factor, and sampling errors of eigenvalues and EOFs. We pay particular attention to
independent samples, North’s rule of thumb, and mode mixing.
A space-time data matrix can be used to define a covariance matrix. Let Xnp be a space-time
data matrix with n spatial stations or grid boxes, and p time steps:
Each row vector xi: is the time series data for a station i or a region or a grid box (either
2D or 3D), and each column vector x:t is the spatial map data for a given time step t.
200
201 6.1 From a Space-Time Data Matrix to a Covariance Matrix
Here, “:” means all the elements of the ith row or tth column. Each pair of stations i, j has
a covariance
Ci j = Cov(xi , x j ) (6.2)
These Ci j , or denoted by C(i, j), form the covariance matrix, which is apparently
symmetric.
For simplicity, suppose the sample mean of the station data is zero (averaging through
time for each station):
1 p
x̄i = ∑ xit = 0. (6.3)
p t=1
In other words, we are dealing with anomaly data, which means the station means have
been removed. This last step simplifies the notation. We have the sample covariance
1 p t
xit x jt
Si j = ∑ xit x jt = ∑ √ √ . (6.4)
p t=1 t=1 p p
√
Here 1/ p is called the time-factor of climate data in the time interval of length p.
Similarly, the space-factor of climate data can be introduced since each station or grid
box may represent different sizes of areas in the case of a surface domain, volumes in the
case of a 3D domain, and lengths in case of a line segment domain. If the ith grid box has
√
an area Ai , then the area-factor for this box is Ai . The climate data xit with the area-factor
is then
p
Ai xit . (6.5)
The climate data matrix with both space- and time-factors can be defined as
"s #
Ai
Xf = xit , (6.6)
p
where
s
Ai
(6.7)
p
are the space-time factors. As an example, for a 5◦ × 5◦ grid and 50 years of data, for the
analysis of annual data we have the space-time factor as
s
Ai p
= C cos(φi ) (6.8)
p
where φi is the latitude (in radians, not in degrees) of the centroid of the grid box i, and C
is a constant, approximated by
5 × 6400 1
C≈ ×√ [km yr−1/2 ]. (6.9)
180 50
Here, the Earth is approximately regarded as a sphere with its radius equal to 6,400 km.
The practical climate data analyses often drop the dimensional constant and use only the
202 Covariance Matrices, EOFs, and PCs
p
dimensionless scaling factors wi = cos(φi ), since a constant multiplication to a covar-
iance matrix does not change the eigenvectors and other useful properties of the matrix.
Therefore, climate data analyses often use the dimensionless space-time scaling factors
instead of the dimensional space-time factors:
Xf = [wi xit ] . (6.10)
Finally the sample covariance matrix can be written as
S = Xf Xtf . (6.11)
Example 6.1 Compute and visualize the sample covariance matrix with space-time
factor for the December SST temperature anomalies using the NOAAGlobalTemp data
(anomalies with respect to 1971–2000 monthly climatology) over a zonal band at 2.5◦ S
from 0◦ E to 360◦ (Figure 6.1(a)) and a meridional band at 2.5◦ E from 90◦ S to 90◦ N
(Figure 6.1(b)). The monthly NOAAGlobalTemp data are the anomalies on a grid 5◦ × 5◦
from −87.5◦ to 87.5◦ in latitude and from 2.5◦ to 375◦ in longitude.
The zonal band under our consideration consists of 72 grid boxes of 5-deg. Thus, the
sample covariance matrix is a 72 × 72 matrix as shown in Figure 6.1(a).
The meridional band has only 36 grid boxes. Thus, the sample covariance matrix is
a 36 × 36 matrix as shown in Figure 6.1(b). The white region of high latitude indicates
missing data.
Figures 6.1(a) and (b) show not only the symmetry around the upward diagonal line, but
also the spatial variation of elements of covariance matrices. The figure allows us to iden-
tify the locations of large covariance. For example, the red region of Figure 6.1(a) shows the
large covariance over the eastern Tropical Pacific with longitude approximately between
180◦ and 280◦ . Figure 6.1(c) shows the December 1997 warm El Niño SAT anomalies
over the eastern Tropical Pacific. The covariance matrices of Figures 6.1(a) and (b) were
computed from 50 years of December SAT anomaly data from 1961 to 2010, with SAT of
December 1997 being one of the 50 time series data samples.
Zonal SAT Covariance: 2.5° S [°C]2 SAT Meridional Covariance: 2.5° E [°C]2
120
80 100
60
300
80
30
60
60
Longitude
Latitude
180
40
0
40
20
20
0
−60
60
0 −20
4
50
2
Latitude
0 0
−2
−50
−4
Longitude
(c)
t
Figure 6.1 The December SAT covariance matrix computed from the NOAAGlobalTemp data from 1961 to 2010: (a) The
covariance of the zonal band at 2.5◦ S, and (b) the covariance of the meridional band at 2.5◦ E. Panel (c) shows the
positions of the zonal band and meridional band indicated by the thick dashed black lines. Data source: Smith et al.
2008; Vose et al. 2012. https://psl.noaa.gov/data/gridded/data.noaaglobaltemp.html
# 1 8 8 0 -0 1 -0 1
sat < - ncvar _ get ( nc , " air " )
dim ( sat )
#[1] 72 36 1674
# 1 6 7 4 months = 1 8 8 0 -0 1 to 2 0 1 9 -0 6 , 1 3 9 years 6 mons
Dec = seq ( 1 2 , 1 6 7 4 , by = 1 2 )
Decsat = sat [ , , Dec ]
N = 72*36
P = length ( Dec )
STsat = matrix ( 0 , nrow =N , ncol = P )
for ( k in 1 : P ){ STsat [ , k ]= as . vector ( Decsat [ , , k ])}
204 Covariance Matrices, EOFs, and PCs
# plot Fig . 6 . 1 ( c )
month . day . year ( time [ 1 4 1 6 ] , c ( month = 1 , day = 1 , year = 1 8 0 0 ))
# Dec 1 9 9 7
mapmat = sat [ , , 1 4 1 6 ]
mapmat = pmax ( pmin ( mapmat , 5 ) , - 5 )
int = seq ( - 5 , 5 , length . out = 5 1 )
rgb . palette = colorRampPalette ( c ( ’ black ’ , ’ blue ’ ,
’ darkgreen ’ , ’ green ’ , ’ white ’ , ’ yellow ’ , ’ pink ’ ,
’ red ’ , ’ maroon ’) , interpolate = ’ spline ’)
setEPS () # Plot the figure and save the file
postscript ( " fig 0 6 0 1 c . eps " , width = 7 , height = 3 . 5 )
par ( mar = c ( 4 . 2 ,5 . 0 ,1 . 8 ,0 . 0 ))
par ( cex . axis = 0 . 9 , cex . lab = 0 . 9 , cex . main = 0 . 8 )
library ( maps )
filled . contour ( lon , lat , mapmat ,
color . palette = rgb . palette , levels = int ,
plot . title = title ( main = " December 1 9 9 7 SAT Anomalies " ,
xlab = " Longitude " , ylab = " Latitude " ) ,
plot . axes ={ axis ( 1 ); axis ( 2 );
map ( ’ world 2 ’ , add = TRUE ); grid ()} ,
key . title = title ( main = expression ( degree * " C " )))
segments ( x 0 = - 2 0 ,y 0 = - 2 . 5 , x 1 = 2 5 5 , y 1 = - 2 . 5 , lwd = 5 , lty = 2 )
206 Covariance Matrices, EOFs, and PCs
6.2.1 Defining EOFs and PCs from the Sample Covariance Matrix
In addition to the SVD definition of EOFs and PCs, we can also define EOFs as the
eigenvectors uk of the sample spatial covariance matrix S:
Suk = λk uk , k = 1, 2, . . . , s. (6.12)
Here, λk are eigenvalues and s is the rank of S.
We have seen in the previous chapter that the eigenvectors of a symmetric matrix are
orthogonal to one another if eigenvalues are distinct. Sometimes an eigenvalue is a repeated
root from the eigen-equation (6.12). This same eigenvalue corresponds to several eigen-
vectors, referred to as a degenerate multiplet of eigenvectors, which form a subspace of
dimension more than one. Any linear combination of the eigenvectors in this subspace is
208 Covariance Matrices, EOFs, and PCs
also an eigenvector, i.e., any vector in this subspace is an eigenvector. Thus, the eigenvec-
tor’s orientation in this subspace is arbitrary. Usually one chooses a set of unit orthogonal
eigenvectors that span the subspace.
The EOFs are conventionally normalized into unit vectors, i.e., orthonormal vectors with
uk · ul = δkl , (6.13)
where
(
1, if k = l,
δkl = (6.14)
0, if k 6= l
is the Kronecker delta.
The eigenvalues λk are nonnegative, which can be shown as follows. Substituting the
data matrix Xf with space-time factor into the EOF definition equation (6.12) yields
Here, for simplicity, we have dropped the subscript f for the space-time factor and used
X to represent the data matrix Xf with space-time factor. Multiplying this equation by the
row vector utk leads to
λk = |Xt uk |2 ≥ 0, (6.16)
since
utk uk = 1. (6.17)
The expression Xt uk can be regarded as the projection of the space-time data onto EOF
uk . Since X is anomaly data,
|Xt uk |2
may be regarded as the variance associated with eigenvector uk . Thus, Equation (6.16)
implies that the eigenvalue λk is equal to the variance explained by the corresponding kth
EOF mode.
When λk > 0, the temporal vector
1
vk = p Xt uk , k = 1, 2, . . . , s. (6.18)
λk
is the kth principal component (PC) of the space-time data. The PCs defined this way are
also orthonormal:
!t
1 1
t
vk vl = p X uk p Xt ul ,
t
λk λl
1
= p p utk XXt ul
λk λl
1
= p p utk λl ul
λk λl
= δkl . (6.19)
209 6.2 Definition of EOFs and PCs
From the perspective of the space-time decomposition of the data matrix X, the principal
component matrix may be regarded as the temporal coefficients of the spatial EOFs. More-
over, principal components are orthogonal with respect to time. This may be interpreted to
mean that (i) individual terms making up the principal component time series are statisti-
cally independent from each other, and (ii) the variance associated with the time series of
EOF k is equal to the eigenvalue λk .
The EOFs and PCs are exactly the column vectors of the spatial eigenvector matrix U
and the temporal eigenvector matrix V of the SVD in the previous chapter. The space-
time data matrix X can be exactly represented by all the nonzero eigenvalues and their
corresponding EOFs and PCs:
p
X = ∑ λk uk vtk . (6.20)
k
where Λ is the square diagonal matrix λk δkl . Be careful that this SVD is not a space-time
decomposition since the matrix S is the spatial sample covariance matrix whose rows and
columns are both for space. Thus, this is a space-space decomposition. Similarly, you can
construct a temporal sample covariance matrix, and make a time-time SVD for this matrix.
This idea can help with the SVD computation when the space-time data matrix is extremely
large, e.g., more than 200 GB, and the time dimension is relatively small, e.g., 500.
100
80
Percentage of variance [%]
90
40
80
Cumulative percentage variance
Percentage variance
20
70
60
0
2 4 6 8 10
EOF mode number
t
Figure 6.2 A scree plot for the eigenvalues of the covariance matrix shown in Figure 6.1(a) for the SAT anomalies over an
equatorial zonal band at 2.5◦ S.
The total amount of variance explained by the first K EOF modes, denoted by QK , is
K
∑Kk=1 λk
QK = ∑ qk = × 100%, K = 1, 2, . . . , N. (6.23)
k=1 ∑Nk λk
Figure 6.2 shows a scree plot for the NOAAGlobalTemp December SAT anomalies in
1961–2010 over an equatorial zonal band at 2.5◦ S. The first EOF mode explains 74%
of the total variance. This implies the importance of the large-scale tropic dynamics. The
latitude of the zonal band becomes higher; the first EOF mode explains less variance. For
example, for the zonal band 47.5◦ N, the first mode explains only 30% of the total variance,
while the second one explains 24%.
Depending on the practical climate problems, you may choose to include the first K
modes in your analysis so that up to 80% or 90% variance is explained.
Figure 6.2 can be plotted by the following computer code.
# R plot Fig . 6 . 2 : Scree plot
K = 10
eigCov = eigen ( covBand )
# covBand is for equatorial zonal band in Fig 6 . 1 ( a )
lam = eigCov $ values
lamK = lam [ 1 : K ]
setEPS () # Plot the figure and save the file
postscript ( " fig 0 6 0 2 . eps " , width = 6 , height = 4 )
par ( mar = c ( 4 ,4 ,2 ,4 ) , mgp = c ( 2 . 2 ,0 . 7 ,0 ))
plot ( 1 :K , 1 0 0 * lamK / sum ( lam ) , ylim = c ( 0 ,8 0 ) , type = " o " ,
ylab = " Percentage of Variance [%] " ,
xlab = " EOF Mode Number " ,
cex . lab = 1 . 2 , cex . axis = 1 . 1 , lwd = 2 ,
main = " Scree Plot of the First 1 0 Eigenvalues " )
legend ( 3 ,3 0 , col = c ( " black " ) , lty = 1 , lwd = 2 . 0 ,
legend = c ( " Percentage Variance " ) , bty = " n " ,
text . font = 2 , cex = 1 . 0 , text . col = " black " )
par ( new = TRUE )
211 6.2 Definition of EOFs and PCs
ax 1 = ax . twinx ()
ax 1 . plot ( np . linspace ( 1 ,K , K ) , np . cumsum ( 1 0 0 * lamK / np . sum ( lam )) ,
’bo - ’ , linewidth = 3 )
ax 1 . set _ ylabel ( " Cumulative Variance [%] " ,
color = ’ blue ’ , labelpad = 1 0 )
ax 1 . tick _ params ( length = 6 , width = 2 ,
labelsize = 2 1 , color = ’b ’ , labelcolor = ’b ’)
ax 1 . set _ yticks ([ 6 0 ,7 0 ,8 0 ,9 0 ,1 0 0 ])
ax 1 . spines [ ’ right ’ ]. set _ color ( ’b ’)
space and time to make a smaller P × P temporal sample covariance matrix, which has the
same eigenvalues as the spatial sample covariance matrix, and whose eigenvectors are PCs.
Usually, we prefer to deal with a smaller matrix.
The P × P temporal sample covariance matrix T defined for the N × P anomaly data
matrix with scaling factors X f is as follows:
T = Xt X. (6.24)
Again, we have dropped the subscript f in this formula for simplicity. This temporal sample
covariance matrix can be diagonalized by PCs vk :
Mathematically, this is the same as the spatial covariance matrix diagonalized by EOFs uk .
Representing the time-space data matrix Xt by a set of independent PC vectors also leads
to the SVD of Xt :
p
Xt = ∑ λk vk utk . (6.26)
k
The equivalence of eigenvalue problems for the spatial and temporal covariance matrices
can be easily shown. Multiplying the eigenvalue problem for the spatial covariance matrix
(6.15) by the data matrix Xt leads to
Xt Xvk = λk vk . (6.28)
Tvk = λk vk . (6.29)
Whether to use the spatial or temporal sample covariance matrix depends on the need
and the matrix size. One would often choose a smaller matrix. However, if your only pur-
pose is to compute EOFs, you may efficiently use SVD to compute the EOFs directly
for the space-time data matrix without computing either the spatial or temporal sample
covariance matrix.
213 6.3 Climate Field and Its EOFs
A climate field refers to a climate variable, such as the 2-meter surface air temperature
T (r,t), defined over a spatial region A and a time interval [0, P]. Here, r is the spatial
position vector in region A, and t is the time variable in the interval [0, P].
The SVD for a space-time data matrix
P p
X= ∑ λm um (vm )t (6.31)
m=1
The summation goes to infinity, instead of P or N for a space-time data matrix of finite
order, because a continuous field may be considered as being composed of infinitely many
points. Here, ψm (r) are the continuous field version of EOFs, known in the literature
as the Karhunin–Loeve functions, and satisfy the orthonormal condition in the sense of
integration:
Z
ψm (r)ψn (r)dA = δmn , (6.33)
A
Here ∆Ai is the area (or volume in the case of a 3D domain) of the ith subdomain or grid
√
box, and ∆Ai is the area-factor (or the volume-factor in the case of a 3D domain). The
EOFs with the area-factor are denoted by
h p i
Em (ri ) = ψm (ri ) ∆Ai . (6.35)
The spatial projection of climate data onto EOFs also has a climate field counterpart:
1
Z
Tm (t) = T (r,t)ψm (r)dA. (6.36)
λm A
214 Covariance Matrices, EOFs, and PCs
1 N p p
Tm (t) ≈ ∑ T (ri ,t) ∆Ai ψm (ri ) ∆Ai . (6.37)
λm i=1
√
Here, T (ri ,t) ∆Ai are the data with area-factor.
A climate field T (r,t) with r ∈ A and t ∈ [0, P] is often stochastic because a climate variable
at a point or over a grid box or a region is considered to be an ensemble mean and has
uncertainties. Thus, T (r,t) follows a probabilistic distribution. When T (r,t) is a stationary
time series, the covariance function of T (r,t) is defined as
or
N
∑ Kf (ri , r j )Em (r j ) = λm0 Em (ri ), (6.43)
j=1
where
λm
λm0 = . (6.44)
P
This eigenvalue equation (6.43) is equivalent to Eq. (6.12).
215 6.4 Generating Random Fields
In summary, to compute the sample covariance matrix, we compute the data matrix with
space-time factor Xf , and then use S = Xf Xtf to compute the covariance matrix. To compute
EOFs and PCs, we directly apply SVD to Xf . The EOFs Em (ri ) have the area-factors and
are normalized as a vector
N
∑ (Em (ri ))2 = 1. (6.45)
i=1
These EOFs Em (ri ) are called the geometric EOFs, in contrast to the physical EOFs ψm (ri )
for climate dynamical interpretations. To obtain the physical EOFs ψm (ri ), you just remove
the area-factor from the geometric EOFs Em (ri ) by a division:
1
ψm (ri ) = √ Em (ri ). (6.46)
Ai
√
The time-factor rescales eigenvalues λm / P, but does not change the percentage of var-
iance explained by the mth EOF mode. In data analysis, we often only need to know the
percentage of variance explained, not the eigenvalues themselves. Thus, this justifies ignor-
ing the time-factor when computing the covariance matrix or EOFs. For the same reason,
if all area-factors are approximately the same size, the area-factor can also be ignored, and
one can directly compute the EOFs from the space-time data.
Here, Em (x) are given orthonormal EOF vectors and satisfy the orthonormal condition:
(
N 1 m=n
∑ Em (x)En (x) = 0 m 6= n . (6.48)
x=1
We use M EOFs with M ≤ N. The symbol λm represents the prescribed variance associated
with Em (x), and Bm (r) represents PCs generated by a random generator. The PC vectors
also satisfy the orthonormal condition:
(
M 1 m=n
∑ Bm (r)Bn (r) = 0 m 6= n . (6.49)
r=1
216 Covariance Matrices, EOFs, and PCs
Of course, the actual numerical results satisfy this condition only approximately because
the Bm (r) are an approximation from a random generator, such as the R command rs <-
rmvnorm(Ms, zeromean, univar), in the example of this section.
We can interpret Bm (r) as the loading of the random field along the EOF direction Em (x)
for the rth realization. The orthonormal condition here is interpreted as the independence
of PCs from each other, i.e., when PCs are treated as random variables, they strictly satisfy
the following condition:
E[Bm Bn ] = δmn . (6.50)
In this way, we can construct simulations of a random field Yxr with the known EOFs
and the prescribed spectra of eigenvalues λm , m = 1, 2, . . . , M. An example follows.
Example 6.2 Generate a random field with random and independent PCs.
Consider a random field based on the prescribed EOFs over a spatial line segment [0, π]:
r
2
ψm (s) = sin(ms), m = 1, 2, . . . , M, 0 ≤ s ≤ π. (6.51)
π
The prescribed eigenvalues λm , m = 1, 2, . . . , 5 are
10.0, 9.0, 4.1, 4.0, 2.0.
We wish to generate a random array with these given EOFs and eigenvalues.
Given positive integers M and N, discretize the line segment [0, π] by N intervals, each
of which has a length equal to π/N. The following defines M vectors in N dimensional
Euclidean space RN for different m:
r r
π 2
Em (xi ) = sin(mxi ), (6.52)
N π
where
π
xi = i , i = 1, 2, . . . , N, m = 1, 2, . . . , M. (6.53)
N
These M pvectors are approximately orthonormal in RN if N is sufficiently large. The length
factor is π/N and normalizes the discrete EOF Em (xi ).
The following N × N matrix C
M
Ci j = ∑ λm Em (xi )Em (x j ), i, j = 1, 2, . . . , N, (6.54)
m=1
can be regarded as a synthetic covariance matrix, which has Em (xi ) as its EOFs.
Then, we can use random PCs and the given variances to generate a random field with
these specified EOFs. The following computer codes generate M random PCs with Ms
samples.
# R code for generating a random space - time field
# Step 1 : Generate EOFs
N <- 1 0 0 # Number of spatial points
eof <- function (n , x ) ( sin ( n * x )/ sqrt ( pi / 2 ))* sqrt (( pi / N ))
217 6.4 Generating Random Fields
x = np . linspace ( 0 , np . pi , N )
print ( sum ( eof ( 4 ,x )** 2 )) # verify the normalization condition
# 0.99
print ( sum (( eof ( 1 ,x ) * eof ( 2 ,x )))) # verify orthogonality
# 2 . 5 3 4 3 2 2 5 7 8 1 8 4 8 6 6e - 1 8
# generate PCs
Mode = 5
Ms = 1 0 0 0
univar = np . diag ( np . repeat ( 1 , Mode ))
zeromean = np . repeat ( 0 , Mode )
rs = np . asmatrix ( np . random . multivariate _ normal ( zeromean ,
univar , Ms ))
pcm = np . asmatrix ( np . zeros ([ Ms , Mode ]))
218 Covariance Matrices, EOFs, and PCs
Figure 6.3 shows this random field with the first 100 realizations. For a given rth realiza-
tion, the field in [0, π] shows the approximate characteristics of EOFs. For a given x value,
the samples along r are not autocorrelated because the PCs are randomly generated. The
following computer code uses the Durbin–Watson test to verify that Yxr at x = −2.5704 is
not auto-correlated.
0.04
0.02
0.00
x
−0.02
−0.04
−0.06
20 40 60 80 100
r
t
Figure 6.3 A realization of a random field over [0, π] with the first 100 independent samples.
# DW = 2 . 0 3 4 4 , p - value = 0 . 6 9 6
# Implying no significant serial correlation
The same DW test procedure can also be applied to PCs to conclude nonexistence of
significant autocorrelation.
Because of the nonexistence of autocorrelation in PCs, Figure 6.3 does not show any
wave propagation. A real climate field over a line segment and in a time period is called
Hovmöller diagram (see Fig. 1.10) and usually shows patterns of wave propagation or other
coherent structures, due to serial correlations. The space-time data are not independent in
time. For a given spatial point, the temporal samples (i.e., a realization of a time series) are
autocorrelated. The effective degree of freedom is less than that of the number of temporal
samples.
# plot Fig . 6 . 3
contf = plt . contourf (r , x , Yxr [: , 1 : 1 0 0 ] , clev 2 ,
cmap = myColMap );
colbar = plt . colorbar ( contf , drawedges = False ,
ticks = [ - 0 . 0 6 ,-0 . 0 4 ,-0 . 0 2 ,0 ,0 . 0 2 ,0 . 0 4 ])
plt . title ( " A random field realization from given EOFs " ,
pad = 2 0 )
plt . xlabel ( " r " , labelpad = 2 0 )
plt . ylabel ( " x " , labelpad = 2 0 )
plt . text ( 1 0 4 , 3 . 2 , " Value " , size = 1 7 )
plt . savefig ( " fig 0 6 0 3 . eps " ) # save figure
A stationary climate anomaly field T (r,t) over region A with N grid boxes or subregions
has a covariance matrix, which is defined as an ensemble mean:
Σi j = E[T (ri ,t)T (r j ,t)]. (6.55)
Thus defined, Σi j may be regarded as the “true” covariance matrix. The exact “true” covar-
iance matrix can never be known, but it can be approximated by a mathematical model. In
practice, the approximation is a sample covariance matrix Σ̂i j computed from the sample
data as demonstrated earlier. The covariance Σi j does not depend on time t because of the
stationarity assumption (see Eq. (6.38)).
Then, our question becomes: what is the error when using the sample covariance matrix
to approximate the “true” covariance matrix? Specifically, what are the corresponding
errors of EOFs and their corresponding eigenvalues? This section provides partial answers
to these questions in both mathematical theory and numerical examples. Our common
sense is that the larger the size R of the independent realizations, the better the approxima-
tion. When the sample size goes to infinity, the sample covariance by definition becomes
the “true” covariance matrix. The ideas and method of this section follow that of North
et al. (1982).
221 6.5 Sampling Errors for EOFs
This subsection reviews the concepts of sample mean, sample variance, and their standard
errors. Then we extend these concepts to the sample eigenvalues and sample EOFs.
Let x1 , x2 , . . . , xR be the R independent samples of a normally distributed random variable
X with mean µ and standard deviation σ . Then, the sample mean
∑Ri=1 xi
x̄ = (6.56)
R
is an unbiased estimator of the true population mean:
E[x̄] = µ. (6.57)
This is because
(R − 1)s2 2
∼ χR−1 (6.63)
σ2
follows a chi-square distribution with R − 1 degrees of freedom and has a variance equal
to 2(R − 1):
(R − 1)s2
Var = 2(R − 1), (6.64)
σ2
or
2σ 4
Var[s2 ] = . (6.65)
R−1
222 Covariance Matrices, EOFs, and PCs
This subsection estimates the errors of eigenvectors for given eigenvalues and sample size.
Again, we use the asymptotic approximation method with a “small” parameter ε (Anderson
1963):
r
2
ε= . (6.72)
R
The eigenvalue problem for the true covariance matrix for the `th mode is
Σu` = λ` u` . (6.73)
The asymptotic method, also called the perturbation method, is to write λ̂` and û` as an
expansion according to powers of ε:
(1) (2)
λ̂` = λ` + ελ` + ε 2 λ` + · · · , (6.75)
(1) (2)
û` = u` + εu` + ε 2 u` + · · · . (6.76)
After steps of derivation, the first order errors of eigenvalue and eigenvectors become
r
(1) 2
δ λ = ελ` ≈ λ` . (6.77)
R
The corresponding approximation for the eigenvector error is
(1) δ λ`
δ u` = εu` ≈ u 0, (6.78)
∆λ` `
where
∆λ` = λ` − λ`0 (6.79)
is the eigenvalue gap, and λ`0 is the eigenvalue closest to λ` . Because λ` is a non-increasing
sequence, either `0 = ` + 1 or `0 = ` − 1. In the derivation of Eq. (6.78), we have ignored
all the terms
λk
λ` − λk
except k = `0 .
The convenient estimations of Eqs. (6.77) and (6.78) are known as North’s rule of thumb
for variance and EOFs, following the original paper North et al. (1982). The rule basically
says that when two eigenvalues are close to each other, the corresponding EOF can have
a large error, amplified by the inverse of the small gap between two eigenvalues, but the
error can be reduced by a sufficiently large number of independent samples. Further, the
sampling error vector of an EOF u` is approximately a scalar multiplication of its neighbor
EOF u`0 .
224 Covariance Matrices, EOFs, and PCs
In practice, many estimates of EOFs are taken from time series wherein there is corre-
lation of nearby temporal terms, known as serial correlation. This will tend to reduce the
number of independent degrees of freedom in the time segment. The serial correlation is
relevant to the time series associated with a particular EOF, typically long correlation times
corresponding to large spatial EOF modes. As in regression problems, this consideration
of serial correlation may help correct the estimation of an erroneously small confidence
interval.
Two synthetic examples will be presented in the next two subsections to illustrate the
EOF errors.
The true parameter field or data from nature are rarely known but only estimated. We thus
use synthetic data as examples to illustrate the examples of EOFs when two eigenvalues
are close to each other. Let the eigenvalues λk be
10.0, 9.0, 4.1, 4.0, 2.0.
So, λ1 and λ2 are close, and λ3 and λ4 are close. We consider two sample sizes: R = 100,
and R = 1, 000. The scree plot and the associated bars of standard errors are shown in
Figure 6.4. The figure shows that with 100 samples, λ1 and λ2 are not well separated, and
neither are λ3 and λ4 . With 1,000 samples, λ1 and λ2 are well separated, but λ3 and λ4 are
still not.
− −
− −
8
−
Variance λ
6
− −
− −
− −
4
− −
−−
−−
2
0
1 2 3 4 5
Mode
t
Figure 6.4 A scree plot for the specified eigenvalues and the bars of standard errors λ` ± λ`
p
2/R.
The scree plot Figure 6.4 can be generated by the following computer code.
225 6.5 Sampling Errors for EOFs
for i in range ( 5 ):
values = lam [ i ] - sd 2 [ i ]
minus 2 . append ( values )
The exact orthonormal EOF modes are defined in the spatial domain [0, π]:
r
2
Ek (i) = sin(kxi ), (6.80)
N
where xi = (i/N)π, i = 1, 2, . . . , N are the N sampling locations in the spatial domain [0, π].
We use random generators to generate a random sequence of length R as PCs. The PCs,
eigenvalues and EOFs form an N × R random matrix, which is considered R independent
samples of a “true” field in the spatial domain [0, π]. We then use SVD for the data matrix
to compute sample EOFs from these generated sample data and compare the sample EOFs
with the exact EOFs to show the EOF errors. The main feature of the EOF errors is the
227 6.5 Sampling Errors for EOFs
t
Figure 6.5 True EOFs and their sampling counterparts: The red is the true EOF, and the blue is the sampling EOF. The left column
shows the first five EOFs with sample size R = 100, and the right for R = 1, 000.
mode mixing. The two EOFs with closest eigenvalues with insufficient samples are mixed
and are not clearly separate.
The upper left panel of Figure 6.5 can be generated by the follow computer code. The
other panels of Figure 6.5 may be generated by entering corresponding parameters in the
code. Because this is a simulation of a random field, your result may not be the same as
what is shown here. You may repeat your numerical experiment many times and observe
the result variations.
228 Covariance Matrices, EOFs, and PCs
# Generate PCs
lam = c ( 1 0 . 0 , 9 . 0 , 4 . 1 , 4 . 0 , 2 . 0 )
round ( sqrt ( 2 / M )* lam , digits = 2 )
#[1] 1.41 1.27 0.58 0.57 0.28
univar = diag ( rep ( 1 , Mode )) # SD = 1
zeromean = rep ( 0 , Mode ) # mean = 0
rs <- rmvnorm (M , zeromean , univar )
dim ( rs ) # 1 0 0 samples and 5 modes
#[1] 100 5
mean ( rs [ , 1 ])
# [ 1 ] -0 . 0 2 5 2 4 0 3 7
var ( rs [ , 1 ])
#[1] 0.9108917
t <- seq ( 0 , 2 * pi , len = M )
a 5 1 <- rs [ , 1 ]* sqrt ( 1 / M )
sum ( a 5 1 ^ 2 )
# [ 1 ] 0 . 9 0 2 6 4 9 2 is the variance approximation
The left column of Figure 6.5 shows the true EOFs (red) and the sampling EOFs (blue)
with sample size equal to R = 100. The right column is for R = 1, 000. The scree plot shows
that λ1 = 10.0 and λ2 = 9.0 are well separated when R = 1, 000, but not when R = 100.
Thus, EOF1 and EOF2 are well separated for R = 1, 000 and have little sampling error.
230 Covariance Matrices, EOFs, and PCs
The sample EOF1 and EOF2 are close to the true EOF1 and EOF2, as shown in the first
two panels of the right column. In contrast, EOF1 and EOF2 are mixed for R = 100 as
shown in the first two panels of the left column. They have relatively large sampling errors.
The sample EOF1 has two local maxima, but the true EOF1 has only one maximum. The
sample EOF2 has only one clear local minimum and has an unclear local maximum, while
the true EOF2 clearly has one local maximum and one local minimum. Thus, the sample
EOF1 and EOF2 are mixed.
The scree plot Figure 6.4 shows that eigenvalues λ3 = 4.1 and λ4 = 4.0 are not well
separated even for R = 1, 000. Thus, EOF3 and EOF4 are mixed for both R = 100 and
R = 1, 000 and are expected to have large sampling errors. The sample EOF3 patterns in
Figure 6.5 show four local extrema, while true EOF3 has only three extrema. The sample
EOF4 patterns has three extrema while the true EOF4 has four extrema. Thus, the sample
EOF3 and EOF4 are mixed.
The scree plot Figure 6.4 also shows that λ5 = 2.0 is well separated from the other
four eigenvalues for both 100 and 1,000 samples. The sample EOF5 is very close to true
EOF5 as shown in Figure 6.5 for both 100 and 1,000 samples. EOF5 has a more complex
pattern than the first four EOFs, but a smaller sampling error for a given sample size. This
seemingly counterintuitive result implies the seriousness of the mode mixing problem and
the high level of difficulty of obtaining correct EOFs when two or more eigenvalues are
close to each other.
Thus, if an eigenvalue is well separated from other eigenvalues, the corresponding EOFs
have a small sampling error even with a small sample size. The complexity of the EOF
patterns is less important than the separation of the eigenvalues. When two neighboring
EOFs are mixed, the sampling error for each EOF can be very large. The error size can be
comparable to its neighbor EOF, according to North’s rule-of-thumb formula Eq. (6.78).
For example, the sampling error of EOF4 is comparable to that of EOF3.
Figure 6.6 shows a simulation example for the error of EOF4 δ u4 , which has a shape
and magnitude comparable to the neighboring EOF3 u3 for R = 100. Figure 6.6 may be
generated by the following computer code. Again, because this is a simulation of a random
field, the result of your first try may not be the same as what is shown here. You can repeat
your experiment many times and observe the result variations.
# R plot Fig . 6 . 6 : EOF 4 error
k = 4 # Use the SVD result from 1 0 0 samples R = 1 0 0
plot (x , svdda $ u [ , k ] , type = " l " , ylim = c ( - 0 . 3 3 ,0 . 3 3 ) ,
xlab = " x " , ylab = " EOFs [ dimensionless ] " ,
main = " EOF 4 error ( black ) vs Exact EOF 3 ( Orange ) " ,
col = " blue " ,
cex . lab = 1 . 4 , cex . axis = 1 . 4 , cex . main = 1 . 4 )
legend ( 0 . 2 ,0 . 3 7 , bty = " n " , cex = 1 . 4 , text . col = ’ blue ’ ,
lwd = 1 . 2 , legend = " Sample EOF 4 " , col = " blue " )
lines (x , eof (k , x ) , col = ’ red ’)
legend ( 0 . 2 ,0 . 4 1 , bty = " n " , cex = 1 . 4 , text . col = ’ red ’ ,
lwd = 1 . 2 , legend = " True EOF 4 " , col = " red " )
lines (x , svdda $ u [ , k ] - eof (k , x ) , lwd = 2 , col = ’ black ’ )
legend ( 0 . 2 ,0 . 3 3 , bty = " n " , cex = 1 . 4 , text . col = ’ black ’ ,
lwd = 2 , legend = " Sample EOF 4 - True EOF 4 " , col = " black " )
lines (x , - eof ( 3 ,x ) , col = ’ orange ’)
231 6.5 Sampling Errors for EOFs
t
Figure 6.6 Sampling errors of EOF4 for R = 100: The red is the true EOF4, the blue is the sampling EOF4, the black is the
difference of the true EOF4 minus the sample EOF4, and the orange is the true EOF3.
The thick black line in Figure 6.6 is the difference of the true EOF4 minus the sam-
ple EOF4 for R = 100, which has a shape and size similar to EOF3 shown by the green
curve. Note that the sample EOF4 is Figure 6.6 is different from that in Figure 6.5, because
the random fields are generated by the random PCs and are different for each generation.
Therefore, although in general the sampling EOFs have large errors when two neigh-
borhood eigenvalues are not well separated, there can be some individual cases that still
produce accurate sample EOFs with not well-separated eigenvalues. However, the latter is
232 Covariance Matrices, EOFs, and PCs
rare. For these reasons, when two eigenvalues are close to each other, we need to carefully
examine whether sample EOFs can represent the behavior of a real physical field.
In summary, when two neighboring eigenvalues are well separated, it is certain that
the sampling error is relatively small, as shown in Figure 6.5 for EOF5. Almost every
simulation yields an accurate sampling EOF5. In contrast, uncertainties, mode mixing,
and large errors likely exist among the corresponding sample EOFs when the neighboring
eigenvalues are not well separated. Keep in mind that the individual random simulation
results may have large differences from one other.
One can also test the errors of the sample eigenvalues. The sample eigenvalues are as
follows for 100 and 1,000 samples:
R= 100 10.8149 8.3288 4.3849 3.4097 1.8479
R=1000 9.9948 8.8308 4.1748 3.8353 1.9690
We consider the following given eigenvalues and EOFs in a 2D square [0, π] × [0, π]: λ =
10.0, 9.0, 4.1, 4.0, and 2.0, and
2
ψk (x, y) = sin(kx) sin(ky). (6.82)
π
On the N × N grid, the discretized EOFs are
2
Ek (i, j) = sin(kxi ) sin(ky j ), (6.83)
N
where
xi = iπ/N, y j = jπ/N, i, j = 1, 2, . . . , N (6.84)
are the coordinates of the N ×N sample grid points over the 2D square domain. The discrete
functions Ek (i, j) are considered to be the true EOFs and are shown in the left column of
Figure 6.7.
We then generate a random field using R = 100 samples based on the these given eigen-
values and EOFs, and perform the SVD on the generated field to recover the EOFs. These
are the sample EOFs, which are shown in the middle column of Figure 6.7. Compared with
the left column of the true EOFs, the middle column displays the obvious distortion of the
spatial patterns for the first four EOFs, but not for EOF5 whose eigenvalue λ5 = 2 is well
separated from other eigenvalues.
233 6.5 Sampling Errors for EOFs
t
Figure 6.7 True EOFs and their sampling counterparts: The left column are the true EOFs, the middle column are the sample EOFs
with sample size 100, and the right column are the sample EOFs with sample size 1,000.
234 Covariance Matrices, EOFs, and PCs
For sample size R = 1, 000, the distortion of EOF1 and EOF2 is less than that for R =
100, but is still obvious compared with the sample EOF5 that has little distortion. Even
with the increased sample size, EOF3 is still severely distorted, and so is EOF4.
Thus, for the 2D case, we have the same conclusion about the EOF sampling errors:
When eigenvalues are close to one another, the sampling error is large. Increasing the
sample size may not effectively reduce the error.
The computer code for generating Figure 6.7 is as follows.
# Generate PCs
lam = c ( 1 0 . 0 , 9 . 0 , 4 . 1 , 4 . 0 , 2 . 0 )
univar = diag ( rep ( 1 , Mode ))
zeromean = rep ( 0 , Mode )
rs <- rmvnorm (M , zeromean , univar )
dim ( rs )
#[1] 100 5
t <- seq ( 0 , len = M )
a 5 1 <- rs [ , 1 ]* sqrt ( 1 / M )
pcm = matrix ( 0 , nrow =M , ncol = Mode )
for ( m in 1 : Mode ){ pcm [ , m ] = rs [ , m ]* sqrt ( 1 / M )}
dim ( pcm )
#[1] 100 5
plt . show ()
Practical estimates of the (true) EOFs are usually computed from space-time data, which
might be serially correlated. The serial correlation can expand the estimates of the con-
fidence interval for an eigenvalue. Since the EOFs are sensitive to the nearness of
neighboring eigenvalues, the serial correlation may impact the stability of EOF shapes.
The number of degrees of freedom for a time series at a given spatial location is usually
smaller than the number of steps in the time series. The PC time series may be less serially
correlated, but not totally uncorrelated. The relationship between the eigenvalue and the
variance of its corresponding PC time series may require reevaluation.
237 6.6 Chapter Summary
This chapter describes covariance, EOFs, and PCs. The spatial covariance matrix may be
computed as the product of a space-time anomaly matrix and its transpose. The eigenvec-
tors of the spatial covariance matrix are called EOFs. The orthonormal projection of the
anomaly data matrix on the EOFs are the PCs, each being the time series associated with
an individual EOF. When the number of columns of the space-time anomaly data matrix is
much smaller than the number of rows, you may wish to compute the temporal covariance
matrix that is equal to the transposed space-time anomaly data matrix times the space-time
anomaly data matrix. The order of this temporal covariance matrix is smaller than that of
the spatial covariance matrix. The eigenvectors of the temporal covariance matrix are PCs.
The orthonormal projection of the anomaly data on PCs lead to EOFs. The idea of comput-
ing a temporal covariance matrix is useful when dealing with a big anomaly dataset with
a relatively small number of columns. When you are dealing with an anomaly data matrix
of less than 1.0 GB, you can obtain the EOFs and PCs more directly using SVD, and you
do not need to compute the covariance matrix. However, the covariance matrix itself may
have climate science implications, as shown in Figure 6.1.
The first of a few EOFs, particularly the first two or three, often have clear climate
science interpretations. The higher order of EOFs and PCs may be noise in the sense of cli-
mate physics, although they may help with a better mathematical approximation in climate
data analysis and modeling. You always have to face a decision about how many EOFs
to use in your data analysis. There is no specific rule for the decision, but you may use
the total variance explained, such as 80%, to justify your number of EOFs to be used. See
Section 6.2.2 and its scree plot for details. Note that the sample EOFs form a complete set
of basis vectors, and can be used to represent the field as a set of statistical modes.
A climate field is often continuous in both space and time, yet our data analysis, obser-
vations, and modeling are often in grid boxes in 2D or 3D space and at discrete steps in
time. The concepts of covariance and its eigenvector for the continuous climate field are
expressed by integrals, which must be discretized for numerical calculations. The factors
of area, volume, and time must be used in this discretization process. See Section 6.3.
In general, climate fields can be taken to be random for statistical modeling. Sections 6.4
discusses the method of generating a random climate field with given standard deviations
and EOFs. Section 6.5 discusses the sampling errors of sample EOFs and sample eigen-
values. We presented North’s rule-of-thumb: the sampling error of an EOF becomes large
when its eigenvalue is close to another EOF’s eigenvalue. Examples for EOF sampling
errors were provided for both 1D and 2D domains.
References and Further Reading
[1] T. W. Anderson, 1963: Asymptotic theory for principal component analysis. Annals of
Mathematical Statistics, 34, 122–148.
This paper used the perturbation method to estimate the errors of sample EOFs.
The original derivation of the North’s rule-of-thumb followed this paper and
the perturbation theory in quantum mechanics.
[2] D. N. Lawley, 1956: Tests of significance for the latent roots of covariance and
correlation matrices. Biometrika, 43, 128–136.
This was an early paper that studies the errors of eigenvalues of a covariance
matrix using the perturbation method.
This is the original paper of North’s rule-of-thumb. Numerous papers had been
motivated by this theory that quantifies the sample errors of EOFs.
This was an earlier paper that described the merged land-ocean monthly SAT
anomaly data from 1880.
[5] H. Von Storch and F. W. Zwiers, 2001: Statistical Analysis in Climate Research.
Cambridge University Press.
238
239 Exercises
[6] R. S. Vose, D. Arndt, V. F. Banzon et al., 2012: NOAA’s merged land-ocean surface
temperature analysis. Bulletin of the American Meteorological Society, 93, 1677–1685.
This paper provided a comprehensive summary of the quality control and other
scientific procedures that were used to produce the dataset.
[7] D. S. Wilks, 2011: Statistical Methods in the Atmospheric Sciences. 3rd ed., Academic
Press.
This excellent textbook is easy to read and contains many simple examples of
analyzing real climate data. It is not only a good reference manual for climate
scientists, but also a guide tool that helps scientists in other fields make sense
of the data analysis results in climate literature.
Exercises
6.1 (a) Instead of the 2.5◦ S zonal band in Section 6.1, compute the covariance matrix for
the January surface air temperature anomalies of the NOAAGlobalTemp p data on a
mid-latitude zonal band from 1961 to 1990. Take the area-factor cos(latitude)
into account. You may choose any single latitude band between 25◦ N and 65◦ N.
Hint: Leave out the rows and columns in the space-time data matrix when the row
or column has at least one missing datum.
(b) Plot the covariance matrix similar to Figure 6.1a for the result of (a).
6.2 (a) Compute the eigenvalues and eigenvectors of the covariance matrix in Exercise 6.1.
(b) Plot the scree plot similar to Figure 6.2.
(c) Each EOF is a function of longitude and is thus a curve. Plot the first four EOFs
on the same figure using different color. The horizontal axis is longitude and the
vertical axis is the EOF values.
6.3 (a) Develop a space-time data matrix for the NOAAGlobalTemp surface air tempera-
ture anomalies on the same zonal band as in Exercise 6.1(a). Remove the rows and
columns with missing data. The entire row or column is removed if it contains at
least one missing datum.
(b) Use SVD to calculate the eigenvalues, EOFs, and PCs of the space-time data
matrix.
240 Covariance Matrices, EOFs, and PCs
6.4 (a) Do thep same as Exercise 6.3(a) but including a space-time data matrix with area-
factor cos(latitude).
(b) Use SVD to calculate the eigenvalues, EOFs, and PCs of the space-time data matrix
with area-factor.
6.5 (a) Comparing the EOFs and eigenvalues computed in Exercises 6.2–6.4, what can
you conclude from the comparison? You may use text, numerical computing, and
graphics to answer.
(b) Does the area-factor make any difference in these three problems? Why?
6.6 Do the same as Exercise 6.1, but for a meridional band across the Pacific. Use
Figure 6.1(b) as a reference. Leave out the rows and columns which have even a single
missing datum.
6.7 (a) Compute the December climatology of the monthly 1,000 mb (i.e., 1,000 hPa) air
temperature data of a reanalysis product, such as NCEP/NCAR Reanalysis, using
the 1951–2010 mean for the entire globe.
(b) Compute the standard deviation of the same data for every grid box.
(c) Plot the maps of climatology and standard deviation.
6.8 (a) For the previous problem, form a space-time data matrix with area-factor for the
anomalies of December air temperature at 1,000 mb level.
(b) Apply SVD to the space-time matrix to compute eigenvalues, EOFs, and PCs.
(c) Plot the scree plot.
(d) Plot the maps of the first four EOFs.
(e) Plot the line graphs of the first four PCs on the same figure using different colors.
6.9 (a) Repeat the previous problem but for standardized anomalies, which are equal to
the anomalies divided by the standard deviation for each grid box.
(b) Comparing the EOFs and PCs of Step (a) of this problem with the EOF and PC
results of the previous problem, discuss the differences and the common features.
6.10 (a) Compute the December climatology of the monthly 1,000 mb geopotential height
data of a reanalysis product, such as NCEP/NCAR Reanalysis, using the 1951–
2010 mean for the entire globe.
(b) Compute the standard deviation of the same data for every grid box.
(c) Plot the maps of climatology and standard deviation.
6.11 (a) For the previous problem, form a space-time data matrix with area-factor for the
December standardized anomalies of the geopotential height data at the monthly
1,000 mb level.
(b) Apply SVD to the space-time matrix to compute eigenvalues, EOFs, and PCs.
(c) Plot the scree plot.
(d) Plot the maps of the first four EOFs.
(e) Plot the line graphs of the first four PCs on the same figure using different colors.
6.12 Develop a list of area factors for the 48 states of the contiguous United States, based on
the theory of Section 6.3 and data you can find from the Internet.
6.13 (a) The Global Precipitation Climatology Project (GPCP) monthly precipitation
dataset from January 1979 is the observed precipitation data on a 2.5◦ × 2.5◦ global
grid. It has 144 × 72 = 10368 grid boxes, with latitude and longitude ranges in
88.75◦ N–88.75◦ S, and 1.25◦ E–358.75◦ E. You can download this dataset from
241 Exercises
6.14 (a) Compute the area-weighted spatial covariance matrix for the January standardized
anomalies of the GPCP data using the 1981–2020 data computed in the previous
problem. The result should be a 10368 × 10368 matrix.
(b) Compute the eigenvectors and eigenvalues of the above spatial covariance matrix.
(c) Plot the scree plot.
(d) Plot the maps of the first four EOFs.
(e) PCs are the anomaly data projections on the EOFs using matrix multiplication.
Compute the PCs. Plot the line graphs of the first four PCs on the same figure
using different colors.
6.15 (a) Following the temporal covariance theory in Sections 6.2 and 6.3, compute the
area-weighted temporal covariance matrix for the January standardized anomalies
of the GPCP data using the 1981–2020 climatology and standard deviation. The
result should be a 40 × 40 matrix.
(b) Compute the eigenvectors and eigenvalues of the above temporal covariance
matrix.
(c) Plot the scree plot.
(d) The eigenvectors here are PCs. Plot the line graphs of the first four PCs on the same
figure using different colors.
(e) EOFs can be computed by the anomaly data projection on the PCs. Compute the
EOFs. Plot the maps of the first four EOFs.
6.16 (a) Generate the space-time area-weighted standardized anomaly matrix from the
January GPCP data from 1981 to 2020. The result should be a 10368 × 40 matrix.
(b) Compute the SVD for this space-time data matrix.
(c) The columns of the U-matrix from the SVD are the EOFs, and V-matrix from the
SVD are the PCs. Plot the maps of the first four EOFs from this computing, and plot
the line graphs of the first four PCs from the same computing. Arrange the figure
into a 4 × 2 matrix, and place the EOFs on the left column and the corresponding
PCs on the right column.
6.17 (a) Square the diagonal elements of the D-matrix from the SVD procedure in Exer-
cise 6.16. Regard the squared diagonal elements as eigenvalues and plot the
corresponding scree plot.
(b) Compare the scree plot from this SVD procedure and that from the spatial or
temporal covariance matrix. What can you conclude?
6.18 (a) Compute the physical EOFs of the January GPCP precipitation data from 1981 to
2020 based on the theory in Section 6.3.
(b) Plot the maps of the first four physical EOFs in (a).
242 Covariance Matrices, EOFs, and PCs
6.19 Stating from the perturbation expansion equations in Section 6.5.3, write down the
detailed steps of the derivation of the formula for the EOF sampling error (Eq. (6.78)):
δ λ`
δ u` ≈ u 0, (6.85)
∆λ` `
6.20 Use both mathematical formulas and physical intuition to justify why area-factors are
needed for computing the EOFs for a climate field across a large range of latitude, e.g.,
from 0◦ N to 75◦ N.
6.21 (a) Following the theory of Sections 6.4 and 6.5, use a computer to generate random
PCs and then to generate a random field with the prescribed 1D EOFs over a spatial
line segment [−π, π],
1
ψm (x) = √ cos(mx), m = 1, 2, . . . , M, (6.86)
π
and a set of eigenvalues of your choice. Work on two cases of spatial sampling:
100 and 2,000 samples.Do not forget the line segment factor.
(b) Plot the random field similar to Figure 6.3.
6.22 (a) Apply SVD to the random field in Exercise 6.21 to recover the EOFs.
(b) Plot the recovered EOFs and compare them with the originally prescribed EOFs in
the interval [−π, π].
6.23 Repeat the 2D example in Section 6.5 but use a new set of eigenvalues: λ = 12.0, 11.0,
6.1.1, 6.0, and 1.0 with the spatial sample sizes: 200 and 3,000 samples.
7 Introduction to Time Series
Roughly speaking, a time series is a string of data indexed according to time, such as a
series of daily air temperature data of San Diego in the last 1,000 days, and a series of the
daily closing values of the Dow Jones Industrial Average in the previous three months. For
these time series data, we often would like to know the following: What is their trend? Is
there evidence of cyclic behavior? Is there randomness behavior? Eventually can we use
these properties to make predictions. For example, a bride may want to plan for a wedding
on the second Saturday of July of next year. She may use the temperature seasonal cycle
to plan some logistics, such as clothes and food. She also needs to consider randomness
of rain, snow, or a cold front, although she might choose to ignore the climate trend in her
approximation.
Mathematically, a time series is defined as a sequence of random variables, indexed
by time t, and is denoted by Xt . This means that for each time index t, the RV Xt has
a probability distribution, ensemble mean (i.e., expected value), variance, skewness, etc.
The time series as a whole may show trend, cycle, and random noise. A given string of
data indexed by time is a realization of a discrete time series, and is denoted by xt . A
time series may be regarded as a collection of infinitely many realizations, which makes it
different from a deterministic function of time that has a unique value for a given time. A
stream of data ordered by time is an individual case drawn from the collection. This chapter
will describe methods for the time series analysis, including the methods to quantify trends,
cycles, and properties of randomness. In practice, a time series data analysis is for a given
dataset, and the randomness property is understood to be part of the analysis but may not
be included explicitly at the beginning.
When t takes discrete values, Xt is a discrete time series. When t takes continuous values,
Xt is a continuous time series. If not specified, the time series dealt with in this book
are discrete. This chapter begins with the time series data of CO2 , and covers the basic
time series terminologies and methods, including white noise, random walk, stochastic
processes, stationarity, moving average, autoregressive processes, Brownian motion, and
data-model fitting.
This section describes two examples of time series data: the monthly data of carbon dioxide
concentration at Mauna Loa, Hawaii, USA, since 1958, and the daily minimum tempera-
ture data of St. Paul, Minnesota, USA, from 1941 to 1950. We will briefly show the features
243
244 Introduction to Time Series
420
Scripps Institution of Oceanography
400
NOAA Global Monitoring Laboratory
t
Figure 7.1 The Keeling curve (red): the monthly atmospheric CO2 concentration data at Mauna Loa Observatory from March 1958
to July 2020. The black curve is the nonlinear trend of the Keeling curve. The trend was the result of the removal of
seasonal oscillation and the computing of a moving average. The data and documents are from the NOAA Global
Monitoring Laboratory (GML) website. https://gml.noaa.gov/ccgg/trends/mlo.html. Access
date: August 2020.
of trend, seasonal cycle, random residual, and forecast. These results will help illustrate the
time series methods to be presented in the subsequent sections.
Figure 7.1 shows the monthly atmospheric carbon dioxide (CO2 ) concentration data at
Mauna Loa Observatory (19.5362◦ N, 155.5763◦ W), elevation 3,394 meters, at Hilo
Island, Hawaii, USA. The figure shows the data from March 1958 to July 2020. C. David
Keeling of the Scripps Institution of Oceanography (SIO) started the observation at a
NOAA facility. From May 1974, NOAA has been making its own CO2 measurements,
running in parallel with the SIO observations.
The red curve in Figure 7.1 displays the monthly CO2 data. The up and down fluctua-
tions of the red curve, in an approximately ±3 ppm (parts per million) range, are due to the
seasonal cycle of CO2 . The black curve was plotted according to the monthly trend data
included in the same data table downloaded from NOAA GML. The trend curve demon-
strates a steady increase of the CO2 concentration. According to the trend data, the CO2
245 7.1 Examples of Time Series Data
concentration increased from approximately 314 ppm in 1958 to 413 ppm in 2020; an
alarming 30% increase in 63 years!
The CO2 seasonal cycle is due to the growth of plants and their leaves, and to the decom-
position of the dead plants and fallen leaves. The seasonal cycle of the CO2 concentration
peaks in May, and reaches the bottom in October. From November, the dead plants and
leaves break down. In this decomposition process, microbes respire and produce CO2 , and
hence the CO2 concentration begins to increase from November until May. In the spring,
tree leaves grow, photosynthesis increases, and the plants and leaves absorb much CO2
in the atmosphere. The photosynthesis burst in June and the following summer months
makes the CO2 concentration decrease from June until reaching the lowest level in Octo-
ber. Although the phase of the biological cycle in the Southern Hemisphere is exactly
opposite, the smaller land area does not alter the CO2 seasonal cycle dominated by the
Northern Hemisphere.
Figure 7.1 can be plotted by the following computer code.
# R plot Fig . 7 . 1 : Mauna Loa CO 2 March 1 9 5 8 - July 2 0 2 0
setwd ( " / Users / sshen / climstats " )
co 2 m = read . table ( " data / co 2 m . txt " , header = TRUE )
dim ( co 2 m )
# [1] 749 7
co 2 m [ 1 : 3 ,]
# year mon date average interpolated trend days
#1 1958 3 1958.208 315.71 315.71 314.62 -1
#2 1958 4 1958.292 317.45 317.45 315.29 -1
#3 1958 5 1958.375 317.50 317.50 314.71 -1
mon = co 2 m [ , 3 ]
co 2 = co 2 m [ , 5 ]
setEPS () # save the . eps figure file to the working directory
postscript ( " fig 0 7 0 1 . eps " , height = 8 , width = 1 0 )
par ( mar = c ( 2 . 2 ,4 . 5 ,2 ,0 . 5 ))
plot ( mon , co 2 , type = " l " , col = " red " ,
main =
expression ( paste ( " Monthly Atmospheric " ,
CO [ 2 ] , " at Mauna Loa Observatory " )) ,
xlab = " " ,
ylab = " Parts Per Million [ ppm ] " ,
cex . axis = 1 . 5 , cex . lab = 1 . 5 , cex . main = 1 . 6 )
text ( 1 9 7 5 , 4 1 0 , " Scripps Institution of Oceanography " ,
cex = 1 . 5 )
text ( 1 9 7 5 , 4 0 0 , " NOAA Global Monitoring Laboratory " ,
cex = 1 . 5 )
lines ( mon , co 2 m [ , 6 ]) # plot the trend data
dev . off ()
The CO2 data has typical components of a time series: a trend due to the global greenhouse
gas increase, a seasonal cycle due to the seasonal cycle of plants, and random residuals
due to the random fluctuations of the plant’s growth and climate conditions. This may be
expressed by the following formula:
xt = Tt + St + Nt , (7.1)
where xt stands for the time series data, Tt for trend, St for seasonal cycle, and Nt for random
residuals (also called noise, or random error). We call this the ETS (i.e., error, trend, and
season) decomposition of a time series.1 A computer code can be used to decompose the
data into the three components, as shown in Figure 7.2. This figure shows an additive
model, as expressed by Eq. (7.1). You can also make a multiplicative decomposition, which
is not discussed in this book.
The computer code for plotting Figure 7.2 is as follows.
# R plot Fig . 7 . 2 : Time series decomposition
co 2 . ts = ts ( co 2 , start = c ( 1 9 5 8 ,3 ) , end = c ( 2 0 2 0 ,7 ) ,
frequency = 1 2 )
co 2 . ts # Display time series with month and year
# Jan Feb Mar Apr May Jun Jul Aug
#1958 315.71 317.45 317.50 317.10 315.86 314.93
#1959 315.62 316.38 316.71 317.72 318.29 318.15 316.54 314.80
#1960 316.43 316.97 317.58 319.02 320.03 319.59 318.18 315.91
1 Although ETS is a commonly used term for error, trend, and season in statistics literatures, it might be more
convenient to call Eq. (7.1) a TSN decomposition, standing for trend, season, and noise. The noise can be due
to random errors or to the natural variability of the variable.
247 7.1 Examples of Time Series Data
400
Observed
360
320
400
Trend
360
3 320
2
Seasonal
1
−1 0
−3
1.0
Random
0.5
0.0
−1.0
t
Figure 7.2 The Keeling curve as time series data and its components of trend, seasonal cycle, and random residual.
The R command decompose(co2.ts) for decomposing time series data co2.ts is for
those data that have a known major seasonal cycle. Many climate variables do, such as the
midlatitude surface air temperature and precipitation. The trend is computed by a moving
average with its window length equal to a year, i.e., 12 months for the monthly data, and
365 days for the daily data. R has different moving average commands based on different
algorithms. One of them is included in the R package itsmr, which has a moving average
smoothing function smooth.ma(x, q), where x is the time series data, and q is the length
of the moving average window, thus 12 in the case of our monthly CO2 data.
After the trend is removed by y = x - smooth.ma(x, q), we have the de-trended
data y. We then compute the climatology of the de-trended data for the entire data history.
249 7.1 Examples of Time Series Data
This January to December climatology cycle is extended to the entire data history for every
year to form the seasonal cycle of the time series. The random residuals are then equal to
the de-trended data y minus the seasonal cycle.
Note that this time series decomposition procedure is for data that have a clear cycle
with a fixed period. When the cycle is quasiperiodic with a variable period, such as the
data dominated by the El Niño signal, this procedure may not yield meaningful results.
Let us further examine the results shown in Figure 7.2. The trend here is largely
monotonic and increases from 315.41 ppm in 1958 to 412.81 ppm in 2020.
summary ( co 2 . decompose $ trend , na . rm = TRUE )
# Min . 1 st Qu . Median Mean 3 rd Qu . Max . NA ’ s
#315.4 330.1 352.9 355.2 377.9 412.8 12
Here, the trend line value 315.41 ppm in Figure 7.2 is for September 1958, and is slightly
different from the trend value 314.62 for March 1958 in the original CO2 data shown in
Figure 7.1. The trend line value 412.81 ppm is for January 2020 in Figure 7.2, and is also
slightly different from the trend value 414.04 for July 2020 in the original CO2 data shown
in Figure 7.1. The small difference is due to (i) the different ways to compute the trend for
Figures 7.1 and 7.2 and (ii) the different beginning month and end month.
The seasonal cycle has an amplitude approximately equal to 3 ppm.
round ( summary ( co 2 . decompose $ seasonal , na . rm = TRUE ) ,
digits = 3 )
# Min . 1 st Qu . Median Mean 3 rd Qu . Max .
# -3 . 2 5 6 -1 . 4 8 4 0.678 0.013 2.318 3.031
The mean is 0.013 ppm, not zero. The seasonal oscillation trough is −3.26 ppm. The sea-
sonal oscillation crest is 3.03 ppm. Thus, the seasonal oscillation is not symmetric around
zero, which may be a consequence of nonlinear properties of CO2 variation. You may plot
a few cycles to see the asymmetry and the nonlinear nature using R:
plot ( co 2 . decompose $ seasonal [ 1 : 5 0 ] , type = ’l ’)
The random residuals are generally bounded between −1.0 ppm and 1.0 ppm, with two
exceptions: August 1973 and April 2016, as found by the following R code.
round ( summary ( co 2 . decompose $ random , na . rm = TRUE ) ,
digits = 4 )
# Min . 1 st Qu . Median Mean 3 rd Qu . Max . NA ’ s
# -0 . 9 6 2 8 -0 . 1 8 1 7 0 . 0 0 3 4 -0 . 0 0 0 9 0 . 1 8 1 8 1 . 2 6 5 9 12
sd ( co 2 . decompose $ random , na . rm = TRUE )
#[1] 0.2953659
which ( co 2 . decompose $ random > 1 . 2 6 5 9 3 6 9 )
#[1] 698
co 2 m [ 6 9 8 ,]
# year mon date average interpolated trend days
#698 2016 4 2016.292 407.45 407.45 404.6 25
The mean of the random residuals is approximately zero. The standard deviation is approx-
imately 0.3 ppm. The random component is relatively small compared with the seasonal
cycle and trend. This makes the CO2 forecast easier since the trend and seasonal cycle can
provide excellent forecasting skills.
250 Introduction to Time Series
The strong trend and seasonal cycle together with the weak random residuals make the
forecast for CO2 relatively reliable. Figure 7.3 shows the forecast of 48 months ahead (the
blue line). The upper and lower bounds of the blue forecasting line are indicated by the
shades. The forecast was made based on the aforementioned error, trend, and seasonal
cycle (ETS) theory. The detailed theory of time series forecasting is beyond the scope of
this book. Readers are referred to online materials and specialized books and papers.
420
400
Parts per million [ppm]
380
360
340
320
t
Figure 7.3 The blue line is the forecast of the Keeling curve with a lead of 48 months based on the ETS method. The red line is the
observed CO2 data, and the black line almost covered by the red line is the ETS fitting result.
The gap between the red and blue lines is due to a drop of 1.8 ppm CO2 concentration
value from July 2020 as the last observed value to August 2020 as the first forecasting
value. The drop in the range 1.5–2.6 ppm from July to August has occurred every year in
the past.
Figure 7.3 can be reproduced by the following computer code.
# R plot Fig . 7 . 3 : Time series forecast for CO 2
# by ETS ( Error , Trend , and Seasonal ) smoothing
library ( forecast )
co 2 forecasts <- ets ( co 2 . ts , model = " AAA " )
# ETS model AAA means exponential smoothing with
# season and trend
# Forecast 4 8 months to the future
co 2 forecasts <- forecast ( co 2 forecasts , h = 4 8 )
plot ( co 2 forecasts ,
ylab = " Parts Per Million [ ppm ] " ,
main = " Observed ( red ) and ETS Forecast ( blue ) " ,
cex . axis = 1 . 5 , cex . lab = 1 . 5 , cex . main = 1 . 2 )
lines ( co 2 . ts , col = ’ red ’)
251 7.1 Examples of Time Series Data
# set up plot
fig , ax = plt . subplots ()
# labels
plt . title ( ’ Observed ( red ) and ETS Forecast ( blue ) ’ , pad = 1 5 )
plt . ylabel ( ’ Parts Per Million [ ppm ] ’ , labelpad = 1 5 )
The monthly CO2 time series has a strong trend, a clear seasonal cycle, and a weak ran-
dom component. However, many climate data time series have a large random component,
particularly the time series of short time scales, such as the daily data and hourly data.
Although the seasonal cycles are usually clear, the trend can be weak, nonexistent, or non-
linear. Figure 7.4 shows the data of ten years (January 1, 1941 to December 31, 1950) of
daily minimum temperature at St. Paul, Minnesota, USA (44.8831◦ N, 93.2289◦ W) with
an elevation of 265.8 meters.
Figure 7.4 shows that (i) a clear and strong seasonal cycle of about −20◦ C in winter and
20◦ C in summer; (ii) the random residual data are bounded in (−20, 20)◦ C, the fluctuations
are often ±3 or 4◦ C, and the variations are obviously larger in winter than summer; and
(iii) the trend is bounded in by (1, 4)◦ C, it is relatively weak compared to the amplitude
of seasonal cycle and random residuals, and further the trend component is nonlinear, var-
ies slowly up and down in time, and is not monotonic. Some extreme values in the ETS
components may correspond to abnormal weather phenomena and may deserve detailed
scrutiny. Thus, the ETS decomposition is a helpful diagnostic tool for climate series data.
252 Introduction to Time Series
10
Observed
−10
−30
4.0
Trend
3.0
2.0
1.0
10
Seasonal
0
−10
−20
10
Random
0
−10
−20
t
Figure 7.4 The daily minimum temperature observed data of St. Paul station, MA, USA, from January 1, 1941 to December 31,
1950, and their ETS components of trend, seasonal cycle, and random residual.
The large random component and the non-monotonic and nonlinear trend in Figure 7.4
make the forecast of the daily minimum temperature of St. Paul have large errors if the fore-
cast is based solely on the ETS time series method. This is very different from the Mauna
Loa CO2 forecast shown in Figure 7.3. The practical daily temperature and precipitation
forecast often uses many other constraints, such as atmospheric pressure, precipitable water
in the air, and more atmospheric dynamical properties. Three-dimensional observations
based on stations, radar, satellite and other instruments, together with numerical models
are often combined to make modern weather forecasts.
Figure 7.4 can be reproduced by the following computer code.
253 7.1 Examples of Time Series Data
Because a leap year has 366 days, strictly speaking, using frequency = 365 in the
time series decomposition is only an approximate period of the annual cycle. However,
since our data string covers only 10 years, the error caused by the leap year is small.
Compared with the monthly CO2 data in Figure 7.2, the daily temperature data at St. Paul
in Figure 7.4 has a much stronger noise component, and a nonlinear and non-monotonic
trend. The seasonal signals are clear and stable in both the monthly CO2 and the daily Tmin
data strings. Most climate data have a clear seasonal signal in the ETS decomposition,
because of the intrinsic seasonality of weather.
A time series Wt is called white noise (WN) if each random variable Wt , for a given discrete
time t, follows a zero-mean normal distribution, and the random variables at different times
t and t 0 are uncorrelated. Namely,
where δtt 0 is the Kronecker delta. That is, each member in a white noise time series is uncor-
related with all past and future members. This can also be expressed as the autocorrelation
function (ACF) with a time lag τ:
The name “white noise” is composed of (i) “noise,” because Wt has zero autocorrelation
with both its past and future, and (ii) “white,” because the time series has neither trend
nor seasonal cycle, i.e., the spectral power of the entire time series is evenly distributed
among all frequencies as white light waves. The white light has uniform spectra across all
frequencies in nature. The spectral theory of a time series will be described in Chapter 8 of
this book.
Figure 7.5 shows a realization of white noise with 500 time steps, generated by the
following computer code.
# R plot Fig . 7 . 5 : Generate white noise
set . seed ( 1 1 9 ) # set seed to ensure the same simulation result
255 7.2 White Noise
2
1
Wt
0
−1
−3
0 100 200 300 400 500
Time
t
Figure 7.5 A simulated time series of white noise of 500 time steps.
Figure 7.5 shows complete randomness without an obvious trend or a seasonal cycle.
Figure 7.6 shows a histogram and ACF ρ(τ) of the time series. The blue line over the his-
togram is the pdf of the standard normal distribution, which is a good fit to the histogram.
This implies that the simulated time series is approximately normally distributed with zero
mean and standard deviation equal to one. The ACF figure shows that the autocorrelations
are mostly close to zero and bounded in [−0.1, 0.1] for the time lag τ = 1, 2, 3, . . .. We
regard the ACF approximately zero for any time lag of a positive integer.
Figure 7.6 shows the histogram, its PDF fit, and the ACF of a realization of white noise
with 500 time steps, generated by the following computer code.
256 Introduction to Time Series
# Plot ACF
sm . graphics . tsa . plot _ acf ( white _ noise _ ser ,
ax = ax [ 1 ] , color = ’k ’)
# Add labels and ticks for ACF
ax [ 1 ]. set _ title ( ’ Auto - correlation of White Noise ’ ,
pad = 2 0 , size = 2 0 )
ax [ 1 ]. set _ xlabel ( ’ Lag ’ , labelpad = 1 5 , size = 2 0 )
ax [ 1 ]. set _ ylabel ( ’ ACF ’ , labelpad = 1 5 , size = 2 0 )
ax [ 1 ]. set _ xticks ([ 0 ,5 ,1 0 ,1 5 ,2 0 ,2 5 ])
0.4
0.8
0.3
Density
ACF
0.2
0.4
0.1
0.0
0.0
−4 −2 0 2 4 0 5 10 15 20 25
Wt Lag
t
Figure 7.6 Left: Histogram of the simulated time series of white noise. Right: The autocorrelation function of the white noise.
White noise is the simplest discrete time series. It provides building blocks of almost all
time series models. However, most time series in nature are not white noise, because the
elements are correlated. Next, you will learn how to use the white noise to build time series
models.
Random walk (RW) literally means a random next step, forward or backward from the
present position, when walking on a line. Both the size and the direction of the next step
are random. Let Xt be a random walk time series, or represent the stochastic process of a
random walk. By definition, the next step Xt+1 is equal to the present position Xt plus a
completely random step modeled by a white noise Wt :
where δ is a fixed constant called drift, Wt ∼ N(0, σ 2 ), and σ 2 is the variance of the white
noise.
From the initial position X0 at time t = 0, we can derive the position XT at time t = T as
follows:
X1 = X0 + δ +W0 , (7.7)
X2 = X1 + δ +W1 = X0 + 2δ + (W0 +W1 ), (7.8)
······ (7.9)
XT = X0 + T δ + (W0 +W1 + · · · +WT −1 ). (7.10)
Thus, the final result of a random walk from the initial position is the sum of white noise
of all the previous steps.
We can also write it as
XT − X0 = (δ + W̄T −1 )T, (7.11)
258 Introduction to Time Series
where
T −1
∑t=0 Wt
W̄T −1 = (7.12)
T
is the mean of the white noise of the previous steps: W0 ,W1 , . . . ,WT −1 . Thus, the departure
of T steps from the initial position X0 by a random walk is proportional to time length T
with a coefficient equal to the drift δ plus the mean white noise W̄T −1 .
The mean of XT is equal to the mean of initial position plus δ T :
E[XT ] = E[X0 ] + δ T. (7.13)
Therefore, the expected value of the terminal position is equal to the expected value of
the initial position plus an integrated drift δ T . This is reasonable since the white noise is
completely random and does not have a preferred direction. This drift formulation may be
used to model a systematic drift of satellite remote sensing measurements, or a drift due to
ambient wind or ocean current.
The variance of XT grows linearly with T , as shown:
T −1
Var[XT ] = Var[X0 ] + ∑ Var[Wt ] = Var[X0 ] + σ 2 T. (7.14)
t=0
This result is important and can be used to develop stochastic models for science and
engineering, such as Brownian motion, which describes a random motion of fine particles
suspended in a liquid or gas. Brownian motion is named after Scottish botanist Robert
Brown (1773–1858) who, in 1827, described the grains of pollen of a plant suspended
in water under a microscope. Of course, the proper mathematical model of the Brownian
motion should use (x, y) coordinates as a function of t. Both x(t) and y(t) are generated
by a sum of many random walk steps simulated by white noise according to Eq. (7.10).
This is a 2-dimensional random walk. Three-dimensional Brownian motion can be used to
model a diffusive process in climate science, such as heat diffusion in both water and air,
and aerosol particle diffusion in atmosphere.
Figure 7.7 shows five realizations of random walk time series of 100 steps with drift
δ = 0.05 and standard deviation of the white noise σ = 0.1, 0.4, 0.7, 1.0, and 1.5 and x0 = 0.
When the standard deviation is relatively small compared to the drift, the drift plays the
dominant role in leading the walk directions. When the standard deviation is large, the large
variance in the white noise part imposes much uncertainty and makes the final position
highly uncertain, even being uncertain whether the final terminal position is on the positive
or negative side.
Figure 7.7 can be generated by the following computer code.
# R plot Fig . 7 . 7 : Random walk time series
# ( a ) Five realizations with different sigma values
n = 1 0 0 0 # Total number of time steps
m = 5 # Number of time series realizations
a = 0 . 0 5 # Drift delta = 0 . 0 5
b = c ( 0 . 1 , 0 . 4 , 0 . 7 , 1 . 0 , 1 . 5 ) # SD sigma
# Generate the random walk time series data
x = matrix ( rep ( 0 , m * n ) , nrow = m )
for ( i in 1 : m ){
259 7.3 Random Walk
100
100
Standard deviation of the simulations SD(t)
σ = 0.1 Theorectical formula: SD(t) ∝ t
σ = 0.4
σ = 0.7
80
σ = 1.0
σ = 1.5
50
60
Xt
Xt
40
0
20
−50
0
−100
−20
(a) (b)
0 200 400 600 800 1,000 0 200 400 600 800 1,000
Time steps: t Time steps: t
t
Figure 7.7 Left: The first 1,000 steps of five realizations of random walk for various values of standard deviation and a fixed drift
value. Right: One hundred realizations of random walk to show that the standard deviation SD[Xt ] of the 100
√
realizations, indicated by the red line, is proportional to t, indicated by the blue line.
w = rnorm (n , mean = 0 , sd = b [ i ])
for ( j in 1 :( n - 1 )){
x [i , j + 1 ] = a + x [i , j ] + w [ j ]
}
}
# Plot the five time series realizations
par ( mfrow = c ( 1 ,2 ))
par ( mar = c ( 4 . 5 ,4 . 5 , 2 . 5 , 0 . 5 ))
plot ( x [ 1 ,] , type = ’l ’ , ylim = c ( - 2 0 ,1 0 0 ) ,
xlab = " Time steps : t " , ylab = expression ( X [ t ]) ,
main = expression ( ’ Five realizations of random walks : ’
~ delta ~ ’= 0 . 0 5 ’) ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 , cex . main = 1 . 2 )
lines ( x [ 2 ,] , type = ’l ’ , col = ’ blue ’)
lines ( x [ 3 ,] , type = ’l ’ , col = ’ red ’)
lines ( x [ 4 ,] , type = ’l ’ , col = ’ orange ’)
lines ( x [ 5 ,] , type = ’l ’ , col = ’ brown ’)
legend ( - 1 0 0 , 1 1 0 ,
legend = c ( expression ( sigma ~ ’= 0 . 1 ’) ,
expression ( sigma ~ ’= 0 . 4 ’) ,
expression ( sigma ~ ’= 0 . 7 ’) ,
expression ( sigma ~ ’= 1 . 0 ’) ,
expression ( sigma ~ ’= 1 . 5 ’)) ,
col = c ( ’ black ’ , ’ blue ’ , ’ red ’ , ’ orange ’ , ’ brown ’) ,
x . intersp = 0 . 2 , y . intersp = 0 . 4 ,
seg . len = 0 . 6 , lty = 1 , bty = ’n ’ , cex = 1 . 3 )
text ( 2 0 ,-2 0 , " ( a ) " , cex = 1 . 4 )
a = 0 . 0 # Drift delta = 0
b = rep ( 1 , m ) # SD sigma is the same
# Generate the random walk time series data
x = matrix ( rep ( 0 , m * n ) , nrow = m )
for ( i in 1 : m ){
w = rnorm (n , mean = 0 , sd = b [ i ])
for ( j in 1 :( n - 1 )){
x [i , j + 1 ] = a + x [i , j ] + w [ j ]
}
}
# Plot the series realizations
par ( mar = c ( 4 . 5 ,4 . 5 , 2 . 5 , 0 . 8 ))
plot ( x [ 1 ,] , type = ’l ’ , ylim = c ( - 1 0 0 ,1 0 0 ) ,
xlab = " Time steps : t " , ylab = expression ( X [ t ]) ,
main = expression ( ’1 0 0 realizations of random walks : ’
~ delta ~ ’= 0 , ’ ~~ sigma ~ ’= 1 ’) ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 , cex . main = 1 . 2 )
for ( i in 2 : m ){
lines ( x [i ,] , type = ’l ’)
}
library ( matrixStats )
y = colSds ( x )
lines (y , type = ’l ’ , lwd = 2 , col = ’ red ’)
lines ( -y , type = ’l ’ , lwd = 2 , col = ’ red ’)
z = sqrt ( 1 : n )
lines (z , type = ’l ’ , lwd = 2 , col = ’ blue ’)
lines ( -z , type = ’l ’ , lwd = 2 , col = ’ blue ’)
legend ( - 1 5 0 , 1 2 0 ,
legend = c ( ’ Standard deviation of the simulations SD ( t ) ’ ,
expression ( ’ Theorectical formula : SD ( t ) ’% prop % sqrt ( t ))) ,
col = c ( ’ red ’ , ’ blue ’) , cex = 1 . 3 ,
x . intersp = 0 . 2 , y . intersp = 0 . 3 ,
seg . len = 0 . 4 , lty = 1 , bty = ’n ’ , lwd = 2 )
text ( 2 0 ,-1 0 0 , " ( b ) " , cex = 1 . 4 )
dev . off () # go back to R ’ s default figure setting
for j in range (n - 1 ):
X [i , j + 1 ] = drift + X [i , j ] + white _ noise [ j ]
261 7.3 Random Walk
# Fig . 7 . 7 ( b ): 1 0 0 realizations
# random walk to show variance increasing with time
seed ( 1 0 0 ) # set seed
n = 1 0 0 0 # Total number of time steps
m = 1 0 0 # Number of time series realizations
drift = 0 # Drift delta = 0
sd = np . ones ( m ) # SD sigma is the same for all realizations
# Generate the random walk time series data
X = np . zeros (( m , n ))
for i in range ( m ):
white _ noise = [ gauss ( 0 . 0 , sd [ i ]) for k in range ( n )]
for j in range (n - 1 ):
X [i , j + 1 ] = drift + X [i , j ] + white _ noise [ j ]
# sd of cols of X
col _ sd = np . std (X , axis = 0 )
plt . plot ( col _ sd , color = ’ red ’ , linewidth = 2 ,
262 Introduction to Time Series
# Add labels
plt . title ( r ’1 0 0 RW Realizations : $\ delta = 0 , \ sigma = 1 $ ’ ,
pad = 1 5 )
plt . ylabel ( r ’$ X _{ t }$ ’ , labelpad = 1 0 )
plt . xlabel ( ’ Time steps : t ’ , labelpad = 1 5 )
ax . set _ ylim ( - 1 0 0 ,1 1 0 )
ax . set _ yticks ([ - 1 0 0 ,-5 0 ,0 ,5 0 ,1 0 0 ])
7.4.2 Stationarity
Stationary literarily means “not moving” or “not changing in condition or quantity.” A time
series Xt is stationary if none of the probability characteristics of each random variable Xt
change with respect to t. This basically means that the mean, variance, skewness, kurtosis,
and all the higher statistical moments do not vary with time. This is called strict-sense
263 7.4 Stochastic Processes and Stationarity
stationarity. A milder form of stationarity for a time series is that only the mean and auto-
covariance do not vary with time. In this case, we call it wide-sense stationarity, or weak
stationarity. In climate science, we often deal with weakly stationary time series.
The two conditions for wide-sense stationarity can be expressed by mathematical
formulas as follows. The expected value does not change:
E[Xt ] = E[Xt 0 ], (7.15)
and the autocovariance depends only on the time lag τ = |t − t 0 |:
Cov[(Xt , Xt 0 ] = h(τ), (7.16)
where the auto-covariance h is a function determined by the particular stochastic process
being studied. Because h(0) = Var[Xt2 ] = Var[Xt20 ], the variance of a weakly stationary time
series does not change with time. Because the variance is constant, ACF can be calculated
as follows:
Cov[Xt , Xt 0 ] h(τ)
Cor[Xt , Xt+τ ] = p = ≡ ρ(τ). (7.17)
Var[Xt ]Var[Xt 0 ] h(0)
Namely, the ACF depends on only the time lag τ, not on the beginning time t.
Example 7.1 The standard white noise is a weakly stationary time series because
E[Wt ] = 0, Var[Wt ] = 1, and the ACF
(
1, if τ = 0
ρ(τ) = (7.18)
0, otherwise
is independent of t.
Example 7.2 A random walk is nonstationary. If there is a nonzero drift, then the mean
E[Xt ] = E[X0 ] + δt varies with t, where δ is the drift. The first condition of stationarity
is violated, and hence the random walk is nonstationary. Even in the absence of drift,
the variance Var[Xt ] = Var[X0 ] + σ 2t also varies with t, where σ 2 is the variance of the
white noise that builds the random walk. Hence the random walk without a drift is still
nonstationary.
Augmented Dickey–Fuller (ADF) test, named after American statisticians Wayne Dickey
(1945–) and Wayne Fuller (1931–), may be used to check if a time series is stationary. R
has a code for the ADF test. The null hypothesis is that the time series is nonstationary, and
the alternative hypothesis is that it is stationary. For example, we can use it to test the blue
time series in the left panel of Figure 7.7 is nonstationary.
library ( tseries ) # tseries is an R package
adf . test ( x [ 2 ,])
# Dickey - Fuller = -1 . 4 1 9 2 , Lag order = 9 , p - value = 0 . 8 2 4 2
264 Introduction to Time Series
The large p-value implies that the null hypothesis is accepted, and hence the random walk
time series is nonstationary.
Similarly, we can test that a white noise time series is stationary.
library ( tseries ) # tseries is an R package
adf . test ( rnorm ( 5 0 0 ))
# Dickey - Fuller = -8 . 4 8 5 1 , Lag order = 7 , p - value = 0 . 0 1
The small p-value implies that the alternative hypothesis is accepted and the white noise
time series is stationary.
The corresponding Python code is as follows.
A moving average (MA) stochastic process is a weighted average of white noise. MA(1)
denotes a moving averaging process that smoothes the white noise at the present time step
and the previous step by a weighted average:
Xt = aWt + bWt−1 , (7.19)
where Wt ∼ N(0, σ 2 ) are Gaussian white noise; a, b, and σ are constants; and
a + b = 1. (7.20)
The three parameters a, b, and σ can be used to fit data or a model. The constraint Eq.
(7.20) may not be necessary when modeling a stochastic process.
Figure 7.8 shows three realizations of moving average time series. The black line is an
MA(1) time series with uniform weights. The blue is an MA(1) with uneven weights with
a larger weight at time t and a smaller one for t − 1. The red is a MA(5) time series with
uniform weights. In MA(5), Xt is the weighted average of the last five steps including the
present step. One can of course extend this concept to MA(q) for any positive integer q.
Figure 7.8 can be generated by the following computer code.
# R plot Fig . 7 . 8 : Moving average time series
set . seed ( 6 5 )
n = 124
265 7.5 Moving Average Time Series
3
MA(1): a = 0.9, b = 0.1
2
1
Xt
0
−1
−2
0 20 40 60 80 100 120
Time steps
t
Figure 7.8 The first 120 steps of three realizations of moving average time series.
w = rnorm ( n )
x 1 = x 2 = x 3 = rep ( 0 , n )
for ( t in 5 : n ){
x 1 [ t ] = 0 . 5 * w [ t ] + 0 . 5 * w [t - 1 ]
x 2 [ t ] = 0 . 9 * w [ t ] + 0 . 1 * w [t - 1 ]
x 3 [ t ] = ( 1 / 5 )*( w [ t ] + w [t - 1 ] + w [t - 2 ] + w [t - 3 ] + w [t - 4 ])
}
par ( mar = c ( 4 . 3 , 4 . 5 , 2 , 0 . 2 ))
plot ( x 1 [ 5 : n ] , ylim = c ( - 2 . 3 ,3 . 2 ) , type = ’l ’ ,
main = " Realizations of moving average time series " ,
xlab = " Time steps " , ylab = expression ( X [ t ]) ,
cex . lab = 1 . 3 , cex . axis = 1 . 3 , lwd = 2 )
lines ( x 2 [ 5 : n ] , col = ’ blue ’ , lwd = 1 )
lines ( x 3 [ 5 : n ] , col = ’ red ’ , lwd = 2 )
legend ( 5 , 3 . 8 , bty = ’n ’ , lty = 1 ,
legend =( c ( " MA ( 1 ): a = 0 . 5 , b = 0 . 5 " ,
" MA ( 1 ): a = 0 . 9 , b = 0 . 1 " ,
" MA ( 5 ): Uniform weights = 1 / 5 " )) ,
col = c ( ’ black ’ , ’ blue ’ , ’ red ’) ,
lwd = c ( 2 ,1 ,2 ))
ma 1 _ uniform = np . zeros ( n )
ma 1 _ uneven = np . zeros ( n )
ma 5 _ uniform = np . zeros ( n )
# generate three MA time series
for t in range ( 4 ,n ):
ma 1 _ uniform [ t ] = . 5 * w [ t ] + . 5 * w [t - 1 ]
ma 1 _ uneven [ t ] = . 9 * w [ t ] + . 1 * w [t - 1 ]
ma 5 _ uniform [ t ] = ( w [ t ] + w [t - 1 ] + w [t - 2 ]
+ w [t - 3 ] + w [t - 4 ]) / 5
It is known that the white noise is a stationary process. We may claim that the smoothing
of a white noise is also stationary. This claim is true. We use MA(1) time series as an
example to justify this claim.
First, consider the mean of Xt :
E[Xt ] = 0, (7.21)
since E[Wt ] = 0.
Next, we consider the auto-covariance of Xt :
Cov[Xt Xt+τ ] = σ 2 (a2 + b2 )δ0,τ + ab(δ1,τ + δ−1,τ ) ,
(7.22)
where δ0,τ is the Kronecker delta function. The auto-covariance Cov[Xt Xt+τ ] is independ-
ent of t. The time independence of the mean and auto-covariance of Xt implies that the
MA(1) time series is weakly stationary.
Because
Var[Xt ] = Cov[Xt Xt+0 ] = σ 2 (a2 + b2 ), (7.23)
ACF of an MA(1) is
Cov[Xt Xt+τ ] ab(δ1,τ + δ−1,τ )
ρ(τ) = = δ0,τ + . (7.24)
Var[Xt ] a2 + b2
267 7.6 Autoregressive Process
The motion of a Brownian particle mentioned in Section 7.3 may be modeled by the
following differential equation:
dv
m = −bv + f (t), (7.28)
dt
where m is mass of the particle, v is velocity, t is time, dv/dt is the derivative of v with
respect to t, b is a frictional damping coefficient, and f (t) is white noise forcing due to ran-
dom buffeting by molecules. The damping provides a stabilizing feedback to the stochastic
system to prevent the variance of v from growing indefinitely. This problem was originally
solved by Albert Einstein (1879–1955) as part of his contributions to statistical mechanics.
An equation that includes derivatives is called a differential equation. The solution of
a deterministic differential equation without any random element is a function. When a
random forcing is involved, the differential equation becomes stochastic and its solution is
a time series. The study of differential equations belongs to a branch of mathematics and is
not included in the scope of our book. Here we only discuss the finite difference solution of
Eq. (7.28). By finite difference, we mean to discretize the stochastic differential equation
(7.28) with a time step size ∆t:
v(tn + ∆t) − v(tn )
m = −bv(tn ) + f (tn ), (7.29)
∆t
where tn = n∆t is the time at the nth time step. Denote Xn = v(tn ) and Xn+1 = v(tn + ∆t).
Then
Xn+1 = λ Xn +Wn , (7.30)
where λ = 1 − b∆t/m is a decay parameter with a bound 0 ≤ λ < 1, and Wn = f (tn )∆t/m
is white noise. The condition 0 ≤ λ implies that ∆t ≤ m/b, a restriction on the time step
size.
This model implies that the time series one step ahead is a regression to the current step
with white noise as its intercept. This is a first-order autoregressive model, denoted by
AR(1).
268 Introduction to Time Series
We claim that the AR(1) process is an infinitely long moving average process MA(∞).
This can be demonstrated by the following derivations.
Multiplying Eq. (7.30) by (1/λ )n+1 yields
Xn+1 Xn Wn
− = n+1 . (7.31)
λ n+1 λ n λ
Summing this equation from 0 to N − 1 results in
N−1
XN X0 1 Wn
− = ∑ . (7.32)
λN λ0 λ n=0 λ
n
This formula means XN is the sum of the decaying initial condition X0 λ N plus the weighted
average of white noise terms from time step 0 to N − 1. Consequently, you may regard the
AR(1) process as MA(∞) when N is large.
The first term of Eq. (7.33) represents the decay of the initial condition X0 when |λ | < 1.
When N is large enough, λ N < 1/e and hence the process forgets the initial condition after
the time step N. This condition leads to
1
N> . (7.34)
| ln λ |
If λ = 0.9, then N > 10, and if λ = 0.7, then N > 3.
The second term of Eq. (7.33) is an MA(N) process and is hence stationary. Thus, AR(1)
process is stationary as time goes to infinity. This is called the asymptotic stationarity.
Figure 7.9 shows four realizations of an AR(1) time series in color and their mean in a
thick black line. The parameters are x0 = 4, λ = 0.9, σ = 0.25. All four realizations start
at the same point: x0 = 4. The decay parameter λ = 0.9 leads to 1/| ln λ | = 9.5, implying
that the initial condition is forgotten when time reaches a number larger than 10, say 20 or
30. The figure supports this claim. The simulations show that the decay is completed in the
first 30 time steps, and the stationary random variations dominate from 30 time steps and
on. The mean of the four time series realizations shows the decay even more clearly. The
white noise has zero mean and 0.25 as its standard deviation, which determines the range
of fluctuations of the time series.
Figure 7.9 can be generated by the following computer code.
# R plot Fig . 7 . 9 : Autoregressive time series AR ( 1 )
set . seed ( 7 9 1 )
n = 1 2 1 # Number of time steps
m = 4 # Number of realizations
lam = 0 . 9 # Decay parameter
x = matrix ( rep ( 4 , n * m ) , nrow = m ) # x 0 = 4
# Simulate the time series data
269 7.6 Autoregressive Process
4
3
2
Xt
1
0
−1
−2
0 20 40 60 80 100 120
Time steps
t
Figure 7.9 The first 120 steps of four realizations of an AR(1) time series (the four color lines) and their mean (the thick black line)
for ( k in 1 : m ){
for ( i in 1 :( n - 1 )){
x [k , i + 1 ] = x [k , i ]* lam +
rnorm ( 1 , mean = 0 , sd = 0 . 2 5 )
}
}
# Plot the realizations and their mean
plot . ts ( x [ 1 ,] , type = ’l ’ , ylim = c ( - 2 ,4 ) ,
main = " Realizations of an AR ( 1 ) time series and their mean " ,
xlab = " Time steps " , ylab = expression ( X [ t ]) ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 ,
col = ’ purple ’)
lines ( x [ 2 ,] , col = ’ blue ’)
lines ( x [ 3 ,] , col = ’ red ’)
lines ( x [ 4 ,] , col = ’ orange ’)
lines ( colMeans ( x ) , lwd = 3 )
This computer code is based on the definition of AR(1): Eq. (7.30). One can also do the
AR(1) simulation using the moving average formula (7.33).
Based on Eq. (7.33), when X0 = 0, the AR(1) process in terms of white noise is
N−1
Wn
XN = λ N−1 ∑ n
. (7.35)
n=0 λ
The variance of XN is
1 − λ 2N
Var[XN ] = Cov[XN , XN ] = σ 2 . (7.39)
1−λ2
This is an increasing function of N. When N is large, λ 2N is negligible when |λ | < 1.
Hence,
σ2
Var[XN ] → as N → ∞. (7.40)
1−λ2
So, the variance approaches a constant from below as N goes to ∞.
The autocorrelation function (ACF) of AR(1) with a lag l is thus
Cov[XN , XN+l ]
ρl = = λ l. (7.41)
Var[XN ]
The zero mean, the independence of the autocorrelation from N, and the existence of the
limit of variance in Eq. (7.40) imply that the AR(1) process becomes approximately sta-
tionary when N is large. After this time, the initial condition is forgotten, and the stochastic
process completes its transition to a stationary process. Therefore, the AR(1) process is
asymptotically stationary.
Another perspective is the decaying property of ACF ρl . As the decay parameter |λ | < 1,
λ l becomes very small when the time lag l is large. Thus, ACF approaches zero as the time
lag increases. The initial condition X0 is forgotten after a certain number of steps in the
AR(1) process.
The third property is the sum of the autocorrelation at all the time lags, denoted by τ. The
sum is equal to the characteristic time scale of the AR(1) process, as justified as follows:
∞ ∞
1
τ= ∑ ρl = ∑ λ l = 1 − λ . (7.42)
l=0 l=0
This agrees with the dynamical equation (7.28) for the Brownian motion, where
b
γ= ∆t = 1 − λ (7.43)
m
is a parameter related to the damping coefficient b, mass m, and time step size ∆t. Its inverse
is the correlation interval or characteristic time scale of decay τ = 1/γ whose time unit is
∆t. Namely, τ∆t is the characteristic time.
When λ → 1, the time scale τ → ∞. This is because λ → 1 implies the frictional damping
coefficient b in Eq. (7.28) of motion of Brownian particle becomes zero. Thus, the damp-
ing mechanism disappears, which implies an infinitely long time scale. In the absence of
damping with b = 0, the Brownian motion becomes a random walk Xn+1 = Xn +Wn . The
variance of the corresponding time series grows linearly with time T , in contrast to the case
of Brownian motion whose variance goes to a constant σ 2 /(1 − λ 2 ) as described in Eq.
(7.39).
Figure 7.10 shows the ACF for two different decay parameters: λ = 0.9 and λ = 0.6. The
corresponding time scales are thus τ = 10 and τ = 2.5. The figure shows that ACF is less
than 1/e = 0.37 when the lag is larger than 11 for λ = 0.9, and 4 for λ = 0.6. Therefore,
the simulation agrees well with the theoretical result based on mathematical derivations.
272 Introduction to Time Series
1.0
Decay parameter λ =0.9 Decay parameter λ =0.6
0.8
0.6
ACF
ACF
0.4
0.2
0.0
−0.2
0 5 15 25 35 0 5 15 25 35
Time lag Time lag
t
Figure 7.10 ACFs and the corresponding time scales and decay parameters.
The fluctuations of small ACF values (mostly bounded inside [−0.1, 0.1] indicated by
two blue dashed lines in Figure 7.10) for large time lags are likely due to the insufficient
length of the sample data string and the noise in numerical simulations. Theoretically, these
values should be almost zero, because ρl = λ l → 0 as l → ∞.
Figure 7.10 may be generated by the following computer code.
# R plot Fig . 7 . 1 0 : ACF of AR ( 1 )
setwd ( ’/ Users / sshen / climstats ’)
n = 4 8 1 # Number of time steps
m = 2 # Number of realizations
lam = c ( 0 . 9 , 0 . 6 ) # Decay parameter
x = matrix ( rep ( 4 , n * m ) , nrow = m ) # x 0 = 4
# Simulate the time series data
for ( k in 1 : m ){
for ( i in 1 :( n - 1 )){
x [k , i + 1 ] = x [k , i ]* lam [ k ] +
rnorm ( 1 , mean = 0 , sd = 0 . 2 5 )
}
}
# Plot the auto - correlation function
setEPS () # Automatically saves the . eps file
postscript ( " fig 0 7 1 0 . eps " , width = 1 0 , height = 5 )
par ( mfrow = c ( 1 ,2 ))
par ( mar = c ( 4 . 5 , 4 . 5 , 3 , 1 ))
acf ( x [ 1 ,] , lag . max = 3 6 ,
main = ’ Auto - correlation function of AR ( 1 ) ’ ,
xlab = ’ Time lag ’ ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 )
text ( 2 0 , 0 . 8 , bquote ( ’ Decay parameter ’~ lambda == 0 . 9 ) ,
cex = 1 . 5 )
par ( mar = c ( 4 . 5 ,4 . 5 ,3 , 0 . 3 ))
acf ( x [ 2 ,] , lag . max = 3 6 ,
main = ’ Auto - correlation function of AR ( 1 ) ’ ,
xlab = ’ Time lag ’ , col = ’ red ’ ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 )
text ( 2 0 , 0 . 8 , expression ( ’ Decay parameter ’~ lambda == 0 . 6 ) ,
col = ’ red ’ , cex = 1 . 5 )
dev . off ()
273 7.7 Fit Time Series Models to Data
# add labels
ax [ 0 ]. set _ title ( ’ Auto - correlation function of AR ( 1 ) ’ ,
pad = 1 5 )
ax [ 0 ]. set _ xlabel ( ’ Time lag ’ , labelpad = 1 5 )
ax [ 0 ]. set _ ylabel ( ’ ACF ’ , labelpad = -5 )
ax [ 0 ]. set _ xticks ([ 0 ,5 ,1 5 ,2 5 ,3 5 ])
ax [ 0 ]. set _ yticks ([ - 0 . 2 , 0 . 2 , 0 . 6 ,1 . 0 ])
ax [ 0 ]. annotate ( r ’ Decay parameter $\ lambda $ = 0 . 9 ’ ,
xy =( 5 ,. 9 ) , size = 1 8 )
ax [ 1 ]. set _ title ( ’ Auto - correlation function of AR ( 1 ) ’ ,
pad = 1 5 )
ax [ 1 ]. set _ xlabel ( ’ Time lag ’ , labelpad = 1 5 )
ax [ 1 ]. set _ ylabel ( ’ ACF ’ , labelpad = 1 0 )
ax [ 1 ]. set _ xticks ([ 0 ,5 ,1 5 ,2 5 ,3 5 ])
ax [ 1 ]. set _ yticks ([ 0 , 0 . 4 ,0 . 8 ])
ax [ 1 ]. annotate ( r ’ Decay parameter $\ lambda $ = 0 . 6 ’ ,
xy =( 5 ,. 9 ) , color = ’ red ’ , size = 1 8 )
plt . tight _ layout ()
plt . savefig ( " fig 0 7 1 0 . pdf " ) # save figure
plt . show ()
In the previous sections about time series, we described mathematical properties and sim-
ulations of MA and AR models. This section answers a question: For a given time series
dataset, can we find an appropriate time series model so that the model can best fit the
data? This is also called a model estimation problem.
274 Introduction to Time Series
Let us find out how to use the given data x1 , x2 , . . . , xN in an MA(1) process
Xn = aWn + bWn−1 , Wn ∼ N(0, σ 2 ) (7.44)
to estimate the model parameters a, b, and σ . The idea is to use the correlation coefficient
ρ computed from the data.
Neither a = 0 nor σ = 0, because the former means white noise and the latter means
a deterministic process. Our MA(1) is neither. Thus, aσ 6= 0. We can normalize the MA
model Eq. (7.44) by dividing both sides of the equation by aσ . This leads to
Yn = Zn + θ Zn−1 , Zn ∼ N(0, 1) (7.45)
where
Xn Wn b
Yn = , Zn = , θ= . (7.46)
aσ σ a
Compute the ACF ρ(τ) as follows:
E[Yn+τ Yn ]
ρ(τ) =
E[Yn2 ]
E[(Zn+τ + θ Zn+τ−1 )(Zn + θ Zn−1 )]
=
E[(Zn + θ Zn−1 )(Zn + θ Zn−1 )]
1,
when τ = 0
= 1+θ 2 , when τ = 1
θ (7.47)
0, when τ > 1
because
E[Zn+τ Zn ] = δτ,0 . (7.48)
This formula provides an equation between the model parameter θ and the data correlation
ρ(1):
ρ(1)(1 + θ 2 ) = θ . (7.49)
The parameter θ has two solutions; hence the MA(1) model is not unique. However, the
two models represent the same process.
Therefore, for a given data sequence x1 , x2 , . . . , xN , we may compute the lag-1 correlation
ρ(1) using the following formula:
∑N−1
n=1 (xn+1 − x̄)(xn − x̄)
ρ(1) ≈ N−1
, (7.50)
∑n=1 (xn − x̄)2
where
∑Nn=1 xn
x̄ = (7.51)
N
is the data mean. Then, the AR(1) parameter θ in the model (7.45) can be computed by
solving Eq. (7.49).
275 7.7 Fit Time Series Models to Data
Similar to the model estimation for MA(1), we can also use the data sequence x1 , x2 , . . . , xN
to estimate the AR(1) model:
Xn+1 = λ Xn +Wn . (7.52)
ρ(τ) = λ τ . (7.53)
Thus, we can use the given data sequence and formulas (7.50) and (7.51) to estimate ρ(1)
that is λ in the AR(1) model (7.52). Consequently, the fitted AR(1) model is
Random walk
Xt = Xt−1 +Wt , (7.56)
Xt − Xt−1 = Wt . (7.57)
The right-hand side is white noise. The left-hand side is a difference, which may be
regarded as a discretization of the first derivative, as explained for the motion equation
of a Brownian particle. It is known that random walk is nonstationary, but its difference
time series
(1)
Xt = Xt − Xt−1 (7.58)
is stationary.
(1)
It sometimes happens that the difference Xt , also called the first difference, from a
nonstationary time series becomes stationary. For example, a monthly air temperature
anomaly time series is often nonstationary, but its month-to-month changes may be sta-
tionary, although the stationarity still needs to be rigorously verified by a proper statistical
hypothesis test, such as the ADF test discussed earlier in this chapter. The method of dif-
ference time series has been used in analyzing real climate data. Peterson et al. (1998) used
(1)
the first difference Xt method to include more long-term stations to calculate the global
average temperature change. Smith et al. (1998) used the method for studying the variation
of the historical SST data.
Sometimes, the first difference time series is still nonstationary, but the second or third
(d)
difference time series is stationary. You may use Xt to denote the difference time series
276 Introduction to Time Series
of the dth order difference, which may be regarded as a formula from approximating the
dth order derivative of X, e.g.,
(2)
Xt = Xt − 2Xt−1 + Xt−2 (7.59)
for d = 2. Of course, there are cases where the difference time series of any order is
still nonstationary, although the method of difference time series is a powerful tool for
analyzing climate data sequences.
A further extension is to the sum of an AR(p) and an MA(q). The result model is called
ARIMA(p, d, q):
(d)
Xt = c+φ1 Xt−1 +φ2 Xt−2 +· · ·+φ p Xt−p +θ1Wt−1 +θ2Wt−2 +· · ·+θqWt−q +Wt . (7.67)
The letter “I” in ARIMA stands for “Integration.” It means the ARIMA model development
is as if integrating a dth order differential equation. R has a command to fit time series data
to an ARIMA model.
The R command arima(ts, order = c(p, d, q)) can fit the time series data ts
to an ARIMA(p, d, q) model and calculate all the model parameters. Since Brownian
motion is common in nature, climate science often uses AR(1) models, although the gen-
eral ARIMA(p, d, q) model may be applied in some special cases. In the following, we
show an example of fitting the Mauna Loa CO2 data to an AR(1) model (see Fig. 7.11).
We will also show that an MA(1) model is not a good fit to the data sequence.
Observed data
400
t
Figure 7.11 AR(1) and MA(1) fitting to the monthly Mauna Loa CO2 data
Figure 7.11 shows a very good AR(1) fitting by R command arima(co2.ts, order
= c(1,0,0)): The fitted AR(1) model data (in black dashed lines) are almost overlapping
with the observed data.
The arima fitting is based on the least square principle, which may be illustrated by the
AR(1) process as an example. The arima is designed for a time series of zero mean:
The arima fitting algorithm minimizes the sum of the residual squares:
n
Sc (φ , µ) = ∑ Wt2 . (7.70)
t=2
This is also called the conditional sum-of-squares (CSS) function. Given the time series
data xt of n time steps, CSS is a function of φ and µ:
n
Sc (φ , µ) = ∑ [xt − µ − φ (xt−1 − µ)]2 . (7.71)
t=2
We use et to denote the term inside the square bracket for time step t:
Minimization of Sc (φ , µ) yields the parameter estimation of φ̂ and µ̂. It can be proven that
φ̂ ≈ ρ(1), (7.73)
which is the lag-1 autocorrelation. For the CO2 data, our arima fitting for the AR(1) model
leads to
φ = 0.9996343, µ = 358.3138084. (7.74)
or
Xt = 0.1310354 + 0.9996343Xt−1 +Wt . (7.76)
The fitted AR(1) model data x̂t is equal to the observed data minus the residuals
x̂t = xt − et . (7.77)
These are the fitted data depicted by the black dashed line in Figure 7.11. We note that the
CO2 data
xt = x̂t + et , (7.78)
are only a single realization of the stochastic process represented by (7.76), and we cannot
recover the CO2 data by simulation based on (7.76).
Figure 7.11 also shows that MA(1) model (the blue dotted line) is not a good fit to the
CO2 data. The model fitting is also based on the principle of least squares.
We have tested other ARIMA model fittings, such as ARIMA(2, 0, 0) and ARIMA(0, 1,
1). Both models also fit the data well.
ACF and partial ACF (PACF) are often used to make a calculated guess of the best
ARIMA model. When ACF ρ(τ) decreases as a power function of the time lag τ, and
PACF(p+1) value drops suddenly to a small value, AR(p) may be a good model. In the
case of the CO2 data, PACF(1) is almost one, but PACF(2) is around −0.15. This suggests
279 7.7 Fit Time Series Models to Data
that AR(1) may be a good model for the CO2 data. The details about PACF referred to
modern time series books, such as Shumway and Stoffer (2011).
Figure 7.11 can be reproduced by the following computer code.
# R plot Fig . 7 . 1 1 : ARIMA model fitting
setwd ( " / Users / sshen / climstats " )
co 2 m = read . table ( " data / co 2 m . txt " , header = TRUE )
mon = co 2 m [ , 3 ]
co 2 = co 2 m [ , 5 ]
co 2 . ts = ts ( co 2 , start = c ( 1 9 5 8 ,3 ) , end = c ( 2 0 2 0 ,7 ) ,
frequency = 1 2 )
# Then fit an AR ( 1 ) model
co 2 . AR 1 <- arima ( co 2 . ts , order = c ( 1 ,0 ,0 ))
# Obtain the AR ( 1 ) model fit data
AR 1 _ fit <- co 2 . ts - residuals ( co 2 . AR 1 )
# Fit an MA ( 1 ) model
co 2 . MA 1 <- arima ( co 2 . ts , order = c ( 0 ,0 ,1 ))
# Obtain the MA ( 1 ) model fit data
MA 1 _ fit <- co 2 . ts - residuals ( co 2 . MA 1 )
# Fit AR ( 1 ) model
AR 1 _ model = ARIMA ( co 2 _ series , order =( 1 ,0 ,0 ))
AR 1 _ model _ fit = AR 1 _ model . fit ()
AR 1 _ residuals = AR 1 _ model _ fit . resid
AR 1 _ fit = pd . Series ( co 2 _ series . values -
AR 1 _ residuals . values , data _ dates )
# Fit MA ( 1 ) model
MA 1 _ model = ARIMA ( co 2 _ series , order =( 0 ,0 ,1 ))
MA 1 _ model _ fit = MA 1 _ model . fit ()
MA 1 _ residuals = MA 1 _ model _ fit . resid
MA 1 _ fit = pd . Series ( co 2 _ series . values -
MA 1 _ residuals . values , data _ dates )
The residuals for the AR(1) model fitting can be can be verified as follows for t = 2.
mu = 3 5 8 . 3 1 3 8 0 8 4
phi = 0 . 9 9 9 6 3 4 3
co 2 [ 2 ] - mu - phi *( co 2 [ 1 ] - mu )
#[1] 1.72442
residuals ( co 2 . AR 1 )[ 2 ]
#[1] 1.724418
281 7.8 Chapter Summary
This chapter has included the basic theory and computer code of time series for climate
data analysis. Time series is an important branch of statistics. Many excellent books are
available, such as the modern textbook with R and real climate data by Shumway and
Stoffer, and the classical textbook by Box and Jenkins. Our chapter is different from the
comprehensive treatment of time series in these books. Instead, we present a few carefully
selected time series methods, concepts, datasets, R code, and Python code that are useful
in climate data analysis. These methods and concepts are summarized as follows.
(i) ETS decomposition of a time series data sequence: Many time series can be decom-
posed into three ETS components: seasonal cycle (S), trend (T), and random error
(E). R code and Python code are included to make the ETS decomposition and to
generate corresponding graphics. The monthly atmospheric carbon dioxide data and
the daily minimum temperature data are used as examples.
(ii) White noise time series: A white noise Wt at a given time t is normally distributed:
Wt ∼ N(0, σ 2 ), where the zero mean and standard deviation σ do not vary in time.
The autocorrelation with a nonzero lag is zero. We use white noise as building blocks
for the commonly used time series: random walk (RW), autoregression (AR), and
moving average (MA).
(iii) Random walk: The difference time series from a random walk is white noise, i.e.,
the next step of a RW is Wt . This concept can be extended from 1-dimensional to
n-dimensional random variables at different time steps. An important result of the
random walk is that its variance grows linearly with time.
(iv) Stationary versus nonstationary: Many methods used in published research papers
use statistical or mathematical methods that require the assumption of stationarity.
Section 7.4 suggests you pay careful attention to the concept of stationarity and
provides you a method to test the stationarity of a time series. When the original
data sequence is not stationary, you may consider its difference time series which
may become stationary. The difference time series method is a very important tool
to homogenize the historical climate data from observational stations (see Peterson
et al. 1998 and Smith et al. 1998).
(v) AR(1) process: The AR(1) model is often applicable to fit a climate data sequence
because its difference time series corresponds to the commonly used first derivative
of a climate variable with respect to time. The AR(1) model provides a very good fit
to the Mauna Loa CO2 data. Mathematically, an AR(1) time series may be regarded
as MA(∞).
(vi) When does a time series forget its initial condition? This is an important problem in
climate science, and can be explored via the ACF method.
(vii) Concise theory of the RW, AR, MA, and ARIMA models: We have presented short
derivations of the mathematical theory on white noise, random walk, autoregression
time series, moving average time series, and ARIMA time series. Our presentation
approach of concise derivation was designed to facilitate a climate scientist readily
282 Introduction to Time Series
learning the method, theory, and coding for analyzing real data, and was not intended
for a professional statistician to make a mathematical investigation. A proper choice
of the statistical model to analyze a given set of climate time series data is often
motivated by common sense, prior evidence of physics, or climate science theories,
in addition to the statistical model assumptions and hypothesis tests.
References and Further Reading
This paper explains the detailed procedures of the first difference method to
homogenize the station data for the monthly surface air temperature.
[2] M. Romer, R. Heckard, and J. Fricks, 2020: Applied Time Series Analysis,
https://online.stat.psu.edu/stat510/lesson/5/5.1, PennState Statistics
Online. Access date: September 2020.
This time series analysis text with R has many examples of real data. It has
excellent materials on time series decomposition.
[4] R. H. Shumway and D. S. Stoffer, 2016: Time Series Analysis and Its Applications:
With R Examples. 4th ed., Springer.
This book includes many inspiring examples. R code is provided for each
example. The book contains some examples of climate datasets, such as global
average annual mean temperature, precipitation, soil surface temperature, fish
population, El Niño, wind speed, and dew point.
283
284 Introduction to Time Series
[5] T. M. Smith, R. E. Livezey, and S. S. P. Shen, 1998: An improved method for interpo-
lating sparse and irregularly distributed data onto a regular grid. Journal of Climate,
11, 1717–1729.
This paper includes the concept of first guess, which can form a difference time
series.
Exercises
7.1 Write a computer code to make ETS decomposition for the monthly total precipitation at
Omaha, Nebraska, USA (Station code: USW00014942, 41.3102◦ N, 95.8991◦ W) from
January 1948 to December 2017. Plot the observed data sequence and its ETS compo-
nents. Use 100–300 words to comment on the seasonal, trend, and error components you
have obtained. The data can be downloaded from internet sites such as
www.ncdc.noaa.gov/cdo-web.
You can also use OmahaP.csv data file from data.zip for this book. The file data.zip
can be downloaded from the book website
www.climatestatistics.org.
7.2 Plot the histogram of the random error component from the previous problem on Omaha
precipitation. Comment on the probabilistic distribution of the error data based on this
histogram.
7.3 Write a computer code to make ETS decomposition for the monthly minimum sur-
face air temperature data for a station of your choice from January 1948 to December
2017. Plot the observed data sequence and its ETS components. Use 100–300 words to
comment on the seasonal, trend, and error components you have obtained.
7.4 Following Figure 7.3, use the data from the previous problem and the ETS forecasting
method to forecast the monthly minimum surface air temperature for the next 12 months:
January to December 2018. Compare your forecast data with the observed data of the
same station in the forecasting period January to December 2018.
7.5 Write a computer code to generate two realizations of 1,000 time steps for a white noise
time series with zero mean and standard deviation equal to 1.0. Plot the two realizations
on the same figure. See Figure 7.5 and its computer code for reference.
7.6 Plot the two histograms and two ACF functions for the two realizations of the white
noise in the previous problem. See Figure 7.6 and its computer code for reference.
7.7 From the random walk model defined by:
where δ is a fixed constant called drift, Wt ∼ N(0, σ 2 ), and σ 2 is the variance of the
white noise, show that the mean of XT is equal to the mean of initial position plus δ T :
E[XT ] = E[X0 ] + δ T. (7.80)
Please include all the details in your mathematical derivation.
7.8 From the random walk Xt defined in the previous problem, show that the variance of XT
grows linearly with T in the following way:
T −1
Var[XT ] = Var[X0 ] + ∑ Var[Wt ] = Var[X0 ] + σ 2 T. (7.81)
t=0
7.19 Use the ARIMA fitting method to fit the observed CO2 data in Figure 7.1 to an AR(2)
model. Plot the observed CO2 data and the fitted model data on the same figure. You may
reference Figure 7.11 and its relevant computer code, formulas, and numerical results.
7.20 Use the ARIMA fitting method to fit the Omaha monthly precipitation data in Exercise
7.1 to an AR(1) model. Plot the observed data and the fitted model data on the same
figure. You may reference Figure 7.11 and its relevant computer code, formulas, and
numerical results.
8 Spectral Analysis of Time Series
Climate has many cyclic properties, such as seasonal and diurnal cycles. Some cycles are
more definite, e.g., sunset time of London, the UK. Others are less certain, e.g., the mon-
soon cycles and rainy seasons of India. Still others are quasiperiodic with cycles of variable
periods, e.g., El Niño Southern Oscillation and Pacific Decadal Oscillation. In general,
properties of a cyclic phenomenon critically depend on the frequency of the cycles. For
example, the color of light depends on the frequency of electromagnetic waves: red cor-
responding to the energy in the range of relatively lower frequencies around 400 THz
(1 THz = 1012 Hz, 1 Hz = 1 cycle per second), and violet to higher frequencies around
700 THz. Light is generally a superposition of many colors (frequencies). The brightness of
each color is the spectral power of the corresponding frequency. Spectra can also be used to
diagnose sound waves. We can tell if a voice is from a man or a woman, because women’s
voices usually have more energy in higher frequencies while men’s have more energy in
relatively lower frequencies. The spectra of temperature, precipitation, atmospheric pres-
sure, and wind speed often are distributed in frequencies far lower than light and sound.
Spectral analysis, by name, is to quantify the frequencies and their corresponding energies.
Climate spectra can help characterize the properties of climate dynamics. This chapter
will describe the basic spectral analysis of climate data time series. Both R and Python
codes are provided to facilitate readers to reproduce the figures and numerical results in the
book.
The word “spectral” was derived from Latin “spectrum,” meaning appearance or look.
Further, “spec” means to look at or regard. In science, spectrum, or spectra in its plural
form, means the set of colors into which a beam of light can be separated. The colors
are lined up according to their wavelength. For example, the seven colors isolated by a
prism from white sunlight are red, orange, yellow, green, blue, indigo, and violet, lined up
according to the wavelength of each color, from red in longer wavelengths around 700 nm
(1 nm = 10−9 m) to violet in shorter wavelengths around 400 nm. Besides color, light has
another important property: brightness, i.e., the energy in a given color. Spectral analysis
studies both color and brightness. In general, it examines wavelength and its associated
energy for any kind of variations, such as climate.
287
288 Spectral Analysis of Time Series
1.0
T(t)
T(t)
0.0
0.0
−1.0
−1.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
1.0
T(t)
T(t)
0.0
0.0
−1.0
−1.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Time t Time t
t
Figure 8.1 Sine function of different amplitude A, period τ, and phase φ .
t = seq ( 0 , 2 , len = 1 0 0 0 )
y 1 = sin ( 2 * pi * t ) # A = 1 , tau = 1 , phi = 0
y 2 = sin ( 2 * pi * t / 0 . 5 ) # A = 1 , tau = 0 . 5 , phi = 0
y 3 = 0 . 5 * sin ( 2 * pi * t - pi / 2 ) # A = 0 . 5 , tau = 1 , phi = 0
y 4 = 0 . 5 * sin ( 2 * pi * t / 0 . 5 + pi / 6 ) # A = 0 . 5 , tau = 1 , phi = pi / 6
plot (t , y 1 ,
type = ’l ’ , lwd = 2 ,
xlab = " " , ylab = " T ( t ) " ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 ,
cex . main = 1 . 5 ,
main = expression ( paste (
’ Function T ( t ): A = 1 , ’ , tau , ’= 1 , ’ , phi , ’= 0 ’ )))
plot (t , y 2 ,
type = ’l ’ , lwd = 2 ,
xlab = " " , ylab = " T ( t ) " ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 ,
cex . main = 1 . 5 , col = ’ red ’ ,
main = expression ( paste (
’ Function T ( t ): A = 1 , ’ , tau , ’= 0 . 5 , ’ , phi , ’= 0 ’ )))
plot (t , y 3 ,
type = ’l ’ , lwd = 2 ,
xlab = " Time t " , ylab = " T ( t ) " ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 ,
cex . main = 1 . 5 , col = ’ blue ’ ,
main = expression ( paste (
’ Function T ( t ): A = 0 . 5 , ’ , tau , ’= 1 , ’ , phi , ’= ’ , - pi / 2 )))
plot (t , y 4 ,
type = ’l ’ , lwd = 2 ,
xlab = " Time t " , ylab = " T ( t ) " ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 ,
cex . main = 1 . 5 , col = ’ purple ’ ,
main = expression ( paste (
’ Function T ( t ): A = 0 . 5 , ’ , tau , ’= 0 . 5 , ’ , phi , ’= ’ , pi / 6 )))
dev . off ()
Why do we choose sine or cosine? First, sine and cosine functions have important
properties of orthogonality, expressed as follows:
Z 1
sin(2πmx) sin(2πnx) dx = 0, (8.2)
0
if m and n are different integers. This is like the zero dot product of two orthogonal vectors.
The orthogonality property turns out to be extremely useful in the analysis of all kinds of
signals, whether climate, electrical, or acoustical. Second, sine and cosine functions are
the simplest periodic orthogonal functions and can be modeled by the x and y coordinates
of a point on a unit circle, or by the harmonics of a simple pendulum oscillation. These
two properties make sine and cosine functions convenient to use in many signal analysis
problems. The signal analysis based on sine or cosine functions and their summations is
known as the Fourier analysis, named after the French mathematician and physicist Jean-
Baptiste Joseph Fourier (1768–1830). This chapter is mainly about the Fourier analysis of
time series.
Figure 8.2 shows a periodic function, whose oscillation is not as regular as a simple
sine function. It is a linear combination of the four sine functions shown in Figure 8.1.
291 8.1 The Sine Oscillation
1
0
T(t)
−1
−2
−3
t
Figure 8.2 Superposition of the four harmonics from Figure 8.1.
The linear combination is also called superposition. In fact, almost all of the periodic func-
tions in geoscience can be expressed as a superposition of simple harmonics. While the
precise statement and proof of this claim are not the aim of this book, we show the intu-
ition, numerical computing, and physical meaning of Fourier analysis, i.e., the spectral
analysis.
Figure 8.2 can be generated by the following computer code.
# R plot Fig . 8 . 2 : Wave superposition
setEPS ()
postscript ( " fig 0 8 0 2 . eps " , height = 6 , width = 1 0 )
par ( mar = c ( 4 . 5 , 4 . 8 , 2 . 5 , 0 . 5 ))
plot (t , y 1 + y 2 + 2 * y 3 + y 4 ,
type = ’l ’ , lwd = 4 ,
xlab = " Time t " , ylab = " T ( t ) " ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 ,
cex . main = 1 . 5 , col = ’ blue ’ ,
main = ’ Superposition of several harmonics ’)
dev . off ()
The previous section shows that a harmonic oscillation gives a signal with a specified
period τ, whose inverse is the frequency f :
1
f= . (8.3)
τ
If τ = 0.005 [sec], then f = 200 [Hz], meaning 200 cycles per second. The unit Hz,
pronounced as Hertz, is after German physicist Heinrich Hertz (1857–1894).
The frequency of an average adult woman’s voice ranges approximately from 165 to
255 Hz, while that of a man ranges from 85 to 155 Hz. Because women and men’s voice
frequency ranges are not overlapping, our ears can easily detect whether the voice is a
woman’s or a man’s. This is a simple signal detection method: separate signals according to
frequency. This kind of detection is called filtering in engineering. The filtering algorithm
or device is called a filter.
In climate science, we often deal with slower oscillations, such as annual or semiannual
cycles. Thus, the unit Hz is rarely used for the climate data of temperature, precipitation,
or atmospheric pressure. Nonetheless, the meteorological observations often involve high-
frequency instruments, such as meteorology radiosonde operating at a radio frequency
around 403 or 1680 MHz (megahertz) (1 MHz = 106 Hz), weather radar waves in a fre-
quency range of 1–40 GHz (gigahertz) (1 GHz = 109 Hz), and infrared satellite remote
sensing in a range of 10–400 THz (terahertz) (1 THz = 1012 Hz).
If we regard all the signals as superpositions of harmonic oscillations expressed in Eq.
(8.1) as indicated by the Fourier analysis theory, then the frequencies f = 1/τ and the cor-
responding amplitudes A can determine the signal up to a phase shift φ , which determines
when the signal begins in the cycle. The square of the amplitude of a given frequency is
called a spectral component, which varies according to frequency. The plot of the spectral
component as a function of frequency is called power spectrum, or periodogram. Con-
ventionally, engineering often uses periodogram to mean discrete power spectrum. This
chapter focuses on the discrete case. We will use the term periodogram.
We can use these ideas to analyze climate data for signals, such as the surface air
temperature variations of Tokyo, Japan. The mathematical procedure is to transform cli-
mate data in the time domain into the amplitudes squared in the frequency domain, i.e., a
frequency–spectrum relationship.
X = (x1 , x2 , · · · , xM ).
293 8.2 Discrete Fourier Series and Periodograms
Similarly, one can define discrete cosine transform (DCT). Although DCT has been used
frequently in image compression, climate scientists often use the discrete transform involv-
ing both sine and cosine, which are connected to a complex-valued function through
Euler’s formula:
eiθ = cos θ + i sin θ , (8.8)
√
where i = −1 is the imaginary unit, and θ is an angle in geometry or a phase lag in
waves. Leonhard Euler (1707–1783) was a Swiss mathematician and physicist.
Imaginary axis
iθ
i e = cosθ+ i sinθ
(x ,y )
sinθ
θ 1
0 cosθ Real axis
t
Figure 8.3 Polar expression of a complex number z = x + iy on a complex plane.
import pylab as pl
from matplotlib import collections as mc
fig , ax = plt . subplots ( figsize =( 1 0 , 1 0 ))
ax . plot (x ,y , color = ’k ’) # plot circle
ax . set _ xlim ( - 1 4 ,1 4 )
ax . set _ ylim ( - 1 4 ,1 4 )
ax . axis ( ’ off ’) # hide axis
# add arrows
ax . annotate ( " " , xy =( 1 4 ,0 ) , xytext =( - 1 3 , 0 ) ,
arrowprops = dict ( width = 0 . 5 , color = ’k ’ ))
ax . annotate ( " " , xy =( 0 ,1 4 ) , xytext =( 0 , -1 4 ) ,
arrowprops = dict ( width = 0 . 5 , color = ’k ’ ))
ax . plot ( x 2 ,y 2 , color = ’k ’) # plot angle
# plot line segments
segments = [[( x 1 , 0 ) ,( x 1 ,y 1 )] , [( 0 ,0 ) ,( x 1 ,y 1 )]]
ls = mc . LineCollection ( segments , colors = ’k ’ , linewidths = 2 )
ax . add _ collection ( ls )
# add text annotations
ax . text ( 1 0 . 3 ,-1 . 5 , " Real Axis " , size = 1 8 )
ax . text ( 0 . 5 ,1 2 . 3 , " Imaginary Axis " , size = 1 8 )
ax . text ( - 1 ,-1 , " 0 " , size = 1 8 )
ax . text ( 1 ,-1 . 2 , r " x = rcos $\ theta $ " , size = 1 8 )
ax . text ( 5 . 2 ,3 , r " y = rsin $\ theta $ " , size = 1 8 )
ax . text ( 2 , 4 . 7 , " r " , size = 1 8 )
ax . text ( 2 , 1 . 3 , r " $\ theta $ " , size = 1 8 )
ax . text ( 5 , 9 , " (x , y ) " , size = 1 8 )
ax . text ( 4 . 5 , 1 0 . 5 ,
r " z = re $^{ i \ theta }$ = rcos $\ theta $ + $ i $ rsin $\ theta $ " ,
size = 1 8 )
plt . savefig ( " fig 0 8 0 3 . eps " ) # save figure
plt . show ()
297 8.2 Discrete Fourier Series and Periodograms
where t is an integer time step, and k is an integer frequency. It can be proven that this
matrix U is a unitary matrix satisfying
U H U = I. (8.21)
The complex unitary matrix U can be decomposed into a real part A plus an imaginary
part B as follows:
U = A + iB, (8.22)
where
A = Re(U), B = Im(U). (8.23)
298 Spectral Analysis of Time Series
Figures 8.4(a) and (b) show the real part A and the imaginary part B of the unitary DFT
transformation matrix U when M = 200.
t
Figure 8.4 Real part and imaginary part of a 200 × 200-order unitary DFT transformation matrix U defined by (8.20).
The real and imaginary parts have many symmetry and other properties, such as ABt = 0,
BAt = 0, and AAt and BBt are diagonal matrices with
These properties and patterns help decompose a time series into many components at dif-
ferent frequencies and hence help with signal analysis of a climate data time series. Also
299 8.2 Discrete Fourier Series and Periodograms
see the patterns of DFT matrices A and B for M = 32 produced by Zalkow and Müller
(2022).
Figure 8.4 may be generated by the following computer code.
# R plot Fig . 8 . 4 : The unitary DFT matrix
M = 200
i = complex ( real = 0 , imaginary = 1 )
time _ freq = outer ( 0 :( M - 1 ) , 0 :( M - 1 ))
U = exp ( i * 2 * pi * time _ freq / M ) / sqrt ( M )
Ure = Re ( U ) # Real part of U
Uim = Im ( U ) # Imaginary part of U
setEPS ()
postscript ( " fig 0 8 0 4 a . eps " , height = 8 . 1 , width = 1 0 )
par ( mar = c ( 4 . 2 , 5 . 5 , 1 . 5 , 0 ))
# Plot the real part
filled . contour ( 0 :( M - 1 ) , 0 :( M - 1 ) , Ure ,
color . palette = heat . colors ,
# xlab = ’t ’ , ylab = ’k ’ ,
plot . axes ={
axis ( 1 , cex . axis = 1 . 8 )
axis ( 2 , cex . axis = 1 . 8 )} ,
plot . title ={
title ( main = ’( a ) Real Part of the DFT Unitary Matrix ’ ,
xlab = " t " , cex . lab = 2 , cex . main = 1 . 5 )
mtext ( " k " ,2 , cex = 2 , line = 4 , las = 1 )
},
key . axes = axis ( 4 , cex . axis = 2 )
)
dev . off ()
# Plot the imaginary part
setEPS ()
postscript ( " fig 0 8 0 4 b . eps " , height = 8 . 1 , width = 1 0 )
par ( mar = c ( 4 . 2 , 5 . 5 , 1 . 5 , 0 ))
# Plot the real part
filled . contour ( 0 :( M - 1 ) , 0 :( M - 1 ) , Uim ,
color . palette = rainbow ,
# xlab = ’t ’ , ylab = ’k ’ ,
plot . axes ={
axis ( 1 , cex . axis = 1 . 8 )
axis ( 2 , cex . axis = 1 . 8 )} ,
plot . title ={
title ( main = ’( b ) Imaginary Part of the DFT Unitary Matrix ’ ,
xlab = " t " , cex . lab = 2 , cex . main = 1 . 5 )
mtext ( " k " ,2 , cex = 2 , line = 4 , las = 1 )
},
key . axes = axis ( 4 , cex . axis = 2 )
)
dev . off ()
np . linspace ( 0 ,M - 1 ,M ))
# Construct the unitary DFT matrix U
U = np . exp ( i * 2 * np . pi * time _ freq / M ) / np . sqrt ( M )
Ure = np . real ( U ) # get real part of U
Uim = np . imag ( U ) # get imaginary part of U
The vector
X̃ = U H X (8.25)
is called the DFT of the time series data vector X. Each entry in the complex vector X̃
corresponds to a harmonic oscillation of a fixed frequency. Thus, the DFT transform means
that a time series X can be decomposed into M harmonic oscillations. Based on this idea,
you can reconstruct the original time series from X̃ by multiplying Eq. (8.25) by U from
left:
X = U X̃. (8.26)
In signal analysis, the recovery of the original signal X from spectra X̃ is referred to as
reconstruction or decoding. Although the inverse DFT reconstruction U X̃ is exact, we will
see later that some noise-filtered reconstruction is only an approximation. The approximate
reconstruction is often used in image compression or transmission, since only the important
parts of the spectra are used and other parts are filtered out. So we may regard U X̃ as a
reconstruction, denoted by Xrecon .
When working on a very long data stream, say M = 107 , the DFT X̃ = U H Xt computing
by a direction multiplication of the unitary matrix U takes too much time and computer
memory resources, measured in the order of O(M 2 ). There is a faster algorithm to do the
computing for DFT, called the fast Fourier transform (FFT). The FFT algorithm commonly
used today was invented by Cooley and Tukey (1965) and continues to be improved. The
computational complexity of the Cooley–Tukey FFT algorithm is O(M log(M)). For a data
sequence of length equal to a million, data size approximately being 4 MB, the FFT’s
computational complexity is in the order of 6 × 106 , while the direct DFT’s is 1012 , a huge
difference in terms of computing time and memory. This complexity of order 1012 means
that even when the 128 GB (i.e., in the order of 1011 bytes) memory of your computer
is completely exhausted, you still cannot perform the discrete Fourier transform for a data
sequence of length M = 106 , but the FFT computation can be performed easily by the same
computer. If it takes a computer 1 second to compute the FFT for M = 106 , then it will take
about 50 hours to compute the corresponding DFT assuming that the computer has enough
memory. In this case, FFT is 0.2 million times faster than DFT! In fact, since the 1960s,
302 Spectral Analysis of Time Series
people have almost exclusively used FFT for practical data analysis. The DFT formulation
is only used for mathematical proofs and the FFT result interpretation. Numerous soft-
ware packages are available for various kinds of FFT algorithms. Almost all the computer
languages have a FFT command. The R command for FFT is simply fft(ts), and so is
Python fft(ts) in the package from numpy.fft import fft, ifft.
Different FFT algorithms may use different normalization conventions in DFT and
iDFT. The FFT result of one FFT algorithm may be equal √ to another FFT result multi-
plied or divided by a normalization factor, which is often M. For example, the FFT result
of the R version 3.6.3 is equal to the DFT result computed from Eq. (8.25) multiplied by
√
M. In the previous DFT R code example, this claim can be verified as follows:
round ( fft ( ts ) , digits = 2 )
#[1] 19.31+0.00i -1 . 5 5 + 2 . 7 3 i -1 . 2 5 + 1 . 2 4 i
round ( sqrt ( M )* ts _ dft , digits = 2 )
# [ 1 ,] 1 9 . 3 1 + 0 . 0 0 i
# [ 2 ,] -1 . 5 5 + 2 . 7 3 i
# [ 3 ,] -1 . 2 5 + 1 . 2 4 i
# ......
The detailed FFT algorithm is beyond the scope of this book. Interested read-
ers can find the relevant materials from numerical analysis books, e.g., Press et al.
(2007).
k=3
k=2
k=1
0.1
k=0
0.0 −0.1
t
Figure 8.5 Real part and imaginary part of the first four DFT harmonics functions √1M e2πikt/M , k = 0, 1, 2, 3 with
M = 200.
The annual cycle corresponds to the period equal to pk = 12 [month], which means
a frequency fk = 1 [cycle/year]. The semiannual cycle means pk = 6 [month] and fk =
2 [cycle/year].
If it is daily data, then the period unit is [day]
M
pk = [day], (8.31)
k
the frequency unit is
1
fk = [cycle/day], (8.32)
pk
and the annual cycle is
1
fk = 365.25 × [cycle/year]. (8.33)
pk
Both climate science and mathematics allow noninteger frequencies.
For the general time step ∆t [unit], the period and frequency can similarly be computed:
M
pk = [∆t], (8.34)
k
1
fk = [cycle/∆t]. (8.35)
pk
304 Spectral Analysis of Time Series
For example, when working on the four-time daily output of a Reanalysis climate model,
we have ∆t = 6 [hour] and
M
pk = 6 × [Hour], (8.36)
k
1
fk = [cycle/hour], (8.37)
pk
or
1
fk = 8, 766 × [cycle/year]. (8.38)
pk
Figure 8.5 may be produced by the following computer code.
# R plot Fig . 8 . 5 : Re and Im of the first four harmonics in DFT
M = 200
time = 1 : 2 0 0
i = complex ( real = 0 , imaginary = 1 )
time _ freq = outer ( 0 :( M - 1 ) , 0 :( M - 1 ))
U = exp ( i * 2 * pi * time _ freq / M ) / sqrt ( M )
Ure = Re ( U ) # Real part of U
Uim = Im ( U ) # Imaginary part of U
setEPS ()
postscript ( " fig 0 8 0 5 . eps " , height = 6 , width = 8 )
layout ( matrix ( c ( 1 ,2 ,3 ,4 ,5 ,6 ,7 ,8 ) ,
nrow = 4 , ncol = 2 ) ,
heights = c ( 0 . 9 2 , 0 . 7 , 0 . 7 , 1 . 1 6 ,
0.92, 0.7, 0.7, 1.16),
widths = c ( 4 , 3 . 3 ) # Widths of the 2 columns
)
par ( mar = c ( 0 ,5 ,3 ,0 )) # Zero space between ( a ) and ( b )
plot ( time , Ure [ 4 ,] , pch = 1 6 , cex = 0 . 3 , xaxt = " n " ,
yaxt = " n " , xlab = " " , ylab = " k = 3 " ,
cex . axis = 1 . 6 , cex . lab = 1 . 6 ,
main = ’ Real part of DFT harmonics ’)
par ( mar = c ( 0 ,5 ,0 ,0 ))
plot ( time , Ure [ 3 ,] , pch = 1 6 , cex = 0 . 3 , xaxt = " n " ,
yaxt = " n " , xlab = " " , ylab = " k = 2 " ,
cex . axis = 1 . 6 , cex . lab = 1 . 6 )
par ( mar = c ( 0 ,5 ,0 ,0 ))
plot ( time , Ure [ 2 ,] , pch = 1 6 , cex = 0 . 3 , xaxt = " n " ,
yaxt = " n " , ylim = c ( - 0 . 1 , 0 . 1 ) ,
xlab = " " , ylab = " k = 1 " ,
cex . axis = 1 . 6 , cex . lab = 1 . 6 )
par ( mar = c ( 6 ,5 ,0 ,0 ))
plot ( time , Ure [ 1 ,] , pch = 1 6 , cex = 0 . 3 ,
xaxt = " n " , yaxt = " n " ,
ylim = c ( - 0 . 1 , 0 . 1 ) ,
xlab = " t " , ylab = " k = 0 " ,
cex . axis = 1 . 6 , cex . lab = 1 . 6 )
axis ( 1 , at = c ( 0 , 5 0 , 1 0 0 , 1 5 0 ) , cex . axis = 1 . 6 )
axis ( 2 , at = c ( - 0 . 1 , 0 , 0 . 1 ) , cex . axis = 1 . 6 )
# real k = 1
ax [ 2 ,0 ]. plot ( time , Ure [ 1 ,:] , ’k - ’)
ax [ 2 ,0 ]. set _ ylabel ( " k = 1 " , labelpad = 4 0 , size = 1 7 )
ax [ 2 ,0 ]. axes . yaxis . set _ ticks ([])
# imaginary k = 1
ax [ 2 ,1 ]. plot ( time , Uim [ 1 ,:] , ’k - ’)
ax [ 2 ,1 ]. axes . yaxis . set _ ticks ([])
# real k = 0
ax [ 3 ,0 ]. plot ( time , Ure [ 0 ,:] , ’k - ’)
ax [ 3 ,0 ]. set _ xlabel ( " t " , size = 1 7 , labelpad = 1 0 )
ax [ 3 ,0 ]. set _ ylabel ( " k = 0 " , labelpad = 1 5 , size = 1 7 )
ax [ 3 ,0 ]. set _ ylim ( - 0 . 1 ,0 . 1 )
ax [ 3 ,0 ]. tick _ params ( axis = ’y ’ , labelsize = 1 5 ,
labelrotation = 9 0 )
ax [ 3 ,0 ]. tick _ params ( axis = ’x ’ , labelsize = 1 5 )
ax [ 3 ,0 ]. set _ yticks ([ - 0 . 1 ,0 ,0 . 1 ])
# imaginary k = 0
ax [ 3 ,1 ]. plot ( time , Uim [ 0 ,:] , ’k - ’)
ax [ 3 ,1 ]. set _ xlabel ( " t " , size = 1 7 , labelpad = 1 0 )
ax [ 3 ,1 ]. axes . yaxis . set _ ticks ([])
ax [ 3 ,1 ]. tick _ params ( axis = ’x ’ , labelsize = 1 5 )
plt . savefig ( " fig 0 8 0 5 . eps " ) # save figure
plt . show ()
The periodogram is also called spectrum, or power spectra, since the periodogram
magnitude |X̃(k)|2 is an indicator of energy, or variance, of the oscillation at the kth
harmonics:
1
X̂(k,t) = X̃(k) × √ e2πikt/M . (8.40)
M
In signal analysis, the word “power” in the “power spectra” often corresponds to energy or
variance, not power in the sense of physics, where it is energy divided by time.
A large value of |X̃(k)|2 indicates that the time series has significant energy or variance
at frequency k. Most climate signals have an annual cycle. Thus, the corresponding peri-
odogram should have a relatively large value at a proper number k so that the frequency
k corresponds to a year. A time series can have several obvious large values as peaks in
its periodogram. The frequencies k for the peak values of the periodogram of a woman’s
voice have larger k values than those for a man.
307 8.2 Discrete Fourier Series and Periodograms
Figure 8.6 shows the monthly temperature time series for years 2011–20 from NCEP/N-
CAR Reanalysis data over the grid box that covers Tokyo (35.67◦ N, 139.65◦ E), Japan.1
The Reanalysis data have a spatial resolution of 2.5◦ latitude–longitude. The grid box that
covers Tokyo is centered at 35◦ N, 140◦ E. The Tokyo temperature time series shows a
clear annual cycle: There were 10 cycles from 2011 to 2020. A small semiannual cycle can
also be seen, but seems not to appear in every year. In fact, there even exists a three-month
cycle, which cannot be seen from Figure 8.6, but can be seen from its periodogram, which
is Figure 8.7.
t
Figure 8.6 Sample time series of the monthly surface temperature data of Tokyo, Japan, from the NCEP/NCAR Reanalysis data in
the period of January 2011–December 2020.
For the periodogram of the Tokyo monthly temperature data from 2011 to 2020, we
have M = 120 [months], and the time step equal to ∆t = 1 [month]. Before applying the
FFT, the mean of the Tokyo monthly temperature data is computed to be 26.43◦ C and
is removed. Otherwise, the large mean signal can obscure the oscillatory signals that are
more interesting to us. Thus, FFT is applied to the anomaly data with respect to the ten-year
mean 26.43◦ C.
In the periodogram formula |X̃(k)|2 , k represents the number of cycles in the M months
(i.e., 120 months or 10 years). Figure 8.7(a) is the periodogram |X̃(k)|2 . Although this
figure is mathematically sound, its interpretation of k may be confusing. Based on the fre-
quency and period interpretation in the previous section, the periodogram may be plotted
versus cycles per year as shown in Figure 8.7(b), cycles per month as shown in Figure
8.7(c), or period [unit: month] in Figure 8.7(d). Figures 8.7(b) and (d) are often the pre-
ferred options, which clearly shows two peaks: The larger one corresponds to an annual
cycle that has one cycle per year, and the smaller one corresponds to the semiannual cycle
that has two cycles per year. The larger peak’s period is 12 months and the smaller peak’s
period is 6 months, as shown in Figure 8.7(d). The annual cycle is due to the Earth’s incli-
nation angle when orbiting around the sun. The semiannual cycle is mainly due to the fact
that the Sun crosses the equator twice a year. Both sea surface temperature and sea level of
the Japan Sea have semiannual cycles.
1 NCEP stands for the NOAA National Centers for Environmental Prediction. NCAR stands for the U.S. National
Center for Atmospheric Research.
308 Spectral Analysis of Time Series
(a) (b)
15,000
15,000
Spectral power
Spectral power
5,000
5,000
0
0
0 10 20 30 40 50 60 0 1 2 3 4 5 6
k Cycles per year
Periodogram in terms of month Periodogram in terms of period
(c) (d)
15,000
15,000
Spectral power
Spectral power
5,000
5,000
0
0
0.0 0.1 0.2 0.3 0.4 0.5 3 6 12 24 48 96
Cycles per month Period in months (in log scale)
t
Figure 8.7 Peridogram of the monthly surface temperature data of Tokyo, Japan, from the NCEP/NCAR Reanalysis data in the
period of January 2011–December 2020.
Figures 8.6 and 8.7 may be plotted by the following computer code.
# R plot of Figs . 8 . 6 and 8 . 7 : Tokyo temperature and spectra
library ( ncdf 4 )
nc = ncdf 4 :: nc _ open ( " / Users / sshen / climstats / data / air . mon . mean . nc " )
Lon <- ncvar _ get ( nc , " lon " )
Lat 1 <- ncvar _ get ( nc , " lat " )
Time <- ncvar _ get ( nc , " time " )
library ( chron )
month . day . year ( 1 2 9 7 3 2 0 / 2 4 ,
c ( month = 1 , day = 1 , year = 1 8 0 0 ))
# 1 9 4 8 -0 1 -0 1
NcepT <- ncvar _ get ( nc , " air " )
dim ( NcepT )
#[1] 144 73 878,
# i . e . , 8 7 8 months = 1 9 4 8 -0 1 to 2 0 2 1 -0 2 , 7 3 years 2 mons
# Tokyo ( 3 5 . 6 7N , 1 3 9 . 6 5 E ) monthly temperature data 2 0 1 1 -2 0 2 0
Lat 1 [ 2 3 ]
# [ 1 ] 3 5 oN
Lon [ 5 7 ]
#[1] 140
309 8.2 Discrete Fourier Series and Periodograms
f _ mon = kk / M
plot ( f _ mon , Mod ( TokyoT _ FFT )[ 1 : 6 0 ]^ 2 ,
type = ’l ’ , lwd = 2 ,
xlab = ’ Cycles per month ’ , ylab = ’ Spectral Power ’ ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 ,
main = ’ Periodogram in terms of month ’)
text ( 0 . 4 5 ,2 1 0 0 0 , ’( c ) ’ , cex = 2 )
# axis ( 1 , at = c ( 0 . 0 8 , 0 . 1 7 , 0 . 3 3 , 0 . 5 0 ) , cex . axis = 1 . 6 )
f _ year = 1 2 * kk / M
plot ( f _ year , Mod ( TokyoT _ FFT )[ 1 : 6 0 ]^ 2 ,
type = ’l ’ , lwd = 2 ,
xlab = ’ Cycles per year ’ , ylab = ’ Spectral Power ’ ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 ,
main = ’ Periodogram in terms of year ’)
text ( 5 . 5 ,2 1 0 0 0 , ’( b ) ’ , cex = 2 )
tau = 1 2 0 / kk [ 0 : 6 0 ]
plot ( tau , Mod ( TokyoT _ FFT )[ 1 : 6 0 ]^ 2 ,
log = ’x ’ , xaxt = " n " ,
310 Spectral Analysis of Time Series
type = ’l ’ , lwd = 2 ,
xlab = ’ Period in Months ( in log scale ) ’ ,
ylab = ’ Spectral Power ’ ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 ,
main = ’ Periodogram in terms of period ’)
text ( 9 0 ,2 1 0 0 0 , ’( d ) ’ , cex = 2 )
axis ( 1 , at = c ( 3 , 6 , 1 2 , 2 4 , 4 8 , 9 6 ) , cex . axis = 1 . 6 )
dev . off ()
# plotting
fig , ax = plt . subplots ()
ax . plot ( t 1 , TokyoT , ’k - ’ , linewidth = 2 )
ax . set _ title ( " NCEP Monthly Tokyo Temperature : 2 0 1 1 -2 0 2 0 " ,
pad = 1 5 )
ax . set _ xlabel ( " Time [ month ] " , labelpad = 1 5 )
ax . set _ ylabel ( " Temperature [$\ degree $ C ] " , labelpad = 1 5 )
ax . set _ xticks ([ 2 0 1 2 , 2 0 1 4 , 2 0 1 6 , 2 0 1 8 , 2 0 2 0 ])
ax . set _ yticks ([ 2 2 ,2 4 ,2 6 ,2 8 ])
# save figure
311 8.2 Discrete Fourier Series and Periodograms
such as voice tone, color, and seasonal variations of climate. For instance, the second har-
monic in the Tokyo spectrum can be traced to the semiannual harmonic in the sunlight
driver.
Given the same variance, Eq. (8.42) implies that one system may have more variance at a
lower frequency |X̃(1)|2 , and another more variance at a higher frequency |X̃(6)|2 . Thus, if
one increases the annual cycle by an amount of variance E and another increases the diurnal
cycle by the same amount of variance E, they result in the same total variance increase.
However, the two resulting systems have different changes, one with an enhanced annual
cycle, and another with an enhanced daily cycle. Using hourly data, one may explore how
the climate change is manifested in the climate variability at daily, annual, and other time
scales. See the climate change application examples of Parseval’s identity in Dillon et al.
(2016).
The proof of Parseval’s theorem is very simple. The iDFT formula (8.26) implies
that
|X|2 = |U H X̃|2
= (U H X̃)H (U H X̃)
= X̃ H (UU H )X̃
= X̃ H X̃
= |X̃|2 . (8.43)
= U H E XX H U
i.e.,
where δkl is the Kronecker delta. This expression implies that the spectra of different
frequencies (when k 6= l) of white noise are uncorrelated. The expected value of the
periodogram (when k = l) is a constant σ 2 , i.e.,
E |X̃(k)|2 = σ 2 , k = 1, 2, . . . , M.
(8.46)
Example 8.1 Calculate the power spectral density (PSD) function of the following
damped Brownian motion equation:
dx x
+ = Z(t), (8.51)
dt τ0
where τ0 is the characteristic time scale, Z(t) ∼ N(0, σ 2 ) is white noise, t is time in
(−∞, ∞), and x is the random variable in the damped Brownian motion. This models a
continuous AR(1) process.
The Fourier transform of this differential equation yields
x̃
− 2πi f x̃ + = Z̃. (8.52)
τ0
This leads to
τ0 Z̃( f )
x̃ = , (8.53)
1 − 2πiτ0 f
where
Z̃ = FT(Z). (8.54)
315 8.4 Fourier Series for a Continuous Time Series on a Finite Time Interval [−T /2, T /2]
The SVD expression of a space-time data matrix has a counterpart for a piecewise
continuous function g(x,t):
where gk corresponds to eigenvalue, ψk (x) to spatial eigenvector, and Gk (t) to the temporal
eigenvector. A special case of this is that x has only one point, and the function g(x,t)
decays to a continuous time series x(t), which can have the following expansion:
∞
1
x(t) = ∑ x̃(k) √ ei2πkt/T . (8.57)
k=−∞ T
This infinite series is called the Fourier series over the time interval [−T /2, T /2]. The
quantities x̃(k) are called Fourier coefficients:
Z T /2
1
x̃(k) = x(t) √ ei2πkt/T dt. (8.58)
−T /2 T
Although
1
Gk (t) = √ ei2πkt/T (8.59)
T
are orthonormal
Z T /2
Gk (t)G∗l (t)dt = δkl , (8.60)
−T /2
they are prescribed in advance, in contrast to SVD where the temporal eigenvectors Gk (t)
are determined by a space-time data matrix g(x,t).
The infinite sequence
|x̃(k)|2 (k = 0, ±1, ±2, . . .)
is called the power spectrum of x(t). This infinite sequence is in contrast to the finite
periodogram
|x̃(k)|2 (k = 0, 1, 2, . . . , M − 1)
Similar to the energy identity of DFT for a finite discrete time series, the spectra of the
continuous x(t) also have an energy identity which is
∞ Z T /2
∑ |x̃(k)|2 = x2 (t) dt. (8.61)
k=−∞ −T /2
If x(t) are anomalies of a climate variable, then the right-hand side of Eq. (8.61) is the
variance of the sample variable, and hence the sum of the power spectra is equal to the
variance of the anomalies.
Because ei2πkt/T are periodic functions in t ∈ (−∞, ∞), the Fourier expansion (8.57) is
by default a periodic function over (−∞, ∞) with period equal to T . In practice, x(t) may
not be a continuous function. If there is a jump discontinuity, the series (8.57) converges
to the midpoint of the jump, i.e., (x(c− ) + x(c+ ))/2, where c is the discontinuity point and
x(c− ) is the x value on the left side of c, and x(c+ ) the x value on the right side of c. This
statement is intuitively reasonable because sine and cosine are continuous functions, and it
can be rigorously proved. The proof and the other detailed Fourier series theory are beyond
the scope of this book and can be found from many advanced calculus textbooks, such as
Khuri (2003).
and x(t + 2) = x(t) when the time domain is extended to t ∈ (−∞, ∞). Thus, the period is
T = 2.
Let
K
8
RK (t) = ∑ eiπ(2k−1)t , − 1 < t < 1 (8.64)
k=−K iπ(2k − 1)
be a finite sum of the Fourier series (8.62). This is called the partial sum. Figure 8.8 shows
the approximations to x(t) by the real part of RK (t) for K = 3, 10, and 100. The figure shows
that the approximations are excellent in the neighborhood of t = −0.5 and 0.5 where the
function x(t) is smooth and does not have much change, and the approximation is bad at
the discontinuity point t = 0, and the end points, which are also discontinuous points when
the function is periodically extended to infinity. This observation is generally true. The
approximation is the best at the differentiable points (e.g., smooth points), the next best at
the continuous but nondifferentiable points (e.g., sharp cusps), and the worst at discontinu-
ous points. At the jump discontinuous points, the approximation oscillates very fast and has
large errors. This is called the Gibbs phenomenon, named after American scientist Josiah
317 8.4 Fourier Series for a Continuous Time Series on a Finite Time Interval [−T /2, T /2]
Function x(t)
4
Sum of 7 terms
Sum of 21 terms
Sum of 201 terms
2
x(t)
0
−2
−4
t
Figure 8.8 Approximation to a discontinuous function by a partial sum of Fourier series.
Willard Gibbs (1839–1903). Why the Gibbs phenomenon occurs is an important question
in physics and is beyond the scope of this book.
Another question concerns the proper value K in order to have a good approximation.
This depends on the smoothness of the function. If a function x(t) is differentiable in the
entire interval, then |x̃(k)| is proportional to 1/k3 and hence does not need a large K to get
a good approximation. We say that the partial sum RK (t) converges very fast. If x(t) has
some nondifferentiable points but is continuous everywhere in the interval, then |x̃(k)| is
proportional to 1/k2 . Thus, the RK (t) convergence is slower. For a discontinuous function
like (8.63), |x̃(k)| is proportional to 1/k, and the RK (t) convergence is very slow.
the perspective of light, sound, or music. You may say that the spectral analysis treats cli-
mate data as music notes and makes the climate data time series “sing” so that scientists
can “hear” and detect climate signals (Lau et al. 1996). The essential components in the
spectral analysis are period or frequency, amplitude, phase, and energy, as well as their var-
iation in space and time. The spectral properties refer various kinds of relationships among
these components.
This chapter has discussed the following materials:
Computer codes and examples based on real climate data in this chapter may help you
make spectral analysis on your own time series data. The analysis may make your data
“sing”! Thus, spectral analysis allows you to explore your time domain climate data in the
320 Spectral Analysis of Time Series
spectral space. In this way, you may better quantify many properties of climate variation,
such as annual cycles, diurnal cycles, and the El Niño Southern Oscillation.
Note that we have only considered nonrandom functions on a discrete time domain with
the exception of the white noise example, which used expectation values. In climate data
analysis, we often estimate the spectrum for supposedly an infinitely long time series from
the finite sample data. We play as if the finite time series were a realization from a popula-
tion of infinitely many realizations. We need to bear in mind that the (sample) periodogram
is computed from a finite data segment and is only an approximation to the spectrum from
the ideal infinitely long stochastic time series. You can simulate this using short time series
data of white noise and their sample periodogram. The ideal spectrum is flat, but the sample
periodogram is not flat. Similar experiments can be done for AR(1) processes.
Modern statistical research over many decades has led us to numerous ways you can
treat the data and obtain more accurate estimates of the true spectrum. Modern texts have
covered these methods and there are many codes available from R or Python to assist. Our
book does not exhaust all these methods.
References and Further Reading
[1] J. W. Cooley and J. W. Tukey, 1965: An algorithm for the machine calculation of
complex Fourier series. Mathematics of Computation, 19, 297–301.
This seminal work established the most commonly used FFT algorithm, known
as the Cooley–Tukey algorithm. The main idea is to break a large DFT into
many smaller DFTs that can be handled by a computer.
[2] M. E. Dillon, H. A. Woods, G. Wang et al., 2016: Life in the frequency domain: The
biological impacts of changes in climate variability at multiple time scales. Integrative
and Comparative Biology, 56, 14–30.
[3] A. Khuri, 2003: Advanced Calculus with Applications in Statistics. 2nd ed., Wiley-
Interscience.
This book has a chapter devoted to the Fourier series and its statistics applica-
tions. It has rigorous proofs of convergence, continuity, and differentiability.
[4] K. M. Lau and H. Weng, 1996: Climate signal detection using wavelet transform:
How to make a time series sing. Bulletin of the American Meteorological Society, 76,
2391–2402.
This beautifully written paper has an attractive title showing the core spirit of
spectral analysis: make time series data sing. In their spectral analysis, instead
of sine and cosine functions, they used wavelet functions in their transforma-
tion matrices. Wavelets are an efficient way to treat the data when frequencies
change with time.
[5] W. H. Press, H. William, S. A. Teukolsky et al., 2007: Numerical Recipes: The Art of
Scientific Computing. 3rd ed., Cambridge University Press.
321
322 Spectral Analysis of Time Series
This is the most popular numerical analysis tool book beginning in the
1980s. It contains many kinds of numerical algorithms, including FFT, used
in engineering and science.
[6] R. H. Shumway, D. S. Stoffer, 2016: Time Series Analysis and Its Applications With R
Examples. 4th ed., Springer.
The R code and examples in chapter 4 of this book are helpful references.
www.audiolabs-erlangen.de/resources/MIR/FMP/C2/C2_DFT-FFT.html
This is the website of resources for the book by M. Müller entitled Fun-
damentals of Music Processing Using Python and Jupyter Notebooks, 2nd
ed., Springer, 2021. The website is hosted at the International Audio Labs,
Erlangen, Germany.
Exercises
8.1 Use a computer to plot six cosine functions, each of which has a different amplitude A,
period τ, and phase φ . You can reference Figure 8.1 for this problem.
8.2 Plot the sum of the six cosine functions in the previous problem. You can reference
Figure 8.2 for this problem.
8.3 (a) For M = 10 and on an M × M grid, make a pixel-color plot for the discrete sine
transform (DST) matrix Φ defined by Eq. (8.4). You can reference Figure 8.4 for
this problem.
(b) Make a similar plot for M = 30, and compare the two plots. Hint: You may use
the R command image(x, y, MatrixData), or Python command ax.imshow
(MatrixData).
8.4 This exercise is to test the idea of using DST to filter out noise. Because of the limitation
of computer memory, the DST method can only be tested for a short sequence of data.
You may choose M = 51 or smaller.
(a)Regard ys = 10 sin(t) as signal and yn ∼ N(0, 32 ) as noise. Plot the signal ys , noise
yn , and signal plus noise ys + yn on the same figure for 0 ≤ t ≤ 10. See Figure 8.9
as a reference.
(b) Make the DST of the data of signal plus noise ys + yn .
(c) Identify the DST component with the largest absolute value, and replace all the
other components by zero. This is the filtering step, i.e., filtering out certain
harmonics.
323 Exercises
20
10
0
y
−10 Signal
Noise
−20
Data = Signal + Noise
0 2 4 6 8 10
t
t
Figure 8.9 Data, signal, and noise for a discrete sine transform (DST) filter: M = 51 for 50 time steps in the time interval
[0, 10]. The small black circles indicate data at discrete time tm (m = 1, 2, . . . , M) in the continuous time interval
0 ≤ t ≤ 10.
(d) Make a reconstruction based on this modified DST vector with only one nonzero
component. Plot this reconstruction and compare it with the signal in Step (b).
(e) Replace the DST component with the largest absolute value by zero, and make
a reconstruction based on this modified DST vector. Plot this reconstruction and
compare it with the noise in Step (b).
Figure 8.9 can be plotted by the following computer code.
# R plot Fig . 8 . 9 : Data = Signal + Noise for Exercise 8 . 4
setwd ( ’ ~/ climstats ’)
setEPS () # Automatically saves the . eps file
postscript ( " fig 0 8 0 9 . eps " , height = 5 , width = 7 )
par ( mar = c ( 4 . 5 , 4 , 2 . 5 , 0 . 2 ))
seed ( 1 0 1 )
M = 51
t = seq ( 0 , 1 0 , len = M )
ys = 1 0 * sin ( t )
yn = rnorm (M , 0 , 3 )
yd = ys + yn
plot (t , yd , type = ’o ’ , lwd = 2 ,
ylim = c ( - 2 0 , 2 0 ) ,
xlab = ’t ’ , ylab = ’y ’ ,
main = ’ Data , signal , and noise for a DST filter ’ ,
cex . lab = 1 . 4 , cex . axis = 1 . 4 )
legend ( 0 , -1 6 , ’ Data = Signal + Noise ’ ,
lty = 1 , bty = ’n ’ , lwd = 2 , cex = 1 . 4 )
lines (t , ys , col = ’ blue ’ , lwd = 4 )
legend ( 0 , -1 0 , ’ Signal ’ , cex = 1 . 4 ,
lty = 1 , bty = ’n ’ , lwd = 4 , col = ’ blue ’)
lines (t , yn , col = ’ brown ’)
legend ( 0 , -1 3 , ’ Noise ’ , cex = 1 . 4 ,
lty = 1 , bty = ’n ’ , col = ’ brown ’)
dev . off ()
324 Spectral Analysis of Time Series
(c) Compute and plot the periodogram with the period in months as the horizontal axis
(see Figure 8.7(d)).
(d) Use text to make climate interpretations of these three figures in text (limited to
30–200 words).
8.16 Verify Parseval’s identity (8.41) for the monthly surface air temperature data X of the
Central England from January 1659 to December 2021.
8.17 Use FFT to compute and then plot the sample periodogram of the daily mean surface
air temperature Tmin of St. Paul, Minnesota, USA, from January 1, 1941 to December
31, 1949. You can use the txt data StPaulStn.txt from the book data file data.zip
downloadable at the book website. Or you can download the data online from the Global
Historical Climatology Network-Daily (GHCN-D) using the station ID: USW00014927.
(a) Plot the St. Paul data time series from January 1, 1941 to December 31, 1949, a
figure similar to Figure 8.6.
(b) Plot the periodogram with the cycles per year as the horizontal axis (see Figure
8.7(b)).
(c) Plot the periodogram with the period in days as the horizontal axis (see Figure
8.7(d)).
(d) Use text to make climate interpretations of these three figures in text. Please
comment on the seasonality of Tmin.
8.18 (a)Isolate the annual component from the FFT analysis of the St. Paul Tmin data from
January 1, 1941 to December 31, 1949 in the previous problem. Here, the annual
component refers the reconstructed FFT component as a function of time with a
period approximately equal to 365 days. Plot the annual component as a function
of time from January 1, 1941 to December 31, 1949.
(b) Compute and plot the monthly climatology of this dataset based on the 1941–1949
mean for each month in January, February, . . ., December. You may compute the
monthly data first, and then compute the ten-year mean. You can exclude February
29 from your computing. However, you can also write your code in such a way that
the monthly climatology is computed directly from the daily data.
(c) Compare the figures from Steps (a) and (b) and comment on the annual cycles
defined in different ways.
8.19 (a) Compute and plot the daily anomalies of the St. Paul Tmin from January 1, 1941 to
December 31, 1949 based on the monthly climatology from the previous problem.
(b) Compute and plot the periodogram of the daily anomalies with the period as the
horizontal axis (see Figure 8.7(d)). The period units are in days.
8.20 Evaluate the following integral:
τ02 σ 2
Z ∞
df, (8.72)
0 1 + 4π 2 τ02 f 2
where the integrand is the PSD function defined by (8.55). If the integral is regarded as
the total power, explain how the total power depends on the time scale τ0 and variance
σ 2.
327 Exercises
8.21 Plot the power spectral density (PSD) as a function of f ∈ (−∞, ∞) for three pairs of
different values of τ0 and σ 2 :
τ02 σ 2
p( f ; τ0 , σ 2 ) = . (8.73)
1 + 4π 2 τ02 f 2
Plot all three curves on the same figure, and explain their differences.
8.22 For the following Fourier series
2 4 ∞ 1
| sint| = − ∑ 2 cos(2kt), (8.74)
π π k=1 4k − 1
2 4 K 1
RK (t) = − ∑ 2 cos(2kt) (8.75)
π π k=1 4k − 1
for K = 1, 2, 9 over the domain t ∈ [−2π, 2π] on the same figure. Comment on the
accuracy of using the partial sum of a Fourier series to approximate a function.
8.23 (a) For the previous problem, plot R3 (t). This is the result of removing the high-
frequency components, i.e., filtering out all the frequencies k ≥ 4. This procedure
is a low-pass filter.
(b) For the previous problem, plot f (t) − R3 (t). This is the result of removing the low-
frequency components, i.e., filtering out all the frequencies k ≤ 3. This procedure
is a high-pass filter.
8.24 Apply a low-pass filter and a high-pass filter to the monthly mean surface air temperature
Tavg of Chicago, USA, from January 1971 to December 2020. You can use the CSV
data ChicagoTavg.csv from the book data file data.zip downloadable at the book
website. Or you can download the data online from the Global Historical Climatology
Network (GHCN) using the station ID: USC00111577. Plot the data and the filtered
results. You can choose your preferred cutoff frequency for your filters.
8.25 (a) Approximate the annual cycle of Chicago surface air temperature Tavg by its
monthly climatology as the mean from 1971 to 2020. Plot the monthly climatology.
(b) Approximate the annual cycle of Chicago surface air temperature Tavg by the har-
monic component of period equal to 12 months. Plot this harmonic on the same
figure as the monthly climatology in Step (a).
(c) Compare the two curves and describe their differences in text (limited to 20–80
words).
8.26 A periodic function in [0, π]
is extended to (−∞, ∞) periodically with a period equal to π. This periodic function has
the following Fourier series:
8 ∞ sin(2k − 1)t
f (t) = ∑ (2k − 1)3 . (8.77)
π k=1
328 Spectral Analysis of Time Series
This periodic function is extended from (−π, π] to (−∞, ∞) with a period equal to
2π. The Fourier series of this function is given as
1 2 ∞ 1 kπ
f (t) = + ∑ sin cos(kt). (8.80)
4 π k=1 k 4
Plot the function f (t) and the partial sums
1 2 K 1
kπ
RK (t) = + ∑ sin cos(kt) (8.81)
4 π k=1 k 4
for K = 3, 9, 201 over the domain t ∈ (−3π, 3π] on the same figure.
(b) Compare the coefficients of the Fourier series and note the convergence rate of the
series. A smoother function has a faster convergence 1/k3 , and a discontinuous
function has a very slow convergence rate 1/k. Discuss your numerical results of
convergence.
9 Introduction to Machine Learning
Machine learning (ML) is a branch of science that uses data and algorithms to mimic
how human beings learn. The accuracy of ML results can be gradually improved based on
new training data and algorithm updates. For example, a baby learns how to pick an orange
from a fruit plate containing apples, bananas, and oranges. Another baby learns how to sort
out different kinds of fruits from a basket into three categories without naming the fruits.
Then, how does ML work? It is basically a decision process for clustering, classification,
or prediction, based on the input data, decision criteria, and algorithms. It does not stop
here, however. It further validates the decision results and quantifies errors. The errors and
the updated data will help update the algorithms and improve the results.
Machine learning has recently become very popular in climate science due to the avail-
ability of powerful and convenient resources of computing. It has been used to predict
weather and climate and to develop climate models. This chapter is a brief introduction to
ML and provides basic ideas and examples. Our materials will help readers understand and
improve the more complex ML algorithms used in climate science, so that they can go a
step beyond only applying the ML software packages as a black box. We also provide R
and Python codes for some basic ML algorithms, such as K-means for clustering, the sup-
port vector machine for the maximum separation of sets, random forest of decision trees
for classification and regression, and neural network training and predictions.
Artificial intelligence (AI) allows computers to automatically learn from past data
without human programming, which enables a machine to learn and to have intelli-
gence. Machine learning is a subset of AI. Our chapter here focuses on ML, not general
AI.
A few toddlers at a daycare center may learn how to grab a few candies near them. It can be
a point of fighting for a candy at a location that is not obviously closer to one than another.
K-means method can help divide the candies among the toddlers in a fair way.
Based on the historical weather data over a country, can ML decide the climate regimes
for the country? Can ML determine the ecoregions of a country? Can ML define the
regimes of wild fire over a region? The K-means clustering method can be useful to
answering these questions.
329
330 Introduction to Machine Learning
The aim of K-means clustering is to divide N points into K clusters so that the total within
cluster sum of squares (tWCSS) is minimized. Here we use 2D data points to describe
tWCSS and the K-means algorithm, although the K-means method can be formulated in
higher dimensions. We regard the data (x1 , x2 , . . . , xN ) as the 2D coordinates of N points.
For example, we treat x1 = (1.2, −7.3) as observational data. Assume that these N points
can be divided into K clusters (C1 ,C2 , . . . ,CK ), where K is subjectively given by you, the
user who wishes to divide the N points into K clusters by the K-means method. Let x ∈ Ci ,
i.e., points within cluster Ci . The number of points in cluster Ci is unknown and is to
be determined by the K-means algorithm. The total WCSS is defined by the following
formula: !
K
2
tWCSS = ∑ ∑ x− µi , (9.1)
i=1 x∈Ci
where
1
µi = ∑x (9.2)
Ki x∈Ci
is the mean of the points within the cluster Ci , as Ki is the number of points in Ci . Thus, we
have K means, the name of the K-means method. The part in the parentheses in Eq. (9.1)
is called the within cluster sum of squares (WCSS)
!
2
WCSSi = ∑ x− µi . (9.3)
x∈Ci
This is defined for each cluster, and tWCSS is the sum of these WCSSi for all the K
clusters.
Some publications use WCSS or Tot WCSS, instead of tWCSS, to express the right-
hand side of Eq. (9.1). Be careful when reading literature concerning the WCSS definition.
Some computer software for K-means may even use a different tWCSS definition.
Our K-means computing algorithm is to minimize the tWCSS by optimally organizing
the N points into K clusters. This algorithm can assign each data point to a cluster. If we
regard µ i as the centroid or center of cluster Ci , we may say that the points in cluster Ci
have some kind of similarity, such as similar climate or similar ecological characteristics,
based on certain criteria. In the following, we use two simple cases (N = 2 and N = 3) to
illustrate the K-means method and its solutions.
(i) Case N = 2:
When N = 2, if we assume K = 2, then
µ 1 = x1 , µ 2 = x2 (9.4)
and
2
2
tWCSS = ∑ ∑ x − µi = 0. (9.5)
i=1 x∈Ci
331 9.1 K-Means Clustering
Example 9.1 Find the two K-means clusters for the following three points:
P1 (1, 1), P2 (2, 1), P3 (3, 3.5). (9.14)
This is a K-means problem with N = 3 and K = 2. The K-means clusters can be found by
a computer command, e.g., kmeans(mydata, 2) in R. Figure 9.1 shows the two K-means
clusters with their centers at
C1 (1.5, 1), C2 (3, 3.5) (9.15)
and
tWCSS = 0.5. (9.16)
Points P1 and P2 , indicated by red dots, are assigned to cluster C1 , whose center is marked
by a red cross. Point P3 , indicated by the sky blue dot, is assigned to cluster C2 , whose
332 Introduction to Machine Learning
4
C2
P3
3
2
y
P1 P2
1
C1
0 1 2 3 4
x
t
Figure 9.1 The K-means clustering results for three points (N = 3) and two clusters (K = 2).
center is marked with a blue cross and obviously overlaps with P3 . This result agrees with
our intuition since P1 and P2 are together. The three points can have two other possibilities
of combination: C1 = {P1 , P3 } and C1 = {P2 , P3 }. The case C1 = {P1 , P3 } leads to tWCSS =
5.125, and C1 = {P2 , P3 } to tWCSS = 3.625. Both tWCSS are greater than 0.5. Thus, these
two combinations should not be a solution of the K-means clustering. These numerical
results may be produced by the following computer codes.
# R code : tWCSS calculation for N = 3 and K = 2
mydata <- matrix ( c ( 1 , 1 , 2 , 1 , 3 , 3 . 5 ) ,
nrow = N , byrow = TRUE )
x 1 = mydata [ 1 , ]
x 2 = mydata [ 2 , ]
x 3 = mydata [ 3 , ]
# Case C 1 = ( P 1 , P 2 )
c 1 = ( mydata [ 1 , ] + mydata [ 2 , ])/ 2
c 2 = mydata [ 3 , ]
tWCSS = norm ( x 1 - c 1 , type = ’2 ’ )^ 2 +
norm ( x 2 - c 1 , type = ’2 ’ )^ 2 +
norm ( x 3 - c 2 , type = ’2 ’ )^ 2
tWCSS
#[1] 0.5
# Case C 1 = ( P 1 , P 3 )
c 1 = ( mydata [ 1 , ] + mydata [ 3 , ])/ 2
c 2 = mydata [ 2 , ]
norm ( x 1 - c 1 , type = ’2 ’ )^ 2 +
norm ( x 3 - c 1 , type = ’2 ’ )^ 2 +
norm ( x 2 - c 2 , type = ’2 ’ )^ 2
#[1] 5.125
333 9.1 K-Means Clustering
# Case C 1 = ( P 2 , P 3 )
c 1 = ( mydata [ 2 , ] + mydata [ 3 , ])/ 2
c 2 = mydata [ 1 , ]
norm ( x 2 - c 1 , type = ’2 ’ )^ 2 +
norm ( x 3 - c 1 , type = ’2 ’ )^ 2 +
norm ( x 1 - c 2 , type = ’2 ’ )^ 2
#[1] 3.625
# case C 1 = ( P 1 , P 2 )
c 1 = ( x 1 + x 2 )/ 2
c 2 = mydata [ 2 ,:]
tWCSS = np . linalg . norm ( x 1 - c 1 )** 2 +\
np . linalg . norm ( x 2 - c 1 )** 2 +\
np . linalg . norm ( x 3 - c 2 )** 2
print ( tWCSS )
#0.5
# case C 1 = ( P 1 , P 3 )
c 1 = ( x 1 + x 3 )/ 2
c 2 = mydata [ 1 ,:]
print ( np . linalg . norm ( x 1 - c 1 )** 2 +\
np . linalg . norm ( x 3 - c 1 )** 2 + \
np . linalg . norm ( x 2 - c 2 )** 2 )
# case C 1 = ( P 2 , P 3 )
c 1 = ( x 2 + x 3 )/ 2
c 2 = mydata [ 0 ,:]
print ( np . linalg . norm ( x 2 - c 1 )** 2 +\
np . linalg . norm ( x 3 - c 1 )** 2 + \
np . linalg . norm ( x 1 - c 2 )** 2 )
# 3 ,6 2 5
Of course, this problem is trivial and can even be solved analytically by hand. The com-
puter solution of the K-means clustering is necessary when more points are involved and
when it is not obvious that a point belongs to a specific cluster.
# plot P 1 -P 3
plt . scatter ( mydata [: , 0 ] , mydata [: , 1 ] ,
color = ( ’r ’ , ’r ’ , ’ dodgerblue ’) ,
facecolors = ’ none ’ , s = 8 0 )
# plot C 1 and C 2
plt . scatter (( Kclusters . cluster _ centers _[ 0 ][ 0 ] ,
Kclusters . cluster _ centers _[ 1 ][ 0 ]) ,
( Kclusters . cluster _ centers _[ 0 ][ 1 ] ,
Kclusters . cluster _ centers _[ 1 ][ 1 ]) ,
marker = ’X ’ , color = ( ’ dodgerblue ’ , ’r ’ ))
# add labels
plt . text ( 1 . 4 3 , 0 . 8 , " $ C _ 1 $ " , color = ’r ’ , size = 1 4 )
plt . text ( 3 . 1 , 3 . 4 5 , " $ C _ 2 $ " , color = ’ dodgerblue ’ , size = 1 4 )
plt . text ( 0 . 9 5 ,1 . 1 , " $ P _ 1 $ " , color = ’r ’ , size = 1 4 )
plt . text ( 1 . 9 5 ,1 . 1 , " $ P _ 2 $ " , color = ’r ’ , size = 1 4 )
plt . text ( 2 . 9 5 ,3 . 3 , " $ P _ 3 $ " , color = ’ dodgerblue ’ , size = 1 4 )
plt . show ()
From Subsection 9.1.1, we may have already developed a feeling that the K-means
clustering method may roughly be divided into the following two steps:
(1)
• Assignment step: Assume a number K and randomly generate K points µ i (i =
1, 2, . . . , K) as the initial K centroids for K clusters. Without replacement, for each point
x, compute the distance
(1)
di = x − µ i i = 1, 2, . . . , K. (9.17)
Assign the point x to the cluster with the smallest distance. Then, assign the next point
(1)
to a cluster until every point is assigned. Thus, the initial K clusters are formed: Ci and
(1) (1)
each Ci containing Ki data points (i = 1, 2, . . . , K).
• Update step: Compute the centroids of the initial K clusters P:
(2) 1
µi = (1) ∑ x (9.18)
Ki (1)
x∈Ci
336 Introduction to Machine Learning
(2)
These µ i (i = 1, 2, . . . , K) are regarded as the updated centroids. We then repeat the
assignment step to assign each point to a cluster, again according to the rule of the
(3)
smallest distance. Next, we compute µ i (i = 1, 2, . . . , K), and tWCSS(3) .
The K-means algorithm continues these iterative steps of assignment and update until
(n)
the centroids µ i (i = 1, 2, . . . , K) do not change from those in the previous assignments
(n−1)
µi (i = 1, 2, . . . , K). When the assignment centroids do not change, we say that the
K-means iterative sequence has converged. It can be rigorously proven that this algorithm
always converges.
R and Python have K-means functions. Many software packages are freely available for
some modified K-means algorithms using different distance formulas. Climate scientists
often just use these functions or packages without actually writing an original computer
code for a specific K-means algorithm. For example, the R calculation in the previous
subsection used the command kmeans(mydata, K) to calculate the cluster results for
three points shown in Figure 9.1.
The K-means algorithm appears very simple, but it was invented only in the 1950s.
Although the K-means iterative sequence here always converges, the final result may
(1)
depend on the initial assignment of the centroid: Ci (i = 1, 2, . . . , K), hence each run of
the R K-means command kmeans(mydata, K) may not always lead to the same solution.
Further, a K-means clustering result may not be a global minimum for tWCSS.
Thus, after we have obtained our K-means results, we should try to explain the results
and see if they are reasonable. We may also run the code with different K values and check
which value K is the most reasonable. The tWCSS score is a function of K. We may plot the
function tWCSS(K) and check the shape of the curve. Usually, tWCSS(K) is a decreasing
function, and further,
decreases fast for K ≤ Ke , and then the decrease suddenly slows down or even increases at
K = Ke . In this sense, the curve tWCSS(K) looks like having an elbow at Ke . We would
choose this Ke as the optimal K. See the elbow in the example of the next subsection.
Different applications may need different optimization criteria for the selection of the
best number of clusters. When you work on your own data, try different criteria to choose
the best number of clusters. For example, the Silhouette method uses the Silhouette score
to measure the similarity level of a point with its own cluster compared to other clusters.
Again, a computer code for the K-means clustering may not always obtain the mini-
mum tWCSS against all partitions. Sometimes, a computer yields a “local” minimum for
tWCSS(K).
In short, when using the K-means method to divide given points into clusters, you should
run the K-means code multiple times and check the relevant solution results. Although most
337 9.1 K-Means Clustering
K-means code yields a unique solution, multiple solutions sometimes appear, and then we
need to examine which solution is the most reasonable for our purposes.
In this subsection, we present an example using the K-means method to cluster the
observed daily weather data at the Miami International Airport in the United States.
Miami International Airport. The wind blowing from the north is zero degree, from the
east is 90 degrees, and from the west is 270 degrees. The scatter plot seems to support two
clusters or weather regimes: the lower regime showing the wind direction between 0 and
180 degrees, meaning the wind from the east, northeast, or southeast; and another with the
wind direction between 180 degrees and 360 degrees, meaning that the wind came from
the west, northwest, or southwest.
(a) 2001 Daily Miami Tmin vs WDF2 (b) K−Means for Daily Tmin vs WDF2
N
350
N
350
NW
NW
300
300
W
W
250
250
Wind direction [deg]
SW
SW
Wind direction
Wind direction
200
200
S
S
150
150
SE
SE
100
100
E
E
50
50
NE
NE
N
N
0
5 10 15 20 25 5 10 15 20 25
t
Figure 9.2 (a) Scatter plot of the daily Tmin vs WDF2 for the Miami International Airport in 2001. (b) The K-means clusters of
the daily Tmin vs WDF2 data points when assuming K = 2.
We would like to use the K-means method to make the clustering automatically and jus-
tify our previous intuitive conclusion from the scatter diagram. This example, although
simple, is closer to reality. You can apply the procedures described here to your own
data for various clustering purposes, such as defining wildfire zones, ecoregions, climate
regimes, and agricultural areas. You can find numerous application examples by an online
search using keywords like “K-means clustering for ecoregions” or “K-means clustering
for climate regimes.”
338 Introduction to Machine Learning
The daily weather data for the Miami International Airport can be obtained from the
Global Historical Climatology Network-Daily (GHCN-D), station code USW00012839.
The daily data from 2001 to 2020 are included in the book’s master data file data.zip
from the book website www.climatestatistics.org. The file name of the Miami data is
MiamiIntlAirport2001 2020.csv. You can access the updated data by an online search
for “NOAA Climate Data Online USW00012839.” Figure 9.2(b) shows the K-means clus-
tering result for the daily Tmin vs WDF2 for the 2001 daily data at the Miami International
Airport. The two clusters generated by the K-means method agree with our intuition by
looking at the scatter diagram Figure 9.2(a). However, the K-means clustering result Fig-
ure 9.2(b) provides a clearer picture: a red cluster with each data point indicated by a small
triangle, and a black cluster with each data point indicated by a small circle. The clus-
ter boundaries are linked by dashed lines. The large vertical gap at Tmin around 5◦ C (in
winter) between the two clusters indicates that the fastest 2-minute wind in a day has a
WDF2 angle value around 50◦ , a northeast wind for the black cluster, and around 300◦ , a
northwest wind for the red cluster.
Figure 9.2 may be generated by the following computer codes.
# R for Fig . 9 . 2 : K - means clustering for 2 0 0 1 daily weather
# data at Miami International Airport , Station ID USW 0 0 0 1 2 8 3 9
setwd ( " ~/ climstats " )
dat = read . csv ( " data / MiamiIntlAirport 2 0 0 1 _ 2 0 2 0 . csv " ,
header = TRUE )
dim ( dat )
#[1] 7305 29
tmin = dat [ , ’ TMIN ’]
wdf 2 = dat [ , ’ WDF 2 ’]
# plot the scatter diagram Tmin vs WDF 2
setEPS () # Plot the data of 1 5 0 observations
postscript ( " fig 0 9 0 2 a . eps " , width = 5 , height = 5 )
par ( mar = c ( 4 . 5 , 4 . 5 , 2 , 4 . 5 ))
plot ( tmin [ 2 : 3 6 6 ] , wdf 2 [ 2 : 3 6 6 ] ,
pch = 1 6 , cex = 0 . 5 ,
xlab = ’ Tmin [ deg C ] ’ ,
ylab = ’ Wind Direction [ deg ] ’ , grid ())
title ( ’( a ) 2 0 0 1 Daily Miami Tmin vs WDF 2 ’ ,
cex . main = 0 . 9 , line = 1 )
axis ( 4 , at = c ( 0 , 4 5 , 9 0 , 1 3 5 , 1 8 0 , 2 2 5 , 2 7 0 , 3 1 5 , 3 6 0 ) ,
lab = c ( ’N ’ , ’ NE ’ , ’E ’ , ’ SE ’ , ’S ’ , ’ SW ’ , ’W ’ , ’ NW ’ , ’N ’ ))
mtext ( ’ Wind Direction ’ , side = 4 , line = 3 )
dev . off ()
#K - means clustering
K = 2 # assuming K = 2 , i . e . , 2 clusters
mydata = cbind ( tmin [ 2 : 3 6 6 ] , wdf 2 [ 2 : 3 6 6 ])
fit = kmeans ( mydata , K ) # K - means clustering
# Output the coordinates of the cluster centers
fit $ centers
#1 18.38608 278.8608
#2 21.93357 103.9161
fit $ tot . withinss # total WCSS
# [ 1 ] 4 5 7 8 4 4 . 9 for # the value may vary for each run
# K - means clustering
fit = KMeans ( n _ clusters = K ). fit ( mydata )
(a) The Elbow Principle from tWCSS Scores (b) The Knee Principle from pWCSS Variance
100
20
Knee at K = 2
80
pWCSS [percent]
tWCSS [ × 105 ]
15
60
40
10
20
Elbow at K = 2
5
0
0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
K K
t
Figure 9.3 (a) tWCSS scores for different K. (b) pWCSS variances for different K.
341 9.1 K-Means Clustering
Another method to choose K is the knee rule, which is determined by the variances
explained by K clusters. When K = 1, tWCSS is the spatial variance of the entire data.
When K = N, tWCSS = 0, where N is the total number of the data points. Usually, we
have
tWCSS[1] ≥ tWCSS[K] ≥ tWCSS[N] (9.21)
for any 1 ≤ K ≤ N. Then,
tWCSS[1] − tWCSS[K]
pWCSS[K] = × 100% (9.22)
tWCSS[1]
is defined as the percentage of variance explained by the K clusters. Usually, pWCSS[K] is
an increasing function of K. When the increase rate suddenly slows down from K, a shape
like a knee appears in the pWCSS[K] curve. We then use this K as the optimal number of
centers for our K-means clustering. This is the so-called the knee rule. Figure 9.3(b) shows
that the increase of pWCSS[K] dramatically slows down at K = 2. Thus, we choose K = 2,
which provides another justification why K = 2.
Figure 9.3 may be generated by the following computer codes.
# R plot Fig . 9 . 3 : tWCSS ( K ) and pWCSS ( K )
twcss = c ()
for ( K in 1 : 8 ){
mydata = cbind ( tmax [ 2 : 3 6 6 ] , wdf 2 [ 2 : 3 6 6 ])
twcss [ K ] = kmeans ( mydata , K )$ tot . withinss
}
twcss
par ( mar = c ( 4 . 5 , 6 , 2 , 0 . 5 ))
par ( mfrow = c ( 1 ,2 ))
plot ( twcss / 1 0 0 0 0 0 , type = ’o ’ , lwd = 2 ,
xlab = ’K ’ , ylab = bquote ( ’ tWCSS [ x ’ ~ 1 0 ^ 5 ~ ’] ’) ,
main = ’( a ) The elbow principle from tWCSS scores ’ ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 )
points ( 2 , twcss [ 2 ]/ 1 0 0 0 0 0 , pch = 1 6 , cex = 3 , col = ’ blue ’)
text ( 4 , 5 , ’ Elbow at K = 2 ’ , cex = 1 . 5 , col = ’ blue ’)
# compute percentage of variance explained
pWCSS = 1 0 0 *( twcss [ 1 ] - twcss )/ twcss [ 1 ]
plot ( pWCSS , type = ’o ’ , lwd = 2 ,
xlab = ’K ’ , ylab = ’ pWCSS [ percent ] ’ ,
main = ’( b ) The knee principle from pWCSS variance ’ ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 )
points ( 2 , pWCSS [ 2 ] , pch = 1 6 , cex = 3 , col = ’ blue ’)
text ( 4 , 8 0 , ’ Knee at K = 2 ’ , cex = 1 . 5 , col = ’ blue ’)
dev . off ()
twcss = twcss [ 1 :]
# plot elbow
plt . plot ( np . linspace ( 1 ,8 ,8 ) , twcss / 1 0 0 0 0 0 , ’k ’ , linewidth = 2 )
# plot points
plt . scatter ([ np . linspace ( 1 ,8 ,8 )] , [ twcss / 1 0 0 0 0 0 ] , color = ’k ’)
# plot elbow at K = 2
plt . scatter ( 2 , twcss [ 1 ]/ 1 0 0 0 0 0 , color = ’ blue ’ , s = 5 0 0 )
# add text
plt . text ( 2 . 2 , 5 . 2 , ’ Elbow at K = 2 ’ , color = ’ blue ’ , size = 2 0 )
# add labels
plt . title ( " ( a ) The elbow principle from tWCSS scores " ,
pad = 1 0 )
plt . xlabel ( " K " , labelpad = 1 0 )
plt . ylabel ( r " tWCSS [ x $ 1 0 ^ 5 $] " , labelpad = 1 0 )
plt . show ()
# plot knee at K = 2
plt . scatter ( 2 , pWCSS [ 1 ] , color = ’ blue ’ , s = 5 0 0 )
# add text
plt . text ( 2 . 2 , 7 6 , ’ Knee at K = 2 ’ , color = ’ blue ’ ,
size = 2 0 )
# add labels
plt . title ( " ( b ) The knee princple from pWCSS variance " ,
pad = 1 0 )
plt . xlabel ( " K " , labelpad = 1 0 )
plt . ylabel ( " pWCSS [%] " , labelpad = 1 0 )
plt . show ()
(i) Data preparation: Write your data into an N × 2 matrix, where N is the total number
of data points, and 2 is for two variables, i.e., in a 2-dimensional space. The real
value matrix must have no missing data. Although our previous examples are for 2-
dimensional data, the K-means clustering can be used for higher dimensional data.
On the Internet, you can easily find examples of the K-means for data of three or
more variables.
(ii) Preliminary data analysis: You may wish to make some preliminary simple analyses
before applying the K-means method. These analyses may include plotting a scatter
343 9.1 K-Means Clustering
diagram, plotting a histogram for each variable, and computing mean, standard devi-
ation, and quantiles for each variable. If the data for the two variables have different
units, you may convert the data into their standardized anomalies. The standardized
anomaly computing procedure is called “scale” in the ML community. For example,
the R command scale(x) yields the standardized anomalies of data sequence x,
i.e, divided by the standard deviation after a subtraction of mean. Mathematically
speaking, the scaling procedure is preferred when two variables have different units,
because the unit12 + unit22 in tWCSS usually do not have a good interpretation.
In contrast, standardized anomalies are dimensionless and have no units. Nondi-
mensional data are often preferred when applying statistical methods. However, our
goal of the K-means clustering is to find patterns without labeling the data, which is
an unsupervised learning. The K-means results from scaled data and unscaled data
may not be the same. As long as the K-means result is reasonable and interpretable,
regardless of the scaled or unscaled data, we adopt the result. For example, in our
Miami Tmin and WDF2 example in the previous subsection, we found the K-means
clusters for both scaled and unscaled data, but we have chosen the result from the
unscaled data from our view point of interpretation. Therefore, we keep in mind that
the K-means is a result-oriented method.
(iii) Test K-means clustering: Apply a computer code of K-means to the prepared N × 2
data matrix with a K value inspired from the scatter diagram or science background
of the data. You may apply the code once to the scaled data and again to the unscaled
data, and see if the results make sense. Run your code a few times and see if the
results vary.
(iv) Determine the optimal K by the elbow rule: Compute the tWCSS scores for different
K and find the elbow location of the tWCSS(K) curve. Apply the K-means computer
code for this optimal K value. Compare this K with the K in the previous step. Check
the details of the clustering results, such as cluster assignment, tWCSS, and centers.
(v) Visualize the K-means clustering results: Use a computer code to plot the K clusters.
Different visualization codes are available for your choice.
(vi) Interpret the final result: Identify the characteristics of the clusters and interpret them
with text or more figures.
Applying the K-means computer code is very easy, but in the K-means data analysis, we
should avoid the following common mistakes:
• Skipping the preliminary data analysis step: Some people may quickly apply the K-
means computer code and treat the K-means results as unique, hence regard their plot of
the clusters as the end of the analysis.
• Missing the interpretation: Interpretation step is needed, short or long. This is our
purpose anyway.
• Forgetting to scale the data: We should always analyze the scaled data. At the same
time, because of the interpretation requirement, we should also check the results from
the unscaled data and see which result is more reasonable.
• Overlooking the justification of the optimal K: A complete K-means clustering should
have this step, which helps us to explore all the possible clusters.
344 Introduction to Machine Learning
In addition to using the standard K-means computer code you can find from the Inter-
net or that built in R or Python, you sometimes may wish to make special plots for your
clustering results with some specific features, such as a single cluster on a figure with the
marked data order, as shown in Figure 9.4. A convex hull is a simple method to plot a
cluster. By definition, the convex hull of a set of points is the smallest convex polygon that
encloses all of the points in the set. Below is the computer code to plot Figure 9.4 using
the convex hull method.
353 114
107 188
N
82 359
12 35 173
358
20 90 76
NW
21 565
324 80365 319
351 175
146 184
22 106 145
189
300
64 63 362
324318 47104194
214 215
295
4 360
23 79 151
78 105 250
191
W
59 268
19 11 68
317
323
352 192
252
193
7 195
202
216
257
Wind direction [deg]
SW
88 159
197
62 189
71 249
41244
190 196 266
256
200
S
SE
100
E
NE
N
0
0 5 10 15 20 25 30
Tmin [deg C]
t
Figure 9.4 Convex hull for Cluster 1. The number above each data point is the day number in a year, e.g., January 1 being 1,
February 2 being 33, and December 31 being 365 in 2001. The day numbers help identify the seasonality of the points,
for example the left-most tip of the cluster being Day 4, a specific winter day.
plt . xlim ([ 0 ,3 0 ])
plt . ylim ([ 0 ,4 0 0 ])
plt . yticks ([ 0 ,1 0 0 ,2 0 0 ,3 0 0 ])
# plot points
plt . scatter ( subset [ " TMIN " ] , subset [ " WDF 2 " ] , color = ’k ’)
# add labels
for i in range ( len ( subset [ " TMIN " ])):
plt . text ( subset [ " TMIN " ]. values [ i ] ,
subset [ " WDF 2 " ]. values [ i ] ,
subset [ " day " ]. values [ i ] , size = 1 2 )
# add boundary
for i in subset . cluster . unique ():
points = \
subset [ subset . cluster == i ][[ ’ TMIN ’ , ’ WDF 2 ’ ]]. values
# get convex hull
hull = ConvexHull ( points )
# get x and y coordinates
# repeat last point to close the polygon
x _ hull = np . append ( points [ hull . vertices , 0 ] ,
points [ hull . vertices , 0 ][ 0 ])
y _ hull = np . append ( points [ hull . vertices , 1 ] ,
346 Introduction to Machine Learning
The K-means method cannot only be used for weather data clustering and climate classi-
fication, but also for many other purposes, such as weather forecasting, storm identification,
and more. In many applications, K-means is part of a machine learning algorithm, and
takes a modified version, such as the incremental K-means for weather forecasting and the
weighted K-means for climate analysis.
The K-means clustering can organize a suite of given data points into K clusters. The data
points are not labeled. This is like organizing a basket of fruits into K piles according
to the minimum tWCSS principle. After the clustering, we may name the clusters, say
1, 2, . . . , K, or apples, oranges, peaches, . . ., grapes. Machine learning often uses the term
“cluster label” in lieu of “cluster name.” So, we say that we label a cluster when we name a
cluster. The learning process for unlabeled data is called unsupervised learning. If a basket
of fruits is sorted by a baby who cannot articulate the fruit names, he is performing an
unsupervised learning.
In contrast, support vector machine (SVM) is in general a supervised learning, working
on the labeled data, e.g., a basket of apples, oranges, peaches, and grapes. Based on the
numerical data, SVM has two purposes. First, SVM determines the maximum “distances”
between the sets of labeled data, say apples, oranges, peaches, and grapes. This is a train-
ing process and helps a computer learn the differences between different labeled sets, e.g.,
different fruits, different weather, or different climate regimes. Second, SVM makes a pre-
diction, which, based on the trained model, predicts the labels (i.e., names) of the new
and unlabeled data. Thus, SVM is a system to learn the maximum differences among the
labeled clusters of data, and to predict the labels of the new unlabeled data. This process
of training and prediction mimics a human’s learning experience. For example, over the
years, a child is trained by his parents to acquire knowledge of apples, oranges, and other
fruits. At a certain point, the child is well trained and ready to independently tell (i.e.,
predict) whether a new fruit is an apple or an orange. During this learning process, the
fruits are labeled (e.g., apples, oranges, etc.). His parents help input the data into his brain
through a teaching and learning process. His brain was trained to maximize the difference
between an apple and an orange. In the prediction stage, the child independently imports
the data using his eyes, and makes the prediction whether a fruit is an apple or an orange.
SVM mimics this process by maximizing the difference between two or more labeled cat-
egories or clusters in the training process, then, by using the trained model, SVM predicts
the categories for the new data. The key is how to quantify the difference between any two
categories.
347 9.2 Support Vector Machine
In the following, we will illustrate these SVM learning ideas using three examples: (i) a
trivial case of two data points, (ii) training and prediction for a system of three data points,
and (iii) a more general case of many data points.
x1 + x2 = 4, (9.23)
that for the positive hyperplane (i.e., the blue dashed line) is
x1 + x2 = 6, (9.24)
and that for the negative hyperplane (i.e., the red dashed line) is
x1 + x2 = 2. (9.25)
Po
si
Se
tiv
e
pa
H
ra
yp
tin
er
g
p
H
la
N
4
yp
eg
ne
er
at
P2
pl
ive
an
H
e
yp
er
pl
an
e
x2
2
n
P1
w
0
−2
−2 0 2 4 6
x1
t
Figure 9.5 Quantify the maximum difference between two points.
This vector is a normal vector for the separating hyperplane. The equation of the separating
hyperplane may be expressed in the normal vector form
√
n · x = 2 2. (9.27)
w · x − b = 0, (9.28)
where
w = (0.5, 0.5) (9.29)
and
b = 2. (9.30)
An advantage of this form is that the equations for the positive and negative hyperplanes
can be simply written in the form
w · x − b = ±1, (9.31)
w · x − b = 0. (9.32)
Here, w is called a normal vector and is related with the unit normal vector n as follows:
1
w = − √ n. (9.33)
2
349 9.2 Support Vector Machine
Let us progress from a two-point system to a three-point system and build an SVM using a
computer code. Figure 9.6 shows the three points labeled in two categories. The two blue
points are P1 (1, 1), P2 (2, 1) which are in the first category corresponding to the categorical
value y = 1. The red point P3 (3, 3.5) is in the second category corresponding to y = −1.
Training these data is to maximize the margin size Dm , which results in the following SVM
parameters.
w = (−0.28 − 0.69), b = 2.24, Dm = 2.69. (9.36)
351 9.2 Support Vector Machine
6
are training data for an SVM
5
Two black diamond points
are to be predicted by the SVM
Q2
4
P3
x2
3
Q1
P2
1
P1
0
0 1 2 3 4 5 6
x1
t
Figure 9.6 SVM training from three data points, and the prediction by the trained SVM.
x 1 = seq ( 0 , 5 , len = 3 1 )
x 2 = ( b - w [ 1 ]* x 1 )/ w [ 2 ]
x 2 p = ( 1 + b - w [ 1 ]* x 1 )/ w [ 2 ]
x 2 m = ( - 1 + b - w [ 1 ]* x 1 )/ w [ 2 ]
x 2 0 = ( b - w [ 1 ]* x 1 )/ w [ 2 ]
# plot the SVM results
setEPS ()
postscript ( " fig 0 9 0 6 . eps " , height = 7 , width = 7 )
par ( mar = c ( 4 . 5 , 4 . 5 , 2 . 0 , 2 . 0 ))
plot (x , col = y + 3 , pch = 1 9 ,
xlim = c ( 0 , 6 ) , ylim = c ( 0 , 6 ) ,
xlab = bquote ( x [ 1 ]) , ylab = bquote ( x [ 2 ]) ,
cex . lab = 1 . 5 , cex . axis = 1 . 5 ,
main = ’ SVM for three points labeled in two categories ’)
axis ( 2 , at = ( - 2 ): 8 , tck = 1 , lty = 2 ,
col = " grey " , labels = NA )
axis ( 1 , at = ( - 2 ): 8 , tck = 1 , lty = 2 ,
col = " grey " , labels = NA )
lines ( x 1 , x 2p , lty = 2 , col = 4 )
lines ( x 1 , x 2m , lty = 2 , col = 2 )
lines ( x 1 , x 2 0 , lwd = 1 . 5 , col = ’ purple ’)
xnew = matrix ( c ( 0 . 5 , 2 . 5 , 4 . 5 , 4 ) ,
ncol = 2 , byrow = TRUE )
points ( xnew , pch = 1 8 , cex = 2 )
for ( i in 1 : 2 ){
text ( xnew [i , 1 ] + 0 . 5 , xnew [i , 2 ] , paste ( ’Q ’ ,i ) ,
cex = 1 . 5 , col = 6 -2 * i )
}
text ( 2 . 2 ,5 . 8 , " Two blue points and a red point
are training data for an SVM " ,
cex = 1 . 5 , col = 4 )
text ( 3 . 5 ,4 . 7 , " Two black diamond points
are to be predicted by the SVM " ,
cex = 1 . 5 )
dev . off ()
The previous quantification of maximum separation between two points or three points can
be generalized to the separation of many points of two categories labeled (1, −1), or (1, 2),
354 Introduction to Machine Learning
or (A, B), or another way. What is a systematic formulation of mathematics to make the best
separation, by maximizing the margin? How can we use a computer code to implement the
separation? This subsection attempts to answer these questions.
We have n training data points:
(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), (9.37)
where xi is a p-dimensional vector (i.e., using p parameters to describe every data point in
a category), and yi is equal to 1 or −1, a category indicator, i = 1, 2, . . . , n.
The separating hyperplane has its linear equation as follows:
w · x − b = 0. (9.38)
Our SVM algorithm is to find the p-dimensional normal vector w and a real-valued scaler
b so that the distance
2
Dm = (9.39)
|w|
between the positive and negative hyperplanes
w · x − b = ±1 (9.40)
is maximized.
The maximization is for the domain in the p-dimensional space below the negative
hyperplane and above the positive hyperplane, i.e.,
w · xi − b ≥ 1 when yi = 1 (9.41)
and
w · xi − b ≤ 1 when yi = −1. (9.42)
This definition of the maximization implies that no training data points are located between
the positive and negative hyperplanes. Thus, the solution of the maximization of Dm =
2/|w| must occur at the boundary of the domain, i.e., the positive and negative hyperplanes.
The points on these hyperplanes are called the support vectors (SV). R or Python SVM
code can find these SVs.
Figure 9.7 shows an example of an SVM training with 20 training data points, and an
SVM prediction for 3 data points. The data are given below.
X1 X2 y
1 1 6.0 1
2 2 8.0 1
3 3 7.5 1
4 1 8.0 1
5 4 9.0 1
6 5 9.0 1
7 3 7.0 1
8 5 9.0 1
9 1 5.0 1
355 9.2 Support Vector Machine
10
b =0
Q3
⋅ x−
w
8
b =1
⋅ x−
w
6
= −1
x2
w −b
⋅x
w
4
Q1
2
Q2
0
0 2 4 6 8 10
x1
t
Figure 9.7 SVM training from 20 data points, and the prediction of 3 new data points by the trained SVM.
10 5 3.0 2
11 6 4.0 2
12 7 4.0 2
13 8 6.0 2
14 9 5.0 2
15 10 6.0 2
16 5 0.0 2
17 6 5.0 2
18 8 2.0 2
19 2 2.0 2
20 1 1.0 2
The following computer code can do the SVM calculation and plot Figure 9.7. We
acquire three new data points Q1 (0.5, 2.5), Q2 (7.5, 2.0), and Q3 (6.0, 9.0). We wish to use
the trained SVM to find out which point belongs to which category. The SVM predic-
tion result is shown in Figure 9.7: Q1 (0.5, 2.5) and Q2 (7.5, 2.0) belong to Category 1, and
Q3 (6.0, 9.0) to Category 2. Again, this prediction result is obvious to us when the new data
points are plotted, our eyes can see the points, and our brains can make the decisions based
on our trained intelligence. The SVM applications in the real world often involve data of
356 Introduction to Machine Learning
higher dimensions, in which case the visual decision is impossible, and requires us to pre-
dict the category for the new data without plotting. The prediction is based on the training
data, new data, and SVM algorithm without visualization. For example, the R prediction
of this problem can be done by the command predict(svmP, xnew) where xnew is the
new data in a 3 × 2 matrix, and svmP is the trained SVM. This command is included in the
computer code for generating Figure 9.7.
# R plot Fig . 9 . 7 : SVM for many points
# Training data x and y
x = matrix ( c ( 1 , 6 , 2 , 8 , 3 , 7 . 5 , 1 , 8 , 4 , 9 , 5 , 9 ,
3, 7, 5, 9, 1, 5,
5, 3, 6, 4, 7, 4, 8, 6, 9, 5, 10, 6,
5, 0, 6, 5, 8, 2, 2, 2, 1, 1),
ncol = 2 , byrow = TRUE )
y= c(1, 1, 1, 1, 1, 1,
1, 1, 1,
2, 2, 2, 2, 2, 2 ,
2, 2, 2, 2, 2)
library ( e 1 0 7 1 )
dat = data . frame (x , y = as . factor ( y ))
svmP = svm ( y ~ . , data = dat ,
kernel = " linear " , cost = 1 0 ,
scale = FALSE ,
type = ’C - classification ’)
svmP
# Number of Support Vectors : 3
svmP $ SV # SVs are # x [ 9 ,] , x [ 1 7 ,] , x [ 1 9 ,]
#9 1 5
#17 6 5
#19 2 2
delx = 1 . 4
dely = delx * ( - w [ 0 ]/ w [ 1 ])
# plot points
color = [ ’r ’ , ’r ’ , ’r ’ , ’r ’ , ’r ’ , ’r ’ , ’r ’ , ’r ’ , ’r ’ , ’ forestgreen ’ ,
’ forestgreen ’ , ’ forestgreen ’ , ’ forestgreen ’ ,
’ forestgreen ’ , ’ forestgreen ’ , ’ forestgreen ’ , ’ forestgreen ’ ,
’ forestgreen ’ , ’ forestgreen ’ , ’ forestgreen ’ ,]
plt . scatter ( x [:: 2 ] , x [ 1 :: 2 ] , color = color )
# plot newly predicted points
plt . scatter ([ 0 . 5 ,7 ,6 ] ,[ 2 . 5 ,2 ,9 ] , marker = " ^ " ,
color = ’k ’ , s = 9 0 )
359 9.3 Random Forest Method for Classification and Regression
# plot lines
plt . plot ( x 1 ,x 2p , color = ’ red ’ , linestyle = ’ -- ’)
plt . plot ( x 1 ,x 2m , color = ’ forestgreen ’ , linestyle = ’ -- ’)
plt . plot ( x 1 ,x 2 0 , color = ’ purple ’)
# add text
plt . text ( 5 + 2 * delx , 6 . 5 + 2 * dely , ’$ w \ cdot x - b = 0 $ ’ ,
color = ’ purple ’ , rotation = thetasvm , size = 1 5 )
plt . text ( 5 - delx , 7 . 6 - dely , ’$ w \ cdot x - b = 1 $ ’ ,
color = ’ red ’ , rotation = thetasvm , size = 1 5 )
plt . text ( 5 , 4 . 8 , ’$ w \ cdot x - b = -1 $ ’ ,
color = ’ forestgreen ’ , rotation = thetasvm , size = 1 5 )
plt . text ( 1 . 8 ,4 . 3 , ’w ’ , color = ’ blue ’ , size = 1 5 )
plt . text ( 0 . 3 ,2 . 1 , ’$ Q _ 1 $ ’ , color = ’k ’ , size = 1 5 )
plt . text ( 6 . 8 ,1 . 6 , ’$ Q _ 2 $ ’ , color = ’k ’ , size = 1 5 )
plt . text ( 5 . 8 ,8 . 6 , ’$ Q _ 3 $ ’ , color = ’k ’ , size = 1 5 )
SVM maximizes 2/|w|, i.e., minimizes |w|. The computer code shows that w =
(−0.39996, 0.53328). This optimization is reached at the three SVs P9 (1, 5), P17 (6, 5), and
P19 (2, 2). We can verify that the end points of these three SVs are on the two hyperplanes,
by plugging these coordinates into the following two linear equations:
w · x − b = ±1. (9.43)
Reversely, these equations have three parameters w1 , w2 , b, which can be determined by
three support vectors (SVs).
The linear SVM separation can have nonlinear extensions, such as circular or spherical
hypersurfaces. Nonlinear SVM is apparently needed when the data cannot be easily sep-
arated by linear hyperplanes. Both R and Python codes have function parameters for the
nonlinear SVM. Another SVM extension is from two classes to many classes. Multi-class
and nonlinear SVM are beyond the scope of this book. The interested readers are referred
to the designated machine learning books, such as those listed at the section of References
and Further Reading at the end of this chapter.
Random forest (RF) is a popular machine learning algorithm and belongs to the class of
supervised learning. It uses many decision trees trained from the given data, called the
training data. The training data usually include categorical data, such as weather types
(e.g., rainy, cloudy, and sunny) as labels, and numeric data, such as historic instrument
observations (e.g., atmospheric pressure, humidity, wind speed, wind direction, and air
temperature). These data form many logical decision trees that decide under what sets of
360 Introduction to Machine Learning
conditions the weather is considered rainy, cloudy, or sunny. Many decision trees form
a forest. Multiple decision nodes and branches for each tree are determined and trained
by using a set of random sampling procedures. The random sampling and the decision
trees lead to the name random forest. The training step results in an RF model that is a
set of decision trees. In the prediction step, you submit your new data to the trained RF
model whose each decision tree makes a vote for a category based on your new data. If
the category “tomorrow is rainy” receives the most votes, then you have predicted a rainy
day tomorrow. In this sense, RF is an ensemble learning method. Because of the nature
of ensemble predictions, an RF algorithm should ensure the votes among the trained trees
have as little correlation as possible.
This is a layman’s description of RF for classification, when the RF objective is for
categorical data, or called factor in data types. RF can also be used to fill in the missing
numeric real values. The missing values, or missing data in climate science, are often
denoted by NA, NaN, or -99999. RF makes a prediction for the missing data. The prediction
is an ensemble mean from the trained RF trees. This is an RF regression, which produces
numerical results.
We use the popular iris flower dataset (Fisher 1936) frequently used for RF teaching as
an example to explain RF computing procedures and to interpret RF results. Sir Ronald
A. Fisher (1890–1962) was a British statistician and biologist. The Fisher dataset con-
tains three iris species: setosa, versicolor, virginica. The data are the length and width of
sepal and petal of a flower. The first two rows of the total 150 rows of the dataset are as
follows.
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1 5.1 3.5 1.4 0.2 setosa
#2 4.9 3.0 1.4 0.2 setosa
Figure 9.8 shows the entire dataset by connected dots. Figure 9.8 may be generated by the
following computer code.
# R plot Fig . 9 . 8 : R . A . Fisher data of three iris species
setwd ( ’ ~/ climstats ’)
data ( iris ) # read the data already embedded in R
dim ( iris )
#[1] 150 5
iris [ 1 : 2 ,] # Check the first two rows of the data
# Sepal . Length Sepal . Width Petal . Length Petal . Width Species
#1 5.1 3.5 1.4 0 . 2 setosa
#2 4.9 3.0 1.4 0 . 2 setosa
Sepal length
Sepal width
8
Petal length
Petal width
t
Figure 9.8 The R. A. Fisher iris dataset of sepal length, sepal width, petal length, and petal width for flower species (Fisher 1936).
# set up plot
plt . title ( " R . A . Fisher data of iris flowers " , pad = 1 0 )
plt . xlabel ( " Sorted order of the flowers for measurement " ,
labelpad = 1 0 )
plt . ylabel ( " Length or width [ cm ] " , labelpad = 1 0 )
plt . xticks ([ 0 ,5 0 ,1 0 0 ,1 5 0 ])
plt . ylim ( - 2 ,9 )
plt . yticks ([ 0 ,2 ,4 ,6 ,8 ])
# add legend
plt . legend ()
# add text
plt . text ( 1 3 ,-1 , " Setosa 1 -5 0 " , size = 1 5 )
plt . text ( 5 7 ,-1 , " Versicolor 5 1 -1 0 0 " , size = 1 5 )
plt . text ( 1 0 7 ,-1 , " Virginica 1 0 1 -1 5 0 " , size = 1 5 )
plt . show ()
The first 50 are setosa, which are the smallest flower, and the last 50 for virginica that
have the largest petals. Given the data of sepal and petal sizes, one may correctly detect
their corresponding flower species. However, not all the flowers have the same size. This
makes the species detection more difficult. The RF algorithm can make the detection with
a small error. We make an RF experiment with this dataset. We randomly select 120 rows
of data as the training data to obtained a train RF model with 500 decision trees, use the
remaining 30 rows as the new data for our detection. These decision trees vote on each
of the 30 rows of the new data. The most species votes a row receives from the trained
RF trees is the RF detected result. By default, the decision trees in an RF algorithm are
built randomly so the RF result from each run based on the same data may be slightly
different. The following computer code shows the RF run and its result of this experiment
for Fisher’s iris data. The code also generates Figure 9.9.
# R code : RF prediction using the Fisher iris data
# install . packages (" randomForest ")
library ( randomForest )
# set . seed ( 8 ) # run this line to get the same result
# randomly select 1 2 0 observations as training data
train _ id = sort ( sample ( 1 : 1 5 0 , 1 2 0 , replace = FALSE ))
train _ data = iris [ train _ id , ]
dim ( train _ data )
#[1] 120 5
# use the remaining 3 0 as the new data for prediction
new _ data = iris [ - train _ id , ]
dim ( new _ data )
#[1] 30 5
363 9.3 Random Forest Method for Classification and Regression
RF Model Error Rate for Each Tree Importance Plot of the RF Model
0.20
40
OOB
Setosa
20
0.10
10
0.05
Sepal.Width
Sepal.Length
Pedal.Length
Pedal.Width
0.00
t
Figure 9.9 (a) RF errors vs trees. (b) The RF’s importance plot.
• Confusion matrix: This matrix displays correct and wrong classifications based on the
training data. The columns are truth and rows are RF classifications. The diagonals are
corrected classifications, and off diagonals are the wrong ones. The 41 setosas are all
correctly classified. Among the 37 versicolors, 34 are correctly classified, but 3 are
wrongly classified as virginica. Among the 37 versicolors, 34 are correctly classified,
but 3 are wrongly classified as virginica.
• Classification error: The last column in the confusion matrix is the classification error
which is equal to the sum of the off-diagonal elements in that row divided by the sum of
the entire row, i.e., the ratio of the number of wrong classifications to the total number
of classifications. The errors show how good the trained RF model is.
• OOB estimate of error rate: OOB stands for out-of-bag, a term for a statistical sampling
process that puts a part of the data for training the RF model and the remainder for
validation. This remainder is referred to as the OOB set of data. The OOB error rate is
defined as the ratio of the number of wrong decisions based on the majority vote to the
size of OOB set.
• Number of variables tried at each split: This is the number of variables randomly sam-
pled for growing trees, and is denoted by mtry. The Fisher iris data have four variables
√
(p = 4). It is recommended that mtry = p for RF classification, and = p/3 for RF
regression.
• Number of trees: We want to build enough trees for RF that the RF errors shown in
Figure 9.9(a) stabilized. Too few trees may yield results of large differences at different
runs.
• RF errors vs trees: The colored lines show errors of different types of validation proce-
dures. As RF users, we pay attention to whether these errors are stabilized. The detailed
error definitions are beyond the scope of this chapter.
366 Introduction to Machine Learning
• Mean decrease Gini scores and the importance plot (Fig. 9.9(b)): A higher mean
decrease Gini (MDG) score corresponds to more importance of the variable in the model.
In our example, the petal width is the most important variable. This agrees with our intu-
ition from Figure 9.8. The figure shows that the petal width has clear distinctions among
the three species and little variance within species. The petal length also has clear dis-
tinctions, but has larger variances, and is the second most important variable. The sepal
width has almost no distinctions among the species and has the smallest MDG score,
1.5. It is the least important variable.
• RF prediction: Finally, RF predictions were made for the 30 rows of new data. The
prediction results for the first six rows of the new data are shown here. These are the
final results. Among the 30 rows, RF correctly predicted 28: 120 and 135 should be
virginica, but RF predicted versicolor for both. In weather forecasting, these can be the
public weather outlook of the 9th day from today (e.g., sunny, rainy, or cloudy) for 30
different locations in a country.
9.3.2 RF Regression for the Daily Ozone Data of New York City
Similar to Fisher’s iris data, the daily air quality data, the file name airquality, of
New York City (NYC) from May 1, 1973 to September 30, 1973 (153 days) is another
benchmark data for the RF algorithms. The dataset contains the following weather param-
eters: ozone concentration in the ground-level air, cumulative solar radiation, average wind
speed, and daily maximum air temperature. When the ozone concentration is higher than 86
parts per billion (ppb), the air is considered unhealthy. The ozone concentration is related
to solar radiation, wind, and temperature. The following list provides more information on
this airquality dataset:
• The ozone data in ppb observed between 1300 and 1500 hours at Roosevelt Island of
New York City. Of the 153 days, 37 had no data and are denoted by NA. The data range
is 1–168 ppb. The maximum ozone level of this dataset is 168 ppb, occurred on August
26, 1973 when the cumulative solar radiation was high at 238 Lang, the average wind
speed low at 3.4 mph, and Tmax moderate at 81◦ F.
• Cumulative solar radiation from 0800 to 1200 hours with units in Langley (Lang or
Ly) (1 Langley = 41, 868 Watt·sec/m2 , or 1 Watt/m2 = 0.085985 Langley/hour) in the
lightwave length range 4,000–7,700 Angstroms at Central Park in New York City. The
data range is 7–334 Lang.
• Average wind speed in miles per hour (mph) at 0700 and 1000 hours at LaGuardia
Airport, less than 10 km from Central Park. The data range is 1.7–20.7 mph.
• Maximum daily temperature Tmax in ◦ F at La Guardia Airport. The data range is 56–97
◦ F. The daily Tmax data can also be downloaded from the NOAA NCEI website by an
online search using the words NOAA Climate Data Online LaGuardia Airport.
While RF analysis for this airquality dataset can be done for many purposes, our
example is to fill the 37 missing ozone data with RF regression results. RFs use the four
aforementioned parameters and their data to build decision trees. Figure 9.10(a) shows the
original 116 observed data points in black dots and lines, and the RF regression estimates
367 9.3 Random Forest Method for Classification and Regression
150
Ozone [ppb]
100
50
0
t
Figure 9.10 (a) New York City ozone data from May 1 to September 30, 1973. Among the 153 days, 116 had observed data (black
dots and lines) and 37 had missing data. The missing data are filled by RF regression results (blue dots and lines). (b)
The complete ozone data time series as a continuous curve with dots when the missing data are replaced by the RF
regression.
for the 37 missing data in the blue dots and lines. A single blue dot means only an isolated
day of missing data. A blue line indicates missing data in successive days. For better vis-
ualization, Figure 9.10(b) shows the complete NYC daily ozone time series that join the
observed data with the result data from an RF regression.
Figure 9.10 may be generated by the following computer code.
# R plot Fig . 9 . 1 0 : RF regression for ozone data
library ( randomForest )
airquality [ 1 : 2 ,] # use R ’ s RF benchmark data " airquality "
# Ozone Solar . R Wind Temp Month Day
#1 41 190 7.4 67 5 1
#2 36 118 8.0 72 5 2
dim ( airquality )
#[1] 153 6
ozoneRFreg = randomForest ( Ozone ~ . , data = airquality ,
mtry = 2 , ntree = 5 0 0 , importance = TRUE ,
na . action = na . roughfix )
# na . roughfix allows NA to be replaced by medians
# to begin with when training the RF trees
368 Introduction to Machine Learning
par ( mfrow = c ( 2 , 1 ))
par ( mar = c ( 3 , 4 . 5 , 2 , 0 . 1 ))
plot ( t 1 , airquality $ Ozone ,
type = ’o ’ , pch = 1 6 , cex = 0 . 5 , ylim = c ( 0 , 1 7 0 ) ,
xlab = ’ ’ , ylab = ’ Ozone [ ppb ] ’ , xaxt = " n " ,
main = ’( a ) Ozone data : Observed ( black ) and RF filled ( blue ) ’ ,
col = 1 , cex . lab = 1 . 3 , cex . axis = 1 . 3 )
MaySept = c ( " May " ," Jun " , " Jul " , " Aug " , " Sep " )
axis ( side = 1 , at = 5 : 9 , labels = MaySept , cex . axis = 1 . 3 )
points ( t 1 , ozone _ filled , col = ’ blue ’ ,
type = ’o ’ , pch = 1 6 , cex = 0 . 5 ) # RF filled data
We have learned that the RF prediction is the result of majority votes from the trained
decision trees in a random forest. What, then, does a decision tree look like? Figure 9.11
shows a decision tree in the RF computing process for the training data of iris flowers. The
gray box on top shows the percentage of each species in the 120 rows of training data. This
particular set of training data was randomly selected from the entire R. A. Fisher dataset
which has 150 rows. Of the 120 rows of training data, 41 rows are setosa (34%), 42 rows
versicolor (35%), and 37 rows (31%) virginica. The first branch of the decision tree grows
from a condition
Petal Length < 2.5 [cm]. (9.44)
If “yes,” this iris is setosa. All 41 setosa flowers (34% of the entire training data)
have been detected. This decision is clearly supported by the real petal length data (the
green line) shown in Figure 9.8, which shows that only setosa has petal length less than
2.5.
If “no,” then two possibilities exist. This allows us to grow a new branch of the tree. The
condition for this branch is
Petal Width < 1.8 [cm]. (9.45)
This condition determines whether a flower is versicolor or virginica. After the isola-
tion of 41 setosa flowers, the remaining 79 flowers are 42 versicolor (53%) and 37
virginica (47%). The “yes” result of condition (9.45) leads to versicolor. This result is 93%
371 9.3 Random Forest Method for Classification and Regression
versicolor setosa
0.34 0.35 0.31 versicolor
100% virginica
yes Petal.Length < 2.5 no
versicolor
0.00 0.53 0.47
66%
t
Figure 9.11 A trained tree in the random forest for the iris data of R. A. Fisher (1936).
correct and 7% incorrect. The “no” result implies virginica. This conclusion is 97% correct,
and 3% incorrect. These conclusions are supported by the real petal width data (the blue
line) shown in Figure 9.8. The petal width of versicolor iris flowers is in general less than
1.8 [cm]. However, the data have some fluctuations in both the versicolor and virginica
sections in Figure 9.8. These fluctuations lead to the errors of decision.
The entire RF process for our iris species example had grown 800 such decision trees
using the 120 rows of training data. Different trees use different branch conditions. RF
algorithms grow these trees following various kinds of optimization principles, which
involve some tedious mathematics not covered here.
Figure 9.11 may be generated by the following computer code.
setwd ( ’ ~/ climstats ’)
setEPS () # Plot the data of 1 5 0 observations
postscript ( " fig 0 9 1 1 . eps " , width = 6 , height = 6 )
par ( mar = c ( 0 , 2 , 1 , 1 ))
iris _ tree = rpart ( Species ~. , data = train _ data )
rpart . plot ( iris _ tree ,
main = ’A decision tree for the RF training data ’)
dev . off ()
372 Introduction to Machine Learning
A neural network (NN) consists of a series of data fitting according to the training data
and an application of the fitted NN model for the test data. The NN outputs categorical
predictions, such as a sunny or rainy day, or numerical predictions as a regression. NN is
also known as an artificial neural network (ANN) or simulated neural network (SNN). It
is a popular machine learning method, and is a fundamental building block of the deep
learning algorithms that often involve the data fitting of multiple layers, referred to as
hidden layers.
Why is it called a neural network? How does a data fitting process have anything to do
with “neural” and/or “network”? Artificial neurons were first proposed by Warren Sturgis
McCulloch (1898–1969), an American neurophysiologist, and Walter Harry Pitts (1923–
1969), an American logician, in their 1943 paper entitled “A logical calculus of ideas
immanent in nervous activity.” This mathematical paper used the terms “neuron,” “action,”
“logic expression,” and “net,” which are among the keywords in modern NN writings.
The ten theorems of McCulloch and Pitts’ paper formulate a suite of logic expressions
based on data. The word “neuron” was more a graphic indication for actions and logic
expressions than a biological reference. A layman’s understanding of NN machine learn-
ing is often incorrectly articulated in terms of biological neurons and a biological neural
network. Therefore, ANN may be a more appropriate term for NN to avoid confusion.
This book attempts to briefly introduce NN so that you can use our R or Python code for
your data and objectives, interpret your NN computing results, and understand the basic
principles of mathematical formulations of an NN algorithm. However, we do not attempt
to derive mathematical details of the NN theory. This section has two subsections: a simple
NN example of a decision system, and an example of NN prediction using the benchmark
data of Fisher’s iris flowers.
373 9.4 Neural Network and Deep Learning
TKS 20 10 30 20 80 30
CSS 90 20 40 50 50 80
Decision Hire Reject Reject Reject Hire Hire
The junior manager has received the following scores for three new job applicants A,
B, and C, whose TKS and CSS scores are as follows: TKS (30, 51, 72), and CSS (85,
51, 30), respectively. The senior manager’s ruling seems to suggest that a candidate is
recruited when either TKS score or CSS score is high. Thus, this junior manager has an
easy decision for Candidate A whose CSS 85 is high, so to be recruited. However, the
decision for Candidate B is difficult as neither TKS 51 nor CSS 51 are high, but both are
higher than those of the rejected candidates, and further the sum of TKS and CSS is 102,
lower than the minimum of the totals of the recruited, 110. Should the junior manager hire
Candidate B? The decision for Candidate C is also difficult. Candidate C seems weaker
than the recruited case TKS 30 and CSS 80, but not too much weaker. Should the junior
manager hire Candidate C? The following NN computer code may learn from the data of
the senior manager and suggest an NN decision that may serve as a useful reference for the
junior manager.
TKS = [ 2 0 ,1 0 ,3 0 ,2 0 ,8 0 ,3 0 ]
CSS = [ 9 0 ,2 0 ,4 0 ,5 0 ,5 0 ,8 0 ]
Recruited = [ 1 ,0 ,0 ,0 ,1 ,1 ]
# combine multiple columns into a single set
df = pd . DataFrame ({ ’ TKS ’: TKS , ’ CSS ’: CSS ,
’ Recruited ’: Recruited })
df . head ()
X = df [[ " TKS " , " CSS " ]]
y = df [[ " Recruited " ]]
# random . seed ( 1 2 3 ).
# The model is random due to too little training data
nn = Sequential ()
375 9.4 Neural Network and Deep Learning
Figure 9.12 shows a simple neural network plotted by the previous computer code. The
black numbers are called weights wi j that are multiplied by the input data xi for neuron j.
The blue numbers are called biases b j , associated with neurons j, indicated by the circles
pointed to by a blue arrow. The last circle on the right is the result, or called the output
layer. The first two circles on the left indicate input data, and form the input layer. The
weights and biases are mathematically aggregated in the following way:
n
z j = ∑ wi j xi + b j , (9.46)
i=1
where z j are for fitting an activation function at neuron j when all the training data are used.
In the computer code, the activation function is logistic. A logistic function is defined as
1
g(z) = , (9.47)
1 + exp(−k(z − z0 ))
where k is called the logistic growth rate, and z0 is the midpoint. This function can also be
written as
1 1 k(z − z0 )
g(z) = + tanh − , (9.48)
2 2 2
376 Introduction to Machine Learning
7.4
−0
.0
9
052
79
43
22
14
9
1.
TKS −0.01681
0.0 −6
−4.0
−0.0
54 .20
0−.01.4
03 39
7
2764
8495
025551
18
9.76367 Recruited
8.7967
1
61
6
.1188
76 96
4 .1
−10.38
04 11
8
0. −
CSS −0.5728
0.4
3.13299
82
65
8
03
.30
Error: 0.000309 Steps: 135 −2
t
Figure 9.12 A simple neural network of five neurons in a hidden layer for an NN hiring decision system.
A curve for a logistic function is shown in Figure 9.13. The curve has properties f (−∞) = 0
and f (∞) = 1. This makes the logistic activation function useful for categorical assignment:
False for 0 and True for 1.
Figure 9.13 can be plotted by the following computer code.
# R plot Fig . 9 . 1 3 : Curve of a logistic function
y = seq ( - 2 , 4 , len = 1 0 1 )
k = 3.2
y0 = 1
setEPS () # Automatically saves the . eps file
postscript ( " fig 0 9 1 3 . eps " , height = 5 , width = 7 )
par ( mar = c ( 4 . 2 , 4 . 2 , 2 , 0 . 5 ))
plot (y , 1 /( 1 + exp ( - k *( y - y 0 ))) ,
type = ’l ’ , col = ’ blue ’ , lwd = 2 ,
xlab = ’y ’ , ylab = ’f ( y ) ’ ,
main = ’ Logistic function for k = 3 . 2 , y 0 = 1 . 0 ’ ,
cex . lab = 1 . 3 , cex . axis = 1 . 3 )
dev . off ()
377 9.4 Neural Network and Deep Learning
1.0
0.8
0.6
g(z)
0.4
0.2
0.0
−2 −1 0 1 2 3 4
z
t
Figure 9.13 Logistic function with growth rate k = 3.2 and midpoint y0 = 1.0.
# plot figure
plt . plot (y , 1 /( 1 + np . exp ( - k *( y - y 0 ))) , ’b ’ , linewidth = 2 . 5 )
# add labels
plt . title ( ’ Logistic function for k = 3 . 2 , z 0 = 1 . 0 ’ ,
pad = 1 0 )
plt . xlabel ( " z " , labelpad = 1 0 )
plt . ylabel ( " g ( z ) " , labelpad = 1 0 )
plt . show ()
When the growth rate k = 1 and midpoint y0 = 0, the logistic function is often referred
to as a sigmoid function. Some Python or R code uses sigmoid to represent the activation
function.
The idea of data-based decision processes is that you integrate all the data, develop a
model through data fitting with a specified optimization, and apply the model to new data
to generate predictions. The neural network integrates data by assigning each datum a
weight, as in Eq. (9.46), and assigning each weighted sum a bias. The weighted data z j and
the decision data, 0 or 1, from the training dataset, form a series of data pairs (z, d). These
data pairs are used for the logistic data fitting many times to train a neural network model.
Each time, the NN algorithm will try to optimize its data fitting, such as minimizing mean
square errors (MSE) of the fitting. This optimization process, called back-propagation in
an NN algorithm, updates the weights and biases. This makes NN learning an optimization
378 Introduction to Machine Learning
process. When the fitting errors do not change much, the optimization process has con-
verged, and a logistic function model has been trained. You can now plot the trained neural
network, denoted by nn in the computer code here, and produce Figure 9.12.
The new data are aggregated together as a data frame denoted by test in our code. You
apply the trained nn model to test and obtain the NN prediction result as probabilities,
a higher probability suggesting a positive decision, i.e., recruited. You may use 0.5 as
your hiring probability threshold and convert the probability result into a decision result,
indicated by 1 (recruit) or 0 (reject). In this example, Candidates A and B are hired and
Candidate C is rejected, as suggested by this NN model.
When running the computer code for this example of hiring data, you may have noticed
that each run has a different prediction result. This makes the prediction unreliable. To
improve the situation, we need more training data and more optimizations by using several
layers of neurons. The approach of multiple hidden layers is a type of deep learning. It
is intuitive why more training data are needed to train a more reliable model, since few
data cannot cover all possibilities. In the next subsection, we will present an example of
NN deep learning using more than six groups of training data. Specifically, we will use 75
groups of the Fisher iris flower data.
nn$weights[[1]][[1]]
in the R NN package neuralnet. R can also output the initial assignment of weights and
bias:
nn$weights[[1]][[2]]
The initial weights and biases are randomly assigned, which makes the NN result different
for different runs when having insufficient amounts of data, or an insufficient number of
hidden layers.
An NN computer package can also output many other modeling results, such as an R
command:
nn$result.matrix
These output data can be used to analyze the quality of your trained model, so, it is
unfair to say that NN is a blackbox. Yet, analyzing a trained model for a complex NN is
very difficult and requires some nontrivial mathematical preparation.
379 9.4 Neural Network and Deep Learning
Many NN users may not have the mathematical ability to analyze the trained model
based on mathematical theories. Instead, they apply the model to the new data and see if
the NN prediction makes sense based on their experience or common sense. If not, train
the model again with different parameters, or even feed more training data.
We wish to use the Fisher iris data to show an example of NN with a sufficiently large size
of training data and a highly reliable result. The Fisher iris data is a 150 × 5 matrix with
three iris species: 50 setosa, 50 virginica, and 50 versicolor, as described in the random
forest section, Section 9.3 (Fisher 1936). We wish to use a percentage of the data as the
training data and the rest as the test data. The following computer code is for 50% of the
data to be the training data, i.e., p = 0.5 in the code. We choose 10 neurons and 10 hidden
layers. The NN prediction result shows that NN correctly predicted all the 27 setosa irises,
got 19 correct among the 20 versivolor irises but got one versicolor incorrectly as virginica,
and got 26 correct among the 28 virginica irises but got 2 incorrectly as versicolor. Since
the 75 groups of training data are plenty, compared to the 6 groups of training data for the
hiring decision, the trained NN model for the iris flowers does not vary much for different
runs. The prediction results are quite reliable. Using more neurons and hidden layers may
also help achieve reliable results. Although the results from different runs of the NN code
do not change much, small differences still exist. For example, you may get a result of
predicting 27 virginica.
When you change p = 0.8, you will have 120 rows of training data and 30 rows of test
data, as we used in Section 9.3 for the random forest example. This increase of training
data size helps further stabilize prediction results, i.e., little difference between the results
of the different runs. You can make some runs of the code on your own computer, and
see how many incorrect predictions are there. Our example runs show only one or two
incorrect predictions among the 30 rows of test data. For example, in one run, the 9 setosa
and 12 versicolor irises were correctly predicted, and the 9 virginica irises had 8 correct
predictions and 1 wrong. The level of the NN prediction accuracy seems similar to that of
the random forest prediction for this particular dataset.
# R NN code for the Fisher iris flower data
# Ref : https :// rpubs . com / vitorhs / iris
data ( iris ) # 1 5 0 -by - 5 iris data
# attach True or False columns to iris data
iris $ setosa = iris $ Species == " setosa "
iris $ virginica = iris $ Species == " virginica "
iris $ versicolor = iris $ Species == " versicolor "
p = 0 . 5 # assign 5 0 % of data for training
train . idx = sample ( x = nrow ( iris ) , size = p * nrow ( iris ))
train = iris [ train . idx ,] # determine the training data
test = iris [ - train . idx ,] # determine the test data
dim ( train ) # check the train data dimension
#[1] 75 8
# use the length , width , True and False data for training
iris . nn = neuralnet ( setosa + versicolor + virginica ~
Sepal . Length + Sepal . Width +
Petal . Length + Petal . Width ,
data = train , hidden = c ( 1 0 , 1 0 ) ,
rep = 5 , err . fct = " ce " ,
linear . output = F , lifesign = " minimal " ,
stepmax = 1 0 0 0 0 0 0 , threshold = 0 . 0 0 1 )
plot ( iris . nn , rep = " best " ) # plot the neural network
iris = pd . read _ csv ( " data / iris . csv " ) # 1 5 0 -by - 5 iris data
# attach True or False columns to iris data
iris [ ’ setosa ’] = iris [ ’ class ’ ]. map ( lambda x :
x == " Iris - setosa " )
iris [ ’ versicolor ’] = iris [ ’ class ’ ]. map ( lambda x :
x == " Iris - versicolor " )
iris [ ’ virginica ’] = iris [ ’ class ’ ]. map ( lambda x :
x == " Iris - virginica " )
381 9.5 Chapter Summary
This chapter has described four methods of machine learning: K-means, support vector
machine (SVM), random forest (RF), and neural network (NN). In the description, we
have used the following datasets: the daily weather data at the Miami International Air-
port, the daily air quality data of New York City, and R. A. Fisher’s iris flower data
of species. We have provided both R and Python codes for these algorithms and their
application examples. The following provides a brief summary of our descriptions about
the four methods.
(i) K-means clustering: Simply speaking, this clustering method can fairly divide N
identical candies on a table for K kids according to the candy locations. It is normally
used as an unsupervised learning method. The K-means algorithm minimizes the
382 Introduction to Machine Learning
[1] B. Boehmke and B. Greenwell, 2019: Hands-on Machine Learning with R. Chapman
and Hall.
This machine learning book has many R code samples and examples. The book
website is https://bradleyboehmke.github.io/HOML/.
This book contains the daily ozone and weather data from May 1, 1973 to
September 30, 1973 of New York City. This dataset serves as a benchmark
dataset in random forest algorithms. John M. Chambers is a Canadian statis-
tician, who developed the S programming language, and is a core member of
the R programming language project.
[3] R. A. Fisher, 1936: The use of multiple measurements in taxonomic problems. Annals
of Eugenics, 7(2), 179–188.
This paper contains the iris data frequently used in the teaching of machine
learning. The current name of the journal Annals of Eugenics is Annals of
Human Genetics. The author, Ronald Aylmer Fisher (1890–1962), was a Brit-
ish mathematician, statistician, biologist, and geneticist. Modern description of
this dataset and its applications in statistics and machine learning can be found
from numerous websites, such as
www.angela1c.com/projects/iris_project/the-iris-dataset/
[4] A. Géron, 2017: Hands-on Machine Learning with Scikit-Learn and TensorFlow:
Concepts: Tools, and Techniques to Build Intelligent Systems. O’Reilly.
[5] W. McCulloch and W. Pitts, 1943: A logical calculus of ideas immanent in nervous
activity. Bulletin of Mathematical Biophysics 5, 115–133.
383
384 Introduction to Machine Learning
This seminal work is considered the first paper to have created modern NN
theory. The paper is highly mathematical, contains ten theorems, and includes
a figure of neurons.
This is a very famous machine learning book and includes in-depth descrip-
tions of the commonly used ML algorithms.
Exercises
9.1 Use the K-means method to conduct a cluster analysis for the following five points in
the 2D space: P1 (1, 1), P2 (2, 2), P3 (2, 3), P4 (3, 4), P5 (4, 4). Assume K = 2. Plot a figure
similar to Figure 9.1. What is the final tWCSS equal to?
9.2 Given the following three points: P1 (1, 1), P2 (2, 2), P3 (2, 3), and given K = 2, conduct a
K-means cluster analysis following the method presented in Sub-section 9.1.1. Plot your
K-means clustering result.
9.3 Use the K-means method to conduct a cluster analysis for the daily Tmin and WDF2
data of the Miami International Airport in 2015, following the method presented in Sub-
section 9.1.3. Here, Tmin is the daily minimum temperature, and WDF2 denotes the
direction [in degrees] of the fastest 2-minute wind in a day. You can obtain the data from
the NOAA Climate Data Online website, or from the file
MiamiIntlAirport2001_2020.csv
www.climatestatistics.org
9.4 Use the K-means method to conduct a cluster analysis for the daily Tmax and WDF2
data of the Miami International Airport in 2015.
9.5 Use the K-means method to conduct a cluster analysis for the daily Tmin and Tmax data
of the Miami International Airport in 2015.
9.6 Use the K-means method to conduct a cluster analysis for the daily Tmin and PRCP
data of the Miami International Airport in 2015. Here, PRCP denotes the daily total
precipitation.
9.7 Identify two climate parameters of your interest, find the corresponding data online, and
conduct a K-means cluster analysis similar to the previous problem for your data.
9.8 Following the method presented in Subsection 9.2.1, conduct an SVM analysis for the
following five points in a 2D space: P1 (1, 1), P2 (2, 2), P3 (2, 3), P4 (3, 4), P5 (4, 4). The first
three points are labeled 1 and the last two are labeled 2. What are w, b, and Dm ? What
points are the support vectors?
385 Exercises
9.9 Two new points are introduced to the previous problem: Q1 (1.5, 1) and Q2 (3, 3). Use the
SVM trained in the previous problem to find out which point belongs to which category.
9.10 From the Internet, download the historical daily data of minimum temperature (Tmin)
and average wind speed (AWND) for a month and a location of your interest. Label
your data as 1 if the daily precipitation is greater than 0.5 millimeter, and 0 otherwise.
Conduct an SVM analysis for your labeled data.
9.11 SVM forecast: From the Internet, download the historical daily data of minimum tem-
perature and sea level pressure for a month and a location of your interest. Label your
data as 1 if the total precipitation of the next day is greater than 0.5 millimeter, and
0 otherwise. Conduct an SVM analysis for your labeled data. Given the daily data of
minimum temperature and sea level pressure of a day in the same month but a different
year, use your trained SVM model to forecast whether the next day is rainy or not, i.e.,
to determine whether the next day is 1 or 0.
9.12 In order to improve the accuracy of your prediction, can you use the data of multiple
years to train your SVM model? Perform numerical experiments and show your results.
9.13 The first 50 in the 150 rows of the R. A. Fisher iris data are for setosa, rows 51–100
are for versocolor, and 101–150 are for virginica. Use the data rows 1–40, 51–90, and
101–140 to train an RF model, following the method in Subsection 9.3.1. Use the RF
model to predict the species of the remaining data of lengths and widths of petal and
sepal. You can download the R. A. Fisher dataset iris.csv from the book website or
use the data already built in R or Python software packages.
9.14 For the 150 rows of the R. A. Fisher iris data, use only 20% of the data from each species
to train your RF model. Select another 10% of the iris data of lengths and widths as the
new data for prediction. Then use the RF model to predict the species of the new data.
Discuss the errors of your RF model and your prediction.
9.15 Plot a decision tree like Figure 9.11 for the previous exercise problem.
9.16 For the same R. A. Fisher dataset, design your own RF training and prediction. Discuss
the RF prediction accuracy for this problem.
9.17 RF forecast: From the Internet, download the historical daily data of minimum temper-
ature and sea level pressure for a month and a location of your interest. Label your data
as 1 if the total precipitation of the next day is greater than 0.5 millimeter, and 0 other-
wise. Conduct an RF analysis for your labeled data. Given the daily data of minimum
temperature and sea level pressure of a day in the same month but a different year, use
your trained RF model to forecast whether the next day was rainy or not. Is your forecast
accurate? How can you improve your forecast?
9.18 Find a climate data time series with missing data and fill in the missing data using the
RF regression method described in Subsection 9.3.2.
9.19 The first 50 in the 150 rows of the R. A. Fisher iris data are for setosa, rows 51–100 are
for versocolor, and rows 101–150 are for virginica. Use the data rows 1–40, 51–90, and
101–140 to train an NN model, following the method in Subsection 9.4.2. Use the NN
model to predict the species of the remaining data of lengths and widths of petal and
sepal.
9.20 For the 150 rows R. A. Fisher iris data, use only 20% of the data from each species to
train your NN model. Select another 10% of the iris data of lengths and widthes as the
386 Introduction to Machine Learning
new data for prediction. Then use the NN model to predict the species of the new data.
Discuss the errors of your NN model and your prediction.
9.21 NN forecast: From the Internet, download the historical daily data of minimum temper-
ature and sea level pressure for a month and a location of your interest. Label your data
as 1 if the total precipitation of the next day is greater than 0.5 millimeter, and 0 other-
wise. Conduct an NN analysis for your labeled data. Given the daily data of minimum
temperature and sea level pressure of a day in the same month but a different year, use
your trained NN model to forecast whether the next day is rainy or not. Is your forecast
accurate? Compare your NN forecast with your RF and SVM forecasts.
9.22 Design a machine learning project for yourself or others. What are your training data?
What are your test data? What is your training model? What is your training model
error? How would you assess your prediction error?
Index
387
388 Index