Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Package Purpose: Pytroll Scikit Learn Scipy

Download as pdf or txt
Download as pdf or txt
You are on page 1of 50

Python - Concepts and Work Environment 89

Table 6.2.1.1: Some of the Python packages commonly used in climate computing and visualisation.

Package Purpose
PyTROLL¹¹ Processing of earth observation
satellite data.
Scikit Learn¹² Machine learning library.
SciPy¹³ Libraries for mathematics,
science, and engineering.

The modular Python building blocks concept can be taken one step further by
combining packages to create even larger and more complex applications. This
creates a semi-layered structure of lower to higher level Python packages Figure
6.2.1.1. The links between packages are further explored in the following section
(Section 6.2.2).

Figure 6.2.1.1: Schematic showing the semi-layered structure of Python packages from lower-level
(bottom) to higher-level (top). Examples of dependencies are indicated for Iris (red lines) and MetPy
(blue lines).

¹¹http://pytroll.github.io
¹²https://scikit-learn.org/stable
¹³https://www.scipy.org
Python - Concepts and Work Environment 90

6.2.2 Package Dependencies


Figure 6.2.1.1 shows the semi-layered structure of Python packages. If a package
depends on other packages then these other packages are called dependencies. For
example, the dependencies for Iris and MetPy are indicated by red and blue lines,
respectively, in Figure 6.2.1.1.
The Iris and MetPy packages share some dependencies such as Cartopy, Matplotlib,
NumPy and SciPy (Figure 6.2.1.1). In theory this could become a problem because Iris
and MetPy are developed independently and they may required different versions of
shared dependency packages in order to work properly. In the past this often created
problems, breaking one of the two packages when used concurrently. Cascading or
unresolvable dependency issues were referred to as dependency hell¹⁴.
Nowadays, dependency hell can be avoided most of the time by using smart package
managers and sourcing Python packages from well maintained repositories which
are discussed in the following section (Section 6.2.3).

6.2.3 Package Managers, Repositories and Channels


Package managers are utilities which can install, upgrade, downgrade or remove
packages. They look for packages in online repositories. Repositories are online
storage locations to which users and organisations can upload the Python packages
they developed.
Repositories tend to have multiple channels. Channels are online directories which
can be specified using URLs and which store selections of packages. Sometimes,
specific packages can be found in multiple channels.
When installing a package the package manager will look through a pre-defined list
of channels until it finds the package to be installed (using the first it finds if multiple
channels supply the package). It subsequently downloads the package along with its
dependencies and installs them. If package managers are pointed to well maintained
repositories then any package dependency issues (Section 6.2.2) will be resolved by
the package manager.
¹⁴https://en.wikipedia.org/wiki/Dependency_hell
Python - Concepts and Work Environment 91

The standard package manger for Python is called Pip¹⁵ (recursive acronym for Pip
Installs Packages). Pip only installs packages from the Python Package Index (PyPI ¹⁶)
repository.
A Python distribution specifically developed for scientific computing is Anaconda¹⁷.
It comes with its own package manager called Conda¹⁸ and by default installs
packages from its standard Anaconda Repository¹⁹.
The use of the package manager Conda is recommended for the purpose of climate
computations as it is a robust, well-supported and versatile application that combines
both package management and a manager for virtual environments (introduced in
Section 6.2.4). The use of Conda to install and manage packages will be discussed in
Section 6.3.3.

6.2.4 Python Virtual Environments


Most climate computations are done on a remote server in order to handle the large
data volumes and make use of the increased processing power. Some default low-level
system packages will have been installed as part of the installation of Python on the
system. These can be used directly without any further installation of packages.
Most of the time, however, users will be working with 3rd party packages from online
repositories (Section 6.2.3). The problem is that different users may require different
packages or package versions. In addition, users would not have the admin rights to
install Python packages themselves directly onto the Linux server.
To get around this problem and to be able to give users full control over the packages
required for their work Python virtual environments are used. Python virtual
environments allow users to create isolated environments that can be configured
to the needs of a specific project. Project specific settings may include Python
version (perhaps Python 2.7 instead of Python 3 to run some legacy code), preferred
repository channels, packages and package versions.
The two main Python virtual environment managers are Virtualenv²⁰ and Conda²¹.
¹⁵https://pip.pypa.io/en/stable
¹⁶https://pypi.org
¹⁷https://www.anaconda.com/distribution
¹⁸https://docs.conda.io/en/latest
¹⁹https://repo.anaconda.com/pkgs
²⁰https://virtualenv.pypa.io/en/latest
²¹https://docs.conda.io/en/latest
Python - Concepts and Work Environment 92

Conda is recommended here as it combines both a Python virtual environment


manager and a Python package manager. How to use Conda to set up and manage
Python virtual environments will be discussed in [Section 6.3.1].

6.3 Conda
Conda integrates both a Python package manager and a manager for Python virtual
environments. Python virtual environments created with Conda are referred to as
Conda environments. The following sections provide a brief introduction to creating
and managing Conda environments as well as installing packages inside a Conda
environment. More details on the usage of Conda can be found in the Conda User
Guide²².

6.3.1 Creating a Conda Environment


The reasons for creating Conda environments have been discussed in Section 6.2.4.
Conda environments can be created from the Unix command line. The following
command creates a Conda environment named myenv.

conda create --name myenv

The above command will create a Conda environment that uses the default Python
version. To find out what the default Python version is the command python --version
can be used. If a Python version different from the default version is required then the
Python version can be specified as part of the Conda environment creation command
as done in the following example.

conda create --name myenv python=3.6

The installation of the Conda environment may take a few minutes. Executing the
above command will produce output in terminal similar to the following.

²²https://conda.io/projects/conda/en/latest/user-guide/index.html
Python - Concepts and Work Environment 93

Collecting package metadata (current_repodata.json): done


Solving environment: done

==> WARNING: A newer version of conda exists. <==


current version: 4.7.10
latest version: 4.8.2

Please update conda by running

$ conda update -n base -c defaults conda

## Package Plan ##

environment location: /ouce-home/staff/worc1870/.conda/envs/myenv

added / updated specs:


- python=3.6

The following packages will be downloaded:

package | build
---------------------------|-----------------
_libgcc_mutex-0.1 | conda_forge 3 KB conda-forge
_openmp_mutex-4.5 | 0_gnu 435 KB conda-forge
ca-certificates-2019.11.28 | hecc5488_0 145 KB conda-forge
certifi-2019.11.28 | py36h9f0ad1d_1 149 KB conda-forge
ld_impl_linux-64-2.33.1 | h53a641e_8 589 KB conda-forge
libgcc-ng-9.2.0 | h24d8f2e_2 8.2 MB conda-forge
libgomp-9.2.0 | h24d8f2e_2 816 KB conda-forge
libstdcxx-ng-9.2.0 | hdf63c60_2 4.5 MB conda-forge
pip-20.0.2 | py_2 1.0 MB conda-forge
python-3.6.10 |h9d8adfe_1009_cpython 34.1 MB conda-forge
python_abi-3.6 | 1_cp36m 4 KB conda-forge
setuptools-46.0.0 | py36h9f0ad1d_1 653 KB conda-forge
sqlite-3.30.1 | hcee41ef_0 2.0 MB conda-forge
tk-8.6.10 | hed695b0_0 3.2 MB conda-forge
Python - Concepts and Work Environment 94

wheel-0.34.2 | py_1 24 KB conda-forge


zlib-1.2.11 | h516909a_1006 105 KB conda-forge
------------------------------------------------------------
Total: 55.8 MB

The following NEW packages will be INSTALLED:

_libgcc_mutex conda-forge/linux-64::_libgcc_mutex-0.1-conda_forge
_openmp_mutex conda-forge/linux-64::_openmp_mutex-4.5-0_gnu
ca-certificates conda-forge/linux-64::ca-certificates-2019.11.28-hecc5488_0
certifi conda-forge/linux-64::certifi-2019.11.28-py36h9f0ad1d_1
ld_impl_linux-64 conda-forge/linux-64::ld_impl_linux-64-2.33.1-h53a641e_8
libffi conda-forge/linux-64::libffi-3.2.1-he1b5a44_1006
libgcc-ng conda-forge/linux-64::libgcc-ng-9.2.0-h24d8f2e_2
libgomp conda-forge/linux-64::libgomp-9.2.0-h24d8f2e_2
libstdcxx-ng conda-forge/linux-64::libstdcxx-ng-9.2.0-hdf63c60_2
ncurses conda-forge/linux-64::ncurses-6.1-hf484d3e_1002
openssl conda-forge/linux-64::openssl-1.1.1d-h516909a_0
pip conda-forge/noarch::pip-20.0.2-py_2
python conda-forge/linux-64::python-3.6.10-h9d8adfe_1009_cpython
python_abi conda-forge/linux-64::python_abi-3.6-1_cp36m
readline conda-forge/linux-64::readline-8.0-hf8c457e_0
setuptools conda-forge/linux-64::setuptools-46.0.0-py36h9f0ad1d_1
sqlite conda-forge/linux-64::sqlite-3.30.1-hcee41ef_0
tk conda-forge/linux-64::tk-8.6.10-hed695b0_0
wheel conda-forge/noarch::wheel-0.34.2-py_1
xz conda-forge/linux-64::xz-5.2.4-h14c3975_1001
zlib conda-forge/linux-64::zlib-1.2.11-h516909a_1006

Proceed ([y]/n)? y

Downloading and Extracting Packages


libgomp-9.2.0 | 816 KB | ###################################### | 100%
zlib-1.2.11 | 105 KB | ###################################### | 100%
ca-certificates-2019 | 145 KB | ###################################### | 100%
ld_impl_linux-64-2.3 | 589 KB | ###################################### | 100%
sqlite-3.30.1 | 2.0 MB | ###################################### | 100%
python-3.6.10 | 34.1 MB | ###################################### | 100%
Python - Concepts and Work Environment 95

wheel-0.34.2 | 24 KB | ###################################### | 100%


_libgcc_mutex-0.1 | 3 KB | ###################################### | 100%
certifi-2019.11.28 | 149 KB | ###################################### | 100%
pip-20.0.2 | 1.0 MB | ###################################### | 100%
libstdcxx-ng-9.2.0 | 4.5 MB | ###################################### | 100%
setuptools-46.0.0 | 653 KB | ###################################### | 100%
python_abi-3.6 | 4 KB | ###################################### | 100%
libgcc-ng-9.2.0 | 8.2 MB | ###################################### | 100%
_openmp_mutex-4.5 | 435 KB | ###################################### | 100%
tk-8.6.10 | 3.2 MB | ###################################### | 100%
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate myenv
#
# To deactivate an active environment, use
#
# $ conda deactivate

In lines 5 to 7 Conda informs the user that a newer Conda version is available and
lists the command that can be used to update Conda in line 11. It is unlikely that
Linux users will have administrator rights on the server to update Conda. Therefore,
there is no need to take action here (perhaps inform the system administrator).
Lines 23 to 44 provide a list of basic default packages that will be downloaded
followed by a list of new packages that will be installed in lines 45 to 68. Note that the
list of packages to be installed is longer than the list of packages to be downloaded.
This is because the latter includes some dependency packages.
Once the user confirms the installation in line 71 Conda starts downloading and
installing the package. The progress for each package can be followed in the terminal
window.

The download and installation of packages may take some time. Be patient.
Python - Concepts and Work Environment 96

It is advisable to check the terminal output for any error messages. The Preparation
transaction, Verifying transaction and Executing transaction (lines 91 to 93) should
all show done if completed successfully.
Finally, the commands to activate and deactivate the Conda environment are shown
in line 97 and 99, respectively (covered in Section 6.3.2).

6.3.2 Activating and Deactivating Conda Environments


In order to use a Conda environment it needs to be activated. To activate a Conda
environment named myenv execute the following command on the Unix command
line.

conda activate myenv

Once the above command has been executed the Conda environment name (in this
example myenv) will appear in brackets at the beginning of the Unix command prompt
similar to the following.

(myenv)abcd1234@linux:~$

To deactivate a Conda environment use the following command. The Conda envi-
ronment name does not need to be provided to deactivate the environment.

conda deactivate

Deactivating a Conda environment using the above command will remove the Conda
environment name (in this example myenv) from the beginning of the Unix command
prompt. The Unix command prompt will appear as normal similar to the following.

abcd1234@linux1:~$

Activating a Conda environment will not change the behaviour of the Unix
command line. All Unix commands can be used as normal.
Python - Concepts and Work Environment 97

6.3.3 Installing Python Packages


Once the Conda environment has been activated (Section 6.3.2) Python packages can
be installed inside the environment. As discussed in Section 6.2.3 Python packages
will be sourced from online repository channels. For the purpose of climate com-
putations it is recommended to point Conda to Anaconda’s conda-forge channel²³.
Execute the following command inside an activated Conda environment will set as
the default channel to conda-forge.

conda config --add channels conda-forge

When installing Python packages Conda will now by default try to download the
packages and their dependencies from the conda-forge channel first. Setting the
default channel should be done before any packages are being installed. Setting the
default channel has to be done only once.
The basic command for installing a Python package is conda install followed by the
package name. A list of Python packages commonly used in climate computations
can be found in Table 6.2.1.1. For instance, to install the Iris package (available from
the conda-forge channel) and its dependencies execute the following command on
the Unix command line inside the activated Conda environment.

conda install iris

Installing a higher-level package such as Iris will also install most of the
recommended packages required for climate computing as dependencies
including Cartopy, Matplotlib, NumPy, netCDF4 and SciPy.

To list all packages currently installed in an activated Conda environment use the
following command.

conda list

Executing the above command inside the Conda environment myenv will generate
output similar to the following.
²³https://conda-forge.org
Python - Concepts and Work Environment 98

# packages in environment at /home/staff/abcd1234/.conda/envs/myenv:


#
# Name Version Build Channel
_libgcc_mutex 0.1 main
antlr-python-runtime 4.7.2 py36_1000 conda-forge
appdirs 1.4.3 py_1 conda-forge
asn1crypto 0.24.0 py36_1003 conda-forge
attrs 19.3.0 py_0 conda-forge
backcall 0.1.0 py_0 conda-forge
bleach 3.1.0 py_0 conda-forge
bokeh 1.3.4 py36_0 conda-forge
bzip2 1.0.8 h516909a_0 conda-forge
ca-certificates 2019.9.11 hecc5488_0 conda-forge
cartopy 0.17.0 py36he1be148_1005 conda-forge
certifi 2019.9.11 py36_0 conda-forge
cf-units 2.1.3 py36hc1659b7_0 conda-forge
cffi 1.12.3 py36h8022711_0 conda-forge
cftime 1.0.3.4 py36hd352d35_1001 conda-forge
chardet 3.0.4 py36_1003 conda-forge
cis 1.7.1 py36_0 conda-forge
click 7.0 py_0 conda-forge
cloudpickle 1.2.1 py_0 conda-forge
cryptography 2.7 py36h72c5cf5_0 conda-forge
curl 7.65.3 hf8cf82a_0 conda-forge
...

The list includes package details such as package name, version number, build
string and the channel the package was sourced from. The build string is used to
differentiate builds of packages with otherwise identical names and version numbers.

6.3.4 Listing and Deleting Conda Environments


To list all existing Conda environments the following command can be used.

conda info --envs

Executing the above command generates output similar to the following.


Python - Concepts and Work Environment 99

# conda environments:
#
base /opt/miniconda
myenv /ouce-home/staff/worc1870/.conda/envs/myenv
ouce /ouce-home/staff/worc1870/.conda/envs/ouce
test /ouce-home/staff/worc1870/.conda/envs/test

The first Conda environment listed is named base. This is the default environment
that was created when Conda was installed on the system. It should not be used
for installing packages or climate data analysis. Instead create additional Conda
environments.
In the above example, three more Conda environments are listed named myenv, ouce
and test. All files associated with these environments are located in the directories
indicated by their respective paths.
To permanently delete the Conda environment named test and all its associated files
the following command can be used.

conda remove --name test --all

6.4 Python Code Development


Code written in the Python programming language is saved in plain text files (see
Section 2.6.1) with file names having the extension .py. In principle, these files are
no different from any other text files. Therefore, any software that allows users to
create and edit plain text files can be used for Python code development. However,
there are a few things to consider and finding the best way of developing, editing
end executing Python code can be challenging. In the following sub-sections some
concepts and options for developing Python code are discussed including Python
code editors (Section 6.4.1), integrated development environments (Section 6.4.2) and
browser-based solutions (Section 6.4.3).

6.4.1 Python Code Editors


Editing text files using text editors has been introduced in Section 3.4.3. If a text editor
is used to develop programming code then it may be referred to as a code editor but
Python - Concepts and Work Environment 100

a clear distinction between the two terms is not usually made. Most Python code
editors will come with some basic features such as syntax highlighting and code
formatting. In more advanced code editors additional features can often be enabled
or installed via extensions or plug-ins.
To edit Python code files saved on the server, either a text editor installed on the local
machine or one installed on the remote server can be used. If a stable connection to
the server is available and the home directory on the remote server can be mapped
on the local machine (see Section 3.3.6) then a locally installed editor can be used
(Table 6.4.1.1). Python files can be created or edited on the server using the locally
installed editor by navigating to the file via the mapped network drive.

Table 6.4.1.1: Locally installed text editors commonly used for Python coding (available for all
platforms).

Name Users Advertised as


Atom²⁴ Beginner, Advanced A hackable text editor for
the 21st Century
Emacs²⁵ Advanced An extensible,
customizable, free/libre
text editor and more
Sublime Text²⁶ Beginner, Advanced A sophisticated text editor
for code, markup and prose
Visual Studio Code²⁷ Advanced Code editing - Redefined -
Free - Built on open source
- Runs everywhere

If the home directory on the server cannot be mapped on the local machine then
a text editor installed on the server should be used. There are two ways in which a
server-side installed text editor may open. First, it may open a graphical user interface
(GUI) in which case the X Window Manager and X11 forwarding need to be set up
and configured correctly (see Section 3.2.3).
Second, a text editor may open inside the terminal window. Those may be referred
to as screen-based or screen-orientated editors. Editors opening inside the terminal
window are recommended when the internet connection is slow as interactions
²⁴https://atom.io
²⁵https://www.gnu.org/software/emacs
²⁶https://www.sublimetext.com
²⁷https://code.visualstudio.com
Python - Concepts and Work Environment 101

with GUIs may be slow and choppy due to the graphical information having to be
transferred between the local display and the server (see Figure 3.2.1 and Section
3.2.3). A list of common server-side installed text editors can be found in Table 6.4.1.2.

In general, it is discouraged to use text editors installed on the remote


server that open a GUI as they often tend be slow in responding to mouse
interactions.

Table 6.4.1.2: Server-side installed text editors commonly used for Python coding.

Name Users Advertised as GUI Terminal


Emacs²⁸ Advanced An extensible, x x
customizable,
free/libre text
editor and more
Gedit²⁹ Beginner The GNOME text x
editor
jEdit³⁰ Beginner Programmer’s x
text editor
nano³¹ Beginner The GNU nano x
text editor
vi/vim³² Advanced The ubiquitous x
text editor

For code editors that open inside the terminal window such as Emacs or
vi/vim it is advisable to take the time to go through one of the many online
tutorials in order to learn the keyboard short-cuts required to use them.
Keyboard short-cut cheat sheets are also helpful for a beginner.

6.4.2 Python IDEs


Integrated development environment (IDE) is a term given to software that provides
a comprehensive list of features that aid the development of software applications.
²⁸https://www.gnu.org/software/emacs
²⁹https://wiki.gnome.org/Apps/Gedit
³⁰http://www.jedit.org
³¹https://www.nano-editor.org
³²https://www.vim.org
Python - Concepts and Work Environment 102

Some IDEs will support multiple coding languages whereas others were developed
with a specific coding language in mind. A Python IDE provides an environment
for the development Python code that is much more comprehensive than a simple
code editor. The following list summarises some of the features commonly found in
Python IDEs.

• Advanced code editor (auto-completion, syntax highlighting, code formatting)


• Python or iPython command line
• Version control
• Linting (analysing code for programming errors)
• Debugging support
• Visualisation window
• Variables window

Python IDEs are usually installed on the local machine. They can be configured to
connect to the remote server. Care should be taken during installation as some IDEs
(e.g, Spyder) will install their own Python executables during the installation process.
Pointing the IDE towards the correct Python executable (e.g., on the server inside a
Conda environment) can be challenging for a beginner.
Due to the complexity and wide range of features IDEs may be slower than simple
but feature-rich code editors. A list of some of the more popular Python IDEs is
provided in Table 6.4.2.1.

Table 6.4.2.1: Popular integrated development environments (IDEs).

Name Users Advertised as


PyCharm³³ Advanced The Python IDE for
professional developers
PyDev³⁴ Advanced Python plugin for Eclipse³⁵
IDE
Spyder³⁶ Beginner, Advanced The scientific Python
development environment
Thonny³⁷ Beginner Python IDE for beginners
³³https://www.jetbrains.com/pycharm
³⁴https://www.pydev.org
³⁵https://www.eclipse.org/ide
³⁶https://www.spyder-ide.org
³⁷https://thonny.org
Python - Concepts and Work Environment 103

Table 6.4.2.1: Popular integrated development environments (IDEs).

Name Users Advertised as


Wing³⁸ Advanced The intelligent
development environment
for Python

Students new to Python programming may be overwhelmed by the large


number of features included in Python IDEs and configuring IDEs correctly
can be challenging. IDEs are still worth exploring as many features are
incredibly helpful (e.g., integrated version control).

6.4.3 Browser-based Python Code Editing


There are also some browser-based Python code editor and IDE solutions available.
Editing code in a browser has the advantage of being independent of the local
operating system and server-based code can be accessed easier from anywhere.

6.4.3.1 Jupyter Notebooks

The Jupyter³⁹ project is an open-source non-profit project that develops web-


applications for scientific computing. One of its products is Jupyter Notebook
which allows users to develop and execute programming code in a web browser
environment.
In order to use Jupyter Notebooks for Python code development on the remote server
it has to be set up and configured by the system administrator on the remote server
side. Therefore, it is not further discussed here. To find out more about how to use
Jupyter Notebooks consult the documentation⁴⁰.
³⁸https://wingware.com
³⁹https://jupyter.org
⁴⁰https://jupyter-notebook.readthedocs.io/en/stable/
Python - Concepts and Work Environment 104

6.4.3.2 VS Code (code-server)

The code-server⁴¹ project allows to start an [IDE(#python-ides)] in a web browser


that runs the VS Code editor. code-server has to be set up and configured by the
system administrator on the remote server side and is not discussed further here.
⁴¹https://github.com/cdr/code-server
7. Python - Programming Basics
7.1 Basic Python Programming Building Blocks
There are coding features that can be found in almost all programming languages.
These include, for instance, how to define variables, change variable types, write
loops for batch processing and the use of conditional statements. In the following
sub-sections these more general coding concepts will be introduced for the Python
programming language.
This section is by no means a complete or in-depth coverage of the Python program-
ming language and the main packages. But it covers the most essential parts to get
started with a focus on the use of Python as part of climate data analysis. For a more
comprehensive review of Python it is advisable to get one of the many excellent
Python programming books or work through online resources and video tutorials.

7.1.1 Declaring Variables


A variable can be seen as a container into which a value or object is saved. Declaring
a variable is done by using a variable name on the left of an equal sign (=) and the
value or object to be saved on the right of the equal sign. In the following examples
the variable names a, b and c are assigned the values 7, 5.8 and Climate science is
cool!, respectively.

a = 7
b = 5.8
c = 'Climate science is cool!'

Similarly, all variables could be declared in a single line as follows.


Python - Programming Basics 106

a, b, c = 7, 5.8, 'Climate science is cool!'

Multiple variables can be assigned the same value in a single line as follows.

a, b, c = 100

Variables can be assigned new values at any point in a script but the previous value
will be lost.
Python will try to guess the appropriate variable type of the variable if the variable
type has not been declared explicitly during the variable creation (see following
Section 7.1.2 for variable types).

7.1.2 Variable Types and Conversion Between them


Table 7.1.2.1 lists the main variable types used in Python. In order to check the type
of a variable the built-in type() and print() functions can be used as shown below
(<var> is the name of the variable in question). The variable types are discussed in
more detail in the following subsections.

print(type(<var>))

The print() function prints Python objects to the text stream, meaning in
most cases the terminal window.

Table 7.1.2.1: Main Python Variable Types.

Variable Type Function Description Examples


Number Number of type -100, 6203547, 0.00076,
integer, float or 4e+37J
complex
String Character string Joe, Hello World, \n
(new line)
List Elements are ordered, [2, 3.2, ’GE’, [-1.2,
mutable and can be ’t’, 5]]
referenced by indexes
Python - Programming Basics 107

Table 7.1.2.1: Main Python Variable Types.

Variable Type Function Description Examples


Dictionary Elements are {’a’:2, ’b’:3.2,
unordered, mutable ’c’:[-1.2, ’t’]}
and can be referenced
by keys
Tuple Same as list but (3.2, ’hello’)
immutable
Boolean Logical value: 0 (false) True, False
and 1 (true)

7.1.2.1 Numbers

Numbers are a very common variable type especially in the field of climate sciences.
Python represents numbers as one of three number types: integer, float and com-
plex. In Python 3 there is no maximum size for integers but the largest possible
integer value is limited by the memory the system architecture allows. On a 64-
bit platform for instance the maximum integer value is 2^63 - 1 which is equal to
9223372036854775807.

Floats or floating point values are numbers which have decimals. Several float
number notation formats are possible. For example, 0.0, 13.5, -273.15, 300., or
54.921+e10.

Complex number are less common in climate sciences.


The built-in functions int(x), float(x) and complex(x) can be used to convert between
the number types wherein x is the value to be converted. x can also be of the type
string as long as the string characters are numeric literals (characters made up of
digits 0 to 9) with a decimal point (.) if it is a float.

In computational climate science many of the data values will be imported


as NumPy arrays rather than as one of Python’s native number formats
(see Section 7.x.x for NumPy arrays).
Python - Programming Basics 108

7.1.2.2 Strings

A string is a sequence of alphanumeric characters within single quotes (') or double


quotes ("). In Python scripts, strings are, for instance, used in plot titles, axis labels
or annotations on a plot.
Sub-strings can be selected using indexing whereby each character is associated with
one index. Indexing starts with 0. Following are some examples how indices can be
applied.

s = 'Global Mean Temperature'

print(s) # Print whole string


print(s[0]) # Print 1st character
print(s[7:11]) # Print 8th to 11th character, inclusive
print(s[7:]) # Print 8th to last character
print(s[:8]) # Print 1st to 8th character
print(s[-1]) # Print last character

Executing the above code will generate the following output.

Global Mean Temperature


G
Mean
Mean Temperature
Global M
e

Strings can be joined by using the plus symbol (+). Also, numerical values can
be converted to a string using the str(x) function wherein x is the number to be
converted. Both concepts are applied in the following example.

print(s+' is: '+str(14.9)+' degC')

The above code will generate the following output. Note that the numerical value
14.9 (float) was converted to a string before being joined with the other strings.
Python - Programming Basics 109

Global Mean Temperature is: 14.9 degC

7.1.2.3 Lists

Lists can be created using square brackets ([]). The elements in a list are separated
by commas (,). The elements in a list often are but do not need to be of the same
variable type. Lists containing sequences of numbers or names of models over which
to loop are quite common in climate computing. List elements can be referenced
using indexes. The following are some examples of indexing and manipulating lists.

a = [1, 2, 3, 4, 5]

print(a) # Print whole list


print(a[0]) # Print 1st list element
print(a[2:3]) # Print 3rd to 4th list element
print(a[-1]) # Print last list element
a.append(100) # Append value 100 at the end of the list
print(a)
a.insert(2, 50) # Insert value 50 after the 2nd list element
print(a)
a.remove(50) # Remove list element specified by it's value
print(a)
del a[3] # Remove list element spedified by it's index
print(a)

The methods append() and insert() in the code examples above are associated with
the list variable (python object) a. Python object methods and attributes are discussed
in more detail in Section 7.x.x. The above code will generate the following output.

[1, 2, 3, 4, 5]
1
[3]
5
[1, 2, 3, 4, 5, 100]
[1, 2, 50, 3, 4, 5, 100]

Lists are mutable which means that they can be changed after they have been created
as shown in the above examples.
Python - Programming Basics 110

A Python object is mutable if it can be changed after it was created. Lists


are mutable. In contrast, Tuples Section 7.1.2.5 are immutable.

7.1.2.4 Dictionaries

A dictionary is in many ways very similar to a list (Section 7.1.2.3). Both store Python
objects (elements), are mutable and can have elements of different data types. The
main difference to lists is the way in which the elements are referenced. While list
elements can be referenced by indices the elements of a dictionary are associated
with keys. The keys can be used to reference the dictionary elements.
Dictionaries can be defined in different ways, but most commonly comma-separated
value pairs within curly brackets are used. In the following example a dictionary is
created that contains country capitals.

capitals = {'Germany': 'Berlin', 'UK': 'London', 'Canada': 'Ottawa'}

The first element of the value pair is the key (e.g., ‘Germany’). The second element
is associated value (e.g., Berlin). A value can be accessed using the associated key
in the dictionary. In the following example the value associated with the key UK is
printed.

print(capitals['UK'])

Executing the above code will generate the following output.

London

While in some situations a key may be easier to access a specific value then an index
dictionaries are generally less common in climate computing than lists, and as such
are not explored further in this chapter.
Python - Programming Basics 111

7.1.2.5 Tuples

Tuples are very similar to lists. They store sequences of python objects which can be
of different data types. The main difference is that tuples are immutable, meaning
once created they cannot be changed.
A tuple can be created in the same way as a list but using normal brackets (()) instead
of square ones ([]). In the following example a tuple called tup is created with three
elements of different data types.

tup = (2.5, 'tomato', True)

Tuples are not very frequently used in climate computing.

7.1.2.6 Booleans

A variable of the data type Boolean can only have one of two values: False or True.
Note that boolean values are case-sensitive, meaning that capitalisation of the first
letter is important.
Converting an integer or float number to the boolean data type using the bool()
function will return True for all values different from 0 (including negative values)
and False for 0.
Boolean values are also frequently used with keyword arguments, which control
how a function operates. For instance, in the following code example the keyword
arguments sharex is set to True and sharey is set to False in a way to control which
plot axes are shared when multiple subplots are being created.

fig, axs = plt.subplots(3, 4, sharex=True, sharey=False)

7.1.2.7 Converting Between Variable Types

Conversions between the variable types integer, float and string are common
practice in Python coding. Some examples and pitfalls are discussed here.
Consider the following sequence of Python commands executed on the Python
command prompt (>>>). Note that a print() statement is not required on the Python
command line. The output of a command will be directly displayed in the terminal
window.
Python - Programming Basics 112

1 >>> a = 5.6
2 >>> type(a)
3 <class 'float'>
4 >>>
5 >>> b = '10.0'
6 >>> type(b)
7 <class 'str'>
8 >>>
9 >>> c = a+b
10 Traceback (most recent call last):
11 File "<stdin>", line 1, in <module>
12 TypeError: unsupported operand type(s) for +: 'float' and 'str'
13 >>>
14 >>> b = float(b)
15 >>> type(b)
16 <class 'float'>
17 >>>
18 >>> c = a+b
19 >>> c
20 15.6
21 >>>
22 >>> d = int(c)
23 >>> type(d)
24 <class 'int'>
25 >>> d
26 15
27 >>>

In line 1 the variable a is declared as 5.6. The type(a) command returns float as the
variable type for a in line 2 and 3 which Python has automatically assigned to a.
In line 5, the variable b is declared as '10.0' and Python correctly assigns the variable
type string (line 6 and 7) because the number 10.0 is inside single quotes.
Trying to add together a and b as attempted in line 9 fails because a float variable
can not be added to a string variable returning an error message in line 10 to 12.
To fix this problem the string variable b is converted to float in line 14 by using the
float() function. The successful conversion is confirmed in line 15 and 16.

Now that both a and b are variables of the type float they can be added together as
Python - Programming Basics 113

done in line 18, returning the correct value of 15.6 for the variable c in line 20.
The float variable c is converted to an integer variable using the int() function in
line 22 and the variable type is confirmed in line 23 and 24. Printing c returns 15.
Note that converting a variable of the type float to integer just cuts off the digits. It
does not round the float number up or down.

The conversion of a float value to integer using the int() function does not
round the float value. It just removes the digits.

Similarly, variables of the type integer and float can be converted to strings as done
in the following sequence of Python commands.

1 >>> a = 'The air temperature is: '


2 >>> a
3 'The air temperature is: '
4 >>>
5 >>> b = 23.6
6 >>> b
7 23.6
8 >>>
9 >>> c = a+b
10 Traceback (most recent call last):
11 File "<stdin>", line 1, in <module>
12 TypeError: must be str, not float
13 >>>
14 >>> c = a+str(b)
15 >>> c
16 'The air temperature is: 23.6'
17 >>>
18 >>> type(b)
19 <class 'float'>
20 >>>
21 >>> print(a, b)
22 The air temperature is: 23.6
23 >>>

The string variable a is declared as 'The air temperature is: ' in line 1 and the the
Python - Programming Basics 114

float variable b is declared as 23.6 in line 5. As expected (see discussion above), trying
to add both variables together (line 9) fails in line 10 to 12.
However, the float variable b can be converted to a string by using the str() function.
Joining string variables a and str(b) now creates a new string saved in the variable
c in line 14 that can be printed (line 15 and 16).

Just to confirm, the variable b is still a float (line 18 and 19) as it was not overwritten
as done in line 14 of the previous command sequence.
If the variables are just displayed in the terminal window then a conversion of b from
float to string is not necessary. The variables can be displayed, for instance, using a
print statement like print(a, b) as done in line 21. Note that an additional space is
added between the two variables (line 22).

7.1.3 Functions
In general, functions are used for repetitive or common tasks. They can be reused
and often make the code layout more clear. Functions generally require some input
and return some output. There are two types of functions, built-in functions and
user-defined functions.

7.1.3.1 Built-in Functions

Built-in functions come with the Python installation. They do not need to be defined
or imported and can be used directly. A list of 60+ Python built-in functions can be
found in the Python documentation¹. Some commonly used built-in functions are
listed in Table 7.1.3.1.1. Some of them have been introduced already in the previous
sections (float(), int(), str() and print()).

¹https://docs.python.org/3.3/library/functions.html
Python - Programming Basics 115

Table 7.1.3.1.1: A small selection of commonly used built-in Python functions.

Function Description
dir() Returns a list of object methods and attributes.
enumerate() Returns an iterable enumerate object.
float() Returns a floating point number.
int() Returns an integer number.
len() Returns the length of an object.
print() Prints objects to the terminal window.
range() Returns a sequence of values starting with 0.
str() Returns a string object.
type() Returns the type of an object.

7.1.3.2 User-defined Functions

User-defined functions are (as the name says) functions that are defined by the user.
The following example shows a function that converts temperature values given in
Fahrenheit to Celsius.

1 def f2c(t):
2 return (t-32)*5/9
3
4 print(f2c(68))

The function definition starts with def followed by the name of the function f2c (line
1). The variable t given in brackets represents the variable that will be passed to the
function when called. In line 2 (indented) return is followed by the equation which
converts the variable t from Fahrenheit to Celsius.
The function is used in line 4 within a print() statement. The function is given an
input value of 68 degree Fahrenheit. Running the code will return the following
temperature in degree Celsius.

20.0

User-defined functions can be defined anywhere in a script and are available for
use elsewhere in the code. Alternatively, frequently used functions can be saved in a
separate file. Consider the following code saved in a file named t_conversions.py.
Python - Programming Basics 116

1 # convert from Celsius to Fahrenheit


2 def c2f(t):
3 return (t*9/5.0)+32
4
5 # convert from Celsius to Kelvin
6 def c2k(t):
7 return t+273.15
8
9 # convert from Fahrenheit to Celsius
10 def f2c(t):
11 return (t-32)*5.0/9
12
13 # convert from Fahrenheit to Kelvin
14 def f2k(t):
15 return (t+459.67)*5.0/9
16
17 # convert from Kelvin to Celsius
18 def k2c(t):
19 return t-273.15
20
21 # convert from Kelvin to Fahrenheit
22 def k2f(t):
23 return (t*9/5.0)-459.67

In order to make use of the functions k2c() and k2f() defined in t_conversions.py they
can be imported in the following way in a script as long as the file t_conversions.py
is located in the same directory.

from t_conversions import k2c, k2f

If the file containing the functions is located in a different directory from the Python
script that uses them then the path to the directory can be added to the system
path at the beginning of script as done in the following example using the sys
(system) package. In this example the file t_conversions.py is located in the directory
/home/rjones/python/functions.
Python - Programming Basics 117

import sys
sys.path.append('/home/rjones/python/functions')

from t_conversions import k2c, k2f

7.1.4 Methods and Attributes


Every Python construct is an object. Objects have attributes and methods attached
to them. Attributes are basically other variables that describe the object’s properties
or features. Methods are similar to functions (Section 7.x.x) and have the ability to
change the object.
As the attributes and methods are part of the Python object they are self-contained.
Different objects may have different attributes and/or methods (e.g., a float variable
versus a list). They can be accessed via dotted notation, meaning a dot (.) between
the object and the attribute or method (see examples below).
Object methods are frequently used within Python code. The following example
shows how methods can be used to modify lists.

1 >>> mylist = [2, 5, 7, 2, 9, 7, 7]


2 >>> mylist
3 [2, 5, 7, 2, 9, 7, 7]
4 >>>
5 >>> mylist.reverse()
6 >>> mylist
7 [7, 7, 9, 2, 7, 5, 2]
8 >>>
9 >>> mylist.sort()
10 >>> mylist
11 [2, 2, 5, 7, 7, 7, 9]
12 >>>

A list called mylist is created in line 1 and printed to the display in line 2 and 3.
The reverse() method is applied to the mylist object. When mylist is printed again
it in line 6 and 7 the order of the list elements has been reversed. Note that no new
variable has been created. The obejct mylist has been modified by the object’s own
reverse() method.
Python - Programming Basics 118

In the same way the sort() method applied to mylist in line 9 sorts the list elements
in alphanumeric order.

Methods can be accessed using dotted notation. A dot (.) is placed between
the object and the method.

To identify what methods are attached to an object the built-in function dir() can
be used. Executing the command dir(mylist) will return the following output.

['__add__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dir__\


', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem_\
_', '__gt__', '__hash__', '__iadd__', '__imul__', '__init__', '__init_subclass_\
_', '__iter__', '__le__', '__len__', '__lt__', '__mul__', '__ne__', '__new__', \
'__reduce__', '__reduce_ex__', '__repr__', '__reversed__', '__rmul__', '__setat\
tr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'append', 'c\
lear', 'copy', 'count', 'extend', 'index', 'insert', 'pop', 'remove', 'reverse'\
, 'sort']

Method names that start and end with double underscores (__) are called magic
methods. They are used mainly internally and can be ignored most of the time.
The methods for specific variable types may also be found in the documentation (e.g.,
for lists²).

While methods and functions have many similarities they are not the same.
The main difference between methods and functions is that methods are
called on an object and may change the object whereas functions stand on
their own and usually return variables or objects.

7.1.5 Controlling the Code Flow


All programming languages have ways of controlling which lines of code are
executed, such as when the script might repeat them or skip them. Most often this is
done using loops and conditional statements which are discussed for Python in the
following two subsections.
²https://docs.python.org/3/tutorial/datastructures.html#more-on-lists
Python - Programming Basics 119

7.1.5.1 for-Loops

Loops are an essential part of any programming language because they allow the user
to iterate over sequences of numbers, list elements or files by allowing blocks of code
to be run repeatedly within a given set of constraints. Loops allow batch-processing
making them a powerful tool.
The most common loop is the for-loop which will be discussed here. The general
syntax of the for-loop is as follows.

for <var> in <sequence>:


<do something>

The loop will iterate over the elements of a sequence (<sequence>). In most cases the
sequence will be a Python list but it can also be a tuple, dictionary, set or string. The
variable (<var>) changes with each iteration to the next element in the sequence. Any
code that is indented in the following lines (<do something>) is inside the loop and will
be executed with each loop iteration. Unlike other coding languages the Python for-
loop does not have a closing statement. The loop ends when the indentation of code
is removed.
The following are some common for-loop examples. In this example the built-in
function range() is used to create a sequence of numbers from 0 to 4. The variable
i changes with each iteration. The variable that changes with each iteration can be
given any chosen name (not a number or boolean) but i is very common for an index.

for i in range(5):
print(i)

Executing the above code will give the following output.

0
1
2
3
4

Similarly, NumPy’s arange() function can be used to generate a sequence of numbers.


In the following example np.arange(3, 8, 2) generates a sequence of numbers from
3 to 8 in steps of 2.
Python - Programming Basics 120

import numpy as np
for i in np.arange(3, 8, 2):
print(i)

Executing the above code will give the following output.

3
5
7

In the next example the loop iterates over a list of model names.

for m in ['CCSM4', 'HadCM3', 'Miroc4']:


print('Processing:', m)

Executing the above code will give the following output.

Processing: CCSM4
Processing: HadCM3
Processing: Miroc4

The same loop as above may be executed differently by iterating through a sequence
of numbers and using them as an index as done in the following example. The
variable modellist is a list of model names. The built-in len() function is used to
return the number of elements in the list into the variable n (n=3). The variable n is
passed to the np.arange() function which generates a sequence of numbers from 0 to
2 which is used as an index to refer to element position in modellist inside the loop.

import numpy as np
modellist = ['CCSM4', 'HadCM3', 'Miroc4']
n = len(modellist)
for i in np.arange(n):
print('Processing:', modellist[i])

Executing the above code will generate the same output as the previous loop.
Python - Programming Basics 121

Processing: CCSM4
Processing: HadCM3
Processing: Miroc4

Often it is useful to have both a variable such as a model name and the associated
index available inside the loop. The built-in function enumerate() comes in handy. In
the following example the list of model names (modellist) is passted to the enumerate()
function. The for loop now has two variables that change with each iteration. i is the
index and m is the name of the associated model.

modellist = ['CCSM4', 'HadCM3', 'Miroc4']


for i, m in enumerate(modellist):
print(i, m)

Executing the above code will give the following output. The index and the associated
model name are printed with each loop iteration.

0 CCSM4
1 HadCM3
2 Miroc4

Nested for-loops are two or more loops inside one another. For instance, if one wants
to loop through each grid box of a two-dimensional data field to perform an operation
on each grid box then a nested for-loop can be used. The nested loop has to be
intended.
The following code, snipped and copied from Code 7.7.3.1 (line 27 to 29), shows
an example of how to loop through each grid box of a global field where lats
holds the latitude values and lons holds the longitude values. The line r, p =
stats.pearsonr(sst, pr[:,y,x]) then performs some processing on each grid box of
the three-dimensional pr field (the last two dimensions are latitude and longitude).

for y in range(len(lats)):
for x in range(len(lons)):
r, p = stats.pearsonr(sst, pr[:,y,x])

More loop examples can be found in Section 7.x.x.


Python - Programming Basics 122

7.1.5.2 Conditional Statement

The if -statement can be found in almost every coding language. It is useful when a
command or code block should only be executed when a certain condition is met.
The if-statement has the following general syntax. Note the indentation of the code
block to be executed if the condition returns True. Same as with the for-loop, the
if-statement does not need to be closed at the end.

if (condition_returns_True):
(do something)

If the condition returns False then an alternative command of code block can be
offered as follows.

if (conditional_returns_True):
(do_something)
else
(do_something_else)

To construct conditions a series of comparison operators is available as listed in Table


7.1.5.2.1.

Table 7.1.5.2.1: Python comparison operators.

Operator Description
== equal to
!= not equal to
> greater than
< less than
>= greater or equal
<= less or equal to

Consider the following example.


Python - Programming Basics 123

1 a = 3
2
3 if a == 3:
4 print(a, 'is equal to 3.')
5
6 if a > 5:
7 print(a, 'is greater than 5.')
8 else:
9 print(a, 'is less or equal to 5.')

Running the above code will return the following output.

3 is equal to 3.
3 is less or equal to 2.

The variable a is set to 3 in line 1. In the first if-statement in line 3 the comparison
operator == is used to test if the variable a actually is equal to 3. As the test returns
True the indented print() command in line 4 is executed resulting in the first output
of 3 is equal to 3..
In line 6 the > comparison operator is used to test if a is greater than 5. This test returns
False the print statement in line 7 is not executed. Instead, the print statement in line
9 is executed resulting in the second output 3 is less or equal to 2..
In addition to single test conditions also multiple test conditions can be applied. For
those Python logical operators can be used (and, or and not) as listed in Table 7.1.5.2.1.
Note that when a condition is defined, the python object it creates is a boolean (see
Section 7.1.2.6, which the if statement then evaluates.

Table 7.1.5.2.1: Python logical operators.

Operator Example (a=10, b=20) Returns


and a == 10 and b > 15 True
or a > 10 or b > 30 False
not not(a == 10 and b > 15) False (result reversed)
Python - Programming Basics 124

7.2 Applying Python in Climate Data Analysis

7.2.1 Error Messages when Running Code


Python code is executed sequentially (line-by-line) by the Python interpreter (re-
specting loops and conditional statements). If a problem is encountered in one of
the lines then the execution of the code terminates right there and returns an error
message. The code crashed.
No programmer writes extensive code that immediately runs without crashing.
Fixing errors in the code is absolutely normal. For beginners, fixing code errors is an
essential part of the learning process. Over time the mistakes become less frequent.
First of all, when the code exits with an error message - do not panic. Examine the
error code carefully. Most of the time Python will try to explain what went wrong
and where. Consider the following Python code example saved in a file named 7_-
python_error.py.

1 import numpy as np
2
3 a = np.arange(8)
4 print(a)
5
6 print(a[10])

Running the above code will return the output below.

[0 1 2 3 4 5 6 7]
Traceback (most recent call last):
File "7_python_error.py", line 6, in <module>
print(a[10])
IndexError: index 10 is out of bounds for axis 0 with size 8

The first line of the output is a sequence of numbers from 0 to 7. It comes from the
print statement in line 4 of the code where the variable a holding a one-dimensional
NumPy array is printed.
Python - Programming Basics 125

After that print statement something went wrong when running the code and Python
returns a Traceback with some details. It tells the user that in line 6 of the code in File
"7_python_error.py" something went wrong. It even prints out line 6 in the following
line for reference (print(a[10])).
The last line usually presents some information about what exactly went wrong. In
this case there is an IndexError and the index 10 is out of bounds for axis 0 with
size 8. The variable a is a NumPy array with 8 elements as defined in line 3 of the
code. Trying to print the 10ʰ element of that array in line 6 causes the error.

When analysing error messages, first examine the first line and identify
the line in the code where the (first) error occurred. Second, examine the
last line of the error message which provides an indication as to what went
wrong. There may be a lots of error messages between the first and last line
caused by subsequent failures in dependent functions which can be ignored
most of the time.

Frequent print() statements in the code can help to identify problems associated with
variables. These could include the following for a variable named var.

print(var) # print the variable itself


print(type(var)) # print the variable type
print(var.shape) # print the dimensions (shape) if var is a NumPy array

While developing code it is also sometimes useful to stop the code at a given point.
This can be done using the Python native exit() function.

Use the exit() function to stop code at a specific point.

7.2.2 Looping Through Input Files


It is a fairly common task when analysing climate data to create loops that iterate
over a (sometimes quite large) number of files. These files may, for instance, represent
Python - Programming Basics 126

output from different climate models (looping over models) or files organised by a
certain time criteria such as years, months, days and hours.
The filenames can be created manually or programmatically and some solutions are
discussed in the following sub-sections.

7.2.2.1 Constructing File Names Manually

If the number of files considered is small then the filenames may be constructed
manually as demonstrated in the following example.

1 # define data directory path and list of model names


2 datadir = '/home/data/model/cmip5/'
3 mlist = ['CanESM2', 'CCSM4', 'HadCM3', 'inmcm4', 'Miroc4']
4
5 # loop over models
6 for c, f in enumerate(mlist):
7 # construct input filename
8 ifile = datadir+f+'/rcp85/mon/Amon/r1i1p1/tas_'+f+'_rcp85.nc'
9
10 # do something with this file
11 print(c, ifile)

In line 2 the variable datadir is defined which holds the full path to the root of where
the data are stored. In line 3 a list named mlist is created which holds a number of
model names.
The loop initiated in line 6 loops over each element of mlist. With each loop iteration
the variable f holding the model name as well as the counter c will change. The
enumerate() built-in function is used here in order to have access to both the element
of the list (f) and its associated index (c).
In line 8 the full path to the file is constructed by joining different strings together.
Note that the model name stored in the variable f appears twice in the path. First, as
a directory name and second in the filename (see output below).
To check that the full path to the file was constructed correctly the counter c and the
full path saved in ifile are printed in line 11. The output from the code above is as
follows.
Python - Programming Basics 127

0 /home/data/model/cmip5/CanESM2/rcp85/mon/Amon/r1i1p1/tas_CanESM2_rcp85.nc
1 /home/data/model/cmip5/CCSM4/rcp85/mon/Amon/r1i1p1/tas_CCSM4_rcp85.nc
2 /home/data/model/cmip5/HadCM3/rcp85/mon/Amon/r1i1p1/tas_HadCM3_rcp85.nc
3 /home/data/model/cmip5/inmcm4/rcp85/mon/Amon/r1i1p1/tas_inmcm4_rcp85.nc
4 /home/data/model/cmip5/Miroc4/rcp85/mon/Amon/r1i1p1/tas_Miroc4_rcp85.nc

7.2.2.2 Constructing File Names Using the Unix find Command

The Unix find command is an extremely versatile tool that allows the user to
modulate search patterns in all kind of ways (see Section 3.6.7).
The following example illustrates how the Unix find command can be used to create
a sorted list of input filenames which is subsequently used in a loop.

1 import subprocess
2
3 # create sorted list of input files
4 cmd = 'find ../data -iname "*.nc"'
5 process = subprocess.Popen([cmd], shell=True, stdout=subprocess.PIPE)
6 output = process.communicate()[0]
7 flist = output.split()
8 flist.sort()
9
10 # loop over input files
11 for counter, f in enumerate(flist):
12 print(counter, f.decode())

In line 1 the subprocess module is imported. The complete find command is saved in
a variable named cmd in line 4. In this example, files ending with .nc are searched for.
It is recommended to test the complete find command on the Unix command line
before running the Python code to make sure the command works as expected.
In line 5 the find command is executed using the subprocess.Popen() function as
described in Section 7.x.x.
The process.communicate() function is used in line 6 to capture the output from the
command and save it in a variable named output.
The output variable contains the filenames as a single string which is why the
variable-internal method output.split() is used to create a list named flist where
each list element corresponds to a single filename.
Python - Programming Basics 128

The find command returns an unsorted list by default. The list is sorted in line 8
using the variable-internal method flist.sort()
The loop in line 11 iterates over each element of the sorted list (flist) allowing
the processing of each file using, for instance, a CDO command as demonstrated
in Section 7.x.x.
In the above example the index counter and the corresponding list element are printed
inside the loop (line 12). Note that the list elements returned are byte strings and they
are converted into normal strings using the decode() method.
The output from the above code may look similar to the following.

0 ../data/ERAI_sh_1997_P.nc
1 ../data/ERAI_sh_1997_potT.nc
2 ../data/ERAI_sh_1997_potVort.nc
3 ../data/ERAI_sh_1997_sigma.nc
4 ../data/HadISST_sst.nc
5 ../data/HadISST_sst_Nov1997.nc
6 ../data/HadISST_sst_Nov1997_anom.nc
7 ../data/HadISST_sst_Nov_ltm.nc
8 ../data/InSalah.SYNOP.wspd10m.nc
9 ../data/InSalah_wpsd10m_seasonal_cycle.nc
10 ../data/SYNOP_InSalah_wpsd10m_9utc_jul_1985_2019.nc
11 ../data/Sahel_JAS_pre.nc
12 ../data/Sahel_JAS_pre_anom.nc
13 ../data/cru_ts4.02.1979.2015.tmp.dat.nc
14 ../data/era5_u_3d_bodele_2018_12.nc
15 ../data/era5_v_3d_bodele_2018_12.nc
16 ../data/era5_z_bodele_20050301_1200.nc
17 ../data/foo.nc
18 ../data/test.nc
19 ../data/tmp_ltm.nc
20 ../data/tmp_timeseries.nc

The Python ‘glob’³ module aims at imitating the Unix find command but
does not match all its functionality (for example, search files by file size).

³https://docs.python.org/2/library/glob.html
Python - Programming Basics 129

7.2.2.3 Looping over Months, Days, Hours

Consider the following list of TRMM (version 3B42) precipitation data files covering
the period 1 January 2005 to 31 December 2005 with one file every 3 hours. The
directory structure is YYYY/MM/DD/ with each directory containing data for one day (8
files) totalling to 2920 files.

2005/01/01/2005010100.trmm.3b42.nc
2005/01/01/2005010103.trmm.3b42.nc
2005/01/01/2005010106.trmm.3b42.nc
2005/01/01/2005010109.trmm.3b42.nc
2005/01/01/2005010112.trmm.3b42.nc
2005/01/01/2005010115.trmm.3b42.nc
2005/01/01/2005010118.trmm.3b42.nc
2005/01/01/2005010121.trmm.3b42.nc
...
2005/12/31/2005123100.trmm.3b42.nc
2005/12/31/2005123103.trmm.3b42.nc
2005/12/31/2005123106.trmm.3b42.nc
2005/12/31/2005123109.trmm.3b42.nc
2005/12/31/2005123112.trmm.3b42.nc
2005/12/31/2005123115.trmm.3b42.nc
2005/12/31/2005123118.trmm.3b42.nc
2005/12/31/2005123121.trmm.3b42.nc

The following things have to be considered when writing a loop that iterates over
the files. The number of days is different for each month (28, 30 or 31). With each
loop iteration the month, day and hour part of the path and filename and the hour
changes in the filename. The month, day and hour details also have to be in the
correct two-character format (MM, DD and hh).
One solution is to create a list of date objects covering the whole period in 3-hourly
timesteps, then iterate over this list and extract the month, day and hour information
in the correct format. The following code example does exactly that.
Python - Programming Basics 130

1 from datetime import datetime, timedelta


2
3 # define a function that creates list of datetime objects
4 def daterange(date_start, date_end):
5 while date_start <= date_end:
6 yield date_start
7 date_start = date_start + timedelta(hours=3)
8
9 # define start and end date
10 date_start = datetime(2005, 1, 1, 3, 0)
11 date_end = datetime(2005, 12, 31, 21, 0)
12
13 for dt in daterange(date_start, date_end):
14 # get month, day and hour details as strings in correct format
15 month = dt.strftime("%m")
16 day = dt.strftime("%d")
17 hour = dt.strftime("%H")
18
19 # construct input filename
20 ifile = '2005/'+month+'/'+day+'/'+'2005'+month+day+hour+'.trmm.3b42.nc'
21
22 # do some processing on this file
23 print(ifile)

In the first line the datetime and timedelta functions are imported from the datetime
package.
A function called daterange is defined in line 4 to 7 which takes the two arguments
date_start and date_end. Both arguments have to be datetime objects. While date_-
start is less or equal date_end (line 5) the start_date variable is defined anew with
each iteration by adding 3 hours to it using the timedelta() function (line7).
The daterange function is now being used in a for-loop in line 13 using the date_start
and date_end defined in line 10 and 11, respectively. With each iteration of the loop
the variable dt change to the next date. The month, day and hour information is
extracted from the date object dt in the correct two-character string format using
the strftime() method in line 15, 16 and 17, respectively.
The path and file name is constructed in line 20 and printed in line 23.
Python - Programming Basics 131

7.2.3 Reading Data Files Into NumPy Variables


Almost all climate data will be dealt with as NumPy arrays. NumPy is the main
Python packages for scientific number crunching. It will be introduced in more depth
in Section 7.x. The following three subsections will discuss how to read in climate
data from different sources such as netCDF files, ASCII files and Excel spreadsheets.

7.2.3.1 Reading Data From netCDF Files

To read data from a netCDF file the Dataset function from the netCDF4 module can be
used. Line 1 in the code example below imports the Dataset function from the netCDF4
module. In line 3 the netCDF file erai_t2m.nc is opened in read-only mode (r) using
the Dataset function creating a file handle f. In lines 4 to 6 the variables longitude,
latitude and t2m are read in, respectively. Check the netCDF input file for the correct
variable names using tools such as ncdump or CDO. In line 7 the units attribute of the
variable t2m is read in by adding the data variable’s attribute units at the end of the
line (see netCDF file headers for variable attributes). In general, it is good practice to
close the file once all variables and units of interest have been read in (line 7).

The netCDF4 module is backward compatible with netCDF3 and can also be
used with netCDF files using the HDF5 library.

After the netCDF file is closed the variables lons, lats, t2m and t2mu can be used in
the remaining part of the script. While the variables lons, lats and t2m are NumPy
arrays the variable t2mu is of the type string.

1 from netCDF4 import Dataset


2
3 f = Dataset('erai_t2m.nc', mode='r')
4 lons = f.variables['longitude'][:]
5 lats = f.variables['latitude'][:]
6 t2m = f.variables['t2m'][:]
7 t2mu = f.variables['t2m'].units
8 f.close()
Python - Programming Basics 132

7.2.3.2 Reading Data From Formatted ASCII Files

Sometimes climate data are made available as formatted ASCII files (see Section
2.5.1). The data values tend to be organised in rows and columns sometimes including
a few lines in the beginning of the file known as file headers. If the values in each
row are separated by commas then they are called comma-separated values (CSV
files) and the standard file extension .csv should have been used (this is not always
done). Other separators are also possible including tabs or white spaces.
The following is an example of a CSV file listing date, time, wind speed and wind
direction information for every hour of the year 2011. The file includes two lines at
the beginning (the file header) providing the station ID and the column headers.

Station ID 65340
date [YYYY/MM/DD], time [hours], wind speed [m/s], wind direction [sector]
2011/01/01, 0, 1.5, N
2011/01/01, 1, 1.8, NE
2011/01/01, 2, 2.1, N
2011/01/01, 3, 2.6, N
2011/01/01, 4, 3.7, NW
2011/01/01, 5, 5.2, W
...
2011/12/31,22, 0.2, W
2011/12/31,23, 0.5, W
2011/12/31,24, 0.3, W

The following Python code reads in the CSV file assuming that the data are saved in
a file named wspd_2011.csv. The numpy module is imported in line 1 and given the alias
np. Line 3 assigns the input file name to the variable ifile. In lines 4 to 7 the actual
data values are read into the variables d, t, wspd and wdir, respectively. The loadtxt
function from the np module requires some arguments (inside brackets) that tell the
function how to read in the data. These arguments are the input filename (ifile)
followed by the data type (dtype), the delimiter (delimiter), the number of rows to
skip in the beginning of the file (skiprows) and the column to read in (usecols).
Python - Programming Basics 133

1 import numpy as np
2
3 ifile = 'long/path/to/file/wspd_2011.csv'
4 d = np.loadtxt(ifile, dtype=str, delimiter=',', skiprows=2, usecols=(0,))
5 t = np.loadtxt(ifile, dtype=int, delimiter=',', skiprows=2, usecols=(1,))
6 wspd = np.loadtxt(ifile, dtype=float, delimiter=',', skiprows=2, usecols=(2,))
7 wdir = np.loadtxt(ifile, dtype=str, delimiter=',', skiprows=2, usecols=(3,))

The data are now available for the remaining part of the code as NumPy arrays d, t,
wspd and wdir.

7.2.3.3 Read Data From an Excel Spreadsheet

Figure 7.2.3.3.1 shows a worksheet of an Excel file (../data/pibal_data.xlsx) with


data from a pilot balloon (Pibal) track on the island of Tenerife on 23 April 2018
at 8:37am local time. Pibal tracking is a method for measuring vertical profiles of
wind speed and wind direction by means of tracking a helium-filled balloon with a
theodolite (optical instrument). From the known lift of the balloon and the Elevation
and Azimuth measured at regular intervals the wind Speed and Direction at different
heights are calculated by the underlying spreadsheet equations.
Python - Programming Basics 134

Figure 7.2.3.3.1: The PIBAL data entry spreadsheet. Cells with a light green background colour can
be edited, while other cells are calculated automatically.

Reading data from an Excel spreadsheet into Python can be done using the openpyxl⁴
module. In the following code example the method for reading in data from the above
Excel spreadsheet is demonstrated.

⁴https://openpyxl.readthedocs.io/en/stable/
Python - Programming Basics 135

1 import numpy as np
2 from openpyxl import load_workbook
3
4 # open Excel file and iterate through sheets
5 wb = load_workbook('../data/pibal_data.xlsx', data_only=True)
6 ws = wb['P01']
7
8 # read in date, time and location
9 d = ws.cell(row=3, column=2).value
10 t = ws.cell(row=4, column=2).value
11 loc = ws.cell(row=5, column=2).value
12
13 # create empty numpy array variables
14 alt = np.array([], dtype='float64')
15 wspd = np.array([], dtype='float64')
16 wdir = np.array([], dtype='float64')
17
18 # iterate over rows 8 to 39; read altitude, wind speed and wind direction
19 for row in range(8, 40):
20 alt = np.append(alt, np.float64(ws.cell(row=row, column=2).value))
21 wspd = np.append(wspd, np.float64(ws.cell(row=row, column=8).value))
22 wdir = np.append(wdir, np.float64(ws.cell(row=row, column=12).value))

In line 1 numpy is imported and in line 2 the load_workbook function is imported from
the openpyxl module. The function is used in line 5 to open the Excel spreadsheet
pibal_data.xlsx and create the handle wb. Setting data_only to True ensures that the
actual data value is read in and not the underlying formula which is the default.
An Excel file can have several worksheets. In line 6 a handle ws is created to the
worksheet named P01.
In lines 9 to 11 the values of three specific cells that hold date, time and location
information are read in. The ws.cell() function expects the row and column numbers
associated with the cell of interest to be specified (compare the specified row and
column numbers with Figure 7.2.3.3.1 for clarity). Note that the variables d, t and loc
are of the Python native variable type string.
Unfortunately, the ws.cell() function does not allow a range of cells to be specified
and read in. In order to read in the wind speed and wind direction values the
following approach may be applied. In lines 14 to 16 the empty NumPy variables
Python - Programming Basics 136

alt for altitude, wspd for wind speed and wdir for wind direction are declared. They
are of the NumPy data type float64.
In line 19 a for-loop is initiated which loops over rows 8 to 39 in the Excel spreadsheet
(Figure 7.2.3.3.1). With each iteration of the loop the altitude, wind speed and wind
direction values are read in and appended to the variables alt, wspd and wdir using
the np.append() function (lines 20 to 22).
As the ws.cell() function returns the cell values in variables of the data type string
they are converted to NumPy variables on the fly using the np.float64() function
inside the np.append() function.
The NumPy variables alt, wspd and wdir now hold the data from the Excel spreadsheet
and are available for further analysis or plotting. Code examples for plotting the data
from the above spreadsheet example can be found in Section 7.5.1 and Section 7.5.2.

7.2.4 Executing Unix System Commands From Within


Python
Sometimes it can be useful to execute a Unix system command from within a Python
script. For instance, processing netCDF files using CDO commands can be done from
within Python. Python can be used as a wrapper to write a loop for processing a large
number of input files using CDO. Another example would be to execute the Unix find
command to create a list of input files to use in the Python script (see Section 7.x.x).
To execute a Unix system commands the subprocess module can be used. The
subprocess module allows interactions with the executed process while the process
is running.

The subprocess module is a comprehensive package that aims to replace


some older packages such os.system, os.spawn* and os.popen*.

The function subprocess.Popen() can be used to execute external commands on


the Unix system. The following example demonstrates how the subprocess module
can be used to execute a CDO command and check that the command completed
successfully.
Python - Programming Basics 137

1 import subprocess
2
3 cmd = 'cdo -b F64 vertsum ../data/ERAI_sh_1997_P.nc ../data/test.nc'
4 process = subprocess.Popen([cmd], shell=True, stdout=subprocess.PIPE)
5 process.communicate()
6
7 # print return code (0 = success)
8 print('return code:', process.returncode)

In line 1 the subprocess modules is imported. Line 3 stores the complete CDO
command to be executed in a variable named cmd.
In line 4, the subprocess.Popen() function is used to execute the command. A handle
named ‘process’ is created. The first argument passed to the subprocess.Popen()
function is the command to be executed (cmd). The shell keyword is set to True
meaning that the command is passed on to the shell (see Section 3.3.2) as is (check
security considerations in the documentation). To capture the command output the
stdout keyword is set to subprocess.PIPE.

The purpose of running the process.communicate() function in line 5 is to make sure


the code waits until the command has completed and the command exit status is
saved (return code).
The return code can then be used to test if the command completed successfully
(return code equal to 0) or whether it failed (return code equal to a non-zero value)
as demonstrated in line 8 by printing the process.returncode value.

7.3 Introduction to Numpy


While every Python distribution has the ability to perform simple maths with
numbers it is not quite suitable for large multi-dimensional datasets that we use
in climate science. The Python package NumPy⁵ is just the right tool as it was
specifically developed for scientific computing and facilitates processing on multi-
dimensional number arrays.
In order make use of NumPy it has to be imported which is done usually in the
following way.
⁵https://numpy.org
Python - Programming Basics 138

import numpy as np

In the following subsections, the main features of NumPy are very briefly introduced
including how to create number arrays and how to index them. There are many much
more comprehensive introductions available as video tutorials, webpages and books
and it is worth spending some time exploring them.

7.3.1 Creating Numpy Arrays


There are many ways in which NumPy arrays can be created manually. Some
frequently used NumPy array creation functions are listed in Table 7.3.1.1.

Table 7.3.1.1: Examples for functions frequently used to create NumPy arrays.

Function Description
np.array([1, 5, 87, 3]) Returns one-dimensional array with set values.
np.arange(5) Returns a sequence of numbers from 0 to 4.
np.zeros(5) Returns a 5-element array with 0 values.
np.empty([3, 2]) Returns a 3 by 2-element array with no values.
np.full((2, 2), 5) Returns a 2 by 2-element array filled value 5.

Most of the time, however, this is not necessary as many Python packages have
integrated NumPy and generate NumPy arrays as output.

7.3.2 Indexing NumPy Arrays


NumPy array can have one or more dimensions. Most of the time climate data
fields will have one (e.g., time series), two (e.g., longitude by latitude) or three (e.g.,
longitude by latitude by vertical levels) dimensions. Sometimes, there are more than
three dimensions (e.g., additional time dimension or multiple variables).
In contrast to some other data analysis software such as IDL the default order
for NumPy arrays is row-major. For a two-dimensional field, this means that the
first (most left) dimension represents the number of rows (y-axis) and the second
dimension represents the number of columns (x-axis) as shown in the following
example.

You might also like