Geophysical Data Analysis Using Python
Geophysical Data Analysis Using Python
Abstract
A set of routines designed for geophysical data analysis that make extensive use of the numerical extensions to the
computer language Python are presented. The routines perform some typical tasks during multivariate analysis of
geophysical fields, such as principal component analysis and related tasks (truncation rules by means of analytical and
Monte Carlo techniques). Other functions perform singular value decomposition of covariance matrices and canonical
correlation analysis for coupled variability of geophysical fields. Other parts of the package allow access to a library of
statistical distribution functions, multivariate digital filters, time-handling routines, kernel-based probability density
function estimation and differential operators over the sphere for gridded data sets. As they rely on the numerical
extensions to the Python language, they are fast for numerical analysis. The programs make the analysis of geophysical
data sets both easier and faster. r 2002 Elsevier Science Ltd. All rights reserved.
Keywords: Principal component analysis; Digital filter; Probability density function; Differential operator; Python
0098-3004/02/$ - see front matter r 2002 Elsevier Science Ltd. All rights reserved.
PII: S 0 0 9 8 - 3 0 0 4 ( 0 1 ) 0 0 0 8 6 - 3
458 !
J. Saenz et al. / Computers & Geosciences 28 (2002) 457–465
In this work, we are presenting the routine PyClimate1 complex models on a UNIX workstation and analyze
as a Python package that performs some of the most the results with a visualization program on a laptop
typical tasks during data analysis in the field of climate running MS Windows or other operating systems.
analysis, although they can be used for any general- To overcome the problems associated with working
purpose geophysical data analysis work. The package in different computing environments, which often
has been developed by the authors making extensive use support the binary representation of numbers in
of the numerical extension to Python. Python is a different ways, there exist some data formats which
modern, free of charge, and multiplatform-interpreted provide transparent access to binary data. netCDF
object-oriented high-level language (Dubois et al., 1996; (network Common Data Format)3 is one of those
Lutz, 1996; Lutz and Ascher, 1999). It was developed by systems which allows the use of the same binary
Guido van Rossum while working for the Stichting data files in ‘‘little-end’’ or ‘‘big-end’’ machines (Cohen,
Mathematisch Centrum in Amsterdam and first released 1981) without the user having to worry about how
in 1991. The programs are easy to read, write and debug, the data was created or how it is being read, and without
which shortens the development cycle. The code is easily the additional computational and storage burden
extensible and embeddable, providing access to external (and loss of precision) associated to the conversion to
devices, libraries and system-calls. The numerical exten- ASCII. There are some conventions built upon the
sions rely on an efficient C implementation of the basic bare netCDF to be able to communicate between
array operations. Thus, even if Python is interpreted, different pieces of software. One of the most popular
numerical Python is compiled and performs fast for conventions in oceanographic and atmospheric data sets
numerical tasks, with support for several data types is the so-called Cooperative Ocean/Atmosphere
(Int8, Int16, Float32, Float64, Complex32, Complex64). Research Data Service (COARDS) Conventions.4
It has access to several libraries for scientific computa- These conventions are supported by several universities
tion such as Fast Fourier Transform, special functions, and agencies and are interpreted by some popular
random number generation or linear algebra. Detailed visualization software packages (GrADS, Ferret).
information about Python can be found on the Python These conventions enforce a certain order to the
Homepage.2 It is a no-cost alternative to similar dimensions of a variable (time, level, latitude, longitude,
commercial products (MatLab, IDL, Maple, etc.) for etc.) and some attributes, so that visualization programs
those people who cannot afford the licensing fees. Its are able to properly interpret the compliant netCDF
simplicity and no-cost facility make it a viable alter- files. PyClimate includes a function to replicate the
native for its use as a classroom language. Another needed parts (dimensions, variables, attributes) of an
advantage over MatLab and other high-level software is existing COARDS netCDF file so that the output file
that Python is a general-purpose programming language from a computation can be created using the same
and allows much more interaction with the operating structure with a single call. An example of this is
system. provided in Fig. 1.
This paper will show a detailed discussion of the One of the problems facing climatologists or oceano-
mathematical concepts and data analysis tasks included graphers is the need to position adequately their
in our package, PyClimate. A detailed description of observations in time. This is especially relevant in the
the functions and several examples are provided in the frequent situation of monthly averages, because there is
documentation included with the distribution. However, no simple definition of a month. It depends on the year
to give the reader a deeper perspective on the kind and the month itself (28–31 days). The approach in this
of programs that can be prepared using these tools, package is to use Julian Days and define one month as
some short examples will also be shown here. Prerequi- one-twelfth of one year, considering one tropical year
sites, compilation for several UNIX flavours, installa- equal to 365.242198781 days. There exists a second class
tion and test procedures are also described in the (JDTimeHandler) whose constructor parses the attri-
documentation accompanying the distribution, freely bute units of the time variable in a COARDS file. Next,
available in Internet. making use of an offset and scaling process, the
instances are able to recover the fields of a date structure
from the time variable expressed as a double. The
2. COARDS-compliant netCDF files and time-handling
routines
Create an output netCDF file structured as the daily data and print the dates:
from pyclimate.ncstruct import nccopystruct
from pyclimate.JDTimeHandler import JDTimeHandler
from Scientific.IO.NetCDF import NetCDFFile
current implementation does not consider the slow drift 3. Cumulative distribution functions
in the number of days per tropical year, so, it is of little
use for geological time scales involving several millennia. The possibility of extending the Python interpreter
The package should be used with caution for dates with external C libraries is illustrated with an example of
before the introduction of the Gregorian calendar, the way we are providing access to the DCDFLIB.C 1.1
because it simply extrapolates this calendar without library from Python in the package PyClimate.
taking into account the date of its introduction, which DCDFLIB.C library is a free library designed for the
changed from country to country. An example of the use direct and inverse computation of parameters corre-
of this class to read time values in netCDF files is shown sponding to discrete cumulative density functions, by
in Fig. 1. Barry W. Brown, James Lovato and Katty Russell,
460 !
J. Saenz et al. / Computers & Geosciences 28 (2002) 457–465
C structures and functions of pyclimate designed to allow access to DCDFLIB from Python:
typedef struct {
int which;
double p;
double q;
double x;
double df;
int status;
double bound;
} CDFChi;
extern int pycdfchi( CDFChi *sptr);
import pyclimate.pydcdflib
pycdf=pyclimate.pydcdflib
chi2=pycdf.CDFChi() # Create an instance of the Chi**2 object
chi2.which=2 # Kind of conversion, from p,q and dof, get x
chi2.p=0.95 # Assign p (q is automatically set to 1-p)
chi2.df=10 # Degrees of freedom
# This intermediate function calls the original one in DCDFLIB.C
pycdf.pycdfchi(chi2)
# This prints: 18.307038 0 (0 means no error in DCDFLIB.C)
print chi2.x,chi2.status
Fig. 2. Access to DCDFLIB.C from Python by means of layer of structures and function calls.
possibility is the use of Monte Carlo techniques based temperature (SST) field and the atmospheric geopoten-
on temporal subsampling of the input dataset, assessing tial height field (Peng and Fyfe, 1996), SSTs and
the stability of the eigenvectors through the congruence tropospheric vertical wind shear (Shapiro and Gold-
coefficient (Cheng et al., 1994; Richman and Lamb, enberg, 1998) or ozone and equatorial zonal winds
1985). For geophysical data sets, the number of (Randel and Wu, 1996). PyClimate includes some
temporal samples is usually lower than the number of functions to compute the covariance and squared
spatial sites. Under these conditions, the covariance covariance fractions, the homogeneous and heteroge-
matrix is singular and the eigenvalue problem cannot be neous correlation maps and the singular vectors. It is
solved by means of standard linear algebra routines like also able to perform a quantitative analysis on the
LAPACK’s dsyev. To overcome this limitation, PyCli- stability of those vectors by means of a Monte Carlo
mate uses the singular value decomposition (SVD) of the analysis based on temporal subsampling.
data matrix to achieve the EOF decomposition (von The canonical correlation analysis (CCA) is another
Storch, 1995; Wunsch, 1997), which makes the problem widely used technique for the analysis of linearly
solvable under all circumstances (Golub and van Loan, coupled fields (von Storch, 1995). Unlike the SVD that
1996). This approach makes essential a judicious use of maximises the covariance of the fields under considera-
the truncation rules previously mentioned. A simple tion, the CCA maximises the correlation of the fields.
example of how the EOF analysis is performed on a PyClimate also provides a module to carry out this
dataset is shown in Fig. 3. analysis.
One of the simplest algorithms available for the
analysis of the linearly coupled variability of multi-
variate datasets is performed by means of the singular 5. Kernel-based probability density function estimation
value decomposition of the covariance matrix of both
the fields (Bretherton et al., 1992). Despite some debate Histograms are simple estimators of probability
about the method (Cherry, 1996; Hu, 1997; Newman density functions (PDF) in univariate and multivariate
and Sardeshmukh, 1995), it is widely applied. Some statistical analysis. However, they have some draw-
applications include, for instance, the sea surface backs, such as their lack of derivability or their poor
Fig. 3. Use of code to compute EOFs by means of SVD decomposition of data matrix.
462 !
J. Saenz et al. / Computers & Geosciences 28 (2002) 457–465
import Numeric
import LinearAlgebra
import pyclimate.KPDF
from pyclimate.readdat import readdat
from Scientific.IO.NetCDF import NetCDFFile
N=Numeric
LA= LinearAlgebra
pyKPDF=pyclimate.KPDF
functions, which accomplish the estimation in Eq. (1) 1999). The analysis of ENSO is often performed after
using Epanechnikov, Biweight or Triangular kernels. filtering data in adequate frequency intervals (Tourre
For multivariate data, simple and Fukunaga (Eq. (2)) and White, 1997). Thus, digital filters are ubiquitous
estimators have been implemented, in geophysical data analysis to remove from a broad-
band signal those frequency components that are
ðdet SÞ1=2 X
n
f#ðxÞ ¼ Kðh2 ðx Xi ÞT S1 ðx Xi ÞÞ; ð2Þ irrelevant to the problem P at hand. The filters are
nhd i¼1 usually linear, that is, Yt ¼ nk¼n ak Xtþk : Making use
of the array operations of Numerical Python, it is easy
where x is the d-dimensional point where the PDF is to to code this kind of filter in an efficient manner. To
be evaluated, Xi ; i ¼ 1; y; n are the observed points achieve a good performance, the implementation in
in the d-dimensional space. The data are previously PyClimate iterates only once over the whole dataset.
scaled by means of the inverse of the covariance matrix First, the filter coefficients ak are computed. Next, the
in the Fukunaga estimator, so, h is a one-dimensional linear combination of records is calculated by means of
bandwidth common to all axis and S is the sample matrix operations. There exist several books devoted to
covariance matrix. For simple multivariate estimators, the design of filters. The main goal is to achieve sharp
instead of the covariance matrix, the identity matrix I edges in the transference function while reducing the
is used. Currently, the estimator works only for d ¼ 2; 3 Gibbs’ oscillations. Currently, PyClimate supports
and Epanechnikov and multivariate Gaussian kernels. Kolmogorov–Zurbenko (Rao et al., 1997; Eskridge
Fig. 4 shows a sample PDF computed using h ¼ 1 et al., 1997) and Lanczos (Duchon, 1979) filters. Other
from the observed monthly Cold Tongue Index (CTI) filters can be added by subclassing the LinearFilter class
and Arctic Oscillation Index (AO), which measure the rewriting the constructors, which simply define the
*
El Nino-Southern Oscillation (ENSO) and the strength coefficients ak :
of the annular mode of the extratropical atmospheric
circulation in the Northern Hemisphere (Thompson
and Wallace, 1998). Fig. 5 shows the program used
7. Differential operators on the sphere
to compute the PDF. The module is completely written
using C to achieve better performance. There are
There exist several libraries with functions to compute
some auxiliary functions to generate the grids as linear
differential operators on spherical coordinates, like
arrays for multidimensional cases. There is no need to
SPHEREPACK,6 but they are oriented to the develop-
use complicated combinations of other Python and
ment of computer-intensive models using low-level
Numerical Python functions such as map, replicate
programming languages. However, in geophysics, it is
and concatenate to obtain those structures, whilst
a typical task to use differential operators in spherical
improving the overall performance of the PDF esti-
coordinates for the analysis of gridded data sets. The
mation.
gradient of geopotential height in the atmosphere or
ocean is needed to compute the geostrophic wind or
currents. The divergence of the troposphere-integrated
6. Multivariate digital filters moisture transport gives an estimation of the evapora-
tion less precipitation over the surface, if the time
When analysing geophysical data sets, there are often evolution of the vertically integrated moisture content is
different scales of motion involved, and they reflect disregarded. Finally, the curl of the surface wind stress
different physical processes. When analysing oceano- over the ocean divided by the density and the Coriolis
graphic data, it is important to distinguish between parameter allows the computation of the vertical
internal gravity waves and Rossby waves (Gill, 1982) in velocity at the base of the Ekman layer, to name a few
the analysis of observed data or to filter spurious aliased examples of popular uses of these differential operators.
energy from model results (Jayne and Tokmakian, PyClimate includes functions to compute the horizontal
1997). For extratropical atmospheric motions, the component of the gradient,
variability in the 2–10 days is due to baroclinic
instability of the atmospheric flow. Conversely, the so- ~h F ¼ 1 qF 1 qF
r ; ; ð3Þ
called low-frequency variability in the monthly time- a cos f ql a qf
scale, is usually attributed to tropical diabatic forcing of the divergence of a horizontal vector field,
the atmosphere (Lau, 1997) or non-linear interaction of
waves of different scales (Handorf et al., 1999; Hansen ~ vh ¼ 1
r
qu qðv cos fÞ
þ ð4Þ
et al., 1997). For the analysis of tropical convection, it is a cos f ql qf
a common practice to remove the joint effects of the
Madden–Julian Oscillation and the mixed Rossby- 6
Spherepack: A Model Development Facility. 1998. http://
gravity waves in the Tropics (Matthews and Kiladis, www.scd.ucar.edu/css/software/spherepack/.
464 !
J. Saenz et al. / Computers & Geosciences 28 (2002) 457–465
Fig. 6. Array-based computation of qðv cos fÞ=qf by means of second-order centered differences.
Handorf, D., Pethoukhov, V.K., Dethloff, K., Eliseev, A.V., Preisendorfer, R.W., 1988. Principal Component Analysis in
Weisheimer, A., Mokhov, I.I., 1999. Decadal climate Meteorology and Oceanography, 1st edn. Elsevier, Am-
variability in a coupled model of moderate complexity. sterdam, 425pp.
Journal of Geophysical Research 104 (D22), 27253–27275. Randel, W.J., Wu, F., 1996. Isolation of the ozone QBO in
Hansen, J., Sato, M., Ruedy, R., Lacis, A., Asamoah, A., SAGE II data by singular-value decomposition. Journal of
Beckford, K., Borenstein, S., Brown, E., Cairns, B., the Atmospheric Sciences 53 (17), 2546–2559.
Carlson, et al., 1997. Forcing and chaos in interannual to Rao, S.T., Zurbenko, I.G., Neagu, R., Porter, P.S., Ku, J.Y.,
decadal climate change. Journal of Geophysical Research Henry, R.F., 1997. Space and time scales in ambient ozone
102(D22), 25679–25720. data. Bulletin of the American Meteorological Society 78
Hu, Q., 1997. On the uniqueness of the singular value (10), 2153–2166.
decomposition in meteorological applications. Journal of Richman, M.B., 1986. Rotation of principal components.
Climate 10 (7), 1762–1766. Journal of Climatology 6 (2), 293–335.
Jackson, J.E., 1991. A User’s Guide to Principal Components, Richman, M.B., Lamb, P.J., 1985. Climatic pattern analysis of
1st edn. Wiley, Chichester, UK, 569pp. three- and seven-day summer rainfall in the central United
Jayne, S.R., Tokmakian, R., 1997. Forcing and sampling of States: some methodological considerations and a regiona-
ocean general circulation models: impact of high-frequency lization. Journal of Climate and Applied Meteorology 24
motions. Journal of Physical Oceanography 27 (6), (12), 1325–1343.
1173–1179. Robertson, A.W., Mechoso, C.R., Kim, Y.-J., 2000. The
Kimoto, M., Ghil, M., 1993. Multiple flow regimes in the influence of Atlantic sea surface temperature anomalies on
Northern Hemisphere winter. Part I: methodology and the North Atlantic Oscillation. Journal of Climate 13 (1),
hemispheric regimes. Journal of the Atmospheric Sciences 122–138.
50 (16), 2625–2643. Shapiro, L.J., Goldenberg, S.B., 1998. Atlantic sea surface
Lau, N.-C-, 1997. Interactions between global SST anomalies temperatures and tropical cyclone formation. Journal of
and the midlatitude atmospheric circulation. Bulletin of the Climate 11 (4), 578–590.
American Meteorological Society 78 (1), 21–33. Silverman, B.W., 1986. Density Estimation for Statistics and
Lutz, M., 1996. Programming Python, 1st edn. O’Reilly, Data Analysis, 1st edn. Chapman and Hall, London, 175pp.
Cambridge, MA, 904pp. Thompson, D.J.W., Wallace, J.M., 1998. The Arctic oscillation
Lutz, M., Ascher, D., 1999. Learning Python, 1st edn. O’Reilly, signature in the wintertime geopotential height and tem-
Cambridge, MA, 384pp. perature fields. Geophysical Research Letters 25 (9), 1297–
Matthews, A.J., Kiladis, G.N., 1999. The tropical–extratropical 1300.
interaction between high-frequency transients and the Tourre, Y.M., White, W.B., 1997. Evolution of the ENSO
Madden–Julian oscillation. Monthly Weather Review 127 signal over the Indo-Pacific domain. Journal of Physical
(5), 661–677. Oceanography 27 (5), 683–696.
Newman, M., Sardeshmukh, P.D., 1995. A caveat concerning von Storch, H., 1995. Spatial patterns: EOFS and CCA. In: von
singular value decomposition. Journal of Climate 8 (2), 352– Storch, H., Navarra, A. (Eds.), Analysis of Climate
360. Variability. Applications of Statistical Techniques. Springer,
North, G.R., Bell, T.L., Cahalan, R.F., Moeng, F.J., 1982. Berlin, pp. 227–258.
Sampling errors in the estimation of empirical ortho- von Storch, H., Zwiers, F.W., 1999, Statistical Analysis in
gonal functions. Monthly Weather Review 110 (7), Climate Research, 1st edn. Cambridge University Press,
699–706. Cambridge, 484pp.
Peng, S., Fyfe, J., 1996. The coupled patterns between sea Wunsch, C., 1997. The vertical partition of oceanic horizontal
level pressure and sea surface temperature in the mid- kinetic energy. Journal of Physical Oceanography 27 (8),
latitude North Atlantic. Journal of Climate 9 (8), 1770–1794.
1824–1839.