Fast Spatio-Temporal Data Mining of Large Geophysical Datasets
Fast Spatio-Temporal Data Mining of Large Geophysical Datasets
Fast Spatio-Temporal Data Mining of Large Geophysical Datasets
Query
Query Language
Interface Parser
Visualization
Manager
(IDL-based) Data
Command Dictionary
Interpreter
Query
Optimzer
Communication
Manager
Information
Repositories
Dataflow
Processors
File-based
Datasets Features DBMS Metadata
Dataflow & Indices Mass Store
Processors
Radar imagery of the earth's surface and of other plan- The scientic workbench consists of a graphical user
ets. CONQUEST shares a great deal in common with interface enabling the formulation of queries in terms of
these systems. Its distinguishing features are, 1) the imagery presented on the screen by the Visualization
fact that it is designed to address datasets with promi- Manager. Queries formulated on the workbench are
nent temporal components in addition to signicant parsed and optimized for target architectures by the
high-resolution spatial information, and 2) that it is Query Manager, and then passed onto the execution
designed from the beginning to take maximum advan- engines. These can be either parallel or serial super-
tage of parallel and distributed processing power. computers, such as IBM SP1 and Intel Paragon super-
The remainder of this paper is laid out as follows. computers, single workstations, or workstation farms.
We rst introduce our underlying system architecture, The simplest queries consist of the extraction of well-
followed by an introduction to the datasets that have dened features from "raw" data, without reference
been used as a testbed for the system. We then fo- to any other information. These features are regis-
cus on the extraction of two important spatio-temporal tered with the Information Depository to act as in-
patterns from these datasets, namely cyclone tracks dices for further queries. Salient information extracted
and blocking features, outline an algorithm to perform by queries can also be displayed via the Visualization
hierarchical cluster analysis, and describe the ecient Manager. The latter is implemented on top of IDL,
implementation of these procedures on massively par- and supports static plotting (2D and 3D graphs) of
allel supercomputers. These patterns exemplify the data, analysis of data (e.g., statistical, contours), and
types of high-level queries with which CONQUEST animation of datasets. Further details of the system ar-
will be populated. chitecture and an outline of typical querying sessions
can be found in (Mesrobian et al. 1994).
System Architecture
The system architecture is outlined in Figure 1. It Datasets
consists of the the following 5 basic components: CONQUEST has been applied in the rst instance to
Scientist Workbench datasets obtained from two dierent sources. The rst
dataset is output from an Atmospheric Global Circula-
Query Manager (parser and optimizer) tion Model developed at UCLA, chosen for two princi-
Visualization Manager pal reasons: (1) it includes a challenging set of spatial-
Query execution engine temporal patterns (e.g., cyclones, hurricanes, fronts,
and blocking events); and (2) it is generally free of in-
Information Repository complete, noisy, or contradictary information. Hence it
t2 state =
t1
Read unfinished tracks
700mb Wind
t2 t1 t2 t1
Read extract track
SLP minima cyclones
Sea Level minima
Pressure
insert completed
visualize tracks
tracks into database
serves as an ideal testbed for validating our prototype tures. These former phenomena interact in a manner
environment. The prognostic variables of the AGCM that is still imperfectly understood, and therefore rep-
are horizontal velocities, potential temperature, water resent ideal candidates for the implementation of com-
vapor and ozone mixing ratio, surface pressure, ground plex queries. A secondary goal is to extend the system
temperature, and the depth of the planetary boundary to include the detection of emergent spatio-temporal
layer. Typically, the model's output is written out to phenomena that are not known a priori, and we outline
the database at 12-hour (simulation time) intervals; a cluster analysis technique that represents an initial
however, this frequency can be modied depending on step in this direction. We describe below the algo-
storage capacity of the database. the model can be rithms and preliminary results obtained for cyclones
run with dierent spatial resolutions (grid sizes) and and blocking features, together with some of the is-
temporal resolution (output frequency). At the low- sues involved in parallelization of the system on MPP's
est spatial resolution (4 5 , 9 levels) with 12 hour (Stolorz et al. 1995).
output interval, the AGCM produces approximately 5
Gbytes of data per simulated year, while a 100-year Cyclone detection
simulation of a AGCM with a 1 1:25, 57 levels) Cyclones are some of the most prominent climatic fea-
generates approximately 30 terabytes of output. tures displayed by Global Circulation Models. There
The second dataset is obtained from ECMWF (Eu- is, however, no single objective denition in the lit-
ropean Center for Medium-range Weather Forecast- erature of the notion of a cyclone. Several working
ing), and is split into two subgroups based upon denitions are based upon the detection of a threshold
simulated data and satellite data respectively. The level of vorticity in quantities such as the atmospheric
ECMWF T42L19 and T42L19 VIlp AMIP 10 Year pressure at sea level. Others are based upon the de-
Simulation (1979-1988) dataset contains elds with termination of local minima of the sea level pressure.
a grid size of 128 longitudinal points (2:8125) by We chose to implement a scheme based upon the de-
64 Gaussian latitudinal points, by 15 pressure lev- tection of local minima, an approach developed in a
els. Model variables were output to les every 6-hour reasonably sophisticated form in (Murray & Simmonds
(simulation time) intervals. Each 4D variable (e.g., 1991). This alternative was chosen because the latters'
geopotential height) requires 7Gb of disk storage. The careful treatment includes the introduction of extra rel-
ECMWF TOGA Global Basic Surface and Upper Air evant information such as prevailing wind velocities in
Analyses dataset consists of elds which are uninitial- a meaningful way. It therefore allows the construction
ized analyses sampled twice a day (0 GMT and 1200 of compound queries, and oers an excellent testbed
GMT), at 14 or 15 pressure levels, over a (2:5 longi- for the extensible sophisticated querying capability of
tude by 2:5 latitude grid. Upper air variables include our system. We have created a modied version of this
geopotential, temperature, vertical velocity, u- and v- approach.
components of horizontal wind, and relative humidity, Cyclones are dened in our work as one-dimensional
while surface variables include surface pressure, sur- tracks in a 3-dimensional space consisting of a time
face temperature, mean sea-level pressure, etc.. The axis and the 2 spatial axes of latitude and longitude.
dataset requires about 130 Mb/month. Cyclones represent paths of abnormally low sea level
pressure in time. A typical cyclone track, in this case
Spatio-temporal Feature Extraction over the continental United States, is shown schemati-
Our initial goal has to been to use CONQUEST to cap- cally in Figure 2, together with a data
ow description
ture heuristic rules for prominent features that have of the associated cyclone query. The track is found
been identied in the literature for some time. We by rst detecting one or more local minima in the 2-
have concentrated on two canonical features to demon- dimensional grid of sea level pressure values represent-
strate the system, namely cyclones and blocking fea- ing a single time frame of the GCM. A local mini-
Figure 3: Cyclopresence density map of cyclones during the northern winter extracted from the ECMWF Analyses
dataset (1985-1994).
mum is found by locating a grid location whose pres- particular class of persistent anomalies, in which the
sure value is lower than that at all the grid points basic westerly jet stream in mid-latitudes is split into
in a neighborhood around the location by some (ad- two branches, has traditionally been referred to as
justable) prescribed threshold. This minimum is then \blocking" events. The typical anomalies in surface
rened by interpolation using low-order polynomials weather (i.e., temperature and precipitation) associ-
such as bi-cubic splines or quadratic bowls. Given a ated with blocking events and their observed frequency
local minimum occurring in a certain GCM frame, the have made predicting their onset and decay a high pri-
central idea is to locate a cyclone track by detecting ority for medium-range (5-15 day) weather forecasters.
in the subsequent GCM frame a new local minimum
which is "suciently close" to the current one. Two While there is no general agreement on how to objec-
minima are deemed "suciently close" to be part of tively dene blocking events, most denitions require
the same cyclone track if they occur within 1/2 a grid that the following conditions exist: 1) the basic west-
spacing of each other. Failing this condition, they are erly wind
ow is split into two branches, 2) a large
also "suciently close" if their relative positions are positive geopotential height anomaly is present down-
consistent with the instantaneous wind velocity in the stream of the split, and 3) the pattern persists with
region. A trail of several such points computed from a recognizable continuity for at least 5 days. We mod-
series of successive frames constitutes a cyclone. eled our blocking analysis operators after Nakamura
Figure 3 presents a cyclopresence density map of and Wallace (Nakamura & Wallace 1990). Blocking
cyclones during the northern winter extracted from features are determined by measuring the dierence
satellite ECMWF datasets. In the gure, white rep- between the geopotential height at a given time of year
resents the lowest density value, while black indicate- and the climatological mean at that time of year aver-
the largest density value. It can be seen that most ex- aged over the entire time range of the dataset. Before
tratropical cyclones are formed and migrate within a taking this dierence, the geopotential height is rst
few zonally-elongated regions (i.e., "stormtracks") in passed through a low-pass temporal lter (a 4th or-
the northern Atlantic and Pacic and o around the der Butterworth lter with a 6-day cut-o), to ensure
Antarctic. that blocking signatures are not contaminated by the
signals of migratory cyclones and anticyclones. The
Blocking Feature extraction ltered eld is averaged to obtain the mean year. A
Fourier transform of the mean year is then taken, fol-
On time scales of one to two weeks the atmosphere oc- lowed by an inverse Fourier transform on the rst four
casionally manifests features which have well-dened Fourier components. This procedure yields smooth
structures and exist for an extended period of time time series for seasonal cycles even if the dataset is
essentially unchanged in form. Such structures are re- small (< 100 years). The resulting ltered mean
ferred to, in general, as \persistent anomalies". One year is subsequently compared with the Butterworth-
Figure 4: Density map of blocking events extracted from the UCLA AGCM model data-(1985-1989).
processed geopotential height elds to generate the total error. The resulting tree structure can be used to
fundamental anomaly elds. Blocking \events" can be reduce the data dimensionality by identifying "cluster
detected as time periods t during which ltered geopo- images" containing most of the important information.
tential anomaly values are persistently higher than . Many of the insights obtained correspond to similar ob-
Figure 4 presents a density plot indicating the global servations that can be made on the basis of a somewhat
occurrances of blocking events for UCLA AGCM data more elaborate singular value decomposition analysis
(1985-1989), extracted using t = 5 days and = 0:5. (Cheng & Wallace 1993).
In the gure, white represents the lowest density value,
while black indicates the largest density value. Since
blocking is by nature an extratropical phenomenon, we
Massively Parallel Feature Detection
have eliminated values in the tropics from the plot. The algorithms described above for extracting cyclone
and blocking features on a 10-year dataset of atmo-
Hierarchical Cluster Analysis spheric data require several hours to execute on a typ-
ical scientic workstation. Since one of our primary
One method of reducing the dimensionality of the large considerations is the need to supply an interactive fa-
image datasets, while retaining the structure of im- cility, these processing times must obviously be drasti-
portant regularities, is to perform some sort of clus- cally reduced if queries of this type are to supply initial
ter analysis which groups images together according to indices to the datasets. Pre-processing and storage of
shared spatial features. We have adapted a hierarchi- indices by workstations is of course a feasible alterna-
cal clustering procedure that has been used in the at- tive for heavily used features, but will not suce for
mospheric science literature (Cheng & Wallace 1993), a more general and wide-ranging querying capability.
to a distributed farm of workstations. Our implemen- It is here that massively parallel processors (MPP's)
tation uses publically available PVM message-passing enter the picture. The features described above can
software. be computed quite eciently on MPP's, bringing the
The procedure used by Wallace seeks to cluster to- turn-around time for a typical query down to the range
gether many frames of a geophysical eld of interest of minutes on medium-scale parallel machines that
generated over time. It builds clusters recursively by have been used to date (a 24-node IBM SP1 and a
starting with each of N frames as an individual cluster. 56-node Intel Paragon). It is expected that near real-
A Euclidean pointwise distance dpq between every pair time performance will be achieved when the system is
of images p amd q is rst computed. The sum of all ported to larger platforms comprising up to 512 nodes.
such distances is the dened as the error of a cluster- The parallel implementation of these queries requires
ing. At each step two clusters are chosen to merged, an explicit decomposition of the problem across the
namely that cluster pair which minimally increases the various nodes of a parallel machine. This is by no
error. A simple mean image is updated at each step means always a trivial task. In the case of cyclone
for each cluster for use in future computations of the detection, the optimal decomposition is based upon a
division of the problem into separate temporal slices, language is relatively small (roughly 10%) compared
each of which is assigned to a separate node of the ma- to the execution of standalone code. These results are
chine. A temporal decomposition such as this proves extremely encouraging, as the ability to formulate and
to be highly ecient on a coarse-grained architecture, add new queries to the system quickly and easily is of
provided that cyclone results obtained during a given paramount importance to its usefulness as a scientic
time zone do not interfere too strongly with those at a analysis tool.
later time.
Care must be exercised in such a decomposition, as Acknowledgements
the temporal dimension does not typically parallelize This research was supported by the NASA HPCC pro-
in a natural way, especially when state information gram, under grants #NAG 5-2224 and #NAG 5-2225.
plays an important role in the global result. State-
information plays a fundamental role in the very deni-
tion of cyclones, so care must obviously be taken in the
References
ensuing parallel decomposition. The problem proved Knuth, E. and Wagner, L.M. 1992. Visual Database
tractable in the case of cyclone detection because of Systems, North Holland.
the observation that no cyclones last longer than 24 Chang, S-K. and Hsu, A. 1992. Image Information
frames. This allows the use of a straightforward tem- Systems: Where do we go from here?. IEEE Trans-
poral shadowing procedure, in which each node is as- actions on Knowledge and Data Engineering 4(5):431{
signed a small number of extra temporal frames that 442.
overlap with the rst few frames assigned to its succes- Niblack, W. et al. 1993. The QUBIC Project:
sor node (Stolorz et al. 1995). In the case of blocking Querying Images by Content Using Color, Texture,
feature detection, a straightforward spatial decompo- and Shape, IBM Research Division, Research Report
sition which assigned dierent blocks of grid points to #RJ 9203.
dierent machine nodes proves to be optimal. This Fayyad, U. M.; Smyth, P.; Weir, N.; and Djorgov-
type of decomposition has also be used for ecient ski, S. 1994. Automated analysis and exploration of
parallel implementation of the hierarchical clustering large image databases: results, progress, and chal-
algorithm. lenges. Journal of Intelligent Information Systems 4:1{
19.
Conclusions Gupta, A.; Weymouth, T.; and Jain, R. 1991. Se-
mantic Queries with Pictures: The VYMSYS Model,
We have outlined the development of an extensible in Proceedings of VLDB, 69{79. Barcelona, Spain.
query processing system in which scientists can eas- Guptill, A. and Stonebraker, M. 1992. The Se-
ily construct content-based queries. It allows impor- quoia 2000 Approach to Managing Large Spatial Ob-
tant features present in geophysical datasets to be ex- ject Databases, in Proc. 5th Int'l. Symposium on Spa-
tracted and catalogued eciently. The utility of the tial Data Handling, 642{651, Charleston, S.C.
system has been demonstrated by its application to Mesrobian, E.; Muntz, R. R.; Santos, J. R.; Shek,
the extraction of cyclone tracks and blocking events E. C.; Mechoso, C. R.; Farrara, J. D.; and Stolorz, P.
from both observational and simulated datasets on the 1994. Extracting Spatio-Temporal Patterns from Geo-
order of gigabytes in size. The system has been im- science Datasets, in IEEE Workshop on Visualization
plemented on medium-scale parallel platforms and on and Machine Vision, Seattle, Washington: IEEE Com-
workstation farms. puter Society.
The prototype system desribed here is being ex- Stolorz, P.; Mesrobian, E.; Muntz, R. R.; Santos, J.
tended and generalized in several directions. One is R.; Shek, E. C.; Mechoso, C. R.; and Farrara, J. D.
the population of the query set with a wider range Spatio-Temporal Data Mining on MPP's 1995. sub-
of phenomena including oceanographic as well as at- mitted to Science Information and Data Compression
mospheric queries. Another is the application of ma- Workshop, Greenbelt, MD.
chine learning methods to extract previously unsus- Murray, R.J. and Simmonds, I. 1991. A numerical
pected patterns of interest. A third issue is the scal- scheme for tracking cyclone centres from digital data.
ing of system size onto massively parallel platforms, a Part I: development and operation of the scheme Aust.
necessary ingredient for the system to cope with the Met. Mag 39:155{166.
terabyte size datasets that are becoming available. In Nakamura, H. and Wallace, J.M. 1990. Observed
this regime, scalable I/O considerations are at least as Changes in Baroclinic Wave Activity during the Life
important as those associated with computation per Cycles of Low-frequency Circulation Anomalies J. At-
se, and are an active area of research. A nal issue mos. Sci. 47:1100{1116.
is the development of an appropriate eld-model lan- Cheng, X. and Wallace, J.M. 1993. Cluster Anal-
guage capable of expressing queries based upon large ysis of the Northern Hemisphere Wintertime 500-
imagery datasets rapidly and eciently. Preliminary hPa Height Field: Spatial Patterns J. Atmos. Sci.
results (Skek and Muntz, private commmunication) 50:2674{2696.
suggest that the overhead introduced by the proposed