Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Fast Spatio-Temporal Data Mining of Large Geophysical Datasets

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Appears in: Proceedings of the First International Conference on Knowl edge Discovery and Data

Mining, AAAI Press, Montreal (1995) 300-305


Fast Spatio-Temporal Data Mining of Large Geophysical Datasets
P. Stolorz H. Nakamura
Jet Propulsion Laboratory Dept. of Earth and Planetary Physics
California Institute of Technology University of Tokyo
E. Mesrobian, R. R. Muntz, E. C. Shek C. R. Mechoso
J. R. Santos, J. Yi, K. Ng, S.-Y. Chien J. D. Farrara
Computer
AbstractScience Dept., UCLA store, manage,Sciences
Atmospheric Dept.,
access and UCLAthe vast quantities
interpret
of information now at our disposal?
The important scienti c challenge of understand-
ing global climate change is one that clearly The issue of data management and analysis is in it-
requires the application of knowledge discovery self a Grand Challenge which must be addressed if the
and datamining techniques on a massive scale. production of real and synthetic data on a large scale
Advances in parallel supercomputing technol- is to prove truly useful. We present here early experi-
ogy, enabling high-resolution modeling, as well ences with the design, implementation and application
as in sensor technology, allowing data capture of CONQUEST (CONtent-based QUerying in Space
on an unprecedented scale, conspire to over- and Time), a distributed parallel querying and anal-
whelm present-day analysis approaches. We ysis environment developed to address this challenge
present here early experiences with a prototype in a geoscienti c setting. The basic idea of CON-
exploratory data analysis environment, CON- QUEST is to supply a knowledge discovery environ-
QUEST, designed to provide content-based ac-
cess to such massive scienti c datasets. CON- ment which allows geophysical scientists to 1) easily
QUEST (CONtent-based Querying in Space and formulate queries of interest, especially the generation
Time) employs a combination of workstations of content-based indices dependant on both "speci ed"
and massively parallel processors (MPP's) to and "emergent" spatio-temporal patterns, 2) execute
mine geophysical datasets possessing a prominent these queries rapidly on massive datasets, 3) visual-
temporal component. It is designed to enable ize the results, and 4) rapidly and interactively infer
complex multi-modal interactive querying and and explore new hypotheses by supporting complex
knowledge discovery, while simultaneously coping compound queries (in general, these queries depend
with the extraordinary computational demands not only on the di erent datasets themselves, but also
posed by the scope of the datasets involved. Af- on content-based indices supplied by the answers to
ter outlining a working prototype, we concentrate
here on the description of several associated fea- previous queries). CONQUEST has been built under
ture extraction algorithms implemented on MPP the auspices of NASA's High Performance Computing
platforms, together with some typical results. and Communications (HPCC) program. Although it
is geared initially to the analysis and exploration of at-
mospheric and oceanographic datasets, we expect that
Introduction many of its features will be of use to knowledge discov-
Understanding the long-term behavior of the earth's ery systems in several other disciplines.
atmospheres and oceans is one of a number of ambi- Content-based access to image databases is a rapidly
tious scienti c and technological challenges which have developing eld with applications to a number of dif-
been classi ed as "Grand Challenge" problems. These ferent scienti c, engineering and nancial problems. A
problems share in common the need for the applica- sampling may be found in volumes such as (Knuth
tion of enormous computational resources. Substan- & Wagner 1992, Chang & Hsu 1992). One exam-
tial progress has of course already been made on global ple is the QUBIC project (Niblack et al. 1993) il-
climate analysis over the years, due on the one hand lustrating the state-of-the-art in image retrieval by
to the development of ever more sophisticated sensors content. Examples of work in the area of geoscience
and data-collection devices, and on the other to the databases include JARTool (Fayyad et al. 1994), VIM-
implementation and analysis of large-scale models on SYS (Gupta, Weymouth & Jain 1991) and Sequoia
supercomputers. Gigabytes of data can now be gener- 2000 (Guptill & Stonebraker 1992). Many of these ef-
ated with relative ease for a variety of important geo- forts are directed at datasets which contain relatively
physical variables over long time scales. However, this static high-resolution spatial patterns, such as high-
very success has created a new problem: how do we resolution Landsat imagery, and Synthetic Aperture
Scientist Workbench Conquest Query Manager

Query
Query Language
Interface Parser
Visualization
Manager
(IDL-based) Data
Command Dictionary
Interpreter

Query
Optimzer
Communication
Manager

Conquest Query Execution Server


(IBM SP1, Sun Workstation Farm)

Information
Repositories

Dataflow
Processors

File-based
Datasets Features DBMS Metadata
Dataflow & Indices Mass Store
Processors

Figure 1: System Architecture

Radar imagery of the earth's surface and of other plan- The scienti c workbench consists of a graphical user
ets. CONQUEST shares a great deal in common with interface enabling the formulation of queries in terms of
these systems. Its distinguishing features are, 1) the imagery presented on the screen by the Visualization
fact that it is designed to address datasets with promi- Manager. Queries formulated on the workbench are
nent temporal components in addition to signi cant parsed and optimized for target architectures by the
high-resolution spatial information, and 2) that it is Query Manager, and then passed onto the execution
designed from the beginning to take maximum advan- engines. These can be either parallel or serial super-
tage of parallel and distributed processing power. computers, such as IBM SP1 and Intel Paragon super-
The remainder of this paper is laid out as follows. computers, single workstations, or workstation farms.
We rst introduce our underlying system architecture, The simplest queries consist of the extraction of well-
followed by an introduction to the datasets that have de ned features from "raw" data, without reference
been used as a testbed for the system. We then fo- to any other information. These features are regis-
cus on the extraction of two important spatio-temporal tered with the Information Depository to act as in-
patterns from these datasets, namely cyclone tracks dices for further queries. Salient information extracted
and blocking features, outline an algorithm to perform by queries can also be displayed via the Visualization
hierarchical cluster analysis, and describe the ecient Manager. The latter is implemented on top of IDL,
implementation of these procedures on massively par- and supports static plotting (2D and 3D graphs) of
allel supercomputers. These patterns exemplify the data, analysis of data (e.g., statistical, contours), and
types of high-level queries with which CONQUEST animation of datasets. Further details of the system ar-
will be populated. chitecture and an outline of typical querying sessions
can be found in (Mesrobian et al. 1994).
System Architecture
The system architecture is outlined in Figure 1. It Datasets
consists of the the following 5 basic components: CONQUEST has been applied in the rst instance to
 Scientist Workbench datasets obtained from two di erent sources. The rst
dataset is output from an Atmospheric Global Circula-
 Query Manager (parser and optimizer) tion Model developed at UCLA, chosen for two princi-
 Visualization Manager pal reasons: (1) it includes a challenging set of spatial-
 Query execution engine temporal patterns (e.g., cyclones, hurricanes, fronts,
and blocking events); and (2) it is generally free of in-
 Information Repository complete, noisy, or contradictary information. Hence it
t2 state =
t1
Read unfinished tracks
700mb Wind

t2 t1 t2 t1
Read extract track
SLP minima cyclones
Sea Level minima
Pressure
insert completed
visualize tracks
tracks into database

Figure 2: Data ow representation of the cyclone tracking query.

serves as an ideal testbed for validating our prototype tures. These former phenomena interact in a manner
environment. The prognostic variables of the AGCM that is still imperfectly understood, and therefore rep-
are horizontal velocities, potential temperature, water resent ideal candidates for the implementation of com-
vapor and ozone mixing ratio, surface pressure, ground plex queries. A secondary goal is to extend the system
temperature, and the depth of the planetary boundary to include the detection of emergent spatio-temporal
layer. Typically, the model's output is written out to phenomena that are not known a priori, and we outline
the database at 12-hour (simulation time) intervals; a cluster analysis technique that represents an initial
however, this frequency can be modi ed depending on step in this direction. We describe below the algo-
storage capacity of the database. the model can be rithms and preliminary results obtained for cyclones
run with di erent spatial resolutions (grid sizes) and and blocking features, together with some of the is-
temporal resolution (output frequency). At the low- sues involved in parallelization of the system on MPP's
est spatial resolution (4  5 , 9 levels) with 12 hour (Stolorz et al. 1995).
output interval, the AGCM produces approximately 5
Gbytes of data per simulated year, while a 100-year Cyclone detection
simulation of a AGCM with a 1  1:25, 57 levels) Cyclones are some of the most prominent climatic fea-
generates approximately 30 terabytes of output. tures displayed by Global Circulation Models. There
The second dataset is obtained from ECMWF (Eu- is, however, no single objective de nition in the lit-
ropean Center for Medium-range Weather Forecast- erature of the notion of a cyclone. Several working
ing), and is split into two subgroups based upon de nitions are based upon the detection of a threshold
simulated data and satellite data respectively. The level of vorticity in quantities such as the atmospheric
ECMWF T42L19 and T42L19 VIlp AMIP 10 Year pressure at sea level. Others are based upon the de-
Simulation (1979-1988) dataset contains elds with termination of local minima of the sea level pressure.
a grid size of 128 longitudinal points (2:8125) by We chose to implement a scheme based upon the de-
64 Gaussian latitudinal points, by 15 pressure lev- tection of local minima, an approach developed in a
els. Model variables were output to les every 6-hour reasonably sophisticated form in (Murray & Simmonds
(simulation time) intervals. Each 4D variable (e.g., 1991). This alternative was chosen because the latters'
geopotential height) requires 7Gb of disk storage. The careful treatment includes the introduction of extra rel-
ECMWF TOGA Global Basic Surface and Upper Air evant information such as prevailing wind velocities in
Analyses dataset consists of elds which are uninitial- a meaningful way. It therefore allows the construction
ized analyses sampled twice a day (0 GMT and 1200 of compound queries, and o ers an excellent testbed
GMT), at 14 or 15 pressure levels, over a (2:5 longi- for the extensible sophisticated querying capability of
tude by 2:5 latitude grid. Upper air variables include our system. We have created a modi ed version of this
geopotential, temperature, vertical velocity, u- and v- approach.
components of horizontal wind, and relative humidity, Cyclones are de ned in our work as one-dimensional
while surface variables include surface pressure, sur- tracks in a 3-dimensional space consisting of a time
face temperature, mean sea-level pressure, etc.. The axis and the 2 spatial axes of latitude and longitude.
dataset requires about 130 Mb/month. Cyclones represent paths of abnormally low sea level
pressure in time. A typical cyclone track, in this case
Spatio-temporal Feature Extraction over the continental United States, is shown schemati-
Our initial goal has to been to use CONQUEST to cap- cally in Figure 2, together with a data ow description
ture heuristic rules for prominent features that have of the associated cyclone query. The track is found
been identi ed in the literature for some time. We by rst detecting one or more local minima in the 2-
have concentrated on two canonical features to demon- dimensional grid of sea level pressure values represent-
strate the system, namely cyclones and blocking fea- ing a single time frame of the GCM. A local mini-
Figure 3: Cyclopresence density map of cyclones during the northern winter extracted from the ECMWF Analyses
dataset (1985-1994).

mum is found by locating a grid location whose pres- particular class of persistent anomalies, in which the
sure value is lower than that at all the grid points basic westerly jet stream in mid-latitudes is split into
in a neighborhood around the location by some (ad- two branches, has traditionally been referred to as
justable) prescribed threshold. This minimum is then \blocking" events. The typical anomalies in surface
re ned by interpolation using low-order polynomials weather (i.e., temperature and precipitation) associ-
such as bi-cubic splines or quadratic bowls. Given a ated with blocking events and their observed frequency
local minimum occurring in a certain GCM frame, the have made predicting their onset and decay a high pri-
central idea is to locate a cyclone track by detecting ority for medium-range (5-15 day) weather forecasters.
in the subsequent GCM frame a new local minimum
which is "suciently close" to the current one. Two While there is no general agreement on how to objec-
minima are deemed "suciently close" to be part of tively de ne blocking events, most de nitions require
the same cyclone track if they occur within 1/2 a grid that the following conditions exist: 1) the basic west-
spacing of each other. Failing this condition, they are erly wind ow is split into two branches, 2) a large
also "suciently close" if their relative positions are positive geopotential height anomaly is present down-
consistent with the instantaneous wind velocity in the stream of the split, and 3) the pattern persists with
region. A trail of several such points computed from a recognizable continuity for at least 5 days. We mod-
series of successive frames constitutes a cyclone. eled our blocking analysis operators after Nakamura
Figure 3 presents a cyclopresence density map of and Wallace (Nakamura & Wallace 1990). Blocking
cyclones during the northern winter extracted from features are determined by measuring the di erence
satellite ECMWF datasets. In the gure, white rep- between the geopotential height at a given time of year
resents the lowest density value, while black indicate- and the climatological mean at that time of year aver-
the largest density value. It can be seen that most ex- aged over the entire time range of the dataset. Before
tratropical cyclones are formed and migrate within a taking this di erence, the geopotential height is rst
few zonally-elongated regions (i.e., "stormtracks") in passed through a low-pass temporal lter (a 4th or-
the northern Atlantic and Paci c and o around the der Butterworth lter with a 6-day cut-o ), to ensure
Antarctic. that blocking signatures are not contaminated by the
signals of migratory cyclones and anticyclones. The
Blocking Feature extraction ltered eld is averaged to obtain the mean year. A
Fourier transform of the mean year is then taken, fol-
On time scales of one to two weeks the atmosphere oc- lowed by an inverse Fourier transform on the rst four
casionally manifests features which have well-de ned Fourier components. This procedure yields smooth
structures and exist for an extended period of time time series for seasonal cycles even if the dataset is
essentially unchanged in form. Such structures are re- small (< 100 years). The resulting ltered mean
ferred to, in general, as \persistent anomalies". One year is subsequently compared with the Butterworth-
Figure 4: Density map of blocking events extracted from the UCLA AGCM model data-(1985-1989).

processed geopotential height elds to generate the total error. The resulting tree structure can be used to
fundamental anomaly elds. Blocking \events" can be reduce the data dimensionality by identifying "cluster
detected as time periods t during which ltered geopo- images" containing most of the important information.
tential anomaly values are persistently higher than . Many of the insights obtained correspond to similar ob-
Figure 4 presents a density plot indicating the global servations that can be made on the basis of a somewhat
occurrances of blocking events for UCLA AGCM data more elaborate singular value decomposition analysis
(1985-1989), extracted using t = 5 days and  = 0:5. (Cheng & Wallace 1993).
In the gure, white represents the lowest density value,
while black indicates the largest density value. Since
blocking is by nature an extratropical phenomenon, we
Massively Parallel Feature Detection
have eliminated values in the tropics from the plot. The algorithms described above for extracting cyclone
and blocking features on a 10-year dataset of atmo-
Hierarchical Cluster Analysis spheric data require several hours to execute on a typ-
ical scienti c workstation. Since one of our primary
One method of reducing the dimensionality of the large considerations is the need to supply an interactive fa-
image datasets, while retaining the structure of im- cility, these processing times must obviously be drasti-
portant regularities, is to perform some sort of clus- cally reduced if queries of this type are to supply initial
ter analysis which groups images together according to indices to the datasets. Pre-processing and storage of
shared spatial features. We have adapted a hierarchi- indices by workstations is of course a feasible alterna-
cal clustering procedure that has been used in the at- tive for heavily used features, but will not suce for
mospheric science literature (Cheng & Wallace 1993), a more general and wide-ranging querying capability.
to a distributed farm of workstations. Our implemen- It is here that massively parallel processors (MPP's)
tation uses publically available PVM message-passing enter the picture. The features described above can
software. be computed quite eciently on MPP's, bringing the
The procedure used by Wallace seeks to cluster to- turn-around time for a typical query down to the range
gether many frames of a geophysical eld of interest of minutes on medium-scale parallel machines that
generated over time. It builds clusters recursively by have been used to date (a 24-node IBM SP1 and a
starting with each of N frames as an individual cluster. 56-node Intel Paragon). It is expected that near real-
A Euclidean pointwise distance dpq between every pair time performance will be achieved when the system is
of images p amd q is rst computed. The sum of all ported to larger platforms comprising up to 512 nodes.
such distances is the de ned as the error of a cluster- The parallel implementation of these queries requires
ing. At each step two clusters are chosen to merged, an explicit decomposition of the problem across the
namely that cluster pair which minimally increases the various nodes of a parallel machine. This is by no
error. A simple mean image is updated at each step means always a trivial task. In the case of cyclone
for each cluster for use in future computations of the detection, the optimal decomposition is based upon a
division of the problem into separate temporal slices, language is relatively small (roughly 10%) compared
each of which is assigned to a separate node of the ma- to the execution of standalone code. These results are
chine. A temporal decomposition such as this proves extremely encouraging, as the ability to formulate and
to be highly ecient on a coarse-grained architecture, add new queries to the system quickly and easily is of
provided that cyclone results obtained during a given paramount importance to its usefulness as a scienti c
time zone do not interfere too strongly with those at a analysis tool.
later time.
Care must be exercised in such a decomposition, as Acknowledgements
the temporal dimension does not typically parallelize This research was supported by the NASA HPCC pro-
in a natural way, especially when state information gram, under grants #NAG 5-2224 and #NAG 5-2225.
plays an important role in the global result. State-
information plays a fundamental role in the very de ni-
tion of cyclones, so care must obviously be taken in the
References
ensuing parallel decomposition. The problem proved Knuth, E. and Wagner, L.M. 1992. Visual Database
tractable in the case of cyclone detection because of Systems, North Holland.
the observation that no cyclones last longer than 24 Chang, S-K. and Hsu, A. 1992. Image Information
frames. This allows the use of a straightforward tem- Systems: Where do we go from here?. IEEE Trans-
poral shadowing procedure, in which each node is as- actions on Knowledge and Data Engineering 4(5):431{
signed a small number of extra temporal frames that 442.
overlap with the rst few frames assigned to its succes- Niblack, W. et al. 1993. The QUBIC Project:
sor node (Stolorz et al. 1995). In the case of blocking Querying Images by Content Using Color, Texture,
feature detection, a straightforward spatial decompo- and Shape, IBM Research Division, Research Report
sition which assigned di erent blocks of grid points to #RJ 9203.
di erent machine nodes proves to be optimal. This Fayyad, U. M.; Smyth, P.; Weir, N.; and Djorgov-
type of decomposition has also be used for ecient ski, S. 1994. Automated analysis and exploration of
parallel implementation of the hierarchical clustering large image databases: results, progress, and chal-
algorithm. lenges. Journal of Intelligent Information Systems 4:1{
19.
Conclusions Gupta, A.; Weymouth, T.; and Jain, R. 1991. Se-
mantic Queries with Pictures: The VYMSYS Model,
We have outlined the development of an extensible in Proceedings of VLDB, 69{79. Barcelona, Spain.
query processing system in which scientists can eas- Guptill, A. and Stonebraker, M. 1992. The Se-
ily construct content-based queries. It allows impor- quoia 2000 Approach to Managing Large Spatial Ob-
tant features present in geophysical datasets to be ex- ject Databases, in Proc. 5th Int'l. Symposium on Spa-
tracted and catalogued eciently. The utility of the tial Data Handling, 642{651, Charleston, S.C.
system has been demonstrated by its application to Mesrobian, E.; Muntz, R. R.; Santos, J. R.; Shek,
the extraction of cyclone tracks and blocking events E. C.; Mechoso, C. R.; Farrara, J. D.; and Stolorz, P.
from both observational and simulated datasets on the 1994. Extracting Spatio-Temporal Patterns from Geo-
order of gigabytes in size. The system has been im- science Datasets, in IEEE Workshop on Visualization
plemented on medium-scale parallel platforms and on and Machine Vision, Seattle, Washington: IEEE Com-
workstation farms. puter Society.
The prototype system desribed here is being ex- Stolorz, P.; Mesrobian, E.; Muntz, R. R.; Santos, J.
tended and generalized in several directions. One is R.; Shek, E. C.; Mechoso, C. R.; and Farrara, J. D.
the population of the query set with a wider range Spatio-Temporal Data Mining on MPP's 1995. sub-
of phenomena including oceanographic as well as at- mitted to Science Information and Data Compression
mospheric queries. Another is the application of ma- Workshop, Greenbelt, MD.
chine learning methods to extract previously unsus- Murray, R.J. and Simmonds, I. 1991. A numerical
pected patterns of interest. A third issue is the scal- scheme for tracking cyclone centres from digital data.
ing of system size onto massively parallel platforms, a Part I: development and operation of the scheme Aust.
necessary ingredient for the system to cope with the Met. Mag 39:155{166.
terabyte size datasets that are becoming available. In Nakamura, H. and Wallace, J.M. 1990. Observed
this regime, scalable I/O considerations are at least as Changes in Baroclinic Wave Activity during the Life
important as those associated with computation per Cycles of Low-frequency Circulation Anomalies J. At-
se, and are an active area of research. A nal issue mos. Sci. 47:1100{1116.
is the development of an appropriate eld-model lan- Cheng, X. and Wallace, J.M. 1993. Cluster Anal-
guage capable of expressing queries based upon large ysis of the Northern Hemisphere Wintertime 500-
imagery datasets rapidly and eciently. Preliminary hPa Height Field: Spatial Patterns J. Atmos. Sci.
results (Skek and Muntz, private commmunication) 50:2674{2696.
suggest that the overhead introduced by the proposed

You might also like