AI2002
Workshop
Proceedings
Data Mining
Edited by
Simeon J. Simoff,
Graham J. Williams and
Markus Hegland
The 15th Australian Joint Conference
on Artificial Intelligence 2002
Rydges Canberra, Australia
2 - 6 December 2002
ADM02
Proceedings
Australasian Data Mining Workshop
3rd December, 2002, Canberra, Australia
Edited by
Simeon J. Simoff, Graham J. Williams and
Markus Hegland
in conjunction with
The 15th Australian Joint Conference
on Artificial Intelligence
Canberra – Australia,
2nd - 6th December, 2002
University of Technology Sydney
2002
© Copyright 2002. The copyright of these papers belongs to the paper's authors.
Permission to copy without fee all or part of this material is granted provided that the
copies are not made or distributed for direct commercial advantage.
Proceedings of the 1st Australasian Data Mining Workshop – ADM02, in conjunction
with the 15th Australian Joint Conference on Artificial Intelligence, 2nd - 6th
December, 2002, Canberra, Australia
S. J. Simoff, G. J. Williams and M. Hegland (eds).
Workshop Web Site:
http://datamining.csiro.au/adm02/
Published by the University of Technology Sydney
ISBN 0-9750075-0-5
Foreword
The Australasian Data Mining Workshop is devoted to the art and science of data mining: the
analysis of (usually large) data sets to discover relationships and present the data in novel
ways that are compact, comprehendible and useful for researchers and practitioners. Data
mining projects involve both the utilisation of established algorithms from machine learning,
statistics, and database systems, and the development of new methods and algorithms,
targeted at large data mining problems. Nowadays data mining efforts have gone beyond
crunching databases of credit card usage or stored transaction records. They have been
focusing on data collected in the health care system, art, design, medicine, biology and other
areas of human endeavour.
There has been an increasing interest in Australian industry, academia, research institutions
and centers towards the area of data mining, evidenced by the growing number of research
groups (e.g. ANU Data Mining Group, CSIRO Enterprise Data Mining, and UTS Smart eBusiness Systems Lab), academic and industry events (e.g. the data mining seminar series
organised by PricewaterhouseCoopers Actuarial Sydney) related to one or another aspect of
data mining. The workshop is aiming to bring together people from academia and industry
that are working in the development and application of data mining methods, techniques and
technologies. This workshop aims to bring together researchers and industry practitioners
from different data mining groups in Australia and the region, and overseas researchers and
practitioners that are working in the development and application of data mining methods,
techniques and technologies. The workshop is expected to become a forum for presenting and
discussing their latest research and developments in the area. The works selected for
presentation at the workshop are expected to facilitate the cross-disciplinary exchange of
ideas and communication between industry and academia in the area of data mining and its
applications. Consequently, the morning part of the workshop (the sessions on “Practical Data
Mining” and “Applications of Data Mining") addresses the data mining practice. The
afternoon part of the workshop includes sessions on “Data Mining Methods and Algorithms”,
“Spatio-Temporal Data Mining”, and “Data Preprocessing and Supporting Technologies”.
The organisers have also reserved a special presentation session for an overview of on-going
projects.
As part of the Australian Joint Conference on Artificial Intelligence the workshop follows a
rigid peer-review and paper selection process. Once again, we would like to thank all those,
who supported this year’s efforts on all stages – from the development and submission of the
workshop proposal to the preparation of the final program and proceedings. We would like to
thank all those who submitted their work to the workshop. All papers were extensively
reviewed by two to three referees drawn from the program committee. Special thanks go to
them for the final quality of selected papers depends on their efforts.
Simeon, J. Simoff, Graham J. Williams and Markus Hegland
November 2002
i
ii
Workshop Chairs
Simeon J. Simoff
University of Technology Sydney, Australia
Graham J. Williams
Enterprise Data Mining, CSIRO, Australia
Markus Hegland
Australian National University, Australia
Program Committee
Sergei Ananyan
Megaputer Intelligence, Russia & USA
Rohan Baxter
Enterprise Data Mining, CSIRO, Australia
John Debenham
University of Technology Sydney, Australia
Vladimir Estivill-Castro
Giffith University, Australia
Eibe Frank
University of Waikato, New Zealand
Paul Kennedy
University of Technology Sydney
Inna Kolyshkina
PricewaterhouseCoopers Actuarial Sydney, Australia
Kevin Korb
Monash University, Australia
Xuemin Lin
University of NSW, Australia
Warwick Graco
Health Insurance Commision, Australia
Ole Nielsen
Australian National University, Australia
Tom Osborn
NUIX Pty Ltd, and The NTF Group, Australia
Chris Rainsford
Enterprise Data Mining, CSIRO, Australia
John Roddick
Flinders University, Australia
David Skillicorn
Queen's University, Canada
Dan Steinberg
Salford Systems, USA
iii
Program for ADM02 Workshop
Tuesday, 3 December, 2002, Canberra, Australia
9:00 - 9:10
Opening and Welcome
9:10 - 10:30 Session 1 – Practical Data Mining
• 09:10 - 10:00 STOCHASTIC GRADIENT BOOSTING: AN INTRODUCTION TO TreeNet™
Dan Steinberg, Mikhail Golovnya and N. Scott Cardell
• 10:00 - 10:20 CASE STUDY: MODELING RISK IN HEALTH INSURANCE - A DATA MINING
APPROACH
Inna Kolyshkina and Richard Brookes
10:20 - 10:35 Coffee break
10:35 - 12:15 Session 2 – Applications of Data Mining
• 10:35 - 11:00 INVESTIGATIVE PROFILE ANALYSIS WITH COMPUTER FORENSIC LOG DATA
USING ATTRIBUTE GENERALISATION
Tamas Abraham, Ryan Kling and Olivier de Vel
• 11:00 - 11:25 MINING ANTARCTIC SCIENTIFIC DATA: A CASE STUDY
Ben Raymond and Eric J. Woehler
• 11:25 - 11: 50 COMBINING DATA MINING AND ARTIFICIAL NEURAL NETWORKS
FOR DECISION SUPPORT
Sérgio Viademonte and Frada Burstein
• 11:50 - 12:15 TOWARDS ANYTIME ANYWHERE DATA MINING E-SERVICES
Shonali Krishnaswamy, Seng Wai Loke and Arkady Zaslavsky
12:15 - 13:00 Lunch
13:00 - 14:00 Session 3 – Data Mining Methods and Algorithms
• 13:00 - 13:20 A HEURISTIC LAZY BAYESIAN RULE ALGORITHM
Zhihai Wang and Geoffrey I. Webb
• 13:20 - 13:40 AVERAGED ONE-DEPENDENCE ESTIMATORS: PRELIMINARY RESULTS
Geoffrey I. Webb, Janice Boughton and Zhihai Wang
• 13:40 - 14:00 SEMIDISCRETE DECOMPOSITION: A BUMP HUNTING TECHNIQUE
S. McConnell and David B. Skillicorn
14:00 - 14:40 Session 4 – Spatio-Temporal Data Mining
• 14:00 - 14:20 AN OVERVIEW OF TEMPORAL DATA MINING
Weiqiang Lin, Mehmet A. Orgun and Graham. J. Williams
• 14:20 - 14:40 DISTANCES FOR SPATIO-TEMPORAL CLUSTERING
Mirco Nanni and Dino Pedreschi
14:40 - 14:55 Coffee break
14:55 - 16:10 Session 5 – Data Preprocessing and Supporting Technologies
• 14:55 - 15:20 PROBABILISTIC NAME AND ADDRESS CLEANING AND STANDARDISATION
Peter Christen, Tim Churches and Justin Zhu
• 15:20 - 15:45 BUILDING A DATA MINING QUERY OPTIMIZER
Raj P. Gopalan, Tariq Nuruddin and Yudho Giri Sucahyo
• 15:45 - 16:10 HOW FAST IS -FAST? PERFORMANCE ANALYSIS OF KDD APPLICATIONS USING
HARDWARE PERFORMANCE COUNTERS ON ULTRASPARC-III
Adam Czezowski and Peter Christen
16:10 - 17:00 Session 6 – Project Reports, Discussion and Closure
iv
Table of Contents
Stochastic Gradient Boosting: An Introduction to TreeNet™
Dan Steinberg, Mikhail Golovnya and N. Scott Cardell ……………………………………………001
Case Study: Modeling Risk in Health Insurance - A Data Mining Approach
Inna Kolyshkina and Richard Brookes ………………………………………………………… 013
Investigative Profile Analysis With Computer Forensic Log Data Using Attribute
Generalisation
Tamas Abraham, Ryan Kling and Olivier de Vel ………………………………………………… 017
Mining Antarctic Scientific Data: A Case Study
Ben Raymond and Eric J. Woehler …………………………………………………………… 029
Combining Data Mining And Artificial Neural Networks For Decision Support
Sérgio Viademonte and Frada Burstein ………………………………………………………… 037
Towards Anytime Anywhere Data Mining e-Services
.…………………………………… 047
Shonali Krishnaswamy, Seng Wai Loke and Arkady Zaslavsky
A Heuristic Lazy Bayesian Rule Algorithm
Zhihai Wang and Geoffrey I. Webb …………………………………………………………… 057
Averaged One-Dependence Estimators: Preliminary Results
…………………………………………… 065
Geoffrey I. Webb, Janice Boughton and Zhihai Wang
Semidiscrete Decomposition: A Bump Hunting Technique
Sabine McConnell and David B. Skillicorn ……………………………………………………… 075
An Overview Of Temporal Data Mining
Weiqiang Lin, Mehmet A. Orgun and Graham J. Williams
………………………………………… 083
Distances For Spatio-Temporal Clustering
……………………………………………………………… 091
Mirco Nanni and Dino Pedreschi
Probabilistic Name and Address Cleaning and Standardisation
Peter Christen, Tim Churches and Justin Zhu …………………………………………………… 099
Building A Data Mining Query Optimizer
Raj P. Gopalan, Tariq Nuruddin and Yudho Giri Sucahyo
…..…………………………………
109
How Fast Is -Fast? Performance Analysis of KDD Applications
Using Hardware Performance Counters on UltraSPARC-III
Adam Czezowski and Peter Christen ………………………………………………………… 117
Author Index …………………………………………………………………………… 131
v
vi
Stochastic Gradient Boosting: An Introduction to TreeNet
Dan Steinberg, Mikhail Golovnya, N. Scott Cardell
Salford Systems
Stochastic Gradient Boosting
Introduction to Stochastic Gradient Boosting
An introduction to TreeNet™
New approach to machine learning and function
approximation developed by Jerome H. Friedman at
Stanford University
Co-author of CART® with Breiman, Olshen and Stone
Author of MARS™, PRIM, Projection Pursuit
Good for classification and regression problems
Builds on the notions of committees of experts and
boosting but is substantially different in key
implementation details
Salford Systems
http://www.salford-systems.com
dstein@salford-systems.com
Dan Steinberg, Mikhail Golovnya, N. Scott Cardell
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
Stochastic Gradient Boosting: Key Innovations -1
Benefits of TreeNet
Built on CART trees and thus
Stagewise function approximation in which each stage models
residuals from last step model
Conventional boosting models use the original target at each
stage
Each stage uses a very small tree, as small as two nodes and
typically in the range of 4-8 nodes
Conventional bagging and boosting use full size trees
Bagging works best with massively large trees (1 case in each
terminal node)
Each stage learns from a fraction of the available training data,
typically less than 50% to start and often falling to 20% or less by
the last stage
Resistant to mislabeled target data
immune to outliers
handles missing values automatically
selects variables,
results invariant wrt monotone transformations of variables
In medicine cases are commonly misdiagnosed
In business, non-responders are occasionally flagged as “responders”
Resistant to overtraining – generalizes well
Can be remarkably accurate with little effort
Trains rapidly; at least as fast as CART
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
Combining Trees into
“Committees of Experts”
Stochastic Gradient Boosting: Key Innovations -2
Each stage learns only a little: severely downweighted
contribution of each new tree (learning rate is typically 0.10,
even 0.01 or less)
How much is learned in each stage compared to a single tree
In classification, focus is on points near decision boundary;
ignores points far away from boundary even if the points are on
the wrong side
If we do very badly on certain observations we ignore them
Unlike boosting which would upweight such points
Explains why boosting is vulnerable to mislabeled data
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
1
The Australasian Data Mining Workshop
Idea that combining good methods could yield promising results first was suggested
by researchers a decade ago
In tree-structured analysis, suggestions made by:
Wray Buntine (1991, Bayes style allows cases to go down several tree paths)
Kwok and Carter (1990, split nodes several different ways to get alternate
trees)
Heath, Kasif and Salzberg (1993, split nodes several different ways using
different linear combination splitters)
More recent work introduced concepts of bootstrap aggregation (“bagging”), adaptive
resampling and combining (“arcing”) and boosting
Breiman (1994, 1996, multiple independent trees via sampling with replacement)
Breiman (1996, multiple trees with adaptive reweighting of training data)
Freund and Schapire (1996, multiple trees with adaptive reweighting of training
data)
The Australasian Data Mining Workshop
Bootstrap Resampling Effectively Reweights
Training Data (Randomly and Independently)
Trees Can be Combined By Voting or Averaging
Trees combined via voting (classification) or averaging (regression)
Classification trees “vote”
Recall that classification trees classify
Probability of being omitted in a single draw is (1 - 1/n)
Probability of being omitted in all n draws is (1 - 1/n)n
Limit of series as n increases is (1/e) = 0.368
assign each case to ONE class only
With 50 trees, 50 class assignments for each case
Winner is the class with the most votes
Votes could be weighted – say by accuracy of individual trees
Regression trees assign a real predicted value for each case
Predictions are combined via averaging
Results will be much smoother than from a single tree
approximately 36.8% sample excluded
0 % of resample
36.8% sample included once
36.8 % of resample
18.4% sample included twice thus represent ... 36.8 % of resample
6.1% sample included three times ...
18.4 % of resample
1.9% sample included four or more times ...
8 % of resample
100 %
Example: distribution of weights in a 2,000 record resample:
0
732
0.366
© Copyright Salford Systems 2001-2002
Test Set Misclassification Rate (%)
Decrease
49%
30%
77%
19%
Problems with Boosting
Similar procedure first introduced by Freund & Schapire (1996)
Breiman variant (ARC-x4) is easier to understand:
Suppose we have already grown K trees:
let m(j) = # times case j was misclassified (0 <= m(j) <= K)
Define w(j) = (1 + m(j)4)
Prob (sample inclusion) = w( j )
Boosting in general is vulnerable to overtraining
6
3
0.002
© Copyright Salford Systems 2001-2002
ARCing reweights the training data
5
6
0.003
Bagging proceeds by independent, identically-distributed
resampling draws
Adaptive resampling: probability that a case is sampled varies
dynamically
Starts with all cases having equal probability
After first tree is grown, weight is increased on all
misclassified cases
For regression, weight increases with prediction error for that
case
Idea is to focus tree on those cases most difficult to predict
correctly
© Copyright Salford Systems 2001-2002
4
32
0.016
(ARCing, a Variant of Boosting)
Statlog Data Set Summary
Bag
6.4
10.3
0.014
5.0
3
119
0.06
Adaptive Resampling and Combining
Data Set # Training # Variables # Classes # Test Set
Letters
15,000
16
26
5,000
Satellite
4,435
36
6
2,000
Shuttle
43,500
9
7
14,500
DNA
2,000
60
3
6,186
1 Tree
12.6
14.8
0.062
6.2
2
359
0.179
© Copyright Salford Systems 2001-2002
Bootstrap Aggregation Performance Gains
Data Set
Letters
Satellite
Shuttle
DNA
1
749
0.375
Boosting highly vulnerable to errors in the data
M
∑ w(i )
Much better fit on training than on test data
Tendency to perform poorly on future data
Technique designed to obsess over errors
Will keep trying to “learn” patterns to predict miscoded data
i =1
Weight = 1 for cases with zero occurrences of misclassification
Weight = 1+ K4 for cases with K misclassifications
Samples will tend to be increasingly dominated by misclassified cases
Documented in study by Dietterich (1998)
An Experimental Comparison of Three Methods for Constructing
Ensembles of Decision Trees: Bagging, Boosting, and Randomization
Rapidly becomes large if case is difficult to classify
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
2
The Australasian Data Mining Workshop
The Australasian Data Mining Workshop
Stochastic Gradient Boosting
Building on multiple tree ideas and adaptive learning
Goal of avoiding shortcomings of standard boosting
Placed in the context of function approximation
Trees combined by adding them (adding scores)
Friedman calls it Multiple Additive Regressive Trees (MART)
Salford calls it TreeNet TM
Function Approximation By a Series of Error Corrections
Our approximation to any function can be written as
F ( X ) = F0 + β1T1 ( X ) + β2T2 ( X ) + ... + βM TM ( X )
Where F0 is the initial guess, usually what we would use
in the absence of any model (e.g. mean, median, etc.)
The approximation is built up stagewise
© Copyright Salford Systems 2001-2002
Average neighborhood home value is $22,533
Start model F(x) with this mean and construct residuals
Model residuals with two-node tree
Function Approximation By a Series of Adjustments
Consider Boston Housing data set
Function is built up through a series of adjustments or considerations
Each adjustment adds (or subtracts) something from the current estimate of
function value
When we know nothing our home value prediction is the mean
This is just an error correction based on one dimension of data
Model will attempt to separate positive from negative residuals
Now update model, obtain new residuals and repeat process
Estimated function will look something like this:
© Copyright Salford Systems 2001-2002
Then we take number of rooms into account and adjust upwards for larger houses and
downwards for smaller houses
Then we take socioeconomic status of residents into account and again adjust up or
down
Continue taking further factors into account until an optimal model is built
Similar to building up a score from a checklist of important factors (get points for
certain characteristics, lose points for others)
© Copyright Salford Systems 2001-2002
Two-node adjusting trees create
main effects-only models
Adjusting Trees Can be Any Size
Each stage is a “weak learner” – a small tree
© Copyright Salford Systems 2001-2002
Function Approximation By a Series of Trees
Once a stage is added it is never revised or refit
Each stage added by assessing model and attempting to
improve its quality by, for example, reducing residuals
Two-node, three-node, and larger trees can be used
Consider again the Boston Housing data set model
Friedman finds that six-node trees generally work well
Each tree involves only one variable
A tree with more than two nodes still adjusts the existing model
Each contribution of any one tree not dependent on which branch a
case terminates in any other tree
High LSTAT reduces estimated home values by same amount
regardless of number of rooms in house
May take several variables into account simultaneously
Each tree just partitions data into subsets
Each subset gets a separate adjustment
F ( X ) = F0 + β1T1 ( X ) + β 2T2 ( X ) + ... + β M TM ( X )
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
3
The Australasian Data Mining Workshop
The Australasian Data Mining Workshop
TreeNet Model with Three-Node Trees
Rationale for Additive Trees
Want to provide this style of function approximation with
some theoretical justification
Need to specify many details:
+0.4
yes
LSTAT<14.3
yes
–8.4
no
+
MV = 22.5 + RM<6.8
+13.7
no
+0.2
yes
yes
RM<6.8
yes
–0.3
CRIM<8.2
yes
–5.2
no
no
+ LSTAT>5.1
no
+8.4
RM<7.4
no
–4.4
+3.2
How to choose tree size to use
How many forward steps to take
How to identify optimal model
How to interpret model and results
How much to adjust at a step
Need to describe practical performance
Comparison with conventional boosting and single trees
Each tree has three terminal nodes, thus partitioning data at each
stage into three segments
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
Classical Function Approximation
Predictive Modeling and Function Approximation
Specify a functional form for F(x), known up to a set of
parameters B
Learn by fitting F*(x) to data, minimizing loss measure L
Achieved by iterative search procedure in which B is
adjusted with reference to gradient (∂L/ ∂F)( ∂F/ ∂B)
Final result is obtained by adding together a series of
parameter changes guided by gradient at an iteration
Think of this as a gradual form of learning from the data
GIVEN
Y
X
L(Y, F)
Output or Response Variable
Inputs or Predictors
Loss Function
ESTIMATE
F*(X) = arg minF(X) EY,X[L(Y,F(X))]
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
Nonparametric Function Approximation
General non-parametric case: F(X) is treated as having a separate parameter for
each distinct combination of predictors X
With infinite data best estimate of F(X) under quadratic loss at any specific data
vector Xi would be
1
F* X i =
yj
N X i j: X j = X i
( )
With plentiful data accurate estimates of F(X) can be obtained for any X
But we only have finite data so
∑
General Optimization Strategy for Function Approximation
Make an initial guess {Fo(Xi)} – for example, assuming that all Fo(Xi) are the
same for all Xi
Compute the negative gradient at each observed data point i
N
∂Lˆ
r
g = −
∂F ( X i )i =1
most possible X vectors not represented in the data
lack of replicates means inaccurate estimates at any X
The negative gradient gives us the direction of the steepest descent
Take a step in the steepest descent direction
Direct optimization in N free parameters will result in a dramatic overfitting
will somehow have to limit the total number of free parameters
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
4
The Australasian Data Mining Workshop
The Australasian Data Mining Workshop
Identifying Common Gradient Partitions
with Regression Trees
Guarding Against Overfit
In the Non-parametric Case
Literal steepest descent is inadvisable as it would allow free
adjustment of one parameter for each data point
Instead, limit the number of free parameters that can be
adjusted to a small number, say L.
Can do this by partitioning data into L mutually exclusive
groups making a common adjustment within each group
The challenge is to find a good partitioning of data into L
mutually exclusive groups
Our goal is to group observations with similar gradients
together so that a common adjustment can be made to the
model for each group
Build an L-node regression tree with the target being the
negative gradient
Within each group gradients should be similar
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
Generic Gradient Boosting Algorithm
Gradient Boosting for Least Squares Loss
1
Lˆ ({F ( X i )}) =
N
For the given estimate of LOSS, and iterations M
Choose start value {F(Xi)}={Fo(Xi)} (e.g. mean, for all data)
FOR m = 1 TO M
1.
2.
3
Compute gm, the derivative of the expected loss with respect to F(Xi)
evaluated at Fm-1(Xi) (e.g. residual, deviance)
4
Fit an L-node regression tree to the components of the negative gradient
this will partition observations into L mutually exclusive groups
5
Find the within-node update hm(Xi), adjusting each node separately:
conventional model updating
2.
3
4
5
6
7.
i =1
(Yi − F ( X i ))2
gm ~ {Yi – Fm-1(Xi)} = {Residuali}
Fit an L-node regression tree to the current residuals this will
partition observations into L mutually exclusive groups
For each given node: hm(Xi) = node-ave(Residuali)
Update: {Fm(Xi)} = {Fm-1(Xi)} + hm(Xi)
END
END
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
Gradient Boosting for Classification:
Binary Response
Gradient Boosting for Least Absolute Loss
1
Lˆ ({F ( X i )}) =
N
∑
N
i =1
Yi − F ( X i )
Initial guess {F0(Xi)}={median(Yi)}
FOR m = 1 TO M
1.
2.
7.
N
Initial guess {F0(Xi)}={ave(Yi)}
FOR m = 1 TO M
1.
Update: {Fm(Xi)} = {Fm-1(Xi)} + hm(Xi)
6
7.
∑
3
gm ~ {sign(Yi – Fm-1(Xi))} = {sign(Residuali)}
4
Fit an L-node regression tree to the signs of the current residuals (+1,
-1): this will partition observations into L mutually exclusive groups
5
6
For each given node: hm(Xi)=node-median(Residuali)
Update: {Fm(Xi)} = {Fm-1(Xi)} + hm(Xi)
In the case of binary response, the negative log-likelihood
function is used in place of the loss function
Friedman codes Y as {+1, -1} with conditional probabilities
P( y | X ) =
1
1+ e
− y F(X )
, y ∈ {− 1,+1}
P(Y = 1 | X )
– log-odds ratio at X
P (Y = −1 | X )
Here F ( X ) = log
F(X) can range from - infinity to +infinity
END
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
5
The Australasian Data Mining Workshop
The Australasian Data Mining Workshop
TreeNet and Binary Response
N
(
Interpretation
)
L ({F ( X i )}) = ∑ log 1 + e − yi F ( X i )
i =1
1+ y
1.
Initial guess F0 ( X ) = log
1− y
2.
FOR m = 1 TO M
3 g m ~ y i 1 + e y i Fm − 1 ( X i ) = {~y i }
{ (
4
5
6
)}
Fit an L-node regression tree to the “residuals” (see the next
slide) computed above this will partition observations into
L mutually exclusive groups
yi ∑ ~yi (1 − ~yi )
For each node hm ( X i ) = ∑ ~
node
node
Update: {F m(Xi)} = {F m-1(Xi)} + h m(Xi)
Put Y=1 in focus and call p – probability that Y=1
Then
pi = 1 1 + e− F ( X i )
(
)
Initial guess = Log[overall resp. rate / (1 – overall resp. rate)]
“Residual”
Update h m(Xi) =(Node resp. rate – Ave. node(p))/Var
~y = 1 − pi , if yi = 1
i
− pi , otherwise
Ave. node( p ) =
∑ pi N node Var = ∑ pi (1 − pi ) N node
node
7. END FOR
© Copyright Salford Systems 2001-2002
node
© Copyright Salford Systems 2001-2002
A Note on Mechanics
Slowing the learn rate: “Shrinkage”
The tree is grown to group observations into homogenous
subsets
Once we have the right partition our update quantities for
each terminal node are computed in a separate step
The update is not necessarily taken from the tree
predictions
Important notion: tree is used to define a structure based
on the split variables and split points
What we do with this partition may have nothing to do
with the usual predictions generated by the trees
Up to this point we have guarded against overfitting by
reducing the number of free parameters to be optimized
It is beneficial to slow down the learning rate by introducing
the shrinkage parameter 0<ν<1 into the update step:
{Fm(Xi)} = {Fm-1(Xi)} + ν hm(Xi) }
With a group of correlated variables, only one variable in the
group might enter the model with ν=1, whereas with ν<1
several variables in the group may enter at the later steps.
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
TreeNet Three-Node Trees Model:
TreeNet Three-Node Trees Model:
Learn Rate=1
Learn Rate= 0.1
+0.4
yes
yes
LSTAT<14.3
yes
+
+0.2
yes
no
–0.3
yes
CRIM<8.2
no
–5.2
yes
+ LSTAT>5.1
no
+8.4
–0.8
+
+1.4
no
yes
RM<6.8
no
MV = 22.5 + RM<6.8
+13.7
no
RM<7.4
yes
–8.4
no
MV = 22.5 + RM<6.8
yes
+0.04
LSTAT<14.3
no
+0.7
RM<7.4
yes
–4.4
RM>6.8
+3.2
no
yes
no
yes
+2.1
+ RM<6.8
no
–0.3
no
Adjustments are smaller and evolution of model differs
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
6
The Australasian Data Mining Workshop
+.02
LSTAT<14.8
–0.8
+1.1
The Australasian Data Mining Workshop
Ignoring data far from the decision boundary
in classification problems
Stochastic Training Data
A further enhancement in performance is obtained by not allowing
the learner to have access to all the training data at any one time
No a priori limit on the number of iterations so there is always
plenty of opportunity to learn from all the data eventually
By limiting the amount of data at any one iteration we reduce the
probability that an erroneous data point will gain influence over the
learning process
In complete contrast to standard boosting in which problem data
points are “locked onto” with steadily growing weight and influence
A further reduction in training data actually processed in
any update occurs in classification problems
We ignore data points “too far” from the decision
boundary to be usefully considered
JHF recommends 50% random sampling rate at any one iteration
Correctly classified points are ignored (as in conventional
boosting)
Badly misclassified data points are also ignored (very different
from conventional boosting)
The focus is on the cases most difficult to classify correctly:
those near the decision boundary
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
Decision Boundary Diagram
A Simple TreeNet Run
2-dimensional
predictor space
Red dots represent
cases with +1 target
Green dots represent
cases with –1 target
Black curve
represents the
decision boundary
Stop after
the first tree
No shrinkage
Use 2-node
trees only
Least-Squares
LOSS
BOSTON HOUSING DATA: Target is MV (median neighborhood home value)
One Predictor: LSTAT (% residents low socio-economic status)
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
Scatter Plot: MV vs. LSTAT
TreeNet Predicted Response
One-step
model
A regression
tree with 2
terminal
nodes
50.00
Good Neighborhood
RESP = 29.667
40.00
30.00
20.00
10.00
Bad Neighborhood
RESP = 17.465
.00
10.00
20.00
LSTAT
30.00
40.00
LSTAT = 9.755
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
7
The Australasian Data Mining Workshop
The Australasian Data Mining Workshop
Identical results from CART Model
TreeNet Model with two 2-node Trees
CART run with
TARGET=MV
PREDICTORS=LSTAT
LIMIT DEPTH=1
Save residuals as RES1
LSTAT < 4.475
RESP = 41.097
4.475 < LSTAT < 9.755
RESP = 28.684
LSTAT > 9.755
RESP = 17.465
Similar to a
regression tree
with 3 terminal
nodes
LSTAT is only
predictor
LSTAT > 9.755
RESP = 16.482
LSTAT < 9.755
RESP = 29.667
4.475
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
Equivalent Two-stage CART Run
Computing RESPONSE -1
CART run with
First Run
TARGET=RES1
Residuals
PREDICTORS=LSTAT
LIMIT DEPTH=1
Save residuals as RES2
LSTAT > 4.475
RESP = -0.983
LSTAT < 4.475
RESP = 11.430
These are within-node
adjustments to the 1st
run RESPONSE
© Copyright Salford Systems 2001-2002
1st CART Run produced:
IF LSTAT < 9.755 THEN RESP1 = 29.667
IF LSTAT > 9.755 THEN RESP1 = 17.465
2nd CART Run produced:
IF LSTAT < 4.475 THEN ADJUST = 11.430
IF LSTAT > 4.475 THEN ADJUST = -0.983
Combining two CART runs:
IF LSTAT < 4.475 THEN RESP2 = 29.667+11.430 = 41.097
IF 4.475< LSTAT< 9.755 THEN RESP2=29.667-0.983 = 28.684
IF LSTAT > 9.755 THEN RESP2 = 17.465 - 0.983 = 16.482
This is exactly what was reported by TreeNet
© Copyright Salford Systems 2001-2002
Computing Response -2
9.755
TreeNet Run with 3 Trees
This process can be schematically shown as
These cut-offs
came from 1st and
2nd trees
Each tree in the sequence can be grown on the entire
training data set. Unlike a decision tree we do not lose
sample size as the learning progresses
This new cut-off is
due to the 3rd tree
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
8
The Australasian Data Mining Workshop
Still just one
predictor in
model
Now we obtain
4 regions
The Australasian Data Mining Workshop
Optimal Tree is identified by reference
to test data performance
TreeNet Runs with 4 and 9 Trees
A Treenet model can be evolved indefinitely
All model results refer by default to performance on test
data
© Copyright Salford Systems 2001-2002
Want to be able to pick the “right-sized” model
Although resistant to overfitting the model can overfit drastically
in smaller data sets
Require independent test sample
Cross-validation methods not available (yet)
For practical real time scoring may also want to select an
overly small model
© Copyright Salford Systems 2001-2002
TreeNet Run with 20 Trees
TreeNet Run with 200 Trees
Even though the optimal
model is based on 200
trees, the learning actually
stopped here
Optimal model is
based on 15 trees
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
First 15 Runs (No Shrinkage)
First 15 Runs (Shrinkage at .2)
Optimal model
after 15 cycles
is too bumpy
Optimal model
after 15 cycles
is smoother
Starting Model
(mean of MV)
Starting Model
(mean of MV)
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
9
The Australasian Data Mining Workshop
The Australasian Data Mining Workshop
Interactions and TreeNet
TreeNet Run with 2 Predictors and 2 Trees
At m-th cycle, TreeNet model can be represented by the
following formula:
m
Fm (x ) = ∑ h(x;ai )
i =1
Here h(x; ai ) stands for individual tree at cycle i.
It now becomes clear that the order of interactions only
depends on the complexity of individual terms in the sum
above, therefore:
“Stumps” (each tree has only one split based on a single
variable) always result an additive model
Trees with L terminal nodes may allow up to L-1 interactions
© Copyright Salford Systems 2001-2002
CART Run with 4 Nodes
Stumps Produces Additive Model
Jointly, 4 different regions are created:
MV
14.43
22.33
30.81
38.68
First tree uses RM (Number of Rooms)
Second tree uses LSTAT
to update residuals
© Copyright Salford Systems 2001-2002
The first split is the
same as TreeNet
#OBS
163
256
4
83
Small houses, bad neighborhood
Small houses, good neighborhood
Large houses, bad neighborhood
Large houses, good neighborhood
But these two splits are
different => the model is
no longer additive,
RM and LSTAT interact
Conclusion:
CART model
builds interactions
This is an additive model:
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
CART Run with 4 Nodes
Using All 13 Available Predictors
Again, 4 different regions are created:
MV
14.98
23.12
30.98
41.21
The accuracy
has increased
nearly 2 times
#OBS
174
245
41
46
Small houses, bad neighborhood
Small houses, good neighborhood
Large houses, bad neighborhood
Large houses, good neighborhood
The model is
quite large
Similar conclusions, but model is no longer additive
Completely different counts but the sums within the
first and the second pairs are the same as before
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
10
The Australasian Data Mining Workshop
The Australasian Data Mining Workshop
Increase the Base Tree Size
Reduce the Learning Rate to .5
Now using
5-node trees
Smaller Model
Moderate Overfit
Better Accuracy
Dramatic overfit
Same accuracy
Smaller model
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
Reduce the Learning Rate to .1
Classification Example
Larger Model
Small Overfit
Smooth curves
CELL Phone Data
RESPONSE: YES/NO to subscribe
© Copyright Salford Systems 2001-2002
YES
NO
PREDICTORS:
126
704
COSTBUY: cost of the hand set (4 levels)
COSTUSE: monthly charges (4 levels)
WEIGHT variable is added to account for non-even
distribution of responders and non-responders
© Copyright Salford Systems 2001-2002
A Single CART Run
Prediction Success
High price and
high rate poor
response
Overall Accuracy
63.734%
Low price and
low rate good
response
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
11
The Australasian Data Mining Workshop
The Australasian Data Mining Workshop
A Simple TreeNet Classification Model
Individual Contributions
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
Prediction Success
Now some live TreeNet runs
Official version available May, 2002 from Salford
Systems
Send e-mail to request copy to
support@salford-systems.com
Overall Accuracy
64.109%
© Copyright Salford Systems 2001-2002
© Copyright Salford Systems 2001-2002
References
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
Breiman, L. (1996). Arcing classifiers (Technical Report). Berkeley: Statistics Department, University
of California.
Buntine, W. (1991). Learning classification trees. In D.J. Hand, ed., Artificial Intelligence Frontiers
in Statistics, Chapman and Hall: London, 182-201.
Dietterich, T. (1998). An experimental comparison of three methods for constructing ensembles of
decision trees: Bagging, Boosting, and Randomization. Machine Learning, 40, 139-158.
Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed.,
Machine Learning: Proceedings of the Thirteenth National Conference, Morgan Kaufmann, pp. 148156.
Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics Department, Stanford
University.
Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford:
Statistics Department, Stanford University.
Heath, D., Kasif, S., and Salzberg, S. (1993) k-dt: A multi-tree learning method. Proceedings of the
Second International Workshop on Multistrategy Learning, 1002-1007, Morgan Kaufman: Chambery,
France.
Kwok, S., and Carter, C. (1990). Multiple decision trees. In Shachter, R., Levitt, T., Kanal, L., and
Lemmer, J., eds. Uncertainty in Artificial Intelligence 4, North-Holland, 327-335.
© Copyright Salford Systems 2001-2002
12
The Australasian Data Mining Workshop
Case study: Modelling Risk in Health Insurance - A Data
Mining Approach.
Inna Kolyshkina
Richard Brookes
PricewaterhouseCoopers
201 Sussex Street
SYDNEY NSW 2000
PricewaterhouseCoopers
201 Sussex Street
SYDNEY NSW 2000
inna.kolyshkina@au.pwcglobal.com
richard.brookes@au.pwcglobla.com
ABSTRACT
2. DATA MINING VERSUS LINEAR
METHODS. MODELLING
METHODOLOGIES USED: CART
DECISION TREES, MARS AND HYBRID
MODELS.
Interest in data mining techniques has been increasing recently
amongst the actuaries and statisticians involved in the analysis of
insurance data sets which typically have a large number of both
cases and variables. This paper discusses the main reasons for
the increasing attractiveness of using data mining techniques in
insurance. A case study is presented showing the application of
data mining to a business problem that required modeling risk in
health insurance, based on a project recently performed for a
large Australian health insurance company by
PricewaterhouseCoopers (Sydney). The data mining methods
discussed in the case study include: Classification and
Regression Trees (CART), Multivariate Adaptive Regression
splines (MARS) and hybrid models that combined CART tree
models with MARS and logistic regression. The noncommercially sensitive implementation issues are also discussed.
The main reasons for the increasing popularity of data mining
methods amongst the actuarial community can be briefly
summarised as follows. Data mining relies on the intense use of
computing power, which results in an exhaustive search for the
important patterns, uncovering hidden structure even in large
and complex data sets and in many cases a well-performing
model. Also, unlike the more traditional linear methods, it does
not assume that the response is distributed according to some
specified distribution (which is often incorrect for real-life
insurance data sets). In contrast, traditional methods take longer
to develop models, and have particular trouble selecting
important predictors and their interactions. Another very
attractive feature, involved in many data mining modeling
methodologies is automatic "self-testing" of the model. A model
is first built on a randomly-selected portion of the data and then
tested and further refined on the remaining data. Finally, most
data mining methods allow the inclusion in the model categorical
predictors with a large number of categories which are typically
present in the insurance data sets (for example, postcode, injury
code, occupation code etc). Classical methods cannot deal with
such variables effectively, and, as a result, they are either left out
of the model, or have to be grouped by hand prior to inclusion.
Keywords
Data analysis in insurance, data mining, Classification and
Regression Trees (CART), Multivariate Adaptive Regression
splines (MARS), hybrid models.
1. INTRODUCTION
In insurance, like in many other industries (health,
telecommunication, banking to name a few) the size of databases
today often reaches terabytes. In a dataset like this, with millions
of cases and hundreds of variables, finding important information
in a dataset is like finding the proverbial need in the haystack.
However the need for extraction of such information is very real,
and data mining is definitely a technique that can meet that need.
Each data mining technique has its advantages as well as its
drawbacks. These outside of the scope of this paper, but are
discussed in detail in the literature (for example, Vapnik (1996)
and Hastie et al. (2001)). We were very aware of the importance
of selecting the method of analysis that is best suited for a
particular problem, and, after an extended study of the available
data mining techniques, we selected tree-based models and their
hybrids for everyday modeling of insurance data. The reasons for
such selection are as follows. Tree-based methods are very fast,
require less data preparation than some other techniques, can
more easily handle missing values or noisy data, are unaffected
by outliers, and are easy to interpret.
Various data mining methodologies have been used in insurance
for risk prediction/assessment, premium setting, fraud detection,
health costs prediction, treatment management optimization,
investments management optimization, customer retention and
acquisition strategies. In fact, recently a number of publications
have examined the use of data mining method in insurance and
actuarial environment (eg, Francis, 2001, WorkCover NSW
News, 2001). The main reasons for the increasing attractiveness
of the data mining approach is that it is very fast computationally
and also overcomes some well-known shortcomings of
traditional methods such as generalised linear models that are
often being used for data analysis in insurance. This paper gives
an example of the application of data mining methodologies to
modelling risk in insurance based on a recent project completed
by PwC Actuarial (Sydney) data mining team for a large
insurance company client.
A useful feature of the software packages we used (CART®
and MARS®) is that they are easy to implement in SAS which
is the main data analysis software package used by us as well as
by the majority of our clients.
We provide below brief introductions to the techniques we used,
only complete enough for appreciating the outline of the
modelling process we describe. A more detailed description of
The Australasian Data Mining Workshop Copyright 2002
13
The Australasian Data Mining Workshop
them can be found in the literature as indicated in the individual
sections.
2.3 Hybrid Models
The strengths of decision trees and “smooth” modeling
techniques can be effectively combined. Steinberg and Cardell
(1998a, 1998b) describe the methodology of such combining
where the output of CART model in the form of terminal node
indicator, of the predicted values or of the complete set of
indicator dummies is included among other inputs in the
“smooth” model. The resulting model is continuous and gives a
unique predicted value for every record in the data. Typically, all
strong effects are detected by the tree, and the “smooth”
technique picks up the additional weak, in particular linear,
effects. Combined, these small effects can very significantly
improve the model performance (Steinberg and Cardell 1998a,
1998b).
2.1 Classification and Regression Trees
(CART )
The CART methodology is known as binary recursive
partitioning (Breiman et al, 1984). It is binary because the
process of modelling involves dividing the data set into exactly
two subgroups (or “nodes”) that are more homogeneous with
respect to the response variable than the initial data set. It is
recursive because the process is repeated for each of the
resulting nodes. The resulting model is usually represented
visually as a tree diagram. It divides all data into a set of several
non-overlapping subgroups or nodes so that the estimate of the
response is “close” to the actual value of the response within
each node (Lewis et al, 1993). CART then ranks all the variables
in the order of importance, so that a relatively small number of
predictors get a non-zero importance score. This means that it
quickly selects the most important predictors out of many
possible ones. The model is quickly built, is robust and easily
interpretable. However, as any decision tree, it is coarse in the
sense that it predicts only a relatively small number of values and
all cases within each node have the same predicted value. It also
lacks smoothness: a small change in a dependent variable can
lead to a large change in the predicted value. Another
disadvantage of CART is that it is not particularly effective in
modelling the linear structure, and would build a large model to
represent a simple relationship. Further details and discussion of
decision trees and CART® can be found in literature ( Breiman
et al, 1984 ; Hastie et al, 2001).
3. HEALTH INSURER CASE STUDY
3.1 Background
The methodology described above was successfully applied in a
recent project completed for a major health insurance company
client. It was used for creating the model of overall projected
lifetime customer value. The model took into account many
aspects influencing customer value such as premium income,
reinsurance, changes in the family situation of a customer (births,
marriages, deaths and divorce), probability of a membership
lapse and transitions from one type of product to another. Each
of these aspects as well as hospital claim frequency and cost for
the next year and ancillary claim frequency and cost for the next
year was modelled separately and the resulting models were
combined into a complex customer lifetime value model. In this
article we will discuss one of the sub-models, namely the model
for hospital claim cost for the next year.
2.2 Multivariate adaptive regression splines
(MARS)
3.2 Data
MARS is an adaptive procedure for regression, and can be
viewed as a generalisation of stepwise linear regression or a
generalization of the recursive partitioning method to improve
the latter’s performance in the regression setting (Friedman,
1991; Hastie et al, 2001). The central idea in MARS is to
formulate a modified recursive partitioning model as an additive
model of functions from overlapping, (instead of disjoint as in
recursive partitioning), subregions (Lewis et al, 1993).
3.2.1 Data description
De-identified data was available at a member level over a 3 year
period. The model used information available over the first 2
years to fit a model based on outcomes over the last year. We
excluded from the data those customers who lapsed prior to the
end of the 3 year period or joined the health insurer later than 3
years ago. This latter exclusion allowed us to avoid issues
related to waiting periods and enabled us to use two years of
data history in the modelling.
The MARS procedure builds flexible regression models by
fitting separate splines (or basis functions) to distinct intervals of
the predictor variables. Both the variables to use and the end
points of the intervals for each variable-referred to as knots-are
found via an exhaustive search procedure, using very fast update
algorithms and efficient program coding. Variables, knots and
interactions are optimized simultaneously by evaluating a "loss
of fit" (LOF) criterion. MARS chooses the LOF that most
improves the model at each step. In addition to searching
variables one by one, MARS also searches for interactions
between variables, allowing any degree of interaction to be
considered. The "optimal" MARS model is selected in a twophase process. In the first phase, a model is grown by adding
basis functions (new main effects, knots, or interactions) until an
overly large model is found. In the second phase, basis functions
are deleted in order of least contribution to the model until an
optimal balance of bias and variance is found. By allowing for
any arbitrary shape for the response function as well as for
interactions, and by using the two-phase model selection
method, MARS is capable of reliably tracking very complex data
structures that often hide in high-dimensional data (Salford
Systems, 2002).
The data used for analysis can be grouped as follows. Firstly,
there were demographic variables, such as age of the customer,
gender, family status (about 30 variables). The second variable
group was geographic and socio-economic variables such as
location of the member’s residence and socio-economic indices
related to the geographic area of the member’s residence such as
indices of education, occupation, relative socio-economic
advantage and disadvantage (about 80 variables). The third
group of variables was related to membership and product
details such as duration of the membership, details of the
hospital and ancillary product held at present as well as in the
past (about 30 variables). The fourth group of variables was
related to claim history (both ancillary and hospital), details of
medical diagnosis of the member, number of hospital episodes
and other services provided to the member in previous years,
number of claims in a particular calendar year etc (about 100
variables). The fifth and last group of variables included such
information as distribution channel, most common transaction
channel, payment method etc (about 50 variables). Overall there
were about 300 variables.
The Australasian Data Mining Workshop Copyright 2002
14
The Australasian Data Mining Workshop
3.2.2 Data preparation, cleaning and enrichment.
Gains Chart - Cost ranked by predicted variable
The data underwent a rigorous checking and cleaning process.
This was performed in close cooperation with the client. Any
significant data issues or inconsistencies found were discussed
with them. Among other things such as statistical summaries,
distribution analysis etc, the checking process involved
exploratory analysis using CART which was applied to identify
any aberrant or unusual data groups.
100%
90%
80%
% Captured
70%
Some of the variables in the original client data set were not
directly used in the analysis. For example instead of the date of
joining, we used the derived predictor “duration of
membership”. In other cases, if a predictor was described by the
client as likely to contain unreliable or incorrect information, it
was excluded from the analysis.
60%
50%
40%
30%
20%
% of actual events captured in the top
X%
Random Sample
10%
90
%
10
0%
80
%
60
%
50
%
40
%
30
%
20
%
10
%
0%
70
%
Theoretical Best
0%
Top x%
A number of variables included in analysis were derived by us
with the purpose of better describing the customer behaviour.
Examples are duration of membership and indicator of whether
or not the member had a hospital claim in previous years. Many
of such predictors, for example, indicators of whether the
member stayed in hospital for longer than one day and whether
or not the services received were of surgical or non-surgical
nature, were created after consultation with clinical experts.
Figure 1 Gains chart for total expected hospital claims cost
A further diagnostic of model performance is analysis of actual
versus expected values of probability of claim or claim cost.
Such analysis can be pictorially represented by a bar chart of
averaged actual and predicted values for overall annual hospital
cost. This chart is shown in Figure 2. To create this chart, the
members were ranked from highest to lowest in terms of
predicted cost, and then segmented into 20 equally sized groups.
The average predicted and actual values of hospital cost for each
group were then calculated and graphed.
Other variables were added to the data from various sources
such as Australian Bureau of Statistics. These included a number
of various socioeconomic indices based on the member’s
residence, some related to broader geographic areas such as
state, others more closely targeting member’s location such as
postcode-based indicators.
Total Expected Cost
4,000
Actual
3.3 Modelling Methodology
3,500
First, we built a CART tree model. This served purposes of
exploration, getting appreciation of the data structure and
selection of the most important predictors and provided easily
interpretable diagram. The client found CART diagrams easy to
understand and informative. To further refine the model, we then
built a hybrid model of CART and MARS using the hybrid
modelling methodology (Steinberg & Cardell, 1998a, Steinberg
& Cardell, 1998b). This was achieved by including CART
output in the form of a categorical variable that assigned each
record to one of the nodes according to the tree model as one
of the input variables into a MARS model. MARS, like CART
3,000
Expected
Cost
2,500
2,000
1,500
1,000
500
0%
10
%
%
90
95
%
%
85
%
75
80
%
%
70
%
60
65
%
55
%
%
45
50
%
40
%
%
35
30
%
25
%
%
15
20
%
10
5%
-
Percentile (predicted)
In some cases where we wanted to achieve an even higher
degree of precision we built a “three-way hybrid model”
combining CART®, MARS® and a linear model such as logistic
regression or generalized linear model. This was done by feeding
MARS® output (in the form of basis functions created by
MARS®) as inputs into a linear model.
Figure 2 The bar chart of averaged actual and predicted
values for overall annual hospital cost
The chart suggests that the model fits well, however slightly
over-predicts for the lower expected costs but this was of little
business importance for the client.
3.4 Model diagnostic and evaluation
3.5 Findings and Results
The main tools we used for model diagnostics were gains chart
and analysis of actual versus predicted values for hospital cost.
3.5.1 Model Precision
The model achieved high degree of precision as is demonstrated
at the actual versus predicted graph (Figure 2) and gains chart
(Figure 1) above.
The gains chart for the overall hospital claims cost model
presented in Figure 1 shows that we are able to predict the high
cost claimants with a good degree of accuracy. As a rough
guide, the overall claim frequency is 15%. Taking the 15% of
members predicted as having the highest cost by the model, we
end up with 56% of the total actual cost. Taking the top 30% of
members predicted as having the highest cost by the model, we
end up with almost 80% of the total actual cost.
3.5.2 Predictor importance for hospital claims cost
Predictors of the highest importance for overall hospital cost
were age of the member, gender, number of hospital episodes
and hospital days in the previous years, the type of cover and
socio-economic characteristics of the member.
Other important predictors included duration of membership,
family status of the member, the type of cover that the member
The Australasian Data Mining Workshop Copyright 2002
15
The Australasian Data Mining Workshop
had in the previous year, previous medical history and the
number of physiotherapy services received by the member in the
previous year. The fact that the number of ancillary services
(physiotherapy) affected hospital claims cost was a particularly
interesting finding.
5. ACKNOWLEDGEMENTS
Details of the resulting model are commercially sensitive.
However, we can state that many of the potential predictors
given above were indeed significant to a degree greater than we
had expected. For example, while some health insurance
specialists argue that the only main risk driver for hospital claim
cost is age of the member, our results have demonstrated clearly
that although age is among important predictors of hospital
claims cost, a large amount of variation is not explained by age
alone. One way of showing this is by means of a graph of
predicted cost by age shown in Figure 3. If age were the most
important predictor with other predictors not adding much
value, the graph would show values scattered closely to a single
curve. The fact that it is scattered so widely, shows that there
are many other factors contributing significantly to predicted
cost. Examples of such factors are socioeconomic indicators,
type of hospital product and, for some age groups, the supply of
hospitals in the location of the member’s residence.
6. REFERENCES
We
would
like
to
thank
Mr
John
Walsh
(PricewaterhouseCoopers Actuarial, Sydney) for support, advice
and thoughtful comments on the analysis.
[1] Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984).
Classification and Regression Trees. Wadsworth, Pacific
Grove, CA.
[2] Francis, L. (2001). Neural networks demystified. Casualty
Actuarial Society Forum, Winter 2001, 252–319.
[3] Haberman, S. and Renshaw, A. E. (1998). Actuarial
applications of generalized linear models. In Hand, D. J. and
Jacka, S. D. (eds). Statistics in Finance. Arnold, London.
[4] Han , J., and Camber M. (2001) Data Mining: Concepts and
Techniques. Morgan Kaufmann Publishers.
[5] Hastie, T., Tibshirani R. and Friedman, J. (2001). The
elements of statistical learning: Data Mining, Inference and
prediction. Springer-Verlag, New York.
[6] Lewis, P.A.W. and Stevens, J.G., “Nonlinear Modeling of
Time Series using Multivariate Adaptive Regression
Splines,” Journal of the American Statistical Association,
86, No. 416, 1991, pp. 864-867.
[7] Lewis, P.A.W., Stevens, J., and Ray, B.K., “Modelling Time
Series using Multivariate Adaptive Regression Splines
(MARS),” in Time Series Prediction: Forecasting the
Future and Understanding the Past, eds. Weigend, A. and
Gershenfeld, N., Santa Fe Institute: Addison-Wesley, 1993,
pp. 297-318.
[8] McCullagh, P. and Nelder, J. A. (1989). Generalized Linear
Models (2nd edition). Chapman and Hall, London.
[9] Salford Systems (2000). CART® for Windows User’s Guide.
Salford Systems
[10]Salford Systems (2002). MARS® (Multivariate Adaptive
Regression
Splines)
[On-line]
http://www.salfordsystems.com, (accessed 08/10/2002).
Figure 3. The graph of predicted hospital cost versus age.
We also build models for ancillary claims of various types,
including optical, dental and physiotherapy claims.
Unsurprisingly, the most important predictor of ancillary claims
is the customer’s previous claiming pattern. However, there are
strong age related effects (for instance the teenage orthodontic
peak for dental claims), socio-economic effects and location
effects.
[11]Smyth, G. (2002). Generalised linear modelling. [On-line]
http://www.statsci.org/glm/index.html,
(accessed
25/09/2002).
[12]Steinberg, D. and Cardell, N. S. (1998a). Improving data
mining with new hybrid methods. Presented at DCI
Database and Client Server World, Boston, MA.
[13]Steinberg, D. and Cardell, N. S. (1998b). The hybrid
CART-Logit model in classification and data mining. Eighth
Annual Advanced Research Techniques Forum, American
Marketing Association, Keystone, CO.
3.6 Implementation issues.
The deliverables for the model included a SAS algorithm which
takes the required input data and produces a cost score for a
given customer so the client could easily implement the model
directly in SAS environment.
[14]Steinberg, D. and Colla, P. L., (1995). CART: TreeStructured Nonparametric Data Analysis. Salford Systems,
San Diego, CA.
4. CONCLUSION
The results described above as well as a number of projects
completed by PwC Actuarial (Sydney) for large insurer clients
demonstrate that data mining methodologies can be very useful
for analysis of the insurance data.
[15]Vapnik, V. (1996). The Nature of Statistical Learning
Theory. Springer-Verlag, New York.
[16]WorkCover NSW News (2001) Technology catches
insurance
fraud.[On-line]
http://www.workcover.nsw.gov.au/pdf/wca46.pdf (accessed
08/10/02)
The Australasian Data Mining Workshop Copyright 2002
16
Investigative Profile Analysis with Computer Forensic Log
Data using Attribute Generalisation
Tamas Abraham
Ryan Kling
Olivier de Vel
Information Networks Division
Defence Science and
Technology Organisation
PO Box 1500, Edinburgh SA
5111, Australia
Information Networks Division
Defence Science and
Technology Organisation
PO Box 1500, Edinburgh SA
5111, Australia
Information Networks Division
Defence Science and
Technology Organisation
PO Box 1500, Edinburgh SA
5111, Australia
tamas.abraham@dsto.defence.gov.au
ryan.kling@dsto.defence.gov.au
olivier.devel@dsto.defence.gov.au
ABSTRACT
'7℄
%&
&
'42℄
+
; 2
17
; 4
+
The Australasian Data Mining Workshop
'5℄
:
1
*
'-℄
* '8℄
.
1
9
+
*
%
*
*
% & ,
'5/℄
'5-℄
3
'(℄
'5(3 57℄
'23 43 56℄
*
*
$
"
*
0
'83 5/3 423 5-℄
1
#
INTRODUCTION
!
1.
'/℄
; 6
The Australasian Data Mining Workshop
<
; (
2.
BACKGROUND TO INVESTIGATIVE PROFILING
1
+
%
&
%&
!
½ < %
$
1
1
installation
version
date/time stamp
file location
...
1
=
?
? %" #&
=5 4
The Australasian Data Mining Workshop
,
<
1
%&
%
& #
&
C
9
!
:
B
%&
<
!
>
%
+
"
= 5 4
0
>
Æ
elm
! 5< 1
&#
'5A℄
...
Outlook
...
MS Word Emacs
application type
Java
C/C++
...
'23 43 6℄
BPau
...
@ %
$
mailer
editor
! 5
compiler
# 1
2.1 Profiling with Generalisation
; 44&
%
= &
&
=
"1
Ë
1
&
%
&
=
= &
%
= &
% =
&
%
=
&
+
%
"1 +
= &
%
= & %
'7℄ !
%
&
¾
¼ < %
%
+ *
9
+
½
18
The Australasian Data Mining Workshop
Forensic Evidence
1
Data Warehouse
concepts
2.2 Concept Hierarchies
9
&
>
$
&
,
1
% ;
+
+
;
@
The Australasian Data Mining Workshop
!
1
+
3.1 Induction Algorithm
24&
1
+
&
@ ! 4
+
THE PROFILING PROCESS
!
1
D
%
1
.
1
% &
1
3.
%
C
;
;
+
! 4< @
profile−outlier
intra−profile
mining
1
pre−processing
1
reports
profile/outliers
formatted log file
%
log file
'453 55℄
19
1
The Australasian Data Mining Workshop
'5A℄
&
&
?
+
. +
1
@
;
3.1.3
F
;
1
H
&
The Australasian Data Mining Workshop
20
:
&
3.1.2 Unbalanced-tree Ascension
%
+
%
1
+ I
H
3.1.2
;
3
3
0 ; 44
1
H
%
!
>
1
1
H
C
1
%
H
C
+
3.1.1 Parameters
& >
$
+
+
+ +
1
%
#1FG#
;
E
The Australasian Data Mining Workshop
!
" # %
&
1
¾ ℄
# $
&
'
%
3.1.3 Numerical Data
+
*
*
(
(
"
¾ ℄
+
! &"
)
C
! "
F
%
!
*
3.1.4 Vote Propagation
1
54A
5AA
%
1$ %
+
$
+
3
% &
+
& 9
!
21
Æ
< 1$
The Australasian Data Mining Workshop
%
$
3.1.5 Ascension Algorithm
&
1
!
%
I
3.2 Profile Separation
545&
*
.* ,
-* ,
1
54A
"
&* , $
9
%
* +
J
! &
$
1
The Australasian Data Mining Workshop
?
&
# #
1
&
%
!
Æ
;
% &
$
+
% &
C
!
¼
L
%
%
= A
&
%
&
%
&
%
&
%
&
4
&
!
%
+
&
&
+
3.2.5 Closest Point
&!5AA
22
C
, 1 %,
1 & , 1
$
The Australasian Data Mining Workshop
5
= 5
% &
<
1
%
, =
=
% &
3.2.1 Constant Separators
= 4
%
%
3.2.4 Rate of Change in Means
&
!
K
&
%
3.2.3 Euclidean Distances
K !
!
&
% &
% = K
+
0
<
3.2.2 Using Line =
!"
= =
: % &
$
=
!
!
%
&
5 "
I
1
H
;
,
I
:
#
9
#
%
The Australasian Data Mining Workshop
#
#
& K %
& = 5
%
1
1
0
5
3.2.1
+
%
23
:
&
+
&
+ %
F
0
The Australasian Data Mining Workshop
I
@
%
+
%B.& '44℄
<
$
: @
1
1
+
3.3 Element Distance Metrics
$
+
Æ
<
& 1
M
M ;
% " & ;
!
;
? +
+
<
0 + M
J
0 + M
3.2.8 Separation in One Dimension
0
<
% &
E
#
#
;
:
9
" #
'543 52℄
3.2.7 Regression
1
I
+
+
3.2.6 K-means Clustering
?
%
The Australasian Data Mining Workshop
;
#
½
'4
¾
%
½
%
<
! 2< C
½
3.1.2&
" #
%
" #&
¾
<
½
%
%
"1
½
5
¾& = % ½
¾&
¾
% ½
%
%
%
9
>
¾
%;
<
% A
%
H
&
I
JF N
1
JF N
% = 5
¾
½
& = 5
5
Æ
The Australasian Data Mining Workshop
% A
4
¾
½
? J
!
3.3.2 Metric Strategy
! 2
¾&
5
% ½
F
¾& = 5
%A 5&
% ½
= A
"
4. DATA, EXPERIMENTAL METHODOLOGY AND RESULTS
&
%&
= A
Æ
& = A
4
H
½& K % ¾&
# % &
!
3.3.1&
¾ &
½ ¾ F
& K 5
%
!
!
# "1
#
#
% ½
+
# "1
&
& = 5
"1
¾
& =
¾
=
! 2
%
%A 5& " #
&
%
&
%
<
JF N
½
F
%;
¾
½
5 .
6
½
!!
!!
!!
!!
!!
!!
&
<
¾
2
& = A
=
¾
<
¾
½
4
½
5
=
!
½ ¾
" #
5 #
'A 5℄ A
℄ :
"
"
# #
# #
# #
$
# #
$
$
&
3.3.1 Metric Properties
#
+
24
The Australasian Data Mining Workshop
!
%
!
# #
# #
# #
# #
# #
# #
!
!
!
!
! (< C
$
!
! 6< ,
J
$
%
! 2&
%
&
&
;
3.3.2
H
H
1
F
¿
7(L&
% &
+
& C
/
$
"
&
.
+
!
25
%
+
0
6 %& 5
H
¿
F
The Australasian Data Mining Workshop
%
&
:
+
&
$
%
+
4AAA
4.2 Intra-Profile Experiments
+
:
H
! (
F
#
2AA8AA
:
"
! 6
%
%
<
+
H
+
F
&
&
1
9
4.1 Profile-to-outlier Experiments
%
3.3.1
%
%
;
; 24
;
&
The Australasian Data Mining Workshop
:
$
= &
= &
!
&½ < %
&¾ < %
= &
= &
%
'6℄
!
,
1
+
"#
!
:
5%2&<475D258
Q
! 1
; 577(
'54℄ O 0 0 O 0 0
! 4 #
! $
% &%-'') ; 5777
'52℄ O 0 0 O 0 Q
< 1
1
F
23 !
The Australasian Data Mining Workshop
I .
2A%6& 4AA5
'55℄ O 0 G !
5774
! 1( 0 ! 2 3
.
<
.
'543 52℄
.$ /
'5A℄ O 0 G
!
9
577-
$ %
'7℄ ! ! ,
"@#
'583 4A℄ 1
!
0 ! %
$ 1 5778
.
1
'/℄ . C 0, Q O ; N N 1
$
#
.
+
'-℄ $ B 1 1 .
! *+
* , , &*#
%-'') 1 5777
, 4AAA
. ;
'5℄
!
$ ! !
% $ &%$'() 9 .
'(℄ C
H
J;1 577/
CONCLUSIONS AND FUTURE DIRECTIONS
$ %
1 P ; , G $
26%4&<-6D/4 4AA5
'8℄ , Q
5.
'2℄ I 1 1 H J
&
(%5>4&<22D(/ 4AA5
=
A -(
=
= &
%
.
1
'4℄ I 1 1 H C
! ### !
$ . O 4AA4
'5℄ 1 $ B
!
&
&
=
=
%
%
1
6. REFERENCES
+
5 A
%
&
&¼ < %
&½ < %
26
;77A6
$ 5777
; J
The Australasian Data Mining Workshop
'56℄ . 0
H
9 9 E
! $
'57℄ ; F $ B 1
62%/&<5A4D5A8
O I
< 1
E I
'4A℄ $ ,
<
'44℄
Q
G .
<
! 1 23 !
47%54&<5452D544/
'42℄ , F B Q .
! *+ $
3 &*#
%-1) 1 4AA5
C
: H
! $
57/8
! *+ $
3 &*#
%-1) 1 4AA5
The Australasian Data Mining Workshop
;
H ! *+ * $
! # &*#%-) 1 4AAA
'5/℄ 1 F
; 577(
5/%6&<642D6(6 4AAA
;
'45℄ ; 1 . H
$ !
'5-℄ 9 . 0 E G ; O :
5778
'58℄ 0 E E ! O 0 9
!
! $ 3
! $
6A%2&<--D/- 577-
! ! 3
&53'() O 577/
4AAA
'5(℄ O Q 9 . . H O 0
27
28
Mining Antarctic scientific data: a case study
Ben Raymond and Eric J Woehler
Australian Antarctic Division
Kingston, Tasmania 7050
http://www-aadc.aad.gov.au
ben.raymond@aad.gov.au
ABSTRACT
20oS
The Australian Antarctic Data Centre is a web-accessible
repository of freely-available Antarctic scientific data. The
Data Centre seeks to increase the value and utility of its
holdings through data mining analyses and research. We
present and discuss analyses of an extensive spatial/temporal
database of at-sea observations of seabirds and related physical environmental parameters. Mixture-model based clustering identified two communities of seabirds in the Prydz
Bay region of East Antarctica, and characterised their spatial and temporal distributions. The relationships between
observations of three seabird species and environmental parameters were explored using predictive logistic models. The
parameters of these models were estimated using data from
the Prydz Bay region. The generality of the models was
tested by applying them to data from a different region
(that adjacent to Australia’s Casey station). This approach
identified regional differences in the at-sea observations of
seabird species. The results of these analyses complement
those of at-sea studies of seabirds elsewhere around the Antarctic. They also provide insights into possible data errors that
were not readily apparent from direct examination of the
data. These analyses enhanced ecological understanding,
provided feedback on survey strategy, and highlighted the
utility of the repository.
30oS
1.
Hobart
Kerguelen Islands (France)
50oS
Heard Island
Macquarie Island
Ba
y
60oS
Casey
yd
z
Mertz Glacier
70oS
60oE
Pr
Mawson
Davis
90oE
120oE
150oE
180oW
Figure 1: Australian Antarctic research stations (•) and
other locations mentioned in the text
data. These analyses are additional to those undertaken as a routine part of Antarctic scientific studies
and aim to exploit the multi-disciplinary nature of the
data held by the AADC;
• the extraction of actionable information from low-level
scientific data. This has direct application to conservation, planning, and legislative activities, as well as
producing “end-product” data suitable for use by other
scientific investigators; and
• to generate a better understanding of the holdings
of the AADC, including the identification of data errors, duplicated data, missing records, linkages between databases, and data acquisition procedures. This
information has direct application for data management issues, such as maintaining high data quality and
an efficient database structure.
INTRODUCTION
The Australian Antarctic Data Centre (AADC) was established in 1995 to make scientific observations and results
from Antarctica freely available. The free availability of
data is one of Australia’s obligations under the Antarctic
Treaty (article III). The majority of the data collected in
Antarctica, while originally collected for a specific investigation, nevertheless have wide potential relevance to other
projects and investigators. Many of the AADC’s holdings
are ecological or environmental in nature, and linkages between databases are extensive.
We present an overview of the mining of the “Wildlife-onVoyage” (WoV) database. This database holds an extensive collection of observations of wildlife (comprising birds,
whales, and kelp) made from ships during Antarctic voyages.
The information within this collection has wide scientific
relevance. However, the data present numerous analytical
challenges, including spatial and temporal variation (within
and across years), missing values, and a lack of balance in
sampling. We begin by describing the data and the methods
that were used to collect them, and then present and discuss
two investigations using these data. These investigations focused on the identification of communities of seabirds and
the relationships of the birds with their environment.
The AADC plays an active role in the analysis of Antarctic
scientific data by mining its holdings. The broad aim is to
improve the value of these data to the Antarctic community.
Several approaches are being taken, including:
• the direct application of mining and exploratory techniques in order to uncover new information from the
The Australasian Data Mining Workshop
40oS
c
Copyright °2002
29
The Australasian Data Mining Workshop
Jul−Sep
Oct−Dec
20 S
o
20 S
o
30oS
30oS
o
40 S
40oS
o
50 S
50 S
o
60 S
o
o
60 S
70oS
o
60 E
o
90 E
o
120 E
o
150 E
70oS
o
180 W
o
60 E
o
90 E
Jan−Mar
o
180 W
o
180 W
150 E
o
Apr−Jun
20 S
o
20 S
30 S
o
30 S
40 S
o
40 S
o
50 S
50 S
o
o
60 S
o
o
o
o
60 S
o
70 S
o
120 E
o
o
60 E
o
90 E
o
120 E
o
150 E
70 S
o
180 W
o
60 E
o
90 E
o
120 E
150 E
o
Figure 2: Spatial and temporal distribution of at-sea sightings of seabirds made from Australian Antarctic voyages, 1980-2002.
Data from all years have been pooled. Densities are shown in cells of size 1◦ longitude × 1◦ latitude. The shade of grey
denotes the number of surveys made in the cell (black = more than 30 surveys, white = no data)
2.
DATA DESCRIPTION
3.
The seabird component of the WoV database comprises approximately 140 000 observations of 119 species, made on 98
voyages conducted between 1980 and 2002. These voyages
were undertaken in the course of Australia’s Antarctic scientific research program. The majority of the voyages were for
the transportation of personnel and supplies to Australian
Antarctic bases (see Figure 1), with a small number of voyages for scientific surveys. While survey voyages attempted
to maintain a balanced sampling strategy, the same was not
true of the transportation voyages. Observations on these
voyages were incidental, with little or no opportunity for
balanced survey design. Figure 2 shows the spatial and temporal distribution of the data. The most densely surveyed
areas are clearly those adjacent to the Australian Antarctic
stations. The temporal distribution of the observations is
heavily biased against the winter months, because the extensive sea ice in the Antarctic during winter makes ship
travel virtually impossible.
3.1 Data cleaning
Data cleaning and error checking consumed a large proportion of the time spent on this study. Prior to the 1992/1993
season all observations were recorded on paper forms and
manually entered into the database. On voyages after this
season a laptop-based entry system was used where possible,
reducing the likelihood of errors in data transcription.
Rule-based techniques were used to detect violations of physical limitations: for example, sea surface temperature cannot be less than -1.8 ◦ C, the approximate freezing point of
sea water. Similarly, the differences in time and position of
consecutive observations were used to calculate an apparent ship speed, which was then compared to a maximum
possible speed of 25 knots. There were instances in which
either the time or position stamp of a data record was in
error by one digit, suggesting an error during manual entry
of the data. Position and time stamp errors were in general
more easily identified using graphical methods, particularly
where the errors were small (for example, transcription errors in the tenths-of-degrees digit).
Observations of wildlife were made in surveys of 10 minutes
duration, with generally one survey made per hour of the
voyage. Physical environmental data collected at the time
of each survey included sea surface temperature (◦ C), sea
state (or wave height, recorded on an ordinal scale), cloud
cover (categorised as clear, partial, total, or blowing snow),
wind force (Beaufort) and direction, and atmospheric pressure (hPa). Sea ice cover was also estimated but, as discussed below, alternative sea ice data derived from satellite
images were used in the analyses.
The Australasian Data Mining Workshop
PREPROCESSING
The species diversities of Antarctic seabird communities are
low. Except for very rare species one could reasonably expect to encounter the same species from year to year in a
given region. The identification of species for which there
were very few observations in a region therefore proved to be
a simple but effective mechanism of finding records that were
likely to contain errors in species identification or data entry. For example, we found four observations of Australasian
gannets in Prydz Bay (66◦ S), a species which is not normally
c
Copyright °2002
30
The Australasian Data Mining Workshop
found south of 50◦ S. Other likely errors in species identification were also identified during the community analyses
(see section 4, below).
is relatively dense, as two of Australia’s four permanent research stations (Davis and Mawson) are located along this
sector of the Antarctic coastline. Observations were pooled
into composite records for the analyses. The pooling was
limited so that these composite records contained consecutive observations from a single voyage only, and spanned
no more than 12 hours and a 50 km change in ship position. These composite records are referred to here as “sites”,
which is the usual nomenclature used in the ecological literature. The species composition of each site was compiled
in presence/absence format, and the environmental variable
values within a composite record were combined using a median (for continuous variables) or mode (for nominal or ordinal variables) operator.
Errors were corrected using interpolation from surrounding
values where possible, or patched using data from the marine science database (see below). In some cases, there were
insufficient data to allow interpolation: such entries were
deleted from the data set.
3.2 Database linkages
The physical environmental variables in the WoV database
(see section 2, above) provide a natural set of linkages to
other databases both within and external to the AADC. Of
particular interest is a marine science database that holds
data collected from onboard sensors during Antarctic voyages. These data include various environmental variables
including sea surface temperature, wind speed and direction, and solar radiation, as well as voyage information such
as ship speed and position. Marine science data are available only from voyages of the Aurora Australis; the other
ships used for Australian Antarctic scientific voyages do not
have this real-time data logging system installed.
Seabird communities were explored using two complementary cluster analyses. The first examined the clustering of
sites based on species composition. The seabird communities were then generated from the species compositions of
the resulting site clusters.
The division of ecological data into discrete clusters can be
problematic because in many cases the data do not show
an inherently grouped structure. Rather, ecological data
commonly form a continuum between extremes. The division of such a continuum into distinct entities does not
necessarily lead to results that make intuitive sense. Soft
clustering algorithms (also known as fuzzy, or probabilistic
clustering), which assign to each datum a membership level
in each cluster may therefore be preferable to “hard” clustering algorithms, which allocate each datum exclusively to
a single cluster.
Other, external, environmental databases are also relevant
to this study. For example, the National Snow and Ice Data
Centre at the University of Colorado (http://nsidc.org)
maintains a database of satellite-derived sea ice concentration data. This database holds daily Antarctic sea ice concentration data from 1978 onwards, on a spatial grid with a
cell size of 25km × 25km. These sea ice data were used in
preference to the directly-observed data, in order to avoid
the potential bias of ship tracks to areas of open water (i.e.
less sea ice).
4.
We applied a mixture-model approach [4; 13] to the clustering of sites by species composition. This is a soft clustering approach in which the data are modelled by a mixture
of probability distributions, with each representing a different cluster. Since the species compositions were in binary
(presence/absence) form, the Bernoulli distribution was the
natural choice. Mixtures of multivariate Bernoulli distributions have been shown in theory to be non-identifiable
[11]; however, in practice, interpretable results can still be
obtained [6]. We used maximum-likelihood estimation by
expectation-maximisation [9; 20]. Although we do not do
so here, the mixture model approach also offers principled
methods for the selection of the correct number of clusters
[10]. This would be of interest in situations where a large
number of cluster analyses were required, with little prior
information available to guide the choice of number of clusters. Our choice of number of clusters was based on prior
knowledge of the seabird communities along with expert assessment of the properties of the emergent clusters.
COMMUNITY ANALYSIS
4.1 Motivation
A community can be defined as a group of species that share
a habitat. Community analysis can offer a broad view of
an ecosystem and allows species-level information to be abstracted and presented in a compact form. Such analyses are
therefore of interest for management and conservation purposes, but may also be used to guide more specific scientific
investigations of particular species or areas of interest. The
concepts and techniques of community analysis are identical to those of market basket analysis in data mining (used
in a transaction database context, for example, to identify
products that tend to be purchased together).
4.2 Methods
The species compositions of the seabird communities were
assessed on the basis of the membership of each species to
each cluster as well as the constancy of each species within
each cluster. The constancy may be calculated as the fraction of sites from a cluster that contain an observation of
the species in question. Species with a high membershipconstancy product can be considered to be the “indicator”
species of an assemblage [8]. Indicator species are useful
for characterising the species composition of an assemblage,
where such an assemblage contains many species.
The study area of interest was Prydz Bay, defined as that
area of the Southern Ocean between 60 ◦ E and 90 ◦ E, and
south of 60 ◦ S to the Antarctic continent (see Figure 1).
Prydz Bay was chosen as it has been the focus of numerous studies of seabirds in their colonies [18]. Prydz Bay is
the primary seabird breeding locality in East Antarctica,
with breeding populations of nine species [18], comprising
approximately 30% of that East Antarctic seabird biomass
[16]. Furthermore, the WoV data coverage within Prydz Bay
The Australasian Data Mining Workshop
c
Copyright °2002
31
The Australasian Data Mining Workshop
15 Sep − 14 Oct
15 Oct − 14 Nov
15 Nov − 14 Dec
o
60 S
o
62 S
o
64 S
o
66 S
o
68 S
o
70 S
15 Jan − 14 Feb
15 Feb − 14 Mar
Species
15 Dec − 14 Jan
o
60 S
o
62 S
o
64 S
o
66 S
o
68 S
o
70 S
o
60 E
15 Mar − 14 Apr
o
70 E
80 oE
90 o
E
15 Apr − 14 May
Legend
o
60 S
Assemblage 1
Assemblage 2
o
62 S
o
64 S
o
66 S
Emperor penguin (R)
Adelie penguin (R)
Southern giant petrel (R)
Southern fulmar (R)
Cape petrel (R)
Antarctic petrel (R)
Snow petrel (R)
Wilson’s storm petrel (R)
South polar skua (R)
Subantarctic skua
Antarctic tern
Arctic tern
Antarctic/arctic tern
Northern giant petrel
Black−browed albatross
Grey−headed albatross
Light−mantled sooty albatross
Wandering albatross
White−headed petrel
Mottled petrel
Kerguelen petrel
Blue petrel
Prion spp.
White−chinned petrel
Dark shearwaters
Black−bellied storm petrel
0
o
68 S
o
o
70 S
60 E
o
70 E
80 oE
90 oE
o
60 E
o
70 E
80 oE
90 oE
Figure 3: Spatial and temporal distribution of two assemblages of seabirds in the Prydz Bay region of Antarctica.
The species composition of each assemblage is shown in Figure 4
*
*
*
*
*
*
*
*
*
*
*
*
*
*
0.5
Assemblage 1
membership
1
0
0.5
Assemblage 2
membership
1
Figure 4: Membership of 26 seabird species to the two assemblages shown in Figure 3. (R) indicates the species that
breed in Antarctic locations. Indicator species (see text) are
marked with an asterisk
algorithm would assign such a site exclusively to one of the
two assemblages. The overlap is more readily observed using
the soft clustering approach. Increasing the number of clusters to three placed this overlap into its own cluster, further
highlighting this finding.
The second cluster analysis grouped seabirds according to
their spatio-temporal ranges. In this approach, species that
were observed in the same region of the ocean at the same
time are grouped together. This yields the seabird communities directly. Dissimilarities between species ranges were
calculated using the TwoStep algorithm [2] and clustering
computed using a hierarchical complete-linkage algorithm.
A hierarchical clustering is more natural in this case because the number of entities is small (26 species within Prydz
Bay) and the hierarchy of the dendrogram is itself of interest. Seabird communities identified using this approach are
referred to here as “associations”. The communities identified by the mixture-model clustering described earlier will
be referred to as “assemblages” in order to differentiate the
two approaches.
Indicator species are marked on Figure 4 with an asterisk.
Two species (cape petrels and Wilson’s storm petrels) were
found to be indicator species in both assemblages. This
suggests that their at-sea distributions were quite broad,
whereas the other breeding species were generally observed
only in relative proximity to the Prydz Bay coast (particularly during the middle of the breeding season; see the distribution of assemblage 1 in Figure 3). This difference is a
result of the fact that these two species breed both on the
Prydz Bay coast as well as sub-Antarctic locations such as
Heard Island (which lies to the north of Prydz Bay; see Figure 1). Thus, individuals observed offshore from the Prydz
Bay coast are probably those breeding on Heard Island. The
only other species that breeds both in Prydz Bay and on
Heard Island is the southern giant petrel.
4.3 Results and Discussion
The clustering of sites by species composition revealed a twogroup structure in the seabird assemblages. The spatial and
temporal distributions of these assemblages (all years combined) is shown in Figure 3, and the species composition of
the two assemblages is shown in Figure 4. Assemblage 1
contains all nine species that breed in Prydz Bay in addition to sub-Antarctic skuas, arctic and Antarctic terns, and
northern giant petrels. This assemblage was observed close
to the Antarctic coast during the middle of the breeding season (January-March, Figure 3). Assemblage 2 contains the
remaining 12 species, all of which breed in temperature or
sub-antarctic latitudes and forage within Prydz Bay during
the southern hemisphere summer. This assemblage was observed during the summer months (December-March), offshore from the Prydz Bay coast. The spatio-temporal ranges
of the two assemblages overlap, as can be seen from the midgrey cells in Figure 3. This overlap is handled transparently
by a soft clustering algorithm, because sites which host both
assemblages at the same time will have a non-zero membership to both assemblages. In contrast, a hard clustering
The Australasian Data Mining Workshop
*
*
*
The hierarchical clustering of species by spatio-temporal
range is shown in Figure 5. Cutting the dendrogram at a
relatively high dissimilarity level yields two seabird associations (marked as (a) and (b) on the figure) that are identical
to the two assemblages shown in Figure 4. Association (a)
may be further split into (a1) and (a2). Sub-association
(a1) contains southern giant petrels, cape petrels, Wilson’s
storm petrels and arctic terns: three of these are the species
that breed both on the Prydz Bay coast and on Heard Island. Their at-sea distributions are therefore different from
the distributions of the remainder of the breeding species.
This finding reinforces that obtained from the first cluster
analysis, discussed above.
As well as providing direct community information, these
analyses yielded additional information relating to issues of
species identification. Antarctic terns, arctic terns, and their
c
Copyright °2002
32
The Australasian Data Mining Workshop
regions.
Emperor penguin (R)
Adelie penguin (R)
Southern fulmar (R)
Antarctic petrel (R)
Snow petrel (R)
South polar skua (R)
Subantarctic skua
Antarctic tern
Arctic tern
Southern giant petrel (R)
Cape petrel (R)
Wilson’s storm petrel (R)
Antarctic/arctic tern
Northern giant petrel
Black−browed albatross
Wandering albatross
Dark shearwaters
Grey−headed albatross
White−headed petrel
Kerguelen petrel
Mottled petrel
Blue petrel
Light−mantled sooty albatross
Prion spp.
White−chinned petrel
Black−bellied storm petrel
0
(a1)
We investigated the use of predictive models as a means of
investigating the relationships between seabird observations
and environment. Seasonal behaviour and the response to
environment are likely to differ among the species within
a community, leading to dynamic community compositions.
Furthermore, neighbouring communities are not disjoint but
rather overlap at the edges of their ranges [19]. The models were therefore built using species-level data rather than
community level. Given predictions of individual species
ranges, it would be a straightforward matter to combine
these into community-level predictions if desired.
(a)
(a2)
(b)
1
2
3
4
5
6
Dissimilarity
7
8
9
The ability to successfully predict the at-sea distributions of
seabirds from environmental parameters would be extremely
valuable. At-sea survey data for much of the world’s oceans
are limited due to the logistic difficulties and costs involved.
Predictive models that use remotely-sensed environmental
data may allow the estimation of seabird distribution in
those areas of the ocean not amenable to direct survey.
10
Figure 5: Dendrogram of seabird species, clustered according to similarity of spatio-temporal range. (R) indicates
species the that breed in Antarctic locations. Groupings
within the dendrogram labelled (a), (b), etc. are discussed
in the text
5.2 Methods
Seabird observations from two different areas were used.
Observations from Prydz Bay were used to build and test
the models. These models were then applied to data from
the Casey station region in order to test the generality of
the models. The delineation of Prydz Bay was the same
as in section 4, above, except that the northern boundary
was extended to 50◦ S. This extension includes Heard Island
(53◦ 5’S, 73◦ 30’E, an important seabird breeding area) in the
study. The Casey station region was delimited to the area
between 100◦ E and 120◦ E, and south of 50◦ S to the Antarctic continent (see Figure 2). There is no northern land mass
equivalent to Heard Island in the Casey station area.
composite (used when specific identification at sea was not
possible), are grouped in the same assemblage in Figure 4,
and relatively tightly in Figure 5. However, the behaviours
of the two species are quite different. Arctic terns breed in
the northern hemisphere and migrate to Prydz Bay in the
southern hemisphere summer to feed. Antarctic terns breed
on sub-Antarctic islands (such as Heard Island) during the
summer and migrate north to South Africa during the winter. Antarctic terns breeding on Heard Island feed inshore
and do not venture far from land. Thus, our clustering results suggest that at least some of the records of Antarctic
terns in Prydz Bay may in fact be arctic terns that have
been misidentified. Detailed examination of the distributions of these records would be needed to identify which are
likely to be in error. Similarly, northern giant petrels were
clustered together with the resident species. Northern giant petrels are a migratory species that are generally found
in the northern regions of Prydz Bay [17]. Examination of
northern giant petrel records revealed that on one particular voyage, a high number of unlikely northern giant petrel
sightings were recorded in the southern part of Prydz Bay.
It is possible that these were misidentified southern giant
petrels.
5.
These geographical areas were divided into grids of spatial bins, each spanning 2◦ longitude by 2◦ latitude. We
assumed that the relationships between bird observations
and the physical environment remain constant among years;
therefore, data from all years were pooled. However, these
relationships do vary with time of year as the bird behaviour
is driven by differing processes throughout the season. For
each species studied here we have therefore fitted a temporal sequence of models. Each model spanned a 30 day time
period and consecutive models overlapped by 15 days.
We present the results of three seabird species: snow petrels (Pagodroma nivea, which breed in Antarctic localities
including Prydz Bay and the Casey station coastal regions),
cape petrels (Daption capense, which breed in the Antarctic as well as in sub-Antarctic localities), and white-chinned
petrels (Procellaria aequinoctialis, which breed on islands
in temperate latitudes and forage in Antarctic waters during the southern hemisphere summer). These three species
were the three most commonly-observed in each of the three
breeding categories described above.
ENVIRONMENTAL RELATIONSHIPS
5.1 Motivation
The community analyses described above provide a foundation for investigating the relationships between seabirds
and their environment. A proper understanding of these relationships is vital for an understanding of the seabirds, the
region, and for planning, management, and legislative purposes. Characterising the dependence of the birds on their
environment is one of the first steps in assessing the likely
impact of global climate change on southern ocean seabirds.
It has been suggested [14] that the initial effects of global
climate change may be most pronounced in sub-Antarctic
The Australasian Data Mining Workshop
The species compositions of the bins were again compiled
in presence/absence format. Logistic regressions were used
to relate the distributions of bird observations to four parameters of the physical environment: sea surface temper-
c
Copyright °2002
33
The Australasian Data Mining Workshop
tion to the fact that there are differences in the processes
linking environment with observations of these birds in the
two regions.
ature (◦ C), sea state (ordinal scale), sea ice concentration
(percent), and distance to coast (km). The model accuracies were assessed using the mean square prediction errors
(MSE). For models using the Prydz Bay data, MSE was assessed using cross-validation by voyage: that is, data from
half of all available voyages (chosen at random) were used
to estimate the model parameters. Data from the remaining voyages were used to assess the model accuracy. Crossvalidation is a widely used method of obtaining estimates of
model accuracy, particularly when data are limited [5; 15].
All MSE values are presented with reference to the null error rate. This is the mean square prediction error that is
obtained with a constant model and reflects the prevalence
of the species in question. Any model that fails to predict
more accurately than the null is no better than uninformed
guessing.
The importances of each of the environmental variables in
predicting the observations of these three species are illustrated in Figure 7. From late October until approximately
January, the most important predictor variables for snow
and white-chinned petrels were sea ice concentration and
sea state. Sea ice and sea state are co-variates: heavy sea
ice will prevent high sea states (wave heights). Snow petrels are known to be an ice-associated species and this is
reflected in the positive sign of the model coefficient (marked
on Figure 7). The reverse is true of white-chinned petrels.
The corresponding variable importances for cape petrels are
not relevant because the model was not accurate during this
period.
The standard logistic regression assumes that the data are
independent. When data are spatial in nature, this assumption is often violated because observations from one location are likely to be similar to observations from nearby
locations. This self-similarity is known as spatial autocorrelation [7]. Spatial autocorrelation can often be exploited
to improve the predictive accuracy of models. Accordingly,
we also applied the spatial autologistic model [1; 3]. This
is an extension of the logistic model that explicitly models
the spatial autocorrelation of the observations. The estimation of the parameters of the spatial autologistic model is
problematic and requires approximate maximum likelihood
techniques (see e.g. [12] for a discussion of the estimation of
such models). We used a Markov chain Monte Carlo implementation provided by LeSage [12].
During the latter half of the season the most important predictor variables were sea surface temperature and distance
to coast. The model parameter for distance to coast was
negative for snow and cape petrels, indicating that these
species were observed close to the coast. This matches the
known behaviour of the birds: during this time of the breeding season adult birds are feeding the newly hatched chicks
and thus forage predominantly close to the coastal colonies.
The autologistic model did not provide substantially better
predictive accuracy than the standard logistic model (results not shown). This suggests that the spatial variation
in the observations was adequately modelled by the spatial
variation in the environmental predictor variables. The additional computational demands of the autologistic model
are therefore not justified in this application.
The relative importance of each environmental variable in
predicting the distribution of observations of each species
was assessed. This was achieved by building a model using only a single environmental variable as a predictor. The
cross-validation predictive accuracy of this model was compared to the best predictive accuracy obtained using all four
predictor variables.
Although in this study we relied on direct observations of
sea state and sea surface temperature, these environmental variables may both be estimated using remote sensing
technology. The models developed here could therefore potentially be used to estimate at-sea distributions of seabirds
in other regions of the Antarctic. Regional differences in the
breeding distributions of seabird species would need to be
addressed.
5.3 Results and Discussion
The predictive accuracies of the logistic models are presented in Figure 6. For observations of snow petrels, good
predictive accuracies (MSE significantly less than the null
rate) were obtained for the entire summer breeding season
in both the Prydz Bay and Casey station regions. The model
for cape petrel observations was generally adequate in the
latter half of the season in both regions. The model for
white-chinned petrel observations was better than the null
for the majority of the season in Prydz Bay, but was no better than the null during the latter half of the season in the
Casey station region.
6.
For snow and cape petrels, the model performance in the
Casey station region was similar to that obtained using the
Prydz Bay data (Figure 6). We can therefore conclude that
the processes linking bird observations and physical environment are similar in the two areas. The same was not true of
white-chinned petrels. During the latter half of the season
the model error was less than that of the null in Prydz Bay,
but not the Casey station region. This result draws atten-
The Australasian Data Mining Workshop
DISCUSSION AND CONCLUSIONS
The collection of data from polar regions is an expensive
and difficult process. Such data are often noisy or incomplete and analyses using conventional statistical (hypothesistesting) techniques can be extremely difficult. Data mining
and exploratory techniques may allow insights into trends
and anomalies to be obtained. The relevance of such findings extends beyond intrinsic scientific interest into fields
such as conservation and planning. Polar science plays a
key role in matters of global importance, including species
conservation and global climate change. There are therefore
social, scientific, and economic obligations to make the best
possible use of Antarctic scientific data.
The investigations presented here used data mining techniques to obtain results of ecological relevance, such as the
structures of seabird communities and the relationships of
c
Copyright °2002
34
The Australasian Data Mining Workshop
SNPE MSE
Prydz Bay
Casey
0.4
0.4
0.2
0.2
0
0
CAPE MSE
Prydz Bay
WCPE MSE
Casey
0.4
0.4
0.2
0.2
0
0
0.6
0.6
Prydz Bay
0.4
0.4
0.2
0.2
0
N
D
J
Month
F
seabird observations with the physical environment. The
techniques and findings also addressed matters of data management. Errors in data, which are often difficult to detect
through direct inspection, may become apparent in the results of the analyses. This was illustrated by the potential
errors in Antarctic tern and northern giant petrel records
discussed in section 4.
0
M
Spatial considerations are often of concern when dealing
with ecological data. In our models of seabird observations
and physical environmental parameters, the predictive ability of spatial autologistic models was found to be no better
than that of ordinary logistic models (in which spatial autocorrelation is ignored). The additional computational cost
of the spatial autologistic model (we used a computationally intensive Markov chain Monte Carlo implementation)
is therefore not justified in this application.
Casey
N
D
J
Month
F
M
Acknowledgments
Figure 6: Mean square prediction error (MSE) of logistic
models of at-sea observations of three species of seabird in
two areas of the Antarctic (Prydz Bay and the Casey station region). The solid line is the logistic model error and
the dot-dash line is the null error rate. A filled circle indicates that the logistic MSE is significantly less than the null
error at that time (p<0.05, Wilcoxon paired sample test).
SNPE=snow petrels, CAPE=cape petrels, WCPE=whitechinned petrels.
The authors would like to thank L Belbin and M Riddle for
their ongoing support, and all observers who have recorded
at-sea observations over the past 22 years. G Cruickshank,
C Hodges, B Priest, and F Spruzen entered much of the data
in prepraration for the analyses. D Watts constructed and
maintains the WoV database. Various freely-available Matlab toolboxes were used: the m map mapping toolbox (Rich
Pawlowicz, http://www2.ocgy.ubc.ca/~rich/), the Econometrics toolbox (James P. LeSage, http://www.spatialeconometrics.com), and ML estimation of mixtures of multivariate Bernoulli distributions (Miguel Á. Carreira-Perpiñán,
http://cns.georgetown.edu/~miguel/).
7.
SNPE Pri.
−
+
+
+
+
−
−
−
−
−
Sec.
−
−
−
−
−
−
+
−
+
−
[1] N. Augustin, M. Mugglestone, and S. Buckland. An autologistic model for the spatial distribution of wildlife.
J Appl Ecol, 33:339–347, 1996.
SEATEMP
[2] M. Austin and L. Belbin. A new approach to the species
classification problem in floristic analysis. Aust J Ecol,
7:75–89, 1982.
SEAICE
CAPE Pri.
+
−
+
+
−
−
−
−
−
−
Sec.
−
+
−
+
−
−
−
−
−
−
[3] J. Besag. Spatial interaction and the statistical analysis
of lattice systems. J Roy Sta B, 36(2):192–236, 1974.
SEASTATE
WCPE Pri.
+
−
−
−
−
+
+
+
+
+
Sec.
−
+
+
+
+
+
+
+
+
−
N
D
J
Month
F
[4] C. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995.
COASTDIST
[5] K. Burnham and D. Anderson. Model selection and inference. Springer-Verlag, 1998.
M
[6] M. Carreira-Perpiñán and S. Renals. Practical identifiability of finite mixtures of multivariate Bernoulli distributions. Neural Comp, 12(1):141–152, 2000.
Figure 7: The two most important predictor variables (primary, Pri., and secondary, Sec.) for logistic models of at-sea
observations of three species of seabirds. A positive sign
indicates that the association was positive (i.e. observations were more likely with increasing values of the environmental variable). SNPE=snow petrels, CAPE=cape petrels, WCPE=white-chinned petrels; SEATEMP=sea surface temperature, SEAICE=sea ice concentration, SEASTATE=sea state (wave height), COASTDIST=distance to
nearest coast.
The Australasian Data Mining Workshop
REFERENCES
[7] N. Cressie. Statistics for spatial data revised edition.
Wiley, 1993.
[8] M. Dufrêne and P. Legendre. Species assemblages and
indicator species: the need for a flexible asymmetric
approach. Ecol Monogr, 67:345–366, 1997.
[9] B. Everitt and D. Hand. Finite Mixture Distributions.
Monographs on Statistics and Applied Probability.
Chapman & Hall, 1981.
c
Copyright °2002
35
The Australasian Data Mining Workshop
[10] C. Fraley and A. Raftery. How many clusters? Which
clustering method? Answers via model-based cluster
analysis. Computer J, 41:578–588, 1998.
[11] M. Gyllenberg, T. Koski, E. Reilink, and M. Verlaan.
Non-uniqueness in probabilistic numerical identification of bacteria. J Appl Prob, 31:542–548, 1994.
[12] J. LeSage. Bayesian estimation of limited dependent variable spatial autoregressive models. Geogr Anal, 32(1):19–35, 2000. http://www.spatialeconometrics.com.
[13] G. McLachlan and K. Basford. Mixture models: inference and applications to clustering. Marcel Dekker, Inc.,
New York, USA, 1988.
[14] R. Smith, D. Ainley, K. Baker, E. Domack, S. Emslie, B. Fraser, J. Kennett, A. Leveter, E. MosleyThompson, S. Stammerjohn, and M. Vernet. Marine
ecosystem sensitivity to climate change. BioScience,
49(5):393–404, 1999.
[15] M. Stone. Cross-validatory choice and assessment of
statistical predictions (with discussion). Biometrika,
64:29–35, 1974.
[16] E. Woehler. The distribution of seabird biomass in the
Australian Antarctic Territory: implications for conservation. Envir Cons, 17:256–261, 1990.
[17] E. Woehler, C. Hodges, and D. Watts. An atlas of the
pelagic distribution and abundance of seabirds in the
southern Indian Ocean, 1981 to 1990, volume 77 of
ANARE Research Notes. Australian Antarctic Division,
Tasmania, 1990.
[18] E. Woehler and G. Johnstone. Status and conservation
of the seabirds of the Australian Antarctic Territory.
In J. Croxall, editor, Seabird status and conservation:
a supplement, pages 279–308. ICBP Cambridge, 1991.
[19] E. Woehler, B. Raymond, and D. Watts. Decadal-scale
seabird assemblages in Prydz Bay, East Antarctica.
Mar Ecol Prog Ser (submitted), 2002.
[20] J. Wolfe. Pattern clustering by multivariate mixture
analysis. Multiv Be R, 5:329–350, 1970.
The Australasian Data Mining Workshop
c
Copyright °2002
36
Combining Data Mining and Artificial Neural Networks
for Decision Support
Sérgio Viademonte
Frada Burstein
School of Information Management and Systems
Monash University
PO 197 Caulfield East
3145 Victoria, Australia
School of Information Management and Systems
Monash University
PO 197 Caulfield East
3145 Victoria, Australia
sergio.viademonte@sims.monash.edu.au
frada.burstein@sims.monash.edu.au
systems [4, 19]. One possible approach for the knowledge
acquisition problem is to automatically induce expert knowledge
directly from raw data [8]. However, this approach brings
additional problems as the amount and diversity of data
increases and demands specific attention. In this research
project, data mining technology is applied to build knowledge
from data, specifically inducing domain knowledge from raw
data and also ensuring data quality.
ABSTRACT
This paper describes an ongoing research project
concerned with the application of data mining (DM) in
the context of Decision Support. Specifically, this
project combines data mining and artificial neural
networks (ANN) in a computational model for decision
support. Data mining is applied to automatically induce
expert knowledge from the historical data and
incorporate it into the decision model. The resulting
knowledge is represented as sets of knowledge rule
bases. An ANN model is introduced to implement
learning and reasoning within the proposed
computational model. The proposed computational
model is applied in the domain of aviation weather
forecasting. The paper describes the proposed decision
support model, introduces the data pre-processing
activities and the data mining approach, the data models
used to generate the knowledge rule bases, and their
integration with the ANN system. The paper presents
evaluation of the performance of the proposed approach
and some discussion of further directions in this
research.
In the context of this project, the preprocessed sets of raw data
used as input in the data mining algorithm are named data
models. The knowledge obtained as a result of data mining
experiments is termed knowledge models. An artificial neural
network (ANN) system provides an interface for the user
decision-makers to test and validate hypotheses about the
specific application domain. The ANN system learns about the
problem domain through the knowledge models, used as training
sets.
Section 2 presents the proposed computational model for
decision support; section 3 discusses the knowledge discovery
process, specifically the data mining phase, and the data and
knowledge models. Section 4 presents the applied artificial
neural network system. Section 5 discusses the decision support
model evaluation and some achieved results, section 6 presents
some comments and conclusions.
Keywords:
2. A MODEL FOR DECISION SUPPORT
BASED ON DM AND ANN
Artificial neural networks, data mining, decision
support, forecasting
The purpose of the proposed computational model is to support
decision-making by recalling past facts and decisions, hence
inducing “chunks” of domain knowledge from past information
and performing reasoning upon this knowledge in order to verify
hypotheses and reach conclusions in a given situation. The
proposed model creates an interactive software environment that
uses data mining technology to automatically induce domain
knowledge from historical raw data, and an ANN based system
as a core for an advisory mechanism (see Figure 1).
1. INTRODUCTION
This paper presents a computational model for decision support
based on a combination of data mining and artificial neural
network technologies. The proposed computational model has
been applied in the domain of aviation weather forecasting,
specifically, identifying fog phenomenon at airport terminals.
Weather forecasts are based on a collection of weather
observations describing the state of the atmosphere, such as
precipitation levels, wind direction and velocity, dew point
depression, etc [1, 10]. Access to the past decision situations
and knowledge derived from them can provide valuable source
of improvement in forecasting rare events, such as fog.
Complexity and diversity of the weather observations and large
variation in the patterns of weather phenomenon occurrences
implies serious problems for forecasters trying to come up with
correlation models. Consequently, the area is a potential
candidate for KDD purposes [21].
The decision support model comprises a database (ideally a
datawarehouse), case bases and knowledge rule bases. The
database contains raw data from the application domain, in the
case of this research project, historical weather observations.
The case base contains selected instances of relevant cases from
the specific problem at hand. In this project, each case
represents a past occurrence and consists of a set of
feature/value pairs and a class in which the case belongs. In this
research project several case bases were generated and used as
input data in the data mining algorithm. The case bases are
named mining data sets in this research project [20].
Computational tools for decision support usually incorporate
expert knowledge of domain experts together with specific
explicit domain knowledge, e.g., factual knowledge. Early
attempts in building expert systems revealed the difficulties of
capture, represent and incorporate expert knowledge in those
Knowledge rule bases are built based on the data mining results;
they contain structured knowledge that corresponds to relevant
relations found (mined) in the case bases. Several rule bases
The Australasian Data Mining Workshop Copyright 2002
37
The Australasian Data Mining Workshop
concerned with occurrence of fog phenomenon regardless of
whether it is a local fog or not. For this reason all the Fog Type
instances with “LF” value were transformed in “F” value,
meaning a fog case. All the instances of Fog Type with null
value were assigned “NF” values, meaning not fog case.
were generated; according to different parameters used for data
mining, e.g., distinct confidence factors and rule support
degrees.
Rule
Evaluation
Automatic
discovery
The attributes PastWeather and PresentWeather were
transformed from numeric type to non-numeric type. These
attributes are qualitative (categorical) attributes; which indicate
weather codes.
The Rainfall attribute shows two problematic behaviors for data
mining: sparsity and lack of variability. It has 21.11 % of null
values in fog class population and 30.60 % of null values in not
fog class population. The rainfall volume is initially measured in
millimeters and presented to the forecasters; who express their
evaluation in codes expressing ranges of millimeters. This
procedure makes sense according to the nature of the
forecasting task, as it is almost impossible to differentiate
precise measurements of rainfall, like 0.3 millimeters and 0.2
millimeters. The numerical values were transformed into
categorical codes, which express ranges of rainfall. The
instances of null rainfall will be classified into code 0, no rain.
To implement this transformation a new attribute was inserted,
Rainfall Range text attribute. A procedure was implemented to
calculate the Rainfall Range attribute corresponding to Rainfall
attribute values.
Data Mining
Case Base
Case Base
Data Warehouse
Advisory
System
Knowledge
KnowledgeBase
Base
DSS User – decisionmaker
IDSS System
Figure 1 – A computational model for decision
support (as described in [21]).
The ANN mechanism is applied to process the obtained
knowledge (rule bases). The ANN uses the content of the
knowledge bases as learning data source, to build knowledge
about the specific application domain through its learning
algorithm [11, 13]. After the ANN-based learning procedure has
been executed, the advisory system provides an interface
through its consult mode for the user to test and validate
hypotheses about the current decision situation.
The Wind Direction is a measure taken by instruments and it is
numerically represented in degrees. However, the forecasters do
not use detailed numerical measurements when reporting a
forecast bulletin but a categorical representation. A categorical
description of compass point is used instead, being N for North,
S for South and so on. For example, lets consider the compass
point 22,1 degrees. Practically, in that case the forecasters
assume the compass point NNE, which is the point 22.5 degree.
In that case the wind direction is said to be NNE instead of 22.1,
as for forecasting purposes the distinction among 22.1, 22.2 and
22.5 degrees is not significant; again a tolerance for imprecision
can be observed. It is assumed that each value in degrees
belongs to the closest compass point, therefore the middle point
between each two compass points was chosen as the boundary
point between them, being the middle point itself belonging to
the next upper compass point. To implement the transformation
of Wind Direction attribute from degrees to compass points a
new attribute was inserted into the data set, the WindCompas,
text attribute.
3. INDUCING KNOWLEDGE THROUGH
DATA MINING
The database of weather observations used for automatic
induction of domain knowledge was generated from Australian
Data Archive for Meteorology (ADAM) data repository. It
contains weather observations from Tullemarine Airport,
Melbourne, Australia, from July 1970 until June 2000, and has
49,901 records and 17 attributes.
3.1 Data preprocessing
The initial database had many problems concerning data quality
issues, such as the significant amount of empty and missing
values, sparse variables and problems with variability. An
extensive pre-processing work was required to make the data
appropriate for KDD process, see [20] for detailed discussion
about this subject.
The not fog class data set initially had 48.963 instances. After all
data transformations and after the nulls instances were removed,
the resulted not fog class data set has 47,995 instances, and the
fog class data set has 938 instances. This database was used to
select the relevant cases for data mining.
The first database used in this project had 75 attributes, some of
them related with data quality control and codes describing
observations. For example, the dew point observation had two
associated attributes, one named DWPT and other named
DWPT_QUAL, this last attributes indicates a data quality
control information, which was not relevant for our data mining
purposes. Many other observations (attributes in the database),
like wind speed, wind direction and visibility presented the same
problem and had to be removed. The Year and Day attributes
were not necessary for mining purposes, just the Month
attribute. A derived attribute previous afternoon dew point was
calculated based in the date, hour and the dew point and inserted
in the table. The forecasters recommended this information as
very important for fog prognosis.
3.2 Generating the data models
The next step was devoted to verify the data dimensionality and
class distribution in the database. As we are interested in
forecasting fog, the population was discretised into two classes.
One class representing fog cases (named fog class), and a
second class representing cases where fog was not observed
(named not fog class). The observation database shows a lowprevalence classification, it means that far fewer cases of fog
class were present comparing to not fog class.
The dataset has 938 instances of fog class and 47,995 instances
of not fog class. Figure 2, below, shows the fog classes
distribution in the entire weather observations database
(population) after data preprocessing. Fog class represents
1.92% of the population, and not fog class represents 98.08% of
the population.
Several data transformations were performed with Bureau
original data set. For example, Fog Type attribute has 3 possible
values assigned: “F” when it refers to a fog event, “LF” when is
“Local Fog” and null when the event is not fog. This study is
The Australasian Data Mining Workshop Copyright 2002
38
The Australasian Data Mining Workshop
randomly generated; a mining data set (used by the data mining
algorithm), an evaluation data set (for comparison purposes)
and a test data set (used as test data by the neural network
system). Those data sets were randomly sampled from their
original data models in 60% and 80% proportions for mining
sets and 10% proportions for evaluation and test.
Fog and Not Fog Class Distribution.
Complete Enumeration.
F
The final data models are obtained by joining fog data sets with
not fog data sets. Four mining data sets were obtained in this
way, named by corresponding sampling proportions: Mining
Model 1-60, Mining Model 1-80, Mining Model 2-60, and
Mining Model 10-60.
For example, Mining Model 1-60 means that this model was
obtained by a sample of 10% out of the overall not fog stratum,
and 60% of this sample was selected for data mining purposes.
The other names follow the same structure.
NF
Figure 2 – Fog class distribution in the population
3.3 Applying data mining
This significant difference in class distribution required the
development of a specific sampling strategy in order to have a
more homogeneous class distribution in the training set [7, 15,
18]. The sampling approach used in this research project can be
classified as stratified multi-stage sampling [7, 23]. The original
population was divided in two strata: fog stratum and not fog
stratum. Sampling was separately conducted in stages within
each stratum. A random sampling approach was used for fog
stratum. Fog stratum was randomly split without replacement in
85% for mining data set and 7% for testing and evaluation,
respectively.
This research project uses an associative rules generator
algorithm for data mining, based on AIS algorithm [2]. An
association rule is an expression X Y, where X and Y are sets
of predicates; where X is a precondition of the rule in disjunctive
normal form and Y - the target post condition. Hence, the
outputs of the data mining experiments are associative rules. We
chose associative rules to represent the induced knowledge
because it is a clear and natural way of knowledge
representation that is easy for people to understand; and also
because it fits well our neural network system. Section 4
addresses the integration issues between the knowledge models
and the ANN system.
→
Not fog stratum was sampled in a different fashion; increased
sizes data sets were selected from the whole stratum in 10%,
20% and 100% proportions. The sample being 10% of the
whole stratum was named Model1, with 4,763 instances. The
second sample named Model2, 20% of the stratum with 9,572
instances. For comparison purposes the whole stratum was also
considered, we call it Model10, meaning 100% of the stratum.
This section discusses some procedures that had to be
performed for data mining, e.g.: features selection, selection of
target attributes and attributes’ values in the database,
discretization or clustering of attributes, and the selection of
mining parameters. Although a detailed description of these
procedures is out of the scope of this paper, we believe that it is
important to mention them at least briefly. Nearly every data
mining project includes the execution of these procedures at a
certain level. If incorrectly performed, these procedures can
potentially compromise the success of the entire data mining
project. For those reasons we decided briefly discuss in this
paper some of those procedures we faced in our research
project.
The 10%, 20% and 100% percentages were arbitrarily selected
based in the size of not fog and fog strata; the aim here is build
data models without a significant difference between the
numbers of instances from each class. Therefore small
percentages were chosen from not fog stratum. In addition,
literature review provide useful insights in incremental sampling,
according to Weiss and Indurkhya [23] typical subset
percentages for incremental sampling might be 10%, 20%, 33%,
50%, 67% and 100%. Using 50% and higher percentages will
keep the difference between not fog cases and fog case too big,
therefore small percentages were chosen. The 100% subset was
selected to verify the mining algorithm performance when using
a significantly difference class distribution, the assumption was
that this subset will produce very few or either none fog cases
rules.
Selection of a target attribute procedure requires the selection
of an attribute from the case bases that discriminates the class in
study, in our project is the attributes that indicates if a particular
case corresponds to a fog observation or not; this attribute is
named FogType. FogType attribute was discretised into two
values, “F” and “NF”; this represents whether fog phenomenon
was or not observed, respectively.
Features selection, means selecting the attributes that form the
antecedent part of the rules. In our research almost all attributes
were selected for data mining, exceptions were the attributes
indicating Year and Day. Besides that, there is an attribute in our
database that indicates the visibility over the airport runaway.
Experiments with and without Visibility attribute were
performed to check whether the Visibility attribute might be
considered as a synonymous for fog. The assumption was to
verify the amount of generated rules in both cases and the
prevalence of the Visibility attribute.
Table 1, below, shows the generated not fog stratum
models:
Table 1. Sample models for not fog class
Data Model
Sample size
Percentage of the
whole stratum
Model1
4.763
10%
Model2
9.572
20%
Model10
47.995
100%
Selection of attributes values is an important procedure, which
addresses dimensionality reduction, together with feature and
cases selection. It could happen that some sets of attribute
values are not relevant to the survey variable, or have a small
frequency of occurrence in the database. In both cases such
These generated models are called data models in the context of
this research project, e.g. Model1, Model2 and Model10 are
data models. From each data model, three data sets were
The Australasian Data Mining Workshop Copyright 2002
39
The Australasian Data Mining Workshop
The obtained amount of rules, specifically for fog class, was
considered small for a good descriptive model. Hence, it was
decided to execute the experiments again with more flexible
parameters. Table 3 illustrates this fact; it summarises the
amount of associative rules obtained from the mining set
Model1-6 and Model1-8. When using 70% rule confidence
degree, minimum rule support of 8% and maximum rule order of
7 was obtained 240 associative rules, being 54 in fog class for
Model1_6, and 245 associative rules, being 35 in fog class for
Model1_8. Keeping the 70% rule confidence degree and
changing the minimum rule support to 6% and maximum rule
order to 10 we obtained 405 associative rules, being 104 in fog
class for Model1_6 and 358 associative rules with 67 in fog
class for Model1_8. An increase of 50 rules can be observed in
fog class for Model1_6 and 32 rules in fog class for Model1_8.
The minimum number of cases, 50, remained constant in all
experiments, because it was considered a satisfactory amount of
cases, not very restrictive but big enough for a good coverage.
attribute values do not add any valuable information in data
mining and may be removed.
In our experiment some values or categories of Hour, Rainfall
and Month attributes were excluded from the data mining
experiments because they had either a small frequency in the
database, or because they had a high frequency for either
classes, fog and not fog. It means high sensitivity but low
specificity. Sensitivity degree of a finding is defined in relation to
a class measures its frequency for that class. Specificity degree
of a finding F in relation to a class C, on the other hand, is
inversely proportional to the frequency with which the finding F
appears in classes other than C
Configuration of mining parameters includes selection of the
minimum desired level of rule confidence, support degree and
the maximum rule order. It means to choose the ratio of the
number of records in the database that support a particular rule.
The maximum rule order parameter sets the maximum number
of antecedents of the rules. For example, in a rule like:
Discretization of numerical attributes is used to determine the
granularity of a certain variable. It can be used in general to
simplify the data mining problem. Also, most data mining tools
and algorithms, mainly those used in classification problems,
require discrete values rather than a range of values [9].
If DRYBULB <= 8.5 And TOTALCLO > 7 And TOTALLOW > 6 And
WINDSPEE <= 1.5
Then FOGTYPE = F, Confidence: 88.24%, Support: 9.29%
The rule order is 4, represented by the attributes dry bulb
temperature (Drybulb), amount of clouds over the airport
runaway (Totalclo), amount of low clouds over the airport
runaway (Totallow) and the wind speed (Windspee) at the
airport runaway.
Table 3. Mining Model1-6 and Model1-8 with different
mining settings
Mining
set
Number
of rules
Confidence
degree
Rule
support
Maximum
rule order
In most of the data mining applications the users are usually only
interested in rules with support and confidence above some
minimum threshold. Thus these parameters are important to be
set. Table 2 shows the selected mining parameters in our
experiments:
Model16
240
70%
8%
7
Model16
405
70%
6%
10
Model18
215
70%
8%
7
Model18
358
70%
6%
10
Table 2. Selected mining parameters
Mining Parameter
Confidence Degree
Value
50%, 70%, 80%, 90%
Minimum Support Degree
8%, 6%
Minimum Number of Cases
50
Maximum Rule Order
Table 4 illustrates three attributes discretization in our
experiment. It shows their respective assigned categorical
classification in the Data Mining Model1-6. Each attribute has
been assigned the same categories in all data models, e.g. Low,
Med, High for Dry Bulb. But distinct value ranges occurred in
different data models.
7, 10
The data mining experiments generated rules with 70%, 80%
and 90% confidence degree. As it was impossible to know
beforehand the amount of generated rules accordingly with a
specific confidence degree, it was decided to use the most
frequent percentages in data mining applications [16, 22]. Our
goal here is to verify if there is a significant difference in
performance accordingly with different combinations of
parameters (confidence degree, minimum support degree and
maximum order degree). And if so, which combination(s) of
these parameters is (are) most appropriate when applying data
mining in problems similar to the one we are addressing in this
project. Here, we consider as performance measure the amount
of rules obtained in each class, together with the amount of item
sets in each rule. In general the descriptive capability of a rule is
associate with its amount of item sets.
Table 4. Discretisation of numerical attributes
Attribute
Mining Model 1-6
Categories
Ranges
Two sets of data mining experiments were performed. One set
of experiments using minimum rule support of 8% and
maximum rule order of 7. A second set of experiments, using
minimum rule support of 6% and maximum rule order of 10.
The first experiments resulted in more restrictive models.
Dry Bulb
< = 8.5
Low
(Celsius degrees)
> 8.5 and <= 12
Med
> 12
High
Total Cloud Amount
<=4
Min
(Eighths)
> 4 and < = 7
Med
>7
Max
Wind Speed
< = 1.5
Light
(meters/second)
> 1.5 and < = 3.6
Lmode
> 3.6 and < = 6.2
Mode
> 6.2
Fmode
The Australasian Data Mining Workshop Copyright 2002
40
The Australasian Data Mining Workshop
build descriptive models as comprehensive as possible for our
application domain. The ultimate goal of such a knowledge
modeling process is to achieve a good predictive performance of
the decision support model. It includes a performance evaluation
of the decision support model that will demonstrate how
efficient is the descriptive model for this particular case.
The discretization of a particular attribute is measure
proportional on the total amount of cases in the database and the
frequency of occurrence of each attribute value. Categorical
attributes already express a discrete value, however numerical
attributes must be discretised in ranges. The used data mining
tool automatically discretizes the numerical attributes based on
their frequency of occurrence and the amount of their
categories.
The above definitions are important for better understanding of
how we generated knowledge models and what they are. The
approach we used in this research project to generate
knowledge is based on the data models, a data mining algorithm
(our descriptive method) and the choice of data mining settings
(rule confidence degree, rule support and maximum rule order).
For each original data models, and combinations of mining
parameters, we obtained a distinct set of associative rules. Not
only the amount of rules are different, but also the rules
itemsets. Each of these distinct sets of associative rules is
identified as a knowledge model, or a knowledge base.
3.4 Generating knowledge models
Knowledge discovery in databases constitute an interactive and
iterative process, having many steps and interrelated fields. We
consider knowledge modeling as an important part of the
knowledge discovery process. In our research project we
distinguish domain modeling, data modeling and knowledge
modeling from each other.
We understand domain modeling in the same way as it has been
widely used by decision support, expert systems and artificial
intelligence community in general. Basically, it is concerned with
building a model of a particular domain under investigation for
any particular purpose. Data modeling in the context of our
project relates to all the activities that transform raw data into
the data used for data mining. Such data modeling includes data
pre-processing, features selection, reduction and transformation,
and data sampling. Knowledge modeling in our context includes
the activities related to extracting knowledge from data. This
includes the interactive process of mining data, testing and
tuning different data mining parameters and data models, e.g.,
adding or eliminating data features, and even cases. In fact, it is
effectively an interactive and iterative process, where we try to
To illustrate our approach, let us consider the data mining
Model1-6 with 70% confidence degree, minimum rule support
of 6% and maximum rule order of 10. After mining this data set,
it generated a particular knowledge base. Similarly, the data
mining Model1-6, with 80% confidence degree, minimum rule
support of 6% and maximum rule order of 10 generated a
different knowledge base. The process follows in this fashion
until we execute all data mining sets (case bases) with the
selected mining parameters; showed in table 2, sub section 3.3.
Table 5 below shows the generated knowledge models, for each
data model accordingly with their respective levels of confidence
degree.
Table 5. Generated knowledge models
Knowledge Models
Mining Data
Models
Generated Rules by Rule Confidence Degree
70 %
80%
F
NF
Total
F
NF
Mining Model1-6V3
104
301
405
37
Mining Model1-8V3
67
291
358
Mining Model2-6V3
45
283
Mining Model10-6V3
10
279
90%
Total
F
NF
Total
291
328
16
204
220
23
291
314
12
228
240
328
20
283
303
12
274
286
289
9
279
288
9
279
288
4. THE ARTIFICIAL NEURAL
NETWORKS SYSTEM
This table refers to the experiment using minimum rule support
of 6% and maximum rule order of 10; this information is
identified by the prefix “V3” in each mining data model. In
Table 5, ‘F” relates to fog class and ‘NF” to not fog class. Rules
with 50% confidence degree were also generated using the
mining Model10-6, due to space limitations it is not included in
table 5, but this does not compromise the understandability of
the proposed approach.
We applied an ANN system as the interface of our decision
model. The ANN system learns about the problem domain
through the knowledge models, used as training sets. Besides
implementing learning capability in our decision support model,
the ANN system provides an interface for the decision-makers
to test and validate hypotheses about the specific application
domain.
For the confidence level of 90% too few rules were obtained for
fog class; with a maximum amount of 16 rules when using data
model Model1-6 and 12 rules when using data Model2-6. These
amounts of rules are unlikely to be enough for a satisfactory
description of the fog phenomenon. The performance evaluation
will show how this affects the predictive capacity of our decision
support model.
For the ANN interface we use the Components for Artificial
Neural Networks (CANN) framework [4]. The CANN
framework is a research project that allows neural networks to
be constructed on a component basis. The CANN project relates
to the design and implementation aspects of framework
architecture for decision support systems that rely on artificial
neural network technology [17].
The Australasian Data Mining Workshop Copyright 2002
41
The Australasian Data Mining Workshop
We selected the evidence Dewpoint to show its properties, for
example, it is a string attribute and it is categorized in four
categories: Low, when the temperature is smaller or equal 4
Celsius degrees; Med between 4 and 6 Celsius; High between 6
and 9 Celsius and Max, temperature higher than 9 Celsius.
The CANN components are designed in an object-oriented way.
It implements a class hierarchy to represent a particular
application domain, the domain evidences, classes and the
relationships among evidences. Figure 3 presents the screen of
the CANN system that represent the evidences (attributes) used
to identify fog phenomenon. At the left in figure 3 a list of
evidences (attributes) about weather forecasting is presented.
Figure 3: Weather evidences modelled into CANN
does for a single case. Figure 5 illustrates a case base consult
session.
CANN system implements a mechanism that associates a data
set with a particular ANN model, for example, the
Combinatorial Neural Model (CNM) [13]. Through the ANN
learning algorithm, CANN implements a learning mechanism.
Figure 4 illustrates the outcome of the learning process executed
by CANN in the meteorological domain.
Figure 5: A case base consult session.
For example, case 119 in figure 5 is indicated as a Fog case with
confidence degree of 0.952, supported by the evidences Total
cloud amount Max and Wind Speed Light. Case 120 is indicated
as a Not Fog case with confidence degree 0.909, supported by
the evidences Drybulb High and Total Cloud Amount Med.
Figure 4: Learning about meteorological domain.
CANN functionality is well suited to the purpose of our project
as it is capable of a flexible domain representation, learning and
consulting functionalities. A decision-maker interacts with
CANN consulting mechanism, in two ways: case consult and
case base consult. A case consult presents to the decision maker
a selection of evidences, and their respective evaluation of
relevance to the situation at hand. CANN sets up a set of
hypotheses based on the presented input data. It evaluates the
selected evidences and calculates a confidence degree for each
hypothesis. The inference mechanism appoints the hypothesis
with the higher confidence degree as the most suitable solution
(class) to the problem.
A detailed discussion of the CANN class hierarchy is outside the
purposes of this paper as is software engineering design issues.
Readers interested in these subjects should refer to [3, 4]. The
Combinatorial Neural Model, its algorithm and its learning,
pruning and consulting algorithms are presented in [12, 13].
4.1 Mapping associative rules into the ANN
topology
The CANN knowledge representation schema reflects the
knowledge model structure and content. The rules are directly
mapped onto the ANN topology, and simultaneously
represented through a symbolic mechanism [14]. Rules
describing relations in the weather forecasting domain are
represented by neurons and synapses. Figure 6 exemplifies this
property. The rule: I3 & I4 & In => F, corresponds to the
A case base consult is similar to the case consult, however,
instead of presenting one single case (or one set of evidences)
each time, several cases are simultaneously presented to the
ANN system. It evaluates the set of cases in the same way it
The Australasian Data Mining Workshop Copyright 2002
42
The Australasian Data Mining Workshop
strengthened connections among the input nodes I3, I4 and In,
the combinatorial node C3, and the output node F of the ANN.
NF
C2
C1
F
C3
obtained with 90% of confidence degree. Those results are not a
surprise, as increasing the rule confidence degree restricts the
amount of obtained rules; therefore a less descriptive model is
expected. The worse performance is verified when applying
Model1-6-90, it happens because this data model has only 16
rules describing fog what does not represent a enough coverage
to describe fog phenomena. Even though, 66.67% of correct
classifications can be considered a surprisingly good result
considering there are only 16 rules about fog in the rule base.
…
C4
…
…
I1
I2
I3
I4
Table 6. Test Data Model1_6 with different rule confidence
degrees
In
Figure 6. Incorporating rules into ANN topology.
For example, consider the following rule:
Rule 1 for Fog Class:
If Total Cloud Amount = Max
And Wind Speed = Light
And Wind Direction = SE
Then Fog Type = F.
Learning
Set
Correct
Misclassifie
d
No
conclusion
Total
Cases
Model1-670
82
28 (23.3%)
10
120
Model1-680
Model1-690
The above rule is mapped into the ANN topology by
representing I3 as total cloud amount max, I4 as wind speed
light and In wind direction SE, and also considering the
hypothesis NF as not fog case and F as fog case. Additional
information as rule confidence degree will be represented in the
ANN topology as confidence level associated with a particular
evidence.
(68.30%)
81
(8.3%)
27 (22.5%)
(67.50%)
80
12
120
(10.0%)
25 (20.8%)
(66.67%)
15
(12.50%)
120
An average of 67.5% of the cases were correctly classified, what
indicates the applicability of the proposed model for decision
support in classificatory problems.
Table 7. Data Model1_6 performace discriminating fog and
not fog classes
5. VALIDATION
The validation of the discovered knowledge is based on the
ability of the model to correctly identify meteorological
observations, specifically a fog case or a not fog case. The
performance of the model relies not only on the applied
computation technologies (data mining and ANN), but also on
the strategy we applied to obtain the data and knowledge
models, e.g., sampling strategy, pre-processing, and mining
parameters. Due to the space limitations we cannot discuss all
these issues in this paper, but it is important to understand that
all those issues have an implication on the performance of our
decision support model.
Learning Set
Correct Fog
Correct Not Fog
Model1-6-70
42 (70.0%)
40 (66.67%)
Model1-6-80
39 (65.0%)
42 (70%)
Model1-6-90
39 (65.0%)
41 (68.33%)
Analyzing individually the performance in each class also
indicates that the 70% rule confidence degree generates the best
set of rules, achieving the highest performance of 70.0% of
correct fog cases classified. What basically differentiates each of
the training models in our experiment is the number of rules
representing fog cases.
We selected data model Model1-6, with 70%, 80% and 90%
rule confidence degrees, 6% of minimum rule support and
maximum rule order 10. We identify each case set by adding the
rule confidence degree in the data model name, therefore
Model1-6-70 corresponds to the data model generated when
selecting 70% rule confidence degree; Model1-6-80 using 80%
rule confidence degree and Model1-6-90 when using 90% rule
confidence degree.
The change in the number of rules representing not fog cases
does not represent a significant change in performance,
specifically 70.0% in the best case and 66.67% in the worse
case. It is because there are enough rules describing not fog
cases. The same comments cannot be extended to fog class; a
decrease in fog rules in Model1-6-90 caused a significant lost in
predictive performance, with 70.0% of correct classification in
the best performance dropping to 65.0%.
Table 5, in section 3.4, describes the amount of rules for each of
these data models. They were used as the ANN learning bases.
For testing we used a subset of the test set generated for data
model Model1-6. The test set has randomly selected 120 cases,
being 60 cases of not fog and 60 cases of fog.
Our experiment so far indicates that the 70% rule confidence
degree seems to be the best value for this parameter, even when
faced with the problem of low prevalence classification.
However 70.0% may not be considered a satisfactory
performance in many applications. Additional experiments can
be carried on to improve the system performance, for example
applying different sampling proportions to obtain a more
homogeneous class distribution, or applying different data
mining parameters. Such as relaxing the minimum rule support
to obtain a higher number of rules or even increasing the
maximum rule order to generate rules with higher itemsets,
therefore better descriptive capabilities.
The results of this experiment are presented in Table 6. They
appoint to the efficiency and applicability of the combined
approach, data mining and ANN, considering an average of
67.5% of correct classifications.
The ANN system correctly classified 68.3% of the cases when
training with rules obtained with 70% of confidence degree.
The performance decreased a little, to 67.5% when training with
rules obtained with 80% of confidence degree; and the
performance decreased to 66.67% when training with rules
The Australasian Data Mining Workshop Copyright 2002
43
The Australasian Data Mining Workshop
(Ed.),
Object-Oriented
Application
Framework:
Applications and Experiences. (1 ed.): John Wiley.
Further experiments are necessary for more accurate
conclusions; however, the results obtained so far indicate the
potential applicability of our approach to automatically induce
domain knowledge, to handle the problem of low prevalence
classification in databases, to incorporate the domain knowledge
and implement learning capabilities in the proposed model for
decision support.
[4]
Beckenkamp, F. a., & Pree, W. (2000, May, 2000.).
Building Neural Networks Components. In Proceedings of
Neural Computation 2000 - NC'2000, Berlin, Germany.
[5]
Buchanan, B., & Feigenbaum, E. (1978). DENDRAL and
META-DENDRAL: Their applications dimensions.
Artificial Intelligence, 1, 5 - 24.
[6]
Carbonell, J. G. (1989, September). Introduction:
Paradigms for Machine Learning. Artificial Intelligence,
40, 1-9.
[7]
Catlett, J. (1991). Megainduction: Machine learning on
very large databases. University of Technology, Sydney,
Australia.
[8]
Fayyad, U. M., Mannila, H., & Ramakrishman, R. (1997).
Data Mining and Knowledge Discovery. (Vol. 3). Boston:
Kluwer Academic Publishers.
[9]
Howard, C. M., & Rayward-Smith, V. J. (1998).
Discovering Knowledge from low-quality meteorological
databases. Knowledge Discovery and Data Mining.
(Pages: 180-202.).
6. CONCLUSION AND COMMENTS
This paper presents a decision support model and its application
to a real world problem. We proposed a decision support model
combining data mining and neural networks. Data mining is
chosen to automatically induce domain knowledge from raw
data and ANN because of its adaptive capabilities, which is
important for providing the means for implementation of
inductive and deductive learning capabilities [6, 19]. Besides
that, this project came up with an efficient sampling strategy to
handle problems of dimensionality and class distribution, mainly
the low prevalence classification problem, as well as conducted
an in-depth investigation of the pre-processing stage to ensure
data quality for data mining.
The results obtained so far demonstrate the applicability of the
proposed decision support model in aviation weather
forecasting, specifically to correct identify fog phenomenon.
[10] Keith, R. (1991). Results And Recommendations Arising
From An Investigation Into Forecasting Problems At
Melbourne Airport. (Meteorological Note 195).
Townsville: Bureau of Meteorology, Meteorological
Office.
The system performance can be further improved through some
additional procedures. For example, in our experiments we used
neural network topology with maximum order of three. It means
that the neural network combinatorial layer associates at
maximum three input neurons. Using higher combinatorial order
will add more evidences in the neural network learning and
evaluation procedures. Considering more evidences for the cases
analysis can potentially improve the system performance.
Additionally, considering higher number of antecedent itemsets
during data mining and relaxing the learning and pruning
threshold parameters in the ANN learning algorithm may also
potentially improve performance.
[11] Machado, R. J., Barbosa, V. C., & Neves, P. A. (1998).
Learning in the Combinatorial Neural Model. IEEE
Transactions on Neural Networks, 9. September, 1998
[12] Machado, R. J., & Rocha, A., F. (1989). Handling
Knowledge in High Order Neural Networks: the
Combinatorial Neural Model. (Technical Report
CCR076). Rio de Janeiro, Brazil.: IBM Rio Scientific
Center.
[13] Machado, R. J., & Rocha, A., F. (1990). The
combinatorial neural network: a connectionist model for
knowledge based systems. In B. B. Bouchon-Meunier,
Yager, R. R. & Zadeh, L. A. (Ed.), Uncertainty in
knowledge bases. Berlin, Springer Verlag.
In addition, issues concerning system integration may be
assessed. Currently case and knowledge bases are stored as
relational tables; different technologies are under evaluation for
storing the knowledge bases, for example using XML document
formats and PMML (http://www.dmg.org), in order to facilitate
its integration with the ANN system, based on Java
implementation.
[14] Medsker, L. R. (1995). Hybrid Intelligent Systems. (Vol.
1). Boston, USA: Kluwer Academic Publishers.
[15] Mohammed, J. Z., Parthasarathy S., &, L. W., & Ogihara,
M. (1996.). Evaluation of Sampling for Data Mining of
Association Rules. (Technical Report 617). Rochester,
New York. The University of Rochester, Computer
Science Dept.
7. ACKNOWLEDGEMENTS
This research is partly funded by the Australian Research
Council and Monash University grants. We would like to thank
the Regional Forecasting Centre from Australian Bureau of
Meteorology, Victorian Regional Office for providing
meteorological data and support. We also thank Dr Robert
Dahni and Mr. Scott Williams from the Regional Forecasting
Centre for their help in validation results in relation to aviation
weather forecast.
[16] Piatetsky-Shapiro, G., & Frawley, W. (1991). Knowledge
Discovery in Database. MIT Press.
[17] Pree, W., Beckenkamp, F. a., & Rosa, S. I. V. (1997,
June, 17 - 20, 1997). Object-Oriented Design &
Implementation of a Flexible Software Architecture for
Decision Support Systems. In Proceedings of 9th.
International Conference on Software Engineering &
Knowledge Engineering - SEKE'97, (pp. 382 - 388).
Madrid, Spain.
8. REFERENCES
[1]
Auer, A. H. J. (1992). Guidelines for Forecasting Fog.
Part 1: Theoretical Aspects: Meteorological Service of
New Zealand.
[2]
Agrawal, R., Imielinski, T., & Swami, A. (1993, May,
1993.). Mining association rules between sets of items in
large databases. In Proceedings of Conference on
Management of Data., (pp. 207-216). Washington, DC.
[3]
[18] Provost, F., Jensen, D. & Oates, T. (2001). Progressive
Sampling. In H. L. a. H. Motoda (Ed.), Instance Selection
and Construction for Data Mining (Vol. 1, pp. 151 - 170).
Norwell, Massachusetts, USA: Kluwer Academic
Publishers.
[19] Tecuci, G. a., & Kodratoff, Y. (1995). Machine Learning
and Knowledge Acquisition: Integrated Approaches.
London, UK.: Academic Press.
Beckenkamp, F. a., & Pree, W. (1999). Neural Network
Framework Components. In S. D. C. a. J. R. Fayad M.
The Australasian Data Mining Workshop Copyright 2002
44
The Australasian Data Mining Workshop
About the authors:
[20] Viademonte, S., Burstein, F., Dahni, R. & Williams, S.
(2001). Discovering Knowledge from Meteorological
Databases: A Meteorological Aviation Forecast Study. In
Proceedings of Data Warehousing and Knowledge
Discovery, Third International Conference - DaWaK
2001, (pp. 61-70). Munich, Germany: Springer-Verlag.
Sérgio Viademonte is a Doctoral candidate at the School of
Information Management and Systems at Monash University.
His research is supported by ORSP and Monash Graduate
Scholarships. Sergio has been working on hybrid architectures
for expert systems since 1995 when he obtained a Master in
Administration, Information Systems Area (by Research) from
Federal University of Rio Grande do Sul (UFRGS), RS, Brazil.
[21] Viademonte, S. B. & Burstein F.. (2001). An Intelligent
Decision Support Model for Aviation Weather
Forecasting. In Proceedings of Advances in intelligent data
analysis: 4 th international conference / IDA 2001, (pp.
278 - 288). Cascais, Portugal.: Springer-Verlag.
Dr Frada Burstein is Associate Professor and Knowledge
Management Academic Program Director at the School of
Information Management and Systems at Monash University.
She is a Chief Investigator for an ARC funded industry
collaborative project with Bureau of Meteorology titled
”Improving Meteorological Forecasting Practice with
Knowledge Management Systems”. The results reported in this
paper address a component of this project.
[22] Weiss, S. M., Galen, R. S. a., & Tadepalli, P. V. (1990).
Maximizing the predictive value of production rules.
Artificial Intelligence, 47 - 71.
[23] Weiss, S. M., & Indurkhya, N. (1998). Predictive Data
Mining: A Practical Guide. (Vol. 1). San Francisco, CA:
Morgan Kaufmann Publishers, Inc.
The Australasian Data Mining Workshop Copyright 2002
45
46
47
The Australasian Data Mining Workshop
48
The Australasian Data Mining Workshop
49
The Australasian Data Mining Workshop
50
The Australasian Data Mining Workshop
51
The Australasian Data Mining Workshop
52
The Australasian Data Mining Workshop
53
The Australasian Data Mining Workshop
54
The Australasian Data Mining Workshop
55
The Australasian Data Mining Workshop
56
57
The Australasian Data Mining Workshop
58
The Australasian Data Mining Workshop
59
The Australasian Data Mining Workshop
60
The Australasian Data Mining Workshop
61
The Australasian Data Mining Workshop
62
The Australasian Data Mining Workshop
63
64
65
The Australasian Data Mining Workshop
66
The Australasian Data Mining Workshop
67
The Australasian Data Mining Workshop
68
The Australasian Data Mining Workshop
69
The Australasian Data Mining Workshop
70
The Australasian Data Mining Workshop
71
The Australasian Data Mining Workshop
72
The Australasian Data Mining Workshop
73
74
SemiDiscrete Decomposition: A Bump Hunting Technique
S. McConnell
D.B. Skillicorn
School of Computing, Queen’s University,
Kingston, Canada.
School of Computing, Queen’s University,
Kingston, Canada, and
Faculty of Information Technology, University of
Technology, Sydney.
mcconnell@cs.queensu.ca
skill@cs.queensu.ca
ABSTRACT
Æ
! "
# !
!
1.
INTRODUCTION
$
% &
$
' !
&
' !
(
)*
+ ,-℄
/
+
+
! 0
!
! (
!
+
,) 1℄ ! 2
3
!
4 5
5
5 !
½ ¾
6
$
!
!
"
! 5 "
6
,)1 )7℄
! ,8 )9 ))℄ 2
+
!
) 9 ) 6
! :
2 !
; ¿ !
6 !
6 "
+
1
3 !
4
) 9 )
!
<
6
< ¼
!
: 6 0
) ! ½
!
1 ' :
¾
6 $
=
/
1991
75
The Australasian Data Mining Workshop
) ) 6
0
/
>
"
? > 4
= /
6
&
" ! (
2 & )
9 ) "
6
&
6
&
&
@
A
6
&
6
"
( !
" 6 Æ
!
2.
WHAT SDD IS DOING
!
!
4
6 " *
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
4
6 $
)
)
)
)
)
)
)
)
9
9
)
9
9
9
)
9
)
)
1
)
)
)
1
)
)
)
)
)
)
)
)
)
)
)
1
)
)
)
1
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
!
)
)
)
)
)
)
)
)
9
9
)
9
9
9
)
9
)
)
)
)
)
)
)
)
=
/
' )0 $
!
&
" *
!
4
6
)
)
)
)
)
)
)
)
9
9
9
)
9
)
9
9
)
)
)
)
)
)
)
)
9
9
9
)
9
)
9
9
)
)
)
)
)
)
)
)
4
)971*
98?-*
99*B*8C
99*B*8C
999?771)
6 !
' ) D
2
A
" E #
/ !
F "
! ½ ½ ½ ½ "
½ "
6
½ ½ B B ! )# ½ )971*
! B B !
)971*
2 !
' 1
!
!
F !
¾ ¾
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
)
9
9
9
)
9
9
9
9
9
9
9
9
9
9
9
)
9
9
9
)
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
9
2
@ ¾ 98*-*
1991
76
The Australasian Data Mining Workshop
' 10 $
"
! 6
! !
"
!
" !
2
' 1
:
A
"
6
6 "
6 1 & !
:
: 6
"
G
2 "
6 ! !
6 +
A
6
:
3 "
! G
: 6
:
G
:
" ' ! !0
4
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
8
)
)
)
8
)
!
9 )
9 )
) )
9 )
4
9 )
9 )
) )
9 )
)
)
)
)
)
)
)
)
)
)
8
)
)
)
8
)
)
)
)
)
)
)
)
)
9
9
)
9
9
9
)
9
)
)
)
)
)
)
)
)
9
9
)
9
9
9
)
9
9
9
9
)
9
)
9
9
)
)
)
)
)
)
)
)
9
9
9
)
9
)
9
9
4
9
9
9
)
9
)
9
9
)
)
)
)
)
)
)
)
4
8
98?-*
98?-*
99*B*8C
99*B*8C
999?771)
)
)
)
)
)
)
)
)
2 "
G 2
A
6 $
=
/
1991
77
The Australasian Data Mining Workshop
G "
1 4 )1BH7C
! -
$
Æ
! + !
+
0 +
6
' ? !
A :
(
?
+
2
+
(
6
6 !
+ !
!
!
:
A +
6
6
6
"
! (
"
2
" $
0
) '
:
1
"
? '
6
"
6
!
> !
# ! 6
2 !
8
" ( !
6 $
=
/
1991
78
0
4
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
8
)
)
)
8
)
)
)
)
)
)
)
)
)
)
)
B
)
)
)
8
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
)
!
4
B-*
98?-*
)7B-*
9*)*71
9*)*71
2
!
6
3. AN APPLICATION
@
@ 6 )7-9
?? ,*℄ 6
:
,C℄
=
(
89
2
!
F
&
' @>$/
6
0
$ : ?
/ ( < 6 I A
$ : 1
/ / @ A "
$
/ A
The Australasian Data Mining Workshop
0.15
0.1
0.05
0
−0.05
−0.1
−0.15
−0.2
−0.25
0.2
0.1
0.25
0
' ?0 J
!
0.2
0.15
+
$ : B
/ J I A
:
$ : ?
6 "
2 !
$
" !
' C +
6
"
2 )* 1< K * )1
)C / ( < ( ' *
6
C / )? 19 L >
M / / % ' 3
I /
6
L
> 6
G M
3
! 6 I
=
/
0.05
0
−0.05
−0.1
−0.15
−0.2
/
I ! /
G
" " +
2 !
!
! +
"
&
9-* A "
&
6
6 $
0.1
1991
79
4. RELATED WORK
6
0
)
1
0 N2 #
# O 6
' '# J>2= ,7℄
J>2= "
2
" !
! F&
H
0 N2
# O
6 !
) ,)C )? B℄
3 Æ
E # 3 " E #
F&
3
The Australasian Data Mining Workshop
0.4
0.3
11
20
0.2
16
13
26
0.1
27
15
19
24
33
2
U3
22
0
3
28
1 10
8
32
31
9
17
21 30
6
−0.1
18
7
4
−0.2
25
29
−0.3
23
12
5
14
0.5
0
−0.5
0.25
0.2
0.15
0.1
0.05
0
−0.05
−0.1
0.3
0.35
U1
' C0 J +
0.4
0.3
11
0.2
16
0.1
U3
22
0
−0.1
6
8
10
28 2
3
1
20
13
24
26
27
15
19
9
17
32
33
31
21 30
18
−0.2
7
−0.3
4
0.4
29
0.2
23
25
5
12
14
0
−0.2
U2
−0.4
−0.1
−0.05
0
0.1
0.05
0.2
0.15
0.25
0.3
0.35
U1
' *0 J +
6 $
=
/
1991
80
4 ) 9 P 4 9 9
4 ) )
4 ) 9
The Australasian Data Mining Workshop
? 0 N2
# O 2
& &
,)*℄ ' !
& 6
& F
F&
F
' !
&
%
>
"
"
!
F
6
+
M
#
J
J ,?℄
" !
"
J J
0
) ) ) ) ) ) ) 9 9 9 9 9
) ) ) ) ) ) 9 9 9 9 9 9
) ) ) ) ) ) ) 9 9 9 9 9
) ) ) ) ) ) ) 9 9 9 9 9
) ) ) ) ) ) ) ) ) 9 9 9
)
)
)
)
)
)
)
)
)
)
)
9
4 ) 9 ) ) ) ) ) ) ) ) ) )
9 9 9 9 ) ) ) ) ) ) ) )
9 9 9 9 ) ) ) ) ) ) ) )
9 9 9 9 9 ) ) ) ) ) ) )
9 9 9 9 9 ) ) ) ) ) ) )
9 9 9 9 9 9 ) ) ) ) ) )
6 J J ' 7 6
" ' - 6
J J
5.
CONCLUSION
=
/
1991
81
' 70 6
J
J
' -0 6
"
! 2
E #
6 $
The Australasian Data Mining Workshop
6
"
6
+
6 !
!
6.
,-℄ 3 3 / < $
Q ( D J ? )887
,B℄ J ( M < 6 J $ :
&
2 % 8C7A8*1 1999
,8℄ 3 L F#< $ !
!
$ &
'
)70?11A?C7 )88B
,)9℄ 6 L F#< /
! $ &
' '
% )888
,))℄ 6 L
F#< (
)9-
-?AB9 + )888
REFERENCES
,)℄ = M 3 F#M D
?- C0*-?A*8* )88*
,)1℄
,1℄ = M > ' <
? C0?9)A?1-
)887
,)?℄ M Q J Q 6 $
> %
6 > 88B- =
> )888
,?℄
,)C℄
M J
1 C0?1*A?CC )88B
F#< J
! """ &
$
?)0CC)ACCC )8B?
6! ) $ $ * J 6
D Q 199)
,C℄ $ / / $
@
! " # " # )0))8A
)?C 199)
,)*℄ 3 > M! ( ( ( < 3
$
2 % ' + """
$ '
,$-+. = /
Q 1991
,*℄
,)7℄ I $ 3 :
!
6 /
J D 1999
/ @ > 3 (
3 @%
@
3 @ @
> $ / )88*
,7℄ Q ' @ ' M
$ )88-
6 $
=
/
1991
82
An Overview of Temporal Data Mining
Weiqiang Lin
Mehmet A. Orgun
Graham J. Williams
Department of Computing
I.C.S., Macquarie University
Sydney, NSW 2109, Australia
Department of Computing
I.C.S., Macquarie University
Sydney, NSW 2109, Australia
CSIRO Data Mining
GPO Box 664
Canberra ACT 2601, Australia
wlin@ics.mq.edu.au
mehmet@ics.mq.edu.au
Graham.Williams@csiro.au
ABSTRACT
2.1
Temporal Data Mining is a rapidly evolving area of research that is at the intersection of several disciplines, including statistics, temporal pattern recognition, temporal
databases, optimisation, visualisation, high-performance computing, and parallel computing. This paper is first intended
to serve as an overview of the temporal data mining in research and applications.
In this section, we first give basic definitions and aims of
Temporal Data Mining. The definition of Temporal Data
Mining is as follows:
1. INTRODUCTION
Temporal Data Mining is a rapidly evolving area of research that is at the intersection of several disciplines, including statistics (e.g., time series analysis), temporal pattern recognition, temporal databases, optimisation, visualisation, high-performance computing, and parallel computing. This paper is intended to serve as an overview of the
temporal data mining in research and applications. In addition to providing a general overview, we motivate the importance of temporal data mining problems within Knowledge
Discovery in Temporal Databases (KDTD) which include
formulations of the basic categories of temporal data mining
methods, models, techniques and some other related areas.
The paper is structured as follows. Section 2 discusses the
definitions and tasks of temporal data mining. Section 3 discusses the issues on temporal data mining techniques. Section 4 discusses two major problems of temporal data mining, those of similarity and periodicity. Section 5 provides
an overview of time series temporal data mining. Section 6
moves onto a discussion of several important challenges in
temporal data mining and outlines our general distribution
theory for answering some those challenges. The last section
concludes the paper with a brief summary.
Definition and Aims
Definition 1. Temporal Data Mining is a single step in
the process of Knowledge Discovery in Temporal Databases
that enumerates structures (temporal patterns or models)
over the temporal data, and any algorithm that enumerates
temporal patterns from, or fits models to, temporal data is a
Temporal Data Mining Algorithm.
Basically temporal data mining is concerned with the analysis of temporal data and for finding temporal patterns
and regularities in sets of temporal data. Also temporal
data mining techniques allow for the possibility of computerdriven, automatic exploration of the data. Temporal data
mining has led to a new way of interacting with a temporal
database: specifying queries at a much more abstract level
than say, Temporal Structured Query Language (TSQL)
permits (e.g., [17], [16]). It also facilitates data exploration
for problems that, due to multiple and multi-dimensionality,
would otherwise be very difficult to explore by humans, regardless of use of, or efficiency issues with, TSQL.
Temporal data mining tends to work from the data up and
the best known techniques are those developed with an orientation towards large volumes of time related data, making
use of as much of the collected temporal data as possible to
arrive at reliable conclusions. The analysis process starts
with a set of temporal data, uses a methodology to develop
an optimal representation of the structure of the data during
which time knowledge is acquired. Once Temporal knowledge has been acquired, this process can be extended to a
larger set of the data working on the assumption that the
larger data set has a structure similar to the sample data.
2. DEFINITION AND TASKS OF TEMPORAL DATA MINING
2.2
The temporal data mining component of the KDTD process
is concerned with the algorithmic means by which temporal patterns are extracted and enumerated from temporal
data. Some problems for temporal data mining in temporal
databases include questions such as: How can we provide
access to temporal data when the user does not know how
to describe the goal in terms of a specific query? How can
we find all the time related information and understand a
large temporal data set? and so on.
Temporal Data Mining Tasks
A relevant and important question is how to apply data mining techniques on a temporal database. According to techniques of data mining and theory of statistical time series
analysis, the theory of temporal data mining may involve
the following areas of investigation since a general theory
for this purpose is yet to be developed:
1. Temporal data mining tasks include:
• Temporal data characterization and comparison,
• Temporal clustering analysis,
The Australasian Data Mining Workshop
83
The Australasian Data Mining Workshop
• Temporal classification,
3.2
• Temporal association rules,
Temporal clustering according to similarity is a concept which
appears in many disciplines, so there are two basic approaches
to analyze it. One is the measure of temporal similarity approach and the other is called temporal optimal partition
approach.
In temporal data analysis, many temporal data mining applications make use of clustering according to similarity and
optimization of temporal set functions. If the number of
clusters is given, then clustering techniques can be divided
into three classes: (1) Metric-distance based technique, (2)
Model-based technique and (3) Partition-based technique.
These techniques can be used occasionally in combination,
such as Probability-based vs. Distance-based clustering analysis. If the number of clusters is not given, then we can use
Non-Hierarchical Clustering Algorithms to find their k.
In recent years, temporal clustering techniques have been
developed for temporal data mining, e.g., [23]. Some studies
have been done by using EM algorithm and Monte-Carlo
cross validation approach (e.g.,[12; 22; 13]).
• Temporal pattern analysis, and
• Temporal prediction and trend analysis.
2. A new temporal data model (supporting time granularity and time-hierarchies) may need to be developed
based on:
• Temporal data structures, and
• Temporal semantics.
3. A new temporal data mining concept may need to be
developed based on the following issues:
• the task of temporal data mining can be seen as
a problem of extracting an interesting part of the
logical theory of a model, and
3.3
• the theory of a model may be formulated in a logical formalism able to express quantitative knowledge and approximate truth.
Temporal Cluster Analysis
Induction
A temporal database is a store of temporally related information but more important is the information which can be
inferred from it([3; 4]. There are two main inference techniques: temporal deduction and temporal induction.
In addition, temporal data mining needs to include an investigation of tightly related issues such as temporal data
warehousing, temporal OLAP, computing temporal measurements, and so on.
1. Temporal deduction is a technique (e.g., in [24] to infer
the information that is a temporal logical consequence
of the information in the temporal database.
3. TEMPORAL DATA MINING TECHNIQUES
2. Temporal induction can be described as a technique
(e.g., in [25]) to infer temporal information that is generalised from the temporal database. Induction has
been used in the following ways within data mining:
1) Decision Trees and 2) Rule Induction.
A common form of a temporal data mining technique is rule
(or functions) discovery. Various types of temporal functions
can be learnt, depending upon the application domain. Also,
temporal functions (or rules) can be constructed in various
ways. They are commonly derived by one of the two basic
approaches, bottom-up or top-down induction.
3.1
4. TWO FUNDAMENTAL TEMPORAL DATA
MINING PROBLEMS
Classification in Temporal Data Mining
The basic goal of temporal classification is to predict temporally related fields in a temporal database based on other
fields. The problem in general is cast as determining the
most likely value of the temporal variable being predicted
given the other fields, the training data in which the target variable is given for each observation, and a set of assumptions representing one’s prior knowledge of the problem. Temporal classification techniques are also related to
the difficult problem of density estimation.
In recent years, a lot of the work has been done in nontemporal classification areas by using “Statistical Approaches
to Predictive Modelling”. Some techniques have been established for estimating a categorical variable, e.g., [26; 5; 20]:
kernel density estimators [20; 11] and K-nearest-neighbor
method [20]. These techniques are based upon the theory
of statistics. Some other techniques such as in [7; 8; 6] are
based upon the theory of databases. Temporal classification
techniques have not been paid much attention so far. In
recent years, the main idea in temporal classification is the
straightforward use of sampling techniques within time series methods (distribution) to build up a model for temporal
sequences.
The Australasian Data Mining Workshop
84
In recent years, two kinds of fundamental problems have
been studied in temporal data mining area. One is the Similarity Problem which is to find a time sequence (or TDB)
similar to a given sequence (or query) or to find all pairs
of similar sequences. The other is the Periodical Problem
which is to find periodic patterns in TDB.
4.1
Similarity Problems
In temporal data mining applications, it is often necessary to
search within a temporal sequence database (e.g: TDB) for
those sequences that are similar to a given query sequence.
Such problems are often called Similarity Search Problem.
This kind of a problem involves search on multiple and multidimensional time series sets in TDBs to find out how many
series are similar to one another. It is one of the most important and growing problems in Temporal Data Mining. In
recent years, we still lack a standard definition and standard
theory for similarity problems in TDB.
Temporal data mining techniques can be applied in similarity problems. The main steps for solving the similarity
problem are as follows:
• define similarity: allows us to find similarities between
sequences with different scaling factors and baseline
values.
The Australasian Data Mining Workshop
• generalized sequential pattern (GSP) algorithm: it essentially performs a level-wise or breadth-first search
of the sequence lattice spanned by the subsequence relation,
• choose a query sequence: allows us to find what we
want to know from large sequences (TDB) (e.g, character, classification)
• processing algorithm for TDB: allows us to apply some
statistical methods (e.g, transformation, wavelet analysis) to TDB (e.g, remove the noisy data, interpolate
the missing data).
• sequential pattern discovery using equivalence classes
(SPADE) algorithm: it decomposes the original problem into smaller sub-problems using equivalence classes
on frequent sequences [15].
• processing an approximate algorithm: allows us to
build up a classcification scheme for the TBD according to the definition of similarity by using some data
mining techniques (e.g, visualisation).
With any new algorithm, there is one important question
that has often been asked: How can we implement the new
algorithm directly on top of a Time-series TDB?
The result of the Similarity Problem search in TDB can be
used for temporal association, prediction, etc.
4.2
5. TIME SERIES TEMPORAL DATA MINING
Periodical Problems
Statistics has been an important tool for data analysis for
a long time. For example, Bayesian inference is the most
extensively studied statistical method for knowledge discovery (e.g, [2], [10], [18]) and Markov Model, Hidden Markov
Model (e.g., [14; ?]) also have made their way into temporal
knowledge discovery process.
Time series is a record of the values of any fluctuating quantity measured at different points of time. One characteristic
feature which distinguishes time series data from other types
of data is that, in general, the values of the series at different
time instants will be correlated1 . Application of time series
analysis techniques in temporal data mining is often called
Time Series Data Mining. A great deal of work has been
done into identifying, gathering, cleaning, and labeling the
data, into specifying the questions to be asked of it, and into
finding the right way to view it to discover useful temporal
patterns.
Time series analysis method has been applied into following
major categories in temporal data mining:
The periodicity problem is the problem of finding periodic
patterns or, cyclicity occurring in time-related databases
(TDB). The problem is related to two concepts: pattern and
interval. In any selected sequence of TDB, we are interested
in finding patterns which repeat over time and their recurring intervals (period), or finding the repeating patterns of a
sequence (or TDB) as well as the interval which corresponds
to the pattern period. For solving a Periodical Problem in
TDB, the main steps are as follows:
• determining some definitions of the concept of a period
under some assumptions: this step allows us to know
what kind of a periodicity search we want to perform
from TDB.
• building up a set of algorithms: this step allows us to
use properties of periodic time series for finding periodic patterns from a subset of TDB by using algorithms.
1. Representation of Temporal Sequence: This refers to
the representation of data before actual temporal data
mining techniques take place. There are two major
methods:
• processing simulation algorithms: this step allows us
find patterns from whole TDB by the algorithms.
A lot of techniques have been involved in these kind of problems by using pure mathematical analysis such as function
analysis, data distribution analysis and so on, e.g.,[9].
4.3
• General representation of data: representation of
data into time series data in either continuous or
discontinuous, linear/non-linear models, stationary/nonstationary models and distribution models (e.g.,
Time domain representation and Time series model
representation).
Discussion
In a time-series TDB, sometimes similarity and periodical
search problems are difficult even when there are many existing methods, but most of the methods are either inapplicable
or prohibitively expensive. There is also another difficult
problem: how we can combine multiple-level similarity or
periodical search in a multiple-level model? With the reference cube structure, such difficult problems can be solved by
extending the methods mentioned in previous subsections,
but the problem of combining multiple-level similarity and
periodicity in a multiple-level model is still unsolved. Also,
more sophisticated techniques need to be developed to reduce memory work-space.
In fact, similarity and periodical search problems can be
combined into the problem of finding interesting sequential
patterns in TDBs. In recent years, some new algorithms
have been developed for “fast mining of sequential patterns
in large TDBs”:
• General transformation of representation of data:
representation of data into time series data in either continuous or discontinuous transformation
(e.g., Fourier transformation, Wavelet transformation and Discretization transformation).
2. Measure of Temporal Sequence: measuring temporal
charactersistic element in given definitions of similarity
and/or periodicity in a temporal sequence (or, two subsequence in a temporal sequence) or between temporal
sequences. There are two methods:
1
Time Analysis Theory can be found in any standard textbook of time series analysis, e.g., [1].
The Australasian Data Mining Workshop
85
The Australasian Data Mining Workshop
a data analysis model to establish the link between the
present temporal knowledge and the future temporal
knowledge.
• Characteristic distance measuring in time domain:
measuring distance between temporal charactersistics in either continuous or discontinuous time
domain (e.g., Euclidean squared distance function).
• Characteristic distance measuring in other than time
domain: measuring distance between temporal charactersistics in either continuous or discontinuous
domain other than time (e.g., distance function
between two distributions).
3. Prediction of Temporal Sequence: the main goal of prediction is to predict some fields in a database based on
Time domain. The techniques can be classified into
two models.
The techniques involved in the above two methods can be
divided into following classes:
1. Temporal data clustering: temporal clustering targets
separating the temporal data into subsets that are similar to each other. There are two fundamental problems of temporal clustering:
• Temporal classification models: the basic goal is
to predict the most likely state of a categorical
variable (the class) in Time domain.
• To define a meaningful similarity measure, and,
• To choose the number of temporal clusters(if we
do not know the cluster numbers).
• Temporal regression models: the basic goal is to
predict a numeric variable in a set by using different transformations (e.g, linear or non-linear) on
databases to find temporal information (or, patterns) of the different (or the same) categorical
data sets (class).
2. Temporal data prediction: the goal of temporal prediction is to predict some fields based on other temporal
fields. Temporal data prediction also involves using
prior temporal patterns (or, models, knowledge) for
finding the data attributes relevant to the attribute of
interest.
Recently, there are various results to date on discovering
temporal information which have offered forums to explore
the temporal data mining progress and future work concerning temporal data mining. But the general theory and
general method of temporal data analysis of discovering temporal patterns for temporal sequence data analysis are not
well known.
3. Temporal data summarization: the purpose of temporal data summarization is to describe a subset of
temporal data by representing extracted temporal information in a model or, in rules or in patterns. It
provides a compact description for a temporal dataset.
It could also involve a logic language such as temporal
logic, fuzzy logic and so on.
6. CHALLENGES AND RESEARCH DIRECTIONS
Recent advances in data collection and storage technologies
have made it possible for companies, administrative agencies
and scientific laboratories to keep vast amounts of temporal
data relating to their activities. Data mining refers to such
an activity to make automatic extraction of different levels
of knowledge from data feasible. One of the main unresolved
problems, often called General Analysis Method of Temporal
Data Mining, that arise during the data mining process is
treating data that contains temporal information.
6.1
• Data temporal measure analysis method: This method
involves the transformation of initial data temporal domain (or space) into another domain (or space), then
the use of this new domain to represent the original
temporal data.
4. Temporal data dependency: Temporal dependency modelling describes time dependencies among data and/or
temporal attributes of data. There are two dependency models: qualitative and quantitative. The qualitative dependency models specify temporal variables
(e.g., time gap) that are locally dependent on a given
state-space S. The quantitative dependency models
specify the value dependencies (e.g., using numerical
scale) in a statistical space P.
6.2
Challenge Questions
Data mining is a step in the knowledge discovery in databases,
although successful data mining applications continue to appear but the fundamental problems are still as difficult as
they have been for the past decade. One such difficult and
fundamental problem is the development of a general data
mining analysis theory. Temporal data mining researchers
have paid some attention to this problem but results still
remain in their infancy. One of the important roots in data
mining analysis is statistical analysis theory. The general
temporal data mining analysis theory includes two important analysis methods:
• Data structural temporal knowledge analysis method: This
method involves the discovery of data prior temporal
knowledge, and the exploitation of the knowledge into
The Australasian Data Mining Workshop
86
Some Answers of the Challenge Questions
During the past few years, we have proposed a formal framework for the definitions and general hidden distribution theory of temporal data mining. We have also investigated
applications in temporal clustering, temporal classification
and temporal feature selection for temporal data mining.
The major work we have done in answering the temporal
data mining challenge questions are:
• We have established a General Hidden Distributionbased Analysis Theory for temporal data mining. The
general mining analysis theory is based on the statistical analysis method but traditional statistical assumptions only come from the data itself. There are two important concepts in the theory: 1) data qualitative set,
data quantitative set and 2) data hidden conditional
distribution function. The data qualitative set is the
set to decide the data moving structure such as data
The Australasian Data Mining Workshop
periodicity and similarity. In other words, data qualitative set is a base of the data. The data quantitative
set is the set to decide the numerical range of the data
moving structure. The data hidden conditional distribution function is built on the characteristics of data
qualitative and data quantitative sets. Another feature of the general mining analysis method is that we
can use (extension of) all existing statistical analysis
methods and techniques for mining temporal patterns.
temporal classification: 1) Provide a definition of temporal classification, 2) Define a distribution distance
function and 3) Provide the weighting of temporal objects for changing their class membership. For large
numbers of classification, we proposed a discriminant
coordinates of time gap distribution to deal with such
kinds of problems.
• We have proposed an algorithm which is called The
Additive Distributional Recursion Algorithm (ADRA)
in General Hidden Distribution-based Analysis Theory
for building up temporal data models. The algorithm
uses the sieve method2 to discover temporal distribution function (models, pattern).
• We have extended a normal measure method to a new
Temporal Measure Method, which is called Time-gap
Measure Method. The new measure method has brought
“time length” (which is between temporal events) or
“time interval” (which is within a temporal event) into
a time point (or, time value) variable. After a temporal
sequence is transformed, it can be measured in both
state-space S and probability space P. The time-gap is
used as a temporal variable in time distribution function f (tv ) or temporal variable functional equations
embedded in temporal models of the sequence.
• We have extended and built up a new application of
fundamental mathematics techniques for dealing with
large temporal datasets, massive temporal datasets and
distributed temporal datasets. The new application is
called Temporal Sequence Set-Chains (A special case
of the Temporal Set-Chains is a Markov Set-Chains).
The key issue in temporal sequence set-chains is the
use of stochastic matrices of samples to build up a
moving kernel distribution. The temporal sequence
set-chains sequence can be used for mining a large temporal sequence, massive temporal sequence and distributed temporal sequence such as Web temporal data
sequence (e.g., Web content sequence, Web usage sequence and Web strutural sequence).
• We have proposed a framework of Temporal Clustering method for discovering temporal patterns. In our
temporal clustering method, there are three stages of
temporal data mining in temporal clustering analysis:
1) the input stage: what an appropriate measure of
similarity to use, 2) the algorithm stage: what types
of algorithms to use, and 3) the output stage: assessing and interpreting the results of cluster analysis.
In the second stage, we also proposed a framework
of Distribution-based Temporal Clustering Algorithm.
The algorithm is based on our general analysis method.
• We have proposed a framework of Temporal Feature
Selection for discovering Temporal patterns. There are
three steps for feature selection in the temporal sequence. The first step of the framework employs a distance measure function on time-gap distributions between temporal events for discovering structural temporal features. In this step, only rough shapes of patterns are decided. The temporal features are grouped
into temporal classifications by employing a distribution distance measure. In the second step, the degree
of similarity and periodicity between the extracted features are measured based on the data value distribution models. The third step of the framework consists
of a hybrid model for selecting global features based
on the results of the first two steps.
• We have established the main steps of applying our
general temporal data mining theory to real world
datasets with different methods and models. There
are three steps of applications of our general analysis
for discovering knowledge from a temporal sequence:
1) preprocessing data analysis including solving data
problems and transforming data from its original form
into its quantitative set and qualitative set, 2) temporal pattern searching including qualitative-based pattern searching, quantitative-based pattern searching
and discovering global temporal patterns (models), and
3) the interpretation of the global temporal patterns
(models) and future prediction.
6.3
Future Research Directions
As we mentioned earlier temporal data mining and knowledge discovery have emerged as fundamental research areas
with important applications in science, medicine and business. In this section, we describe some of the major directions of research from recent general analysis theory of
temporal data mining research:
1. An extension of this temporal sequence measure method
to general temporal points (e.g., temporal interval-based
gap function) allowing an arbitrary interval between
temporal points may lead to a very powerful temporal
sequence transformation method.
2. An extension of the notion of Temporal Sequence SetChains on different temporal variables, or different components of a temporal variable, can be applied to deal
with following problems of temporal data mining:
• We have proposed a framework of Temporal Classification. This temporal classification is generated by
our Temporal Clustering method. According to our
general analysis theory for Temporal Sequence Mining and its application in temporal clustering, there
are also the following three steps for constructing a
• the number of temporally related attributes of
each observation increases,
• the number of temporally related observations increases, and
• the number of temporally related distribution functions increases.
2
The sieve method is an important method in number theory.
The Australasian Data Mining Workshop
87
The Australasian Data Mining Workshop
3. An important extension of the general temporal mining theory is the development of distributed temporal
data mining algorithms.
[5] C. Glymour, D. Madigan, D. Pregibon, and P. Smyth.
Statistical inference and data mining. Communications
of the ACM, 39(11):35–41, Nov. 1996.
4. In applications of temporal data mining, all new temporal data mining theories, methods and techniques
should be developed on/with privacy and security models and protocols appropriate for temporal data mining.
[6] F. H. Grupe and M. M. Owrang. Data-base mining discovering new knowledge and competitive advantage.
Information Systems Management, 12:26–31, 1995.
5. In general data mining theory, we may need to develop
fundamental mathematical techniques of fuzzy methods for mining purposes (e.g., temporal fuzzy clustering and algorithms, temporal fuzzy association rules
and new types of temporal databases.).
[7] J. Han, Y. Cai, and N. Cercone. Knowledge discovery
in databases: An attribute-oriented approach. In Proceedings of the 18th VLDB Conference, pages 547–559,
Vancouver, British Columbia, Canada, Aug. 1992.
[8] J. W. Han, Y. D. Cai, and N. Cercone. Data-driven
discovery of quantitative rules in relational databases.
Ieee Trans. On Knowledge And Data Engineering, 5:29–
40, Feburary 1993.
7. CONCLUDING REMARKS
[9] J. W. Han, Y. Yin, and G.Dong. Efficient mining of
partial periodic patterns in time series database. IEEE
Trans. On Knowledge And Data Engineering, 1998.
Temporal data mining is a very fast expanding field with
many new research results reported and many new temporal data mining analysis methods or prototypes developed
recently. Some articles of overview of temporal data mining
have discussed in different frameworks for coveing research
and application in temporal data mining. In [19], for example, Roddick and Spiliopoulou have presented a comprehensive overview of techniques for the mining of temporal
data.
In this report we have provided an overview of the temporal data mining process and some background to Temporal
Data Mining. Also we discussed a difficult and fundamental
problem, a general analysis theory of temporal data mining
and provided some answers to the problem. This leads into
a discussion on why there was a need for Temporal Data
Mining in industry, which has been a major factor in the
efforts that have gone into building the present generation
of Temporal Data Mining Systems. We have presented a
number of areas which are related to Temporal Data Mining in their objectives and compared and contrasted these
technologies with Temporal Data Mining.
[10] D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors. Learning bayesian networks: the combineation of knowledge and statistical data. AAAI Press,
1994.
[11] D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors. Proceedings of the Third International
Conference on Knowledge Discovery and Data Mining
(KDD-97). AAAI Press, 1997.
[12] E. Keogh and P. Smyth. A probabilistic approach to
fast pattern matching in time series databases. page
126.
[13] A. Ketterlin. Clustering sequences of complex objects.
In Heckerman et al. [11], page 215.
[14] C. Li and G. Biswas. Temporal pattern generation using
hidden markov model based unsuperised classifcation.
In Proc. of IDA-99, pages 245–256, 1999.
[15] M.J.Zaki. Fast mining of sequential patterns in very
large databases. Uni. of Rochester Technical report,
1997.
Acknowledgements
This research has been supported in part by an Australian
Research Council (ARC) grant and a Macquarie University
Research Grant (MURG).
[16] S. a. O.Etzion, editor. Temporal databases: Research
and Practice. Springer-Verlag,LNCS1399, 1998.
8. REFERENCES
[17] B. Padmanabhan and A. Tuzhilin. Pattern discovery
in temporal databases: A temporal logic approach. In
Simoudis et al. [21], page 351.
[1] D. Brillinger, editor. Time Series: Data Analysis and
Theory. Holt, Rinehart and Winston, New York, 1975.
[18] P.sprites, C.Glymour, and R.Scheines. Causation, Prediction and Search. Springer-Verlag, 1993.
[2] P. Cheeseman and J. Stutz. Bayesian classification (AUTOCLASS): Theory and results. In U. M. Fayyad,
G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,
editors, Advances in Knowledge Discovery and Data
Mining. AAAI Press / MIT Press, 1995.
[19] J. Roddick and M. Spiliopoulou. A survey of temporal knowledge discovery paradigms and methods.
IEEE Transactions on Knowledge and Data Engineering, 2002.
[3] T. Fulton, S. Salzberg, S. Kasif, and D. Waltz. Local
induction of decision trees: Towards interactive data
mining. In Simoudis et al. [21], page 14.
[20] R.O.Duda and P. Hart. Pattern classification and scene
analysis. John Wiley and Sons, 1973.
[21] E. Simoudis, J. W. Han, and U. Fayyad, editors.
Proceedings of the Second International Conference
on Knowledge Discovery and Data Mining (KDD-96).
AAAI Press, 1996.
[4] B. R. Gaines and P. Compton. Induction of metaknowledge about knowledge discovery. IEEE Trans. On
Knowledge And Data Engineering, 5:990–992, 1993.
The Australasian Data Mining Workshop
88
The Australasian Data Mining Workshop
[22] P. Smyth. Clustering using monte carlo
validation. In Simoudis et al. [21], page 126.
cross-
[23] T.Oates. Identifying distinctive subsequences in multivariate time series by clustering. In 5th International Conference on Knowledge Discovery Data Mining, pages 322–326, 1999.
[24] J. D. Ullman and C. Zaniolo. Deductive databases:
achievements and future directions. SIGMOD Record
(ACM Special Interest Group on Management of Data),
19(4):75–82, Dec. 1990.
[25] D. Urpani, X. Wu, and J. Sykes. RITIO - rule induction
two in one. In Simoudis et al. [21], page 339.
[26] P. Usama Fayyad and O.L.Mangasarian. Data mining:
Overview and optimization opportunities. INFORMS,
Special issue on Data Mining, 1998.
The Australasian Data Mining Workshop
89
90
Distances for Spatio-temporal clustering
Mirco Nanni
Dino Pedreschi
ISTI - Institute of CNR
Via Moruzzi 1 – Loc. S. Cataldo, 56124
Pisa, Italy
Dipartimento di Informatica, Università di Pisa
Via F. Buonarroti 2, 56127
Pisa, Italy
nanni@guest.cnuce.cnr.it
pedre@di.unipi.it
ABSTRACT
!
"!
# $
$
%
& ' ()) *
+,#+-./ . - /
! - , ) 0
' # % $ 12℄
Keywords
) - ,
1.
INTRODUCTION
/
$
% $
%
+
!
4
! #
%
$
$
5+,
5 + , 5+,!
')
6 5+,
7
# 5+,
$ %
#
5+,
$
7
+
Æ
$
% ! +
%
%
,
+
8
$
9 !
% %
91
The Australasian Data Mining Workshop
$ !
%
% %
#
¯ #
$
$ %
#
# !
%
=
!
% $
! # %
%
%
!"
#
$ 5+, :
;
: ; #
!
$ !
:
;! #
$
#
$
%
& $ !
+
!
8 : ;
%
%
1.1 Aim of the paper
+
6
$
¯ <
+
& %
6
! =
¯ #
#
# % $ 12℄
%
# , "
> $ + , ?
$
+ , @
!
, A
, B % <
, C $
$
2. RELATED WORK
D
%
! + 1B= C℄ %
8 E
E
%
- %
1?℄
8
$
%
! 6
92
The Australasian Data Mining Workshop
1F℄
! 1@℄
8
9
!
, $
!
8
% G #
>%
! $
< 6 1"℄
G
,
& <6 1A℄ #
$ %
! %
!
! %
#
!
,
% $
< % 1 ℄
,)*!
1 H℄
3.
A DATA MODEL FOR TRAJECTORIES
+
<
" ?! < Ê· Ê
+
$ 6
% $
# %
5 $
%
*
!
<
) $
$
%
< %
+ $ !
"
%
# >%
%
4. A FAMILY OF DISSIMILARITY MEASURES
+
4.1 General definition and example instances
#
+
#
%
%
! ' (
&
½ ¾ ! I ) !
½ ¾
¬
¬
!
!
½ ¾ ½ ¾
Ê· ( $
)
!
#
! <
%
%
!
%
! ,
%
%
93
!
'
The Australasian Data Mining Workshop
, ! 6$ $7
G 6
!
! ¾!!
½
$
!
)!
Ê
!
) ! I
I ! H
! I !
$ @ ½
½ Ê· +
½ ¾ ! I ½ ½ ½ ¾! J J ½ ¾!
½ ¾
!! Ê·
1H ℄!
4.3 Computational properties
, ?
,
)!
!
$
#
% ½ J ¾ ! !
$ A ½ ¾
! " %
' $
#
*
½ ¾
)
$ . ½ ¾
# / ½ ¾ *
) &
+ '
½
¾
# "
5. EFFECTS ON SOME CLUSTERING ALGORITHMS
! I
!I
+ $
,
, ( -
½ ¾ * )
%
!
!
8
' $
12℄
!
! I H . "!
" I ! H !
? ! I ! !
@ ! J ! ! !
$
- ) !
! '
½ ¾
$
%
!
"
- ½ ¾ )
'
? -
! I %
$
! I
)
'
-
&
)
½ ¾
+
$ %
& %
!
$
!
% %!
# %
%¾
#
% !
#
%
' ! ½ J
¾ ! I !
#
%
% ¾ !
5.2 K-means
¾
# $
! I H .
J !
5.1 Dissimilarity Matrix-based
%
+ >
%
<
%
4.2 Mathematical properties
!
94
The Australasian Data Mining Workshop
+
#
E
E
% !
! # $
%
%
+ %
%
9
,
% I !
$
B -
I J
J ! I !
%
! I ! I !
#
%
!
%
D(o,c’)
1
2
3
4
5
6
c’
< < %
? # " A
' !!
' !! ? #
! ?
<
%
, B %
¼
¼
5.3 Optimisations
+ ' ! %
%
!
#
%
+
*
< $
' ! ' ! J ' !
' ! ' ! J ' !
. %
' ! I ' !!
' ! ' ! ' ! ' ! J ' !
#
1 ℄ -
%
' !
< %
% !
$
' !
¼
¼
¼
¼
¼
¼
¼
¼
¼
¼
¼
¼
6. EXPERIMENTATIONS
+
, A?
$ G
!
) !
½ ¾
6.1 Sythesised dataset: The “leader” model
+ %
"
D
#
#
G
#
%
$
#
!
%
$
!
< " % #
$
6.2 Effects of Optimisation
%
4 ,
95
The Australasian Data Mining Workshop
120
100000
Naive k-means
Optimised
110
100
10000
90
80
1000
Time
70
60
100
50
40
10
30
20
1
10
0
10
20
30
40
50
60
70
80
90
100
< " #
4
"A AHH "A #
% F ,
$
% %
< $
, A? #
! $
=
%
: ; % 8 %
%
(
% 8
$
%
+ E
E $
! %
6 %
H
< ?
%
%
I J
I J
4 #
¼
¼
¼
¼
0.1
10
1000
< ? . $
I
I
% !
%
6
%
BH K 2?CK
F@CK
, $
,
%
%
6 %
!
D
% ' A!
# %
%
,
$
!
¼
7. CONCLUSIONS
¼
+ % !
$
$
%
8
,
$
96
100
Dataset size
The Australasian Data Mining Workshop
8
%
% $ +
% %
D %
% L6*
$
$ $
¯ D
%
=
¯
%
$
! - ) ,
=
¯
!
!
=
¯ $
$ ! $
12℄! +
=
¯
8. REFERENCES
1 ℄ /
(+
*
8,
,
( , <
1 '
+ - 0
1 '2345 @2HEAH 22A
1"℄ -<
(+
*
< 4
%
+ 6
#-78' 9445 22A
1?℄ )
( 8
" ?2E
2FC
8 <
C"
1@℄ ,%
5
' , #+:''
44
222
1A℄ M5N
/ / -
$ N5$
* '
-
<
+ 6 -0'* 9444 AH"E
A 222
1B℄
(
-
% 0
* '
#
*,++#
, < 22C
1C℄ + 6(
- %
/ -3 0 : '
' 8 . & - 22C
1F℄
( ' 5 $ N
( 4 $ 8
+ ' ; < 6 .
6 < - = %
( - # ""2E"?F <
* <* 0, 22A
12℄ 6
. 0
' ') ) +
E 0O ' "HH"
1 H℄ P Q-
,L
,
+ 0-:834> "A E"AF 22F
97
%
%
!
98
99
The Australasian Data Mining Workshop
100
The Australasian Data Mining Workshop
101
The Australasian Data Mining Workshop
102
The Australasian Data Mining Workshop
103
The Australasian Data Mining Workshop
104
The Australasian Data Mining Workshop
105
The Australasian Data Mining Workshop
106
The Australasian Data Mining Workshop
107
The Australasian Data Mining Workshop
108
109
The Australasian Data Mining Workshop
110
The Australasian Data Mining Workshop
111
The Australasian Data Mining Workshop
112
The Australasian Data Mining Workshop
113
The Australasian Data Mining Workshop
114
The Australasian Data Mining Workshop
115
The Australasian Data Mining Workshop
116
117
The Australasian Data Mining Workshop
118
The Australasian Data Mining Workshop
119
The Australasian Data Mining Workshop
120
The Australasian Data Mining Workshop
121
The Australasian Data Mining Workshop
122
The Australasian Data Mining Workshop
123
The Australasian Data Mining Workshop
124
The Australasian Data Mining Workshop
125
The Australasian Data Mining Workshop
126
The Australasian Data Mining Workshop
127
The Australasian Data Mining Workshop
128
The Australasian Data Mining Workshop
129
130
Author Index
Tamas Abraham
……
17
Sabine McConnell
……
75
Janice Boughton
……
65
Mirco Nanni
……
91
Richard Brookes
……
13
Tariq Nuruddin
……
109
Frada Burstein
……
37
Mehmet A. Orgun
……
83
N. Scott Cardell
……
1
Dino Pedreschi
……
91
Peter Christen
……
99, 117
Ben Raymond
……
29
Tim Churches
……
99
David B. Skillicorn
……
75
Adam Czezowski
……
117
Dan Steinberg
……
1
Olivier de Vel
……
17
Yudho Giri Sucahyo
……
109
Mikhail Golovnya
……
1
Sérgio Viademonte
……
37
Raj P. Gopalan
……
109
Zhihai Wang
……
57, 65
Ryan Kling
……
17
Geoffrey I. Webb
……
57, 65
Inna Kolyshkina
……
13
Graham. J. Williams
……
83
Shonali Krishnaswamy
……
47
Eric J. Woehler
……
29
Weiqiang Lin
……
83
Arkady Zaslavsky
……
47
Seng Wai Loke
……
47
Justin Zhu
……
99
131