Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
AI2002 Workshop Proceedings Data Mining Edited by Simeon J. Simoff, Graham J. Williams and Markus Hegland The 15th Australian Joint Conference on Artificial Intelligence 2002 Rydges Canberra, Australia 2 - 6 December 2002 ADM02 Proceedings Australasian Data Mining Workshop 3rd December, 2002, Canberra, Australia Edited by Simeon J. Simoff, Graham J. Williams and Markus Hegland in conjunction with The 15th Australian Joint Conference on Artificial Intelligence Canberra – Australia, 2nd - 6th December, 2002 University of Technology Sydney 2002 © Copyright 2002. The copyright of these papers belongs to the paper's authors. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage. Proceedings of the 1st Australasian Data Mining Workshop – ADM02, in conjunction with the 15th Australian Joint Conference on Artificial Intelligence, 2nd - 6th December, 2002, Canberra, Australia S. J. Simoff, G. J. Williams and M. Hegland (eds). Workshop Web Site: http://datamining.csiro.au/adm02/ Published by the University of Technology Sydney ISBN 0-9750075-0-5 Foreword The Australasian Data Mining Workshop is devoted to the art and science of data mining: the analysis of (usually large) data sets to discover relationships and present the data in novel ways that are compact, comprehendible and useful for researchers and practitioners. Data mining projects involve both the utilisation of established algorithms from machine learning, statistics, and database systems, and the development of new methods and algorithms, targeted at large data mining problems. Nowadays data mining efforts have gone beyond crunching databases of credit card usage or stored transaction records. They have been focusing on data collected in the health care system, art, design, medicine, biology and other areas of human endeavour. There has been an increasing interest in Australian industry, academia, research institutions and centers towards the area of data mining, evidenced by the growing number of research groups (e.g. ANU Data Mining Group, CSIRO Enterprise Data Mining, and UTS Smart eBusiness Systems Lab), academic and industry events (e.g. the data mining seminar series organised by PricewaterhouseCoopers Actuarial Sydney) related to one or another aspect of data mining. The workshop is aiming to bring together people from academia and industry that are working in the development and application of data mining methods, techniques and technologies. This workshop aims to bring together researchers and industry practitioners from different data mining groups in Australia and the region, and overseas researchers and practitioners that are working in the development and application of data mining methods, techniques and technologies. The workshop is expected to become a forum for presenting and discussing their latest research and developments in the area. The works selected for presentation at the workshop are expected to facilitate the cross-disciplinary exchange of ideas and communication between industry and academia in the area of data mining and its applications. Consequently, the morning part of the workshop (the sessions on “Practical Data Mining” and “Applications of Data Mining") addresses the data mining practice. The afternoon part of the workshop includes sessions on “Data Mining Methods and Algorithms”, “Spatio-Temporal Data Mining”, and “Data Preprocessing and Supporting Technologies”. The organisers have also reserved a special presentation session for an overview of on-going projects. As part of the Australian Joint Conference on Artificial Intelligence the workshop follows a rigid peer-review and paper selection process. Once again, we would like to thank all those, who supported this year’s efforts on all stages – from the development and submission of the workshop proposal to the preparation of the final program and proceedings. We would like to thank all those who submitted their work to the workshop. All papers were extensively reviewed by two to three referees drawn from the program committee. Special thanks go to them for the final quality of selected papers depends on their efforts. Simeon, J. Simoff, Graham J. Williams and Markus Hegland November 2002 i ii Workshop Chairs Simeon J. Simoff University of Technology Sydney, Australia Graham J. Williams Enterprise Data Mining, CSIRO, Australia Markus Hegland Australian National University, Australia Program Committee Sergei Ananyan Megaputer Intelligence, Russia & USA Rohan Baxter Enterprise Data Mining, CSIRO, Australia John Debenham University of Technology Sydney, Australia Vladimir Estivill-Castro Giffith University, Australia Eibe Frank University of Waikato, New Zealand Paul Kennedy University of Technology Sydney Inna Kolyshkina PricewaterhouseCoopers Actuarial Sydney, Australia Kevin Korb Monash University, Australia Xuemin Lin University of NSW, Australia Warwick Graco Health Insurance Commision, Australia Ole Nielsen Australian National University, Australia Tom Osborn NUIX Pty Ltd, and The NTF Group, Australia Chris Rainsford Enterprise Data Mining, CSIRO, Australia John Roddick Flinders University, Australia David Skillicorn Queen's University, Canada Dan Steinberg Salford Systems, USA iii Program for ADM02 Workshop Tuesday, 3 December, 2002, Canberra, Australia 9:00 - 9:10 Opening and Welcome 9:10 - 10:30 Session 1 – Practical Data Mining • 09:10 - 10:00 STOCHASTIC GRADIENT BOOSTING: AN INTRODUCTION TO TreeNet™ Dan Steinberg, Mikhail Golovnya and N. Scott Cardell • 10:00 - 10:20 CASE STUDY: MODELING RISK IN HEALTH INSURANCE - A DATA MINING APPROACH Inna Kolyshkina and Richard Brookes 10:20 - 10:35 Coffee break 10:35 - 12:15 Session 2 – Applications of Data Mining • 10:35 - 11:00 INVESTIGATIVE PROFILE ANALYSIS WITH COMPUTER FORENSIC LOG DATA USING ATTRIBUTE GENERALISATION Tamas Abraham, Ryan Kling and Olivier de Vel • 11:00 - 11:25 MINING ANTARCTIC SCIENTIFIC DATA: A CASE STUDY Ben Raymond and Eric J. Woehler • 11:25 - 11: 50 COMBINING DATA MINING AND ARTIFICIAL NEURAL NETWORKS FOR DECISION SUPPORT Sérgio Viademonte and Frada Burstein • 11:50 - 12:15 TOWARDS ANYTIME ANYWHERE DATA MINING E-SERVICES Shonali Krishnaswamy, Seng Wai Loke and Arkady Zaslavsky 12:15 - 13:00 Lunch 13:00 - 14:00 Session 3 – Data Mining Methods and Algorithms • 13:00 - 13:20 A HEURISTIC LAZY BAYESIAN RULE ALGORITHM Zhihai Wang and Geoffrey I. Webb • 13:20 - 13:40 AVERAGED ONE-DEPENDENCE ESTIMATORS: PRELIMINARY RESULTS Geoffrey I. Webb, Janice Boughton and Zhihai Wang • 13:40 - 14:00 SEMIDISCRETE DECOMPOSITION: A BUMP HUNTING TECHNIQUE S. McConnell and David B. Skillicorn 14:00 - 14:40 Session 4 – Spatio-Temporal Data Mining • 14:00 - 14:20 AN OVERVIEW OF TEMPORAL DATA MINING Weiqiang Lin, Mehmet A. Orgun and Graham. J. Williams • 14:20 - 14:40 DISTANCES FOR SPATIO-TEMPORAL CLUSTERING Mirco Nanni and Dino Pedreschi 14:40 - 14:55 Coffee break 14:55 - 16:10 Session 5 – Data Preprocessing and Supporting Technologies • 14:55 - 15:20 PROBABILISTIC NAME AND ADDRESS CLEANING AND STANDARDISATION Peter Christen, Tim Churches and Justin Zhu • 15:20 - 15:45 BUILDING A DATA MINING QUERY OPTIMIZER Raj P. Gopalan, Tariq Nuruddin and Yudho Giri Sucahyo • 15:45 - 16:10 HOW FAST IS -FAST? PERFORMANCE ANALYSIS OF KDD APPLICATIONS USING HARDWARE PERFORMANCE COUNTERS ON ULTRASPARC-III Adam Czezowski and Peter Christen 16:10 - 17:00 Session 6 – Project Reports, Discussion and Closure iv Table of Contents Stochastic Gradient Boosting: An Introduction to TreeNet™ Dan Steinberg, Mikhail Golovnya and N. Scott Cardell ……………………………………………001 Case Study: Modeling Risk in Health Insurance - A Data Mining Approach Inna Kolyshkina and Richard Brookes ………………………………………………………… 013 Investigative Profile Analysis With Computer Forensic Log Data Using Attribute Generalisation Tamas Abraham, Ryan Kling and Olivier de Vel ………………………………………………… 017 Mining Antarctic Scientific Data: A Case Study Ben Raymond and Eric J. Woehler …………………………………………………………… 029 Combining Data Mining And Artificial Neural Networks For Decision Support Sérgio Viademonte and Frada Burstein ………………………………………………………… 037 Towards Anytime Anywhere Data Mining e-Services .…………………………………… 047 Shonali Krishnaswamy, Seng Wai Loke and Arkady Zaslavsky A Heuristic Lazy Bayesian Rule Algorithm Zhihai Wang and Geoffrey I. Webb …………………………………………………………… 057 Averaged One-Dependence Estimators: Preliminary Results …………………………………………… 065 Geoffrey I. Webb, Janice Boughton and Zhihai Wang Semidiscrete Decomposition: A Bump Hunting Technique Sabine McConnell and David B. Skillicorn ……………………………………………………… 075 An Overview Of Temporal Data Mining Weiqiang Lin, Mehmet A. Orgun and Graham J. Williams ………………………………………… 083 Distances For Spatio-Temporal Clustering ……………………………………………………………… 091 Mirco Nanni and Dino Pedreschi Probabilistic Name and Address Cleaning and Standardisation Peter Christen, Tim Churches and Justin Zhu …………………………………………………… 099 Building A Data Mining Query Optimizer Raj P. Gopalan, Tariq Nuruddin and Yudho Giri Sucahyo …..………………………………… 109 How Fast Is -Fast? Performance Analysis of KDD Applications Using Hardware Performance Counters on UltraSPARC-III Adam Czezowski and Peter Christen ………………………………………………………… 117 Author Index …………………………………………………………………………… 131 v vi Stochastic Gradient Boosting: An Introduction to TreeNet Dan Steinberg, Mikhail Golovnya, N. Scott Cardell Salford Systems Stochastic Gradient Boosting Introduction to Stochastic Gradient Boosting  An introduction to TreeNet™ New approach to machine learning and function approximation developed by Jerome H. Friedman at Stanford University   Co-author of CART® with Breiman, Olshen and Stone Author of MARS™, PRIM, Projection Pursuit Good for classification and regression problems  Builds on the notions of committees of experts and boosting but is substantially different in key implementation details  Salford Systems http://www.salford-systems.com dstein@salford-systems.com Dan Steinberg, Mikhail Golovnya, N. Scott Cardell © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 Stochastic Gradient Boosting: Key Innovations -1 Benefits of TreeNet  Built on CART trees and thus         Stagewise function approximation in which each stage models residuals from last step model  Conventional boosting models use the original target at each stage  Each stage uses a very small tree, as small as two nodes and typically in the range of 4-8 nodes  Conventional bagging and boosting use full size trees  Bagging works best with massively large trees (1 case in each terminal node)  Each stage learns from a fraction of the available training data, typically less than 50% to start and often falling to 20% or less by the last stage Resistant to mislabeled target data    immune to outliers handles missing values automatically selects variables, results invariant wrt monotone transformations of variables In medicine cases are commonly misdiagnosed In business, non-responders are occasionally flagged as “responders” Resistant to overtraining – generalizes well Can be remarkably accurate with little effort Trains rapidly; at least as fast as CART © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 Combining Trees into “Committees of Experts” Stochastic Gradient Boosting: Key Innovations -2     Each stage learns only a little: severely downweighted contribution of each new tree (learning rate is typically 0.10, even 0.01 or less) How much is learned in each stage compared to a single tree In classification, focus is on points near decision boundary; ignores points far away from boundary even if the points are on the wrong side    If we do very badly on certain observations we ignore them Unlike boosting which would upweight such points  Explains why boosting is vulnerable to mislabeled data © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 1 The Australasian Data Mining Workshop Idea that combining good methods could yield promising results first was suggested by researchers a decade ago  In tree-structured analysis, suggestions made by:  Wray Buntine (1991, Bayes style allows cases to go down several tree paths)  Kwok and Carter (1990, split nodes several different ways to get alternate trees)  Heath, Kasif and Salzberg (1993, split nodes several different ways using different linear combination splitters) More recent work introduced concepts of bootstrap aggregation (“bagging”), adaptive resampling and combining (“arcing”) and boosting  Breiman (1994, 1996, multiple independent trees via sampling with replacement)  Breiman (1996, multiple trees with adaptive reweighting of training data)  Freund and Schapire (1996, multiple trees with adaptive reweighting of training data) The Australasian Data Mining Workshop Bootstrap Resampling Effectively Reweights Training Data (Randomly and Independently) Trees Can be Combined By Voting or Averaging   Trees combined via voting (classification) or averaging (regression) Classification trees “vote”    Recall that classification trees classify  Probability of being omitted in a single draw is (1 - 1/n) Probability of being omitted in all n draws is (1 - 1/n)n Limit of series as n increases is (1/e) = 0.368  assign each case to ONE class only With 50 trees, 50 class assignments for each case Winner is the class with the most votes Votes could be weighted – say by accuracy of individual trees  Regression trees assign a real predicted value for each case           Predictions are combined via averaging Results will be much smoother than from a single tree  approximately 36.8% sample excluded 0 % of resample 36.8% sample included once 36.8 % of resample 18.4% sample included twice thus represent ... 36.8 % of resample 6.1% sample included three times ... 18.4 % of resample 1.9% sample included four or more times ... 8 % of resample 100 % Example: distribution of weights in a 2,000 record resample: 0 732 0.366 © Copyright Salford Systems 2001-2002     Test Set Misclassification Rate (%) Decrease 49% 30% 77% 19%   Problems with Boosting  Similar procedure first introduced by Freund & Schapire (1996) Breiman variant (ARC-x4) is easier to understand:  Suppose we have already grown K trees: let m(j) = # times case j was misclassified (0 <= m(j) <= K) Define w(j) = (1 + m(j)4)  Prob (sample inclusion) = w( j ) Boosting in general is vulnerable to overtraining     6 3 0.002 © Copyright Salford Systems 2001-2002 ARCing reweights the training data  5 6 0.003 Bagging proceeds by independent, identically-distributed resampling draws Adaptive resampling: probability that a case is sampled varies dynamically Starts with all cases having equal probability After first tree is grown, weight is increased on all misclassified cases For regression, weight increases with prediction error for that case Idea is to focus tree on those cases most difficult to predict correctly © Copyright Salford Systems 2001-2002  4 32 0.016 (ARCing, a Variant of Boosting) Statlog Data Set Summary Bag 6.4 10.3 0.014 5.0 3 119 0.06 Adaptive Resampling and Combining Data Set # Training # Variables # Classes # Test Set Letters 15,000 16 26 5,000 Satellite 4,435 36 6 2,000 Shuttle 43,500 9 7 14,500 DNA 2,000 60 3 6,186 1 Tree 12.6 14.8 0.062 6.2 2 359 0.179 © Copyright Salford Systems 2001-2002 Bootstrap Aggregation Performance Gains Data Set Letters Satellite Shuttle DNA 1 749 0.375 Boosting highly vulnerable to errors in the data  M  ∑ w(i ) Much better fit on training than on test data Tendency to perform poorly on future data Technique designed to obsess over errors Will keep trying to “learn” patterns to predict miscoded data i =1   Weight = 1 for cases with zero occurrences of misclassification  Weight = 1+ K4 for cases with K misclassifications  Samples will tend to be increasingly dominated by misclassified cases  Documented in study by Dietterich (1998)  An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomization Rapidly becomes large if case is difficult to classify © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 2 The Australasian Data Mining Workshop The Australasian Data Mining Workshop Stochastic Gradient Boosting  Building on multiple tree ideas and adaptive learning  Goal of avoiding shortcomings of standard boosting  Placed in the context of function approximation  Trees combined by adding them (adding scores)  Friedman calls it Multiple Additive Regressive Trees (MART)  Salford calls it TreeNet TM Function Approximation By a Series of Error Corrections  Our approximation to any function can be written as F ( X ) = F0 + β1T1 ( X ) + β2T2 ( X ) + ... + βM TM ( X ) Where F0 is the initial guess, usually what we would use in the absence of any model (e.g. mean, median, etc.)  The approximation is built up stagewise     © Copyright Salford Systems 2001-2002     Average neighborhood home value is $22,533 Start model F(x) with this mean and construct residuals Model residuals with two-node tree   Function Approximation By a Series of Adjustments Consider Boston Housing data set   Function is built up through a series of adjustments or considerations  Each adjustment adds (or subtracts) something from the current estimate of function value  When we know nothing our home value prediction is the mean This is just an error correction based on one dimension of data Model will attempt to separate positive from negative residuals Now update model, obtain new residuals and repeat process Estimated function will look something like this:  © Copyright Salford Systems 2001-2002    Then we take number of rooms into account and adjust upwards for larger houses and downwards for smaller houses  Then we take socioeconomic status of residents into account and again adjust up or down  Continue taking further factors into account until an optimal model is built Similar to building up a score from a checklist of important factors (get points for certain characteristics, lose points for others) © Copyright Salford Systems 2001-2002 Two-node adjusting trees create main effects-only models Adjusting Trees Can be Any Size  Each stage is a “weak learner” – a small tree © Copyright Salford Systems 2001-2002 Function Approximation By a Series of Trees  Once a stage is added it is never revised or refit Each stage added by assessing model and attempting to improve its quality by, for example, reducing residuals Two-node, three-node, and larger trees can be used  Consider again the Boston Housing data set model Friedman finds that six-node trees generally work well  Each tree involves only one variable A tree with more than two nodes still adjusts the existing model  Each contribution of any one tree not dependent on which branch a case terminates in any other tree  High LSTAT reduces estimated home values by same amount regardless of number of rooms in house  May take several variables into account simultaneously  Each tree just partitions data into subsets  Each subset gets a separate adjustment F ( X ) = F0 + β1T1 ( X ) + β 2T2 ( X ) + ... + β M TM ( X ) © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 3 The Australasian Data Mining Workshop The Australasian Data Mining Workshop TreeNet Model with Three-Node Trees Rationale for Additive Trees Want to provide this style of function approximation with some theoretical justification  Need to specify many details:  +0.4 yes LSTAT<14.3 yes –8.4 no + MV = 22.5 + RM<6.8 +13.7 no   +0.2 yes yes RM<6.8 yes  –0.3  CRIM<8.2 yes –5.2 no no + LSTAT>5.1 no +8.4 RM<7.4 no –4.4   +3.2 How to choose tree size to use How many forward steps to take How to identify optimal model How to interpret model and results How much to adjust at a step Need to describe practical performance  Comparison with conventional boosting and single trees Each tree has three terminal nodes, thus partitioning data at each stage into three segments © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 Classical Function Approximation Predictive Modeling and Function Approximation  Specify a functional form for F(x), known up to a set of parameters B  Learn by fitting F*(x) to data, minimizing loss measure L  Achieved by iterative search procedure in which B is adjusted with reference to gradient (∂L/ ∂F)( ∂F/ ∂B)  Final result is obtained by adding together a series of parameter changes guided by gradient at an iteration  Think of this as a gradual form of learning from the data GIVEN Y X L(Y, F)  Output or Response Variable  Inputs or Predictors  Loss Function ESTIMATE F*(X) = arg minF(X) EY,X[L(Y,F(X))] © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 Nonparametric Function Approximation   General non-parametric case: F(X) is treated as having a separate parameter for each distinct combination of predictors X With infinite data best estimate of F(X) under quadratic loss at any specific data vector Xi would be 1 F* X i = yj N X i j: X j = X i ( ) With plentiful data accurate estimates of F(X) can be obtained for any X  But we only have finite data so     ∑   General Optimization Strategy for Function Approximation Make an initial guess {Fo(Xi)} – for example, assuming that all Fo(Xi) are the same for all Xi Compute the negative gradient at each observed data point i N  ∂Lˆ  r g = −   ∂F ( X i )i =1  most possible X vectors not represented in the data lack of replicates means inaccurate estimates at any X  The negative gradient gives us the direction of the steepest descent Take a step in the steepest descent direction Direct optimization in N free parameters will result in a dramatic overfitting  will somehow have to limit the total number of free parameters © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 4 The Australasian Data Mining Workshop The Australasian Data Mining Workshop Identifying Common Gradient Partitions with Regression Trees Guarding Against Overfit In the Non-parametric Case     Literal steepest descent is inadvisable as it would allow free adjustment of one parameter for each data point Instead, limit the number of free parameters that can be adjusted to a small number, say L. Can do this by partitioning data into L mutually exclusive groups making a common adjustment within each group The challenge is to find a good partitioning of data into L mutually exclusive groups   Our goal is to group observations with similar gradients together so that a common adjustment can be made to the model for each group  Build an L-node regression tree with the target being the negative gradient Within each group gradients should be similar © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 Generic Gradient Boosting Algorithm Gradient Boosting for Least Squares Loss 1 Lˆ ({F ( X i )}) = N For the given estimate of LOSS, and iterations M Choose start value {F(Xi)}={Fo(Xi)} (e.g. mean, for all data) FOR m = 1 TO M 1. 2. 3 Compute gm, the derivative of the expected loss with respect to F(Xi) evaluated at Fm-1(Xi) (e.g. residual, deviance) 4 Fit an L-node regression tree to the components of the negative gradient  this will partition observations into L mutually exclusive groups 5 Find the within-node update hm(Xi), adjusting each node separately: conventional model updating 2. 3 4 5 6 7. i =1 (Yi − F ( X i ))2 gm ~ {Yi – Fm-1(Xi)} = {Residuali} Fit an L-node regression tree to the current residuals  this will partition observations into L mutually exclusive groups For each given node: hm(Xi) = node-ave(Residuali) Update: {Fm(Xi)} = {Fm-1(Xi)} + hm(Xi) END END © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 Gradient Boosting for Classification: Binary Response Gradient Boosting for Least Absolute Loss 1 Lˆ ({F ( X i )}) = N ∑ N i =1  Yi − F ( X i )  Initial guess {F0(Xi)}={median(Yi)} FOR m = 1 TO M 1. 2. 7. N Initial guess {F0(Xi)}={ave(Yi)} FOR m = 1 TO M 1. Update: {Fm(Xi)} = {Fm-1(Xi)} + hm(Xi) 6 7. ∑ 3 gm ~ {sign(Yi – Fm-1(Xi))} = {sign(Residuali)} 4 Fit an L-node regression tree to the signs of the current residuals (+1, -1): this will partition observations into L mutually exclusive groups 5 6 For each given node: hm(Xi)=node-median(Residuali) Update: {Fm(Xi)} = {Fm-1(Xi)} + hm(Xi) In the case of binary response, the negative log-likelihood function is used in place of the loss function Friedman codes Y as {+1, -1} with conditional probabilities P( y | X ) = 1 1+ e − y F(X ) , y ∈ {− 1,+1} P(Y = 1 | X ) – log-odds ratio at X P (Y = −1 | X )  Here F ( X ) = log  F(X) can range from - infinity to +infinity END © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 5 The Australasian Data Mining Workshop The Australasian Data Mining Workshop TreeNet and Binary Response N ( Interpretation  ) L ({F ( X i )}) = ∑ log 1 + e − yi F ( X i ) i =1 1+ y 1. Initial guess F0 ( X ) = log 1− y 2. FOR m = 1 TO M 3 g m ~ y i 1 + e y i Fm − 1 ( X i ) = {~y i } { ( 4 5 6  )} Fit an L-node regression tree to the “residuals” (see the next slide) computed above  this will partition observations into L mutually exclusive groups yi ∑ ~yi (1 − ~yi ) For each node hm ( X i ) = ∑ ~ node node Update: {F m(Xi)} = {F m-1(Xi)} + h m(Xi) Put Y=1 in focus and call p – probability that Y=1 Then pi = 1 1 + e− F ( X i ) ( )  Initial guess = Log[overall resp. rate / (1 – overall resp. rate)]  “Residual”  Update h m(Xi) =(Node resp. rate – Ave. node(p))/Var ~y = 1 − pi , if yi = 1  i − pi , otherwise Ave. node( p ) = ∑ pi N node Var = ∑ pi (1 − pi ) N node node 7. END FOR © Copyright Salford Systems 2001-2002 node © Copyright Salford Systems 2001-2002 A Note on Mechanics Slowing the learn rate: “Shrinkage” The tree is grown to group observations into homogenous subsets  Once we have the right partition our update quantities for each terminal node are computed in a separate step  The update is not necessarily taken from the tree predictions  Important notion: tree is used to define a structure based on the split variables and split points  What we do with this partition may have nothing to do with the usual predictions generated by the trees   Up to this point we have guarded against overfitting by reducing the number of free parameters to be optimized  It is beneficial to slow down the learning rate by introducing the shrinkage parameter 0<ν<1 into the update step: {Fm(Xi)} = {Fm-1(Xi)} + ν hm(Xi) }  With a group of correlated variables, only one variable in the group might enter the model with ν=1, whereas with ν<1 several variables in the group may enter at the later steps. © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 TreeNet Three-Node Trees Model: TreeNet Three-Node Trees Model: Learn Rate=1 Learn Rate= 0.1 +0.4 yes yes LSTAT<14.3 yes + +0.2 yes no –0.3 yes CRIM<8.2 no –5.2 yes + LSTAT>5.1 no +8.4 –0.8 + +1.4 no yes RM<6.8 no MV = 22.5 + RM<6.8 +13.7 no RM<7.4 yes –8.4 no MV = 22.5 + RM<6.8 yes +0.04 LSTAT<14.3 no +0.7 RM<7.4 yes –4.4 RM>6.8 +3.2 no yes no yes +2.1 + RM<6.8 no –0.3 no Adjustments are smaller and evolution of model differs © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 6 The Australasian Data Mining Workshop +.02 LSTAT<14.8 –0.8 +1.1 The Australasian Data Mining Workshop Ignoring data far from the decision boundary in classification problems Stochastic Training Data  A further enhancement in performance is obtained by not allowing the learner to have access to all the training data at any one time  No a priori limit on the number of iterations so there is always plenty of opportunity to learn from all the data eventually By limiting the amount of data at any one iteration we reduce the probability that an erroneous data point will gain influence over the learning process In complete contrast to standard boosting in which problem data points are “locked onto” with steadily growing weight and influence    A further reduction in training data actually processed in any update occurs in classification problems  We ignore data points “too far” from the decision boundary to be usefully considered  JHF recommends 50% random sampling rate at any one iteration Correctly classified points are ignored (as in conventional boosting)  Badly misclassified data points are also ignored (very different from conventional boosting)  The focus is on the cases most difficult to classify correctly: those near the decision boundary  © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 Decision Boundary Diagram A Simple TreeNet Run 2-dimensional predictor space  Red dots represent cases with +1 target  Green dots represent cases with –1 target  Black curve represents the decision boundary  Stop after the first tree No shrinkage Use 2-node trees only Least-Squares LOSS   BOSTON HOUSING DATA: Target is MV (median neighborhood home value) One Predictor: LSTAT (% residents low socio-economic status) © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 Scatter Plot: MV vs. LSTAT TreeNet Predicted Response One-step model  A regression tree with 2 terminal nodes  50.00 Good Neighborhood RESP = 29.667 40.00 30.00 20.00 10.00 Bad Neighborhood RESP = 17.465 .00 10.00 20.00 LSTAT 30.00 40.00 LSTAT = 9.755 © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 7 The Australasian Data Mining Workshop The Australasian Data Mining Workshop Identical results from CART Model   TreeNet Model with two 2-node Trees  CART run with TARGET=MV PREDICTORS=LSTAT LIMIT DEPTH=1 Save residuals as RES1 LSTAT < 4.475 RESP = 41.097  4.475 < LSTAT < 9.755 RESP = 28.684 LSTAT > 9.755 RESP = 17.465 Similar to a regression tree with 3 terminal nodes LSTAT is only predictor LSTAT > 9.755 RESP = 16.482 LSTAT < 9.755 RESP = 29.667 4.475 © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 Equivalent Two-stage CART Run   Computing RESPONSE -1 CART run with First Run TARGET=RES1 Residuals PREDICTORS=LSTAT LIMIT DEPTH=1 Save residuals as RES2   LSTAT > 4.475 RESP = -0.983 LSTAT < 4.475 RESP = 11.430  These are within-node adjustments to the 1st run RESPONSE  © Copyright Salford Systems 2001-2002 1st CART Run produced:  IF LSTAT < 9.755 THEN RESP1 = 29.667  IF LSTAT > 9.755 THEN RESP1 = 17.465 2nd CART Run produced:  IF LSTAT < 4.475 THEN ADJUST = 11.430  IF LSTAT > 4.475 THEN ADJUST = -0.983 Combining two CART runs:  IF LSTAT < 4.475 THEN RESP2 = 29.667+11.430 = 41.097  IF 4.475< LSTAT< 9.755 THEN RESP2=29.667-0.983 = 28.684  IF LSTAT > 9.755 THEN RESP2 = 17.465 - 0.983 = 16.482 This is exactly what was reported by TreeNet © Copyright Salford Systems 2001-2002 Computing Response -2  9.755 TreeNet Run with 3 Trees This process can be schematically shown as  These cut-offs came from 1st and 2nd trees  Each tree in the sequence can be grown on the entire training data set. Unlike a decision tree we do not lose sample size as the learning progresses This new cut-off is due to the 3rd tree © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 8 The Australasian Data Mining Workshop  Still just one predictor in model Now we obtain 4 regions The Australasian Data Mining Workshop Optimal Tree is identified by reference to test data performance TreeNet Runs with 4 and 9 Trees  A Treenet model can be evolved indefinitely    All model results refer by default to performance on test data    © Copyright Salford Systems 2001-2002 Want to be able to pick the “right-sized” model Although resistant to overfitting the model can overfit drastically in smaller data sets Require independent test sample Cross-validation methods not available (yet) For practical real time scoring may also want to select an overly small model © Copyright Salford Systems 2001-2002 TreeNet Run with 20 Trees TreeNet Run with 200 Trees Even though the optimal model is based on 200 trees, the learning actually stopped here Optimal model is based on 15 trees © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 First 15 Runs (No Shrinkage) First 15 Runs (Shrinkage at .2) Optimal model after 15 cycles is too bumpy Optimal model after 15 cycles is smoother Starting Model (mean of MV) Starting Model (mean of MV) © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 9 The Australasian Data Mining Workshop The Australasian Data Mining Workshop Interactions and TreeNet  TreeNet Run with 2 Predictors and 2 Trees At m-th cycle, TreeNet model can be represented by the following formula: m Fm (x ) = ∑ h(x;ai ) i =1 Here h(x; ai ) stands for individual tree at cycle i.  It now becomes clear that the order of interactions only depends on the complexity of individual terms in the sum above, therefore:  “Stumps” (each tree has only one split based on a single variable) always result an additive model  Trees with L terminal nodes may allow up to L-1 interactions    © Copyright Salford Systems 2001-2002 CART Run with 4 Nodes Stumps Produces Additive Model Jointly, 4 different regions are created: MV     14.43 22.33 30.81 38.68 First tree uses RM (Number of Rooms) Second tree uses LSTAT to update residuals © Copyright Salford Systems 2001-2002 The first split is the same as TreeNet #OBS 163 256 4 83 Small houses, bad neighborhood Small houses, good neighborhood Large houses, bad neighborhood Large houses, good neighborhood But these two splits are different => the model is no longer additive, RM and LSTAT interact Conclusion: CART model builds interactions This is an additive model: © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 CART Run with 4 Nodes Using All 13 Available Predictors Again, 4 different regions are created: MV       14.98 23.12 30.98 41.21 The accuracy has increased nearly 2 times #OBS 174 245 41 46 Small houses, bad neighborhood Small houses, good neighborhood Large houses, bad neighborhood Large houses, good neighborhood The model is quite large Similar conclusions, but model is no longer additive Completely different counts but the sums within the first and the second pairs are the same as before © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 10 The Australasian Data Mining Workshop The Australasian Data Mining Workshop Increase the Base Tree Size Reduce the Learning Rate to .5 Now using 5-node trees Smaller Model Moderate Overfit Better Accuracy Dramatic overfit Same accuracy Smaller model © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 Reduce the Learning Rate to .1 Classification Example   Larger Model Small Overfit Smooth curves CELL Phone Data RESPONSE: YES/NO to subscribe     © Copyright Salford Systems 2001-2002 YES NO PREDICTORS:   126 704 COSTBUY: cost of the hand set (4 levels) COSTUSE: monthly charges (4 levels) WEIGHT variable is added to account for non-even distribution of responders and non-responders © Copyright Salford Systems 2001-2002 A Single CART Run Prediction Success High price and high rate  poor response Overall Accuracy 63.734% Low price and low rate  good response © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 11 The Australasian Data Mining Workshop The Australasian Data Mining Workshop A Simple TreeNet Classification Model Individual Contributions © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 Prediction Success Now some live TreeNet runs Official version available May, 2002 from Salford Systems  Send e-mail to request copy to  support@salford-systems.com Overall Accuracy 64.109% © Copyright Salford Systems 2001-2002 © Copyright Salford Systems 2001-2002 References          Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140. Breiman, L. (1996). Arcing classifiers (Technical Report). Berkeley: Statistics Department, University of California. Buntine, W. (1991). Learning classification trees. In D.J. Hand, ed., Artificial Intelligence Frontiers in Statistics, Chapman and Hall: London, 182-201. Dietterich, T. (1998). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, Boosting, and Randomization. Machine Learning, 40, 139-158. Freund, Y. & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta, ed., Machine Learning: Proceedings of the Thirteenth National Conference, Morgan Kaufmann, pp. 148156. Friedman, J.H. (1999). Stochastic gradient boosting. Stanford: Statistics Department, Stanford University. Friedman, J.H. (1999). Greedy function approximation: a gradient boosting machine. Stanford: Statistics Department, Stanford University. Heath, D., Kasif, S., and Salzberg, S. (1993) k-dt: A multi-tree learning method. Proceedings of the Second International Workshop on Multistrategy Learning, 1002-1007, Morgan Kaufman: Chambery, France. Kwok, S., and Carter, C. (1990). Multiple decision trees. In Shachter, R., Levitt, T., Kanal, L., and Lemmer, J., eds. Uncertainty in Artificial Intelligence 4, North-Holland, 327-335. © Copyright Salford Systems 2001-2002 12 The Australasian Data Mining Workshop Case study: Modelling Risk in Health Insurance - A Data Mining Approach. Inna Kolyshkina Richard Brookes PricewaterhouseCoopers 201 Sussex Street SYDNEY NSW 2000 PricewaterhouseCoopers 201 Sussex Street SYDNEY NSW 2000 inna.kolyshkina@au.pwcglobal.com richard.brookes@au.pwcglobla.com ABSTRACT 2. DATA MINING VERSUS LINEAR METHODS. MODELLING METHODOLOGIES USED: CART DECISION TREES, MARS AND HYBRID MODELS. Interest in data mining techniques has been increasing recently amongst the actuaries and statisticians involved in the analysis of insurance data sets which typically have a large number of both cases and variables. This paper discusses the main reasons for the increasing attractiveness of using data mining techniques in insurance. A case study is presented showing the application of data mining to a business problem that required modeling risk in health insurance, based on a project recently performed for a large Australian health insurance company by PricewaterhouseCoopers (Sydney). The data mining methods discussed in the case study include: Classification and Regression Trees (CART), Multivariate Adaptive Regression splines (MARS) and hybrid models that combined CART tree models with MARS and logistic regression. The noncommercially sensitive implementation issues are also discussed. The main reasons for the increasing popularity of data mining methods amongst the actuarial community can be briefly summarised as follows. Data mining relies on the intense use of computing power, which results in an exhaustive search for the important patterns, uncovering hidden structure even in large and complex data sets and in many cases a well-performing model. Also, unlike the more traditional linear methods, it does not assume that the response is distributed according to some specified distribution (which is often incorrect for real-life insurance data sets). In contrast, traditional methods take longer to develop models, and have particular trouble selecting important predictors and their interactions. Another very attractive feature, involved in many data mining modeling methodologies is automatic "self-testing" of the model. A model is first built on a randomly-selected portion of the data and then tested and further refined on the remaining data. Finally, most data mining methods allow the inclusion in the model categorical predictors with a large number of categories which are typically present in the insurance data sets (for example, postcode, injury code, occupation code etc). Classical methods cannot deal with such variables effectively, and, as a result, they are either left out of the model, or have to be grouped by hand prior to inclusion. Keywords Data analysis in insurance, data mining, Classification and Regression Trees (CART), Multivariate Adaptive Regression splines (MARS), hybrid models. 1. INTRODUCTION In insurance, like in many other industries (health, telecommunication, banking to name a few) the size of databases today often reaches terabytes. In a dataset like this, with millions of cases and hundreds of variables, finding important information in a dataset is like finding the proverbial need in the haystack. However the need for extraction of such information is very real, and data mining is definitely a technique that can meet that need. Each data mining technique has its advantages as well as its drawbacks. These outside of the scope of this paper, but are discussed in detail in the literature (for example, Vapnik (1996) and Hastie et al. (2001)). We were very aware of the importance of selecting the method of analysis that is best suited for a particular problem, and, after an extended study of the available data mining techniques, we selected tree-based models and their hybrids for everyday modeling of insurance data. The reasons for such selection are as follows. Tree-based methods are very fast, require less data preparation than some other techniques, can more easily handle missing values or noisy data, are unaffected by outliers, and are easy to interpret. Various data mining methodologies have been used in insurance for risk prediction/assessment, premium setting, fraud detection, health costs prediction, treatment management optimization, investments management optimization, customer retention and acquisition strategies. In fact, recently a number of publications have examined the use of data mining method in insurance and actuarial environment (eg, Francis, 2001, WorkCover NSW News, 2001). The main reasons for the increasing attractiveness of the data mining approach is that it is very fast computationally and also overcomes some well-known shortcomings of traditional methods such as generalised linear models that are often being used for data analysis in insurance. This paper gives an example of the application of data mining methodologies to modelling risk in insurance based on a recent project completed by PwC Actuarial (Sydney) data mining team for a large insurance company client. A useful feature of the software packages we used (CART® and MARS®) is that they are easy to implement in SAS which is the main data analysis software package used by us as well as by the majority of our clients. We provide below brief introductions to the techniques we used, only complete enough for appreciating the outline of the modelling process we describe. A more detailed description of The Australasian Data Mining Workshop Copyright  2002 13 The Australasian Data Mining Workshop them can be found in the literature as indicated in the individual sections. 2.3 Hybrid Models The strengths of decision trees and “smooth” modeling techniques can be effectively combined. Steinberg and Cardell (1998a, 1998b) describe the methodology of such combining where the output of CART model in the form of terminal node indicator, of the predicted values or of the complete set of indicator dummies is included among other inputs in the “smooth” model. The resulting model is continuous and gives a unique predicted value for every record in the data. Typically, all strong effects are detected by the tree, and the “smooth” technique picks up the additional weak, in particular linear, effects. Combined, these small effects can very significantly improve the model performance (Steinberg and Cardell 1998a, 1998b). 2.1 Classification and Regression Trees (CART )  The CART methodology is known as binary recursive partitioning (Breiman et al, 1984). It is binary because the process of modelling involves dividing the data set into exactly two subgroups (or “nodes”) that are more homogeneous with respect to the response variable than the initial data set. It is recursive because the process is repeated for each of the resulting nodes. The resulting model is usually represented visually as a tree diagram. It divides all data into a set of several non-overlapping subgroups or nodes so that the estimate of the response is “close” to the actual value of the response within each node (Lewis et al, 1993). CART then ranks all the variables in the order of importance, so that a relatively small number of predictors get a non-zero importance score. This means that it quickly selects the most important predictors out of many possible ones. The model is quickly built, is robust and easily interpretable. However, as any decision tree, it is coarse in the sense that it predicts only a relatively small number of values and all cases within each node have the same predicted value. It also lacks smoothness: a small change in a dependent variable can lead to a large change in the predicted value. Another disadvantage of CART is that it is not particularly effective in modelling the linear structure, and would build a large model to represent a simple relationship. Further details and discussion of decision trees and CART® can be found in literature ( Breiman et al, 1984 ; Hastie et al, 2001). 3. HEALTH INSURER CASE STUDY 3.1 Background The methodology described above was successfully applied in a recent project completed for a major health insurance company client. It was used for creating the model of overall projected lifetime customer value. The model took into account many aspects influencing customer value such as premium income, reinsurance, changes in the family situation of a customer (births, marriages, deaths and divorce), probability of a membership lapse and transitions from one type of product to another. Each of these aspects as well as hospital claim frequency and cost for the next year and ancillary claim frequency and cost for the next year was modelled separately and the resulting models were combined into a complex customer lifetime value model. In this article we will discuss one of the sub-models, namely the model for hospital claim cost for the next year. 2.2 Multivariate adaptive regression splines (MARS) 3.2 Data MARS is an adaptive procedure for regression, and can be viewed as a generalisation of stepwise linear regression or a generalization of the recursive partitioning method to improve the latter’s performance in the regression setting (Friedman, 1991; Hastie et al, 2001). The central idea in MARS is to formulate a modified recursive partitioning model as an additive model of functions from overlapping, (instead of disjoint as in recursive partitioning), subregions (Lewis et al, 1993). 3.2.1 Data description De-identified data was available at a member level over a 3 year period. The model used information available over the first 2 years to fit a model based on outcomes over the last year. We excluded from the data those customers who lapsed prior to the end of the 3 year period or joined the health insurer later than 3 years ago. This latter exclusion allowed us to avoid issues related to waiting periods and enabled us to use two years of data history in the modelling. The MARS procedure builds flexible regression models by fitting separate splines (or basis functions) to distinct intervals of the predictor variables. Both the variables to use and the end points of the intervals for each variable-referred to as knots-are found via an exhaustive search procedure, using very fast update algorithms and efficient program coding. Variables, knots and interactions are optimized simultaneously by evaluating a "loss of fit" (LOF) criterion. MARS chooses the LOF that most improves the model at each step. In addition to searching variables one by one, MARS also searches for interactions between variables, allowing any degree of interaction to be considered. The "optimal" MARS model is selected in a twophase process. In the first phase, a model is grown by adding basis functions (new main effects, knots, or interactions) until an overly large model is found. In the second phase, basis functions are deleted in order of least contribution to the model until an optimal balance of bias and variance is found. By allowing for any arbitrary shape for the response function as well as for interactions, and by using the two-phase model selection method, MARS is capable of reliably tracking very complex data structures that often hide in high-dimensional data (Salford Systems, 2002). The data used for analysis can be grouped as follows. Firstly, there were demographic variables, such as age of the customer, gender, family status (about 30 variables). The second variable group was geographic and socio-economic variables such as location of the member’s residence and socio-economic indices related to the geographic area of the member’s residence such as indices of education, occupation, relative socio-economic advantage and disadvantage (about 80 variables). The third group of variables was related to membership and product details such as duration of the membership, details of the hospital and ancillary product held at present as well as in the past (about 30 variables). The fourth group of variables was related to claim history (both ancillary and hospital), details of medical diagnosis of the member, number of hospital episodes and other services provided to the member in previous years, number of claims in a particular calendar year etc (about 100 variables). The fifth and last group of variables included such information as distribution channel, most common transaction channel, payment method etc (about 50 variables). Overall there were about 300 variables. The Australasian Data Mining Workshop Copyright  2002 14 The Australasian Data Mining Workshop 3.2.2 Data preparation, cleaning and enrichment. Gains Chart - Cost ranked by predicted variable The data underwent a rigorous checking and cleaning process. This was performed in close cooperation with the client. Any significant data issues or inconsistencies found were discussed with them. Among other things such as statistical summaries, distribution analysis etc, the checking process involved exploratory analysis using CART which was applied to identify any aberrant or unusual data groups. 100% 90% 80% % Captured 70% Some of the variables in the original client data set were not directly used in the analysis. For example instead of the date of joining, we used the derived predictor “duration of membership”. In other cases, if a predictor was described by the client as likely to contain unreliable or incorrect information, it was excluded from the analysis. 60% 50% 40% 30% 20% % of actual events captured in the top X% Random Sample 10% 90 % 10 0% 80 % 60 % 50 % 40 % 30 % 20 % 10 % 0% 70 % Theoretical Best 0% Top x% A number of variables included in analysis were derived by us with the purpose of better describing the customer behaviour. Examples are duration of membership and indicator of whether or not the member had a hospital claim in previous years. Many of such predictors, for example, indicators of whether the member stayed in hospital for longer than one day and whether or not the services received were of surgical or non-surgical nature, were created after consultation with clinical experts. Figure 1 Gains chart for total expected hospital claims cost A further diagnostic of model performance is analysis of actual versus expected values of probability of claim or claim cost. Such analysis can be pictorially represented by a bar chart of averaged actual and predicted values for overall annual hospital cost. This chart is shown in Figure 2. To create this chart, the members were ranked from highest to lowest in terms of predicted cost, and then segmented into 20 equally sized groups. The average predicted and actual values of hospital cost for each group were then calculated and graphed. Other variables were added to the data from various sources such as Australian Bureau of Statistics. These included a number of various socioeconomic indices based on the member’s residence, some related to broader geographic areas such as state, others more closely targeting member’s location such as postcode-based indicators. Total Expected Cost 4,000 Actual 3.3 Modelling Methodology 3,500 First, we built a CART tree model. This served purposes of exploration, getting appreciation of the data structure and selection of the most important predictors and provided easily interpretable diagram. The client found CART diagrams easy to understand and informative. To further refine the model, we then built a hybrid model of CART and MARS using the hybrid modelling methodology (Steinberg & Cardell, 1998a, Steinberg & Cardell, 1998b). This was achieved by including CART output in the form of a categorical variable that assigned each record to one of the nodes according to the tree model as one of the input variables into a MARS model. MARS, like CART 3,000 Expected Cost 2,500 2,000 1,500 1,000 500 0% 10 % % 90 95 % % 85 % 75 80 % % 70 % 60 65 % 55 % % 45 50 % 40 % % 35 30 % 25 % % 15 20 % 10 5% - Percentile (predicted) In some cases where we wanted to achieve an even higher degree of precision we built a “three-way hybrid model” combining CART®, MARS® and a linear model such as logistic regression or generalized linear model. This was done by feeding MARS® output (in the form of basis functions created by MARS®) as inputs into a linear model. Figure 2 The bar chart of averaged actual and predicted values for overall annual hospital cost The chart suggests that the model fits well, however slightly over-predicts for the lower expected costs but this was of little business importance for the client. 3.4 Model diagnostic and evaluation 3.5 Findings and Results The main tools we used for model diagnostics were gains chart and analysis of actual versus predicted values for hospital cost. 3.5.1 Model Precision The model achieved high degree of precision as is demonstrated at the actual versus predicted graph (Figure 2) and gains chart (Figure 1) above. The gains chart for the overall hospital claims cost model presented in Figure 1 shows that we are able to predict the high cost claimants with a good degree of accuracy. As a rough guide, the overall claim frequency is 15%. Taking the 15% of members predicted as having the highest cost by the model, we end up with 56% of the total actual cost. Taking the top 30% of members predicted as having the highest cost by the model, we end up with almost 80% of the total actual cost. 3.5.2 Predictor importance for hospital claims cost Predictors of the highest importance for overall hospital cost were age of the member, gender, number of hospital episodes and hospital days in the previous years, the type of cover and socio-economic characteristics of the member. Other important predictors included duration of membership, family status of the member, the type of cover that the member The Australasian Data Mining Workshop Copyright  2002 15 The Australasian Data Mining Workshop had in the previous year, previous medical history and the number of physiotherapy services received by the member in the previous year. The fact that the number of ancillary services (physiotherapy) affected hospital claims cost was a particularly interesting finding. 5. ACKNOWLEDGEMENTS Details of the resulting model are commercially sensitive. However, we can state that many of the potential predictors given above were indeed significant to a degree greater than we had expected. For example, while some health insurance specialists argue that the only main risk driver for hospital claim cost is age of the member, our results have demonstrated clearly that although age is among important predictors of hospital claims cost, a large amount of variation is not explained by age alone. One way of showing this is by means of a graph of predicted cost by age shown in Figure 3. If age were the most important predictor with other predictors not adding much value, the graph would show values scattered closely to a single curve. The fact that it is scattered so widely, shows that there are many other factors contributing significantly to predicted cost. Examples of such factors are socioeconomic indicators, type of hospital product and, for some age groups, the supply of hospitals in the location of the member’s residence. 6. REFERENCES We would like to thank Mr John Walsh (PricewaterhouseCoopers Actuarial, Sydney) for support, advice and thoughtful comments on the analysis. [1] Breiman, L., Friedman, J., Olshen, R. and Stone, C. (1984). Classification and Regression Trees. Wadsworth, Pacific Grove, CA. [2] Francis, L. (2001). Neural networks demystified. Casualty Actuarial Society Forum, Winter 2001, 252–319. [3] Haberman, S. and Renshaw, A. E. (1998). Actuarial applications of generalized linear models. In Hand, D. J. and Jacka, S. D. (eds). Statistics in Finance. Arnold, London. [4] Han , J., and Camber M. (2001) Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers. [5] Hastie, T., Tibshirani R. and Friedman, J. (2001). The elements of statistical learning: Data Mining, Inference and prediction. Springer-Verlag, New York. [6] Lewis, P.A.W. and Stevens, J.G., “Nonlinear Modeling of Time Series using Multivariate Adaptive Regression Splines,” Journal of the American Statistical Association, 86, No. 416, 1991, pp. 864-867. [7] Lewis, P.A.W., Stevens, J., and Ray, B.K., “Modelling Time Series using Multivariate Adaptive Regression Splines (MARS),” in Time Series Prediction: Forecasting the Future and Understanding the Past, eds. Weigend, A. and Gershenfeld, N., Santa Fe Institute: Addison-Wesley, 1993, pp. 297-318. [8] McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models (2nd edition). Chapman and Hall, London. [9] Salford Systems (2000). CART® for Windows User’s Guide. Salford Systems [10]Salford Systems (2002). MARS® (Multivariate Adaptive Regression Splines) [On-line] http://www.salfordsystems.com, (accessed 08/10/2002). Figure 3. The graph of predicted hospital cost versus age. We also build models for ancillary claims of various types, including optical, dental and physiotherapy claims. Unsurprisingly, the most important predictor of ancillary claims is the customer’s previous claiming pattern. However, there are strong age related effects (for instance the teenage orthodontic peak for dental claims), socio-economic effects and location effects. [11]Smyth, G. (2002). Generalised linear modelling. [On-line] http://www.statsci.org/glm/index.html, (accessed 25/09/2002). [12]Steinberg, D. and Cardell, N. S. (1998a). Improving data mining with new hybrid methods. Presented at DCI Database and Client Server World, Boston, MA. [13]Steinberg, D. and Cardell, N. S. (1998b). The hybrid CART-Logit model in classification and data mining. Eighth Annual Advanced Research Techniques Forum, American Marketing Association, Keystone, CO. 3.6 Implementation issues. The deliverables for the model included a SAS algorithm which takes the required input data and produces a cost score for a given customer so the client could easily implement the model directly in SAS environment. [14]Steinberg, D. and Colla, P. L., (1995). CART: TreeStructured Nonparametric Data Analysis. Salford Systems, San Diego, CA. 4. CONCLUSION The results described above as well as a number of projects completed by PwC Actuarial (Sydney) for large insurer clients demonstrate that data mining methodologies can be very useful for analysis of the insurance data. [15]Vapnik, V. (1996). The Nature of Statistical Learning Theory. Springer-Verlag, New York. [16]WorkCover NSW News (2001) Technology catches insurance fraud.[On-line] http://www.workcover.nsw.gov.au/pdf/wca46.pdf (accessed 08/10/02) The Australasian Data Mining Workshop Copyright  2002 16 Investigative Profile Analysis with Computer Forensic Log Data using Attribute Generalisation Tamas Abraham Ryan Kling Olivier de Vel Information Networks Division Defence Science and Technology Organisation PO Box 1500, Edinburgh SA 5111, Australia Information Networks Division Defence Science and Technology Organisation PO Box 1500, Edinburgh SA 5111, Australia Information Networks Division Defence Science and Technology Organisation PO Box 1500, Edinburgh SA 5111, Australia tamas.abraham@dsto.defence.gov.au ryan.kling@dsto.defence.gov.au olivier.devel@dsto.defence.gov.au ABSTRACT                                                                     '7℄                                          %&                                                             &     '42℄            +                                                                                      ;   2  17                                                        ;   4                     +              ­                                            The Australasian Data Mining Workshop      '5℄                             :    1                                    *                    '-℄                 *   '8℄                      .      1              9                     +           *   %       *        *          %   & ,                                       '5/℄         '5-℄                                       3     '(℄                       '5(3 57℄     '23 43 56℄                                       *             *                        $      "                                 *     0         '83 5/3 423 5-℄                               1                                                                 #      INTRODUCTION   !               1.                                                                                  '/℄                            ;   6         The Australasian Data Mining Workshop  <     ;   ( 2. BACKGROUND TO INVESTIGATIVE PROFILING 1 +       %          &         %&                           !                  ½ < %                           $            1                                   1  installation   version date/time stamp file location ...                 1                 =                              ?        ?   %"  #&                =5 4 The Australasian Data Mining Workshop    ,                                         <                                      1                                        %&                             %                                                  &  #   &                   C                9                 !       :     B                                %&                                                 <    !                                       >               %             +                        "                      = 5 4     0       >     Æ    elm             !  5< 1   &#                  '5A℄          ... Outlook               ... MS Word Emacs           application type Java               C/C++                ...       '23 43 6℄          BPau ...     @ %    $            mailer                    editor                    !  5 compiler    # 1  2.1 Profiling with Generalisation   ;   44&                    %  =  &   & =   "1      Ë  1 &    %       &        =       =   & %    =  & %  =   & % =   &           +               %        "1   +                  =  & %   =    &  %                '7℄ ! %   &                   ¾            ¼ < %     % + *                  9          +              ½                                          18 The Australasian Data Mining Workshop                         Forensic Evidence   1                Data Warehouse     concepts       2.2 Concept Hierarchies 9                                 &                         >                                                                                     $   &                                                                                             ,                               1                                                      %   ;                                                    +                    +                                                                        ;        @                The Australasian Data Mining Workshop                                                       !         1                +                                                                 3.1 Induction Algorithm           24&            1       +            &                                @     !  4                     +               THE PROFILING PROCESS         !                    1                                        D                                %                                                   1                                          .      1                                                                                  %     &                                                      1      3.                        %   C  ;                                      ;                       +        !  4<  @                                               profile−outlier           intra−profile mining       1  pre−processing  1         reports profile/outliers              formatted log file  %                 log file      '453 55℄              ­  19                                            1                  The Australasian Data Mining Workshop                                      '5A℄                                            &                        &         ?                                     +        . +                                   1                                                    @                           ;            3.1.3                          F                                              ;     1                 H               &     The Australasian Data Mining Workshop                                   20                                                                 :                                                                              &             3.1.2 Unbalanced-tree Ascension  %                  +    %                                                                                        1                       +    I                                           H 3.1.2  ;              3                      3         0       ;   44                             1                                                                                               H    %  !                                                                                                                                                     >     1            1 H                                                                       C                    1       %                  H                      C                                                            +     3.1.1 Parameters                  & >       $                    +                                                                                            +                                             +   +                       1                   %    #1FG#                                             ;                     E                                               The Australasian Data Mining Workshop !                                     "   # %                      &                                                                 1                                                                          ¾      ℄             # $   &    '     %        3.1.3 Numerical Data        +                                       *  *                     (         (             "              ¾      ℄                    +       ! &"    )                   C                           ! "    F                                             %                      !                                        * 3.1.4 Vote Propagation 1                                                               54A    5AA                  %                    1$  %                                                 +               $      +                                3            %   &              +                & 9                          !                ­  21     Æ                                            < 1$  The Australasian Data Mining Workshop                                   %                       $                   3.1.5 Ascension Algorithm    &                                                                                   1                                            !                                                          %                                                    I             3.2 Profile Separation             545&                     *                  .* ,                   -* ,       1    54A          "                                                &* ,  $                   9            %   * +                                  J                                  ! &         $                                 1          The Australasian Data Mining Workshop                                  ?             &               #   #    1                                                        &   %           !          Æ                              ;                                  % &                       $                +            % &       C  !    ¼    L  %    %     = A    & %     &      %    &         %              &  %       &      4                     &                        !       %                                                                                  +          &         &                           +                     3.2.5 Closest Point                     &!5AA   22   C        , 1 %,     1 & , 1            $                               The Australasian Data Mining Workshop         5           = 5                 %  &  < 1           %                                                                   ,  =  =    %  &                  3.2.1 Constant Separators            = 4     %                              %             3.2.4 Rate of Change in Means            &                                   !                                             K                &          %                               3.2.3 Euclidean Distances                                         K ! !           &        %   &       %  =  K                                                         +     0                 <                                                                                           3.2.2 Using Line  =          !"          =    =   :  %      &                   $             =     !                         !   %             &               5    "               I            1                 H            ;              ,      I      :                                         #                 9   #       %                                                 The Australasian Data Mining Workshop                  #       #              & K %  &     = 5       %                        1         1                     0                                                                                  5                                          3.2.1                          +          %                                                                                                      23     :          &                      +             &                    +    %              F                                             0            The Australasian Data Mining Workshop             I                  @                                %                                                                    +                           %B.& '44℄                 < $           :   @                  1                           1        +                                                    3.3 Element Distance Metrics $                        +                                         Æ                                                       <        & 1      M                               M ;                                                              % "  & ;                      !                  ;                           ?     +     +                          <                   0  +       M    J          0       + M   3.2.8 Separation in One Dimension           0                             <       %   &     E                              #                                #     ;       :         9                        " #                                                '543 52℄          3.2.7 Regression               1                                        I         +                                     +                           3.2.6 K-means Clustering                            ?     %  The Australasian Data Mining Workshop           ;          #     ½ '4         ¾    %           ½ %                 <          !  2< C                ½           3.1.2&           "  #       %     "  #&           ¾        <                 ½ %      %                  "1            ½       5 ¾& = % ½  ¾& ¾       % ½      %                                    %  % 9   >                                                                 ¾     %;     < % A    %      H       &    I                                                               JF N   1                                  JF N                                            %  = 5     ¾ ½  & = 5                           5         Æ            The Australasian Data Mining Workshop    % A        4           ¾     ½  ?    J                                         !   3.3.2 Metric Strategy                     !  2 ¾&                                        5 % ½      F       ¾& = 5                      %A 5&  % ½      = A           "              4. DATA, EXPERIMENTAL METHODOLOGY AND RESULTS &   %&            = A               Æ     & = A    4        H                    ½& K % ¾&      # %  &          !                         3.3.1&                             ¾ &    ½  ¾  F       & K 5                 %    !                !                   # "1 #      #   % ½   +              # "1                  &                                            & = 5       "1                  ¾         & =                        ¾  =     !  2             %  %A 5&      "  #    & %        & %             <                         JF N ½     F     %;       ¾  ½     5  .     6                                       ½                        !! !! !! !! !!    !!       & < ¾      2                   & = A  =    ¾  <          ¾ ½ 4                 ½ 5 =   !        ½  ¾    "  # 5     #         'A 5℄   A        ℄ :             "  "  # # # # # # $  # # $  $    & 3.3.1 Metric Properties #  +          24 The Australasian Data Mining Workshop                    !          %                           !  # # # # # # # # # # # #  ! !              ! ! !  (< C                               $              !                               !  6< ,                      J $         %         !  2&               %   &           &                                        ;   3.3.2                                        H            H    1                    F                   ¿           7(L&                         % &               +        & C                      /      $  "  &             .                                         +             ! ­                                            25                       %     +                               0                6   %&  5         H                             ¿                                 F          The Australasian Data Mining Workshop                                         %               &                                                        :                                                   +                      &                $        %                  +       4AAA    4.2 Intra-Profile Experiments         +                          :                 H !  (               F               #       2AA8AA                           :                  "             !  6         %   %                             <                                                                +            H                                            +         F        &             &                                                                           1            9                                4.1 Profile-to-outlier Experiments                                                        %               3.3.1        %                                     %            ;                ;   24                        ;                 &       The Australasian Data Mining Workshop   :                         $           = &  = &         !            &½ < % &¾ < %  = &  = &             %                   '6℄                                                !     ,      1    +        "#         !                          :                                                                   5%2&<475D258   Q                                 ! 1 ;  577( '54℄  O 0    0 O 0   0       !   4 #           !  $    %      &%-'') ;  5777         '52℄  O 0    0 O 0   Q            < 1                   1         F 23 !      The Australasian Data Mining Workshop    I .   2A%6& 4AA5 '55℄ O 0  G !         5774                 ! 1( 0 !   2 3               .  <           .     '543 52℄        .$ /  '5A℄ O 0 G                !                  9 577-       $    %                '7℄  !   ! ,        "@#                      '583 4A℄ 1            !      0 !   %        $    1 5778  .       1                        '/℄ . C  0, Q   O ;   N N 1                   $                      #        .  +                 '-℄ $  B  1 1   .                                 !   *+    * ,      ,   &*# %-'') 1 5777         ,  4AAA . ;            '5℄                   !   $   !   !   %   $  &%$'() 9 .  '(℄ C                                              H  J;1 577/  CONCLUSIONS AND FUTURE DIRECTIONS         $    %       1   P ;  , G $           26%4&<-6D/4 4AA5 '8℄ , Q               5.                 '2℄ I 1    1 H  J   &       (%5>4&<22D(/ 4AA5      =  A -(        =   =  & %       .         1               '4℄ I 1    1 H  C                        !   ###   !    $    .  O    4AA4            '5℄  1   $  B         !             &  &  =  = % %    1              6. REFERENCES            +                      5 A                    %  & &¼ < % &½ < %                            26  ;77A6      $   5777   ;   J  The Australasian Data Mining Workshop '56℄ . 0    H  9  9   E      !   $ '57℄ ; F  $  B  1     62%/&<5A4D5A8    O   I       < 1   E I    '4A℄  $  ,        <                  '44℄    Q             G .      <             ! 1 23 !     47%54&<5452D544/ '42℄ , F   B Q  .               !   *+   $   3         &*# %-1) 1 4AA5     C    : H        !   $ 57/8     !   *+   $   3         &*# %-1) 1 4AA5 The Australasian Data Mining Workshop   ;      H       !   *+   * $   ! #  &*#%-) 1 4AAA '5/℄ 1 F    ;  577( 5/%6&<642D6(6 4AAA    ;   '45℄  ;    1   .   H    $     !  '5-℄ 9 .   0   E  G ;  O :       5778 '58℄ 0 E E !  O 0 9                      !        !   $  3      !   $ 6A%2&<--D/- 577-        !   !   3       &53'() O 577/ 4AAA '5(℄ O Q  9 .    . H O 0     ­  27  28 Mining Antarctic scientific data: a case study Ben Raymond and Eric J Woehler Australian Antarctic Division Kingston, Tasmania 7050 http://www-aadc.aad.gov.au ben.raymond@aad.gov.au ABSTRACT 20oS The Australian Antarctic Data Centre is a web-accessible repository of freely-available Antarctic scientific data. The Data Centre seeks to increase the value and utility of its holdings through data mining analyses and research. We present and discuss analyses of an extensive spatial/temporal database of at-sea observations of seabirds and related physical environmental parameters. Mixture-model based clustering identified two communities of seabirds in the Prydz Bay region of East Antarctica, and characterised their spatial and temporal distributions. The relationships between observations of three seabird species and environmental parameters were explored using predictive logistic models. The parameters of these models were estimated using data from the Prydz Bay region. The generality of the models was tested by applying them to data from a different region (that adjacent to Australia’s Casey station). This approach identified regional differences in the at-sea observations of seabird species. The results of these analyses complement those of at-sea studies of seabirds elsewhere around the Antarctic. They also provide insights into possible data errors that were not readily apparent from direct examination of the data. These analyses enhanced ecological understanding, provided feedback on survey strategy, and highlighted the utility of the repository. 30oS 1. Hobart Kerguelen Islands (France) 50oS Heard Island Macquarie Island Ba y 60oS Casey yd z Mertz Glacier 70oS 60oE Pr Mawson Davis 90oE 120oE 150oE 180oW Figure 1: Australian Antarctic research stations (•) and other locations mentioned in the text data. These analyses are additional to those undertaken as a routine part of Antarctic scientific studies and aim to exploit the multi-disciplinary nature of the data held by the AADC; • the extraction of actionable information from low-level scientific data. This has direct application to conservation, planning, and legislative activities, as well as producing “end-product” data suitable for use by other scientific investigators; and • to generate a better understanding of the holdings of the AADC, including the identification of data errors, duplicated data, missing records, linkages between databases, and data acquisition procedures. This information has direct application for data management issues, such as maintaining high data quality and an efficient database structure. INTRODUCTION The Australian Antarctic Data Centre (AADC) was established in 1995 to make scientific observations and results from Antarctica freely available. The free availability of data is one of Australia’s obligations under the Antarctic Treaty (article III). The majority of the data collected in Antarctica, while originally collected for a specific investigation, nevertheless have wide potential relevance to other projects and investigators. Many of the AADC’s holdings are ecological or environmental in nature, and linkages between databases are extensive. We present an overview of the mining of the “Wildlife-onVoyage” (WoV) database. This database holds an extensive collection of observations of wildlife (comprising birds, whales, and kelp) made from ships during Antarctic voyages. The information within this collection has wide scientific relevance. However, the data present numerous analytical challenges, including spatial and temporal variation (within and across years), missing values, and a lack of balance in sampling. We begin by describing the data and the methods that were used to collect them, and then present and discuss two investigations using these data. These investigations focused on the identification of communities of seabirds and the relationships of the birds with their environment. The AADC plays an active role in the analysis of Antarctic scientific data by mining its holdings. The broad aim is to improve the value of these data to the Antarctic community. Several approaches are being taken, including: • the direct application of mining and exploratory techniques in order to uncover new information from the The Australasian Data Mining Workshop 40oS c Copyright °2002 29 The Australasian Data Mining Workshop Jul−Sep Oct−Dec 20 S o 20 S o 30oS 30oS o 40 S 40oS o 50 S 50 S o 60 S o o 60 S 70oS o 60 E o 90 E o 120 E o 150 E 70oS o 180 W o 60 E o 90 E Jan−Mar o 180 W o 180 W 150 E o Apr−Jun 20 S o 20 S 30 S o 30 S 40 S o 40 S o 50 S 50 S o o 60 S o o o o 60 S o 70 S o 120 E o o 60 E o 90 E o 120 E o 150 E 70 S o 180 W o 60 E o 90 E o 120 E 150 E o Figure 2: Spatial and temporal distribution of at-sea sightings of seabirds made from Australian Antarctic voyages, 1980-2002. Data from all years have been pooled. Densities are shown in cells of size 1◦ longitude × 1◦ latitude. The shade of grey denotes the number of surveys made in the cell (black = more than 30 surveys, white = no data) 2. DATA DESCRIPTION 3. The seabird component of the WoV database comprises approximately 140 000 observations of 119 species, made on 98 voyages conducted between 1980 and 2002. These voyages were undertaken in the course of Australia’s Antarctic scientific research program. The majority of the voyages were for the transportation of personnel and supplies to Australian Antarctic bases (see Figure 1), with a small number of voyages for scientific surveys. While survey voyages attempted to maintain a balanced sampling strategy, the same was not true of the transportation voyages. Observations on these voyages were incidental, with little or no opportunity for balanced survey design. Figure 2 shows the spatial and temporal distribution of the data. The most densely surveyed areas are clearly those adjacent to the Australian Antarctic stations. The temporal distribution of the observations is heavily biased against the winter months, because the extensive sea ice in the Antarctic during winter makes ship travel virtually impossible. 3.1 Data cleaning Data cleaning and error checking consumed a large proportion of the time spent on this study. Prior to the 1992/1993 season all observations were recorded on paper forms and manually entered into the database. On voyages after this season a laptop-based entry system was used where possible, reducing the likelihood of errors in data transcription. Rule-based techniques were used to detect violations of physical limitations: for example, sea surface temperature cannot be less than -1.8 ◦ C, the approximate freezing point of sea water. Similarly, the differences in time and position of consecutive observations were used to calculate an apparent ship speed, which was then compared to a maximum possible speed of 25 knots. There were instances in which either the time or position stamp of a data record was in error by one digit, suggesting an error during manual entry of the data. Position and time stamp errors were in general more easily identified using graphical methods, particularly where the errors were small (for example, transcription errors in the tenths-of-degrees digit). Observations of wildlife were made in surveys of 10 minutes duration, with generally one survey made per hour of the voyage. Physical environmental data collected at the time of each survey included sea surface temperature (◦ C), sea state (or wave height, recorded on an ordinal scale), cloud cover (categorised as clear, partial, total, or blowing snow), wind force (Beaufort) and direction, and atmospheric pressure (hPa). Sea ice cover was also estimated but, as discussed below, alternative sea ice data derived from satellite images were used in the analyses. The Australasian Data Mining Workshop PREPROCESSING The species diversities of Antarctic seabird communities are low. Except for very rare species one could reasonably expect to encounter the same species from year to year in a given region. The identification of species for which there were very few observations in a region therefore proved to be a simple but effective mechanism of finding records that were likely to contain errors in species identification or data entry. For example, we found four observations of Australasian gannets in Prydz Bay (66◦ S), a species which is not normally c Copyright °2002 30 The Australasian Data Mining Workshop found south of 50◦ S. Other likely errors in species identification were also identified during the community analyses (see section 4, below). is relatively dense, as two of Australia’s four permanent research stations (Davis and Mawson) are located along this sector of the Antarctic coastline. Observations were pooled into composite records for the analyses. The pooling was limited so that these composite records contained consecutive observations from a single voyage only, and spanned no more than 12 hours and a 50 km change in ship position. These composite records are referred to here as “sites”, which is the usual nomenclature used in the ecological literature. The species composition of each site was compiled in presence/absence format, and the environmental variable values within a composite record were combined using a median (for continuous variables) or mode (for nominal or ordinal variables) operator. Errors were corrected using interpolation from surrounding values where possible, or patched using data from the marine science database (see below). In some cases, there were insufficient data to allow interpolation: such entries were deleted from the data set. 3.2 Database linkages The physical environmental variables in the WoV database (see section 2, above) provide a natural set of linkages to other databases both within and external to the AADC. Of particular interest is a marine science database that holds data collected from onboard sensors during Antarctic voyages. These data include various environmental variables including sea surface temperature, wind speed and direction, and solar radiation, as well as voyage information such as ship speed and position. Marine science data are available only from voyages of the Aurora Australis; the other ships used for Australian Antarctic scientific voyages do not have this real-time data logging system installed. Seabird communities were explored using two complementary cluster analyses. The first examined the clustering of sites based on species composition. The seabird communities were then generated from the species compositions of the resulting site clusters. The division of ecological data into discrete clusters can be problematic because in many cases the data do not show an inherently grouped structure. Rather, ecological data commonly form a continuum between extremes. The division of such a continuum into distinct entities does not necessarily lead to results that make intuitive sense. Soft clustering algorithms (also known as fuzzy, or probabilistic clustering), which assign to each datum a membership level in each cluster may therefore be preferable to “hard” clustering algorithms, which allocate each datum exclusively to a single cluster. Other, external, environmental databases are also relevant to this study. For example, the National Snow and Ice Data Centre at the University of Colorado (http://nsidc.org) maintains a database of satellite-derived sea ice concentration data. This database holds daily Antarctic sea ice concentration data from 1978 onwards, on a spatial grid with a cell size of 25km × 25km. These sea ice data were used in preference to the directly-observed data, in order to avoid the potential bias of ship tracks to areas of open water (i.e. less sea ice). 4. We applied a mixture-model approach [4; 13] to the clustering of sites by species composition. This is a soft clustering approach in which the data are modelled by a mixture of probability distributions, with each representing a different cluster. Since the species compositions were in binary (presence/absence) form, the Bernoulli distribution was the natural choice. Mixtures of multivariate Bernoulli distributions have been shown in theory to be non-identifiable [11]; however, in practice, interpretable results can still be obtained [6]. We used maximum-likelihood estimation by expectation-maximisation [9; 20]. Although we do not do so here, the mixture model approach also offers principled methods for the selection of the correct number of clusters [10]. This would be of interest in situations where a large number of cluster analyses were required, with little prior information available to guide the choice of number of clusters. Our choice of number of clusters was based on prior knowledge of the seabird communities along with expert assessment of the properties of the emergent clusters. COMMUNITY ANALYSIS 4.1 Motivation A community can be defined as a group of species that share a habitat. Community analysis can offer a broad view of an ecosystem and allows species-level information to be abstracted and presented in a compact form. Such analyses are therefore of interest for management and conservation purposes, but may also be used to guide more specific scientific investigations of particular species or areas of interest. The concepts and techniques of community analysis are identical to those of market basket analysis in data mining (used in a transaction database context, for example, to identify products that tend to be purchased together). 4.2 Methods The species compositions of the seabird communities were assessed on the basis of the membership of each species to each cluster as well as the constancy of each species within each cluster. The constancy may be calculated as the fraction of sites from a cluster that contain an observation of the species in question. Species with a high membershipconstancy product can be considered to be the “indicator” species of an assemblage [8]. Indicator species are useful for characterising the species composition of an assemblage, where such an assemblage contains many species. The study area of interest was Prydz Bay, defined as that area of the Southern Ocean between 60 ◦ E and 90 ◦ E, and south of 60 ◦ S to the Antarctic continent (see Figure 1). Prydz Bay was chosen as it has been the focus of numerous studies of seabirds in their colonies [18]. Prydz Bay is the primary seabird breeding locality in East Antarctica, with breeding populations of nine species [18], comprising approximately 30% of that East Antarctic seabird biomass [16]. Furthermore, the WoV data coverage within Prydz Bay The Australasian Data Mining Workshop c Copyright °2002 31 The Australasian Data Mining Workshop 15 Sep − 14 Oct 15 Oct − 14 Nov 15 Nov − 14 Dec o 60 S o 62 S o 64 S o 66 S o 68 S o 70 S 15 Jan − 14 Feb 15 Feb − 14 Mar Species 15 Dec − 14 Jan o 60 S o 62 S o 64 S o 66 S o 68 S o 70 S o 60 E 15 Mar − 14 Apr o 70 E 80 oE 90 o E 15 Apr − 14 May Legend o 60 S Assemblage 1 Assemblage 2 o 62 S o 64 S o 66 S Emperor penguin (R) Adelie penguin (R) Southern giant petrel (R) Southern fulmar (R) Cape petrel (R) Antarctic petrel (R) Snow petrel (R) Wilson’s storm petrel (R) South polar skua (R) Subantarctic skua Antarctic tern Arctic tern Antarctic/arctic tern Northern giant petrel Black−browed albatross Grey−headed albatross Light−mantled sooty albatross Wandering albatross White−headed petrel Mottled petrel Kerguelen petrel Blue petrel Prion spp. White−chinned petrel Dark shearwaters Black−bellied storm petrel 0 o 68 S o o 70 S 60 E o 70 E 80 oE 90 oE o 60 E o 70 E 80 oE 90 oE Figure 3: Spatial and temporal distribution of two assemblages of seabirds in the Prydz Bay region of Antarctica. The species composition of each assemblage is shown in Figure 4 * * * * * * * * * * * * * * 0.5 Assemblage 1 membership 1 0 0.5 Assemblage 2 membership 1 Figure 4: Membership of 26 seabird species to the two assemblages shown in Figure 3. (R) indicates the species that breed in Antarctic locations. Indicator species (see text) are marked with an asterisk algorithm would assign such a site exclusively to one of the two assemblages. The overlap is more readily observed using the soft clustering approach. Increasing the number of clusters to three placed this overlap into its own cluster, further highlighting this finding. The second cluster analysis grouped seabirds according to their spatio-temporal ranges. In this approach, species that were observed in the same region of the ocean at the same time are grouped together. This yields the seabird communities directly. Dissimilarities between species ranges were calculated using the TwoStep algorithm [2] and clustering computed using a hierarchical complete-linkage algorithm. A hierarchical clustering is more natural in this case because the number of entities is small (26 species within Prydz Bay) and the hierarchy of the dendrogram is itself of interest. Seabird communities identified using this approach are referred to here as “associations”. The communities identified by the mixture-model clustering described earlier will be referred to as “assemblages” in order to differentiate the two approaches. Indicator species are marked on Figure 4 with an asterisk. Two species (cape petrels and Wilson’s storm petrels) were found to be indicator species in both assemblages. This suggests that their at-sea distributions were quite broad, whereas the other breeding species were generally observed only in relative proximity to the Prydz Bay coast (particularly during the middle of the breeding season; see the distribution of assemblage 1 in Figure 3). This difference is a result of the fact that these two species breed both on the Prydz Bay coast as well as sub-Antarctic locations such as Heard Island (which lies to the north of Prydz Bay; see Figure 1). Thus, individuals observed offshore from the Prydz Bay coast are probably those breeding on Heard Island. The only other species that breeds both in Prydz Bay and on Heard Island is the southern giant petrel. 4.3 Results and Discussion The clustering of sites by species composition revealed a twogroup structure in the seabird assemblages. The spatial and temporal distributions of these assemblages (all years combined) is shown in Figure 3, and the species composition of the two assemblages is shown in Figure 4. Assemblage 1 contains all nine species that breed in Prydz Bay in addition to sub-Antarctic skuas, arctic and Antarctic terns, and northern giant petrels. This assemblage was observed close to the Antarctic coast during the middle of the breeding season (January-March, Figure 3). Assemblage 2 contains the remaining 12 species, all of which breed in temperature or sub-antarctic latitudes and forage within Prydz Bay during the southern hemisphere summer. This assemblage was observed during the summer months (December-March), offshore from the Prydz Bay coast. The spatio-temporal ranges of the two assemblages overlap, as can be seen from the midgrey cells in Figure 3. This overlap is handled transparently by a soft clustering algorithm, because sites which host both assemblages at the same time will have a non-zero membership to both assemblages. In contrast, a hard clustering The Australasian Data Mining Workshop * * * The hierarchical clustering of species by spatio-temporal range is shown in Figure 5. Cutting the dendrogram at a relatively high dissimilarity level yields two seabird associations (marked as (a) and (b) on the figure) that are identical to the two assemblages shown in Figure 4. Association (a) may be further split into (a1) and (a2). Sub-association (a1) contains southern giant petrels, cape petrels, Wilson’s storm petrels and arctic terns: three of these are the species that breed both on the Prydz Bay coast and on Heard Island. Their at-sea distributions are therefore different from the distributions of the remainder of the breeding species. This finding reinforces that obtained from the first cluster analysis, discussed above. As well as providing direct community information, these analyses yielded additional information relating to issues of species identification. Antarctic terns, arctic terns, and their c Copyright °2002 32 The Australasian Data Mining Workshop regions. Emperor penguin (R) Adelie penguin (R) Southern fulmar (R) Antarctic petrel (R) Snow petrel (R) South polar skua (R) Subantarctic skua Antarctic tern Arctic tern Southern giant petrel (R) Cape petrel (R) Wilson’s storm petrel (R) Antarctic/arctic tern Northern giant petrel Black−browed albatross Wandering albatross Dark shearwaters Grey−headed albatross White−headed petrel Kerguelen petrel Mottled petrel Blue petrel Light−mantled sooty albatross Prion spp. White−chinned petrel Black−bellied storm petrel 0 (a1) We investigated the use of predictive models as a means of investigating the relationships between seabird observations and environment. Seasonal behaviour and the response to environment are likely to differ among the species within a community, leading to dynamic community compositions. Furthermore, neighbouring communities are not disjoint but rather overlap at the edges of their ranges [19]. The models were therefore built using species-level data rather than community level. Given predictions of individual species ranges, it would be a straightforward matter to combine these into community-level predictions if desired. (a) (a2) (b) 1 2 3 4 5 6 Dissimilarity 7 8 9 The ability to successfully predict the at-sea distributions of seabirds from environmental parameters would be extremely valuable. At-sea survey data for much of the world’s oceans are limited due to the logistic difficulties and costs involved. Predictive models that use remotely-sensed environmental data may allow the estimation of seabird distribution in those areas of the ocean not amenable to direct survey. 10 Figure 5: Dendrogram of seabird species, clustered according to similarity of spatio-temporal range. (R) indicates species the that breed in Antarctic locations. Groupings within the dendrogram labelled (a), (b), etc. are discussed in the text 5.2 Methods Seabird observations from two different areas were used. Observations from Prydz Bay were used to build and test the models. These models were then applied to data from the Casey station region in order to test the generality of the models. The delineation of Prydz Bay was the same as in section 4, above, except that the northern boundary was extended to 50◦ S. This extension includes Heard Island (53◦ 5’S, 73◦ 30’E, an important seabird breeding area) in the study. The Casey station region was delimited to the area between 100◦ E and 120◦ E, and south of 50◦ S to the Antarctic continent (see Figure 2). There is no northern land mass equivalent to Heard Island in the Casey station area. composite (used when specific identification at sea was not possible), are grouped in the same assemblage in Figure 4, and relatively tightly in Figure 5. However, the behaviours of the two species are quite different. Arctic terns breed in the northern hemisphere and migrate to Prydz Bay in the southern hemisphere summer to feed. Antarctic terns breed on sub-Antarctic islands (such as Heard Island) during the summer and migrate north to South Africa during the winter. Antarctic terns breeding on Heard Island feed inshore and do not venture far from land. Thus, our clustering results suggest that at least some of the records of Antarctic terns in Prydz Bay may in fact be arctic terns that have been misidentified. Detailed examination of the distributions of these records would be needed to identify which are likely to be in error. Similarly, northern giant petrels were clustered together with the resident species. Northern giant petrels are a migratory species that are generally found in the northern regions of Prydz Bay [17]. Examination of northern giant petrel records revealed that on one particular voyage, a high number of unlikely northern giant petrel sightings were recorded in the southern part of Prydz Bay. It is possible that these were misidentified southern giant petrels. 5. These geographical areas were divided into grids of spatial bins, each spanning 2◦ longitude by 2◦ latitude. We assumed that the relationships between bird observations and the physical environment remain constant among years; therefore, data from all years were pooled. However, these relationships do vary with time of year as the bird behaviour is driven by differing processes throughout the season. For each species studied here we have therefore fitted a temporal sequence of models. Each model spanned a 30 day time period and consecutive models overlapped by 15 days. We present the results of three seabird species: snow petrels (Pagodroma nivea, which breed in Antarctic localities including Prydz Bay and the Casey station coastal regions), cape petrels (Daption capense, which breed in the Antarctic as well as in sub-Antarctic localities), and white-chinned petrels (Procellaria aequinoctialis, which breed on islands in temperate latitudes and forage in Antarctic waters during the southern hemisphere summer). These three species were the three most commonly-observed in each of the three breeding categories described above. ENVIRONMENTAL RELATIONSHIPS 5.1 Motivation The community analyses described above provide a foundation for investigating the relationships between seabirds and their environment. A proper understanding of these relationships is vital for an understanding of the seabirds, the region, and for planning, management, and legislative purposes. Characterising the dependence of the birds on their environment is one of the first steps in assessing the likely impact of global climate change on southern ocean seabirds. It has been suggested [14] that the initial effects of global climate change may be most pronounced in sub-Antarctic The Australasian Data Mining Workshop The species compositions of the bins were again compiled in presence/absence format. Logistic regressions were used to relate the distributions of bird observations to four parameters of the physical environment: sea surface temper- c Copyright °2002 33 The Australasian Data Mining Workshop tion to the fact that there are differences in the processes linking environment with observations of these birds in the two regions. ature (◦ C), sea state (ordinal scale), sea ice concentration (percent), and distance to coast (km). The model accuracies were assessed using the mean square prediction errors (MSE). For models using the Prydz Bay data, MSE was assessed using cross-validation by voyage: that is, data from half of all available voyages (chosen at random) were used to estimate the model parameters. Data from the remaining voyages were used to assess the model accuracy. Crossvalidation is a widely used method of obtaining estimates of model accuracy, particularly when data are limited [5; 15]. All MSE values are presented with reference to the null error rate. This is the mean square prediction error that is obtained with a constant model and reflects the prevalence of the species in question. Any model that fails to predict more accurately than the null is no better than uninformed guessing. The importances of each of the environmental variables in predicting the observations of these three species are illustrated in Figure 7. From late October until approximately January, the most important predictor variables for snow and white-chinned petrels were sea ice concentration and sea state. Sea ice and sea state are co-variates: heavy sea ice will prevent high sea states (wave heights). Snow petrels are known to be an ice-associated species and this is reflected in the positive sign of the model coefficient (marked on Figure 7). The reverse is true of white-chinned petrels. The corresponding variable importances for cape petrels are not relevant because the model was not accurate during this period. The standard logistic regression assumes that the data are independent. When data are spatial in nature, this assumption is often violated because observations from one location are likely to be similar to observations from nearby locations. This self-similarity is known as spatial autocorrelation [7]. Spatial autocorrelation can often be exploited to improve the predictive accuracy of models. Accordingly, we also applied the spatial autologistic model [1; 3]. This is an extension of the logistic model that explicitly models the spatial autocorrelation of the observations. The estimation of the parameters of the spatial autologistic model is problematic and requires approximate maximum likelihood techniques (see e.g. [12] for a discussion of the estimation of such models). We used a Markov chain Monte Carlo implementation provided by LeSage [12]. During the latter half of the season the most important predictor variables were sea surface temperature and distance to coast. The model parameter for distance to coast was negative for snow and cape petrels, indicating that these species were observed close to the coast. This matches the known behaviour of the birds: during this time of the breeding season adult birds are feeding the newly hatched chicks and thus forage predominantly close to the coastal colonies. The autologistic model did not provide substantially better predictive accuracy than the standard logistic model (results not shown). This suggests that the spatial variation in the observations was adequately modelled by the spatial variation in the environmental predictor variables. The additional computational demands of the autologistic model are therefore not justified in this application. The relative importance of each environmental variable in predicting the distribution of observations of each species was assessed. This was achieved by building a model using only a single environmental variable as a predictor. The cross-validation predictive accuracy of this model was compared to the best predictive accuracy obtained using all four predictor variables. Although in this study we relied on direct observations of sea state and sea surface temperature, these environmental variables may both be estimated using remote sensing technology. The models developed here could therefore potentially be used to estimate at-sea distributions of seabirds in other regions of the Antarctic. Regional differences in the breeding distributions of seabird species would need to be addressed. 5.3 Results and Discussion The predictive accuracies of the logistic models are presented in Figure 6. For observations of snow petrels, good predictive accuracies (MSE significantly less than the null rate) were obtained for the entire summer breeding season in both the Prydz Bay and Casey station regions. The model for cape petrel observations was generally adequate in the latter half of the season in both regions. The model for white-chinned petrel observations was better than the null for the majority of the season in Prydz Bay, but was no better than the null during the latter half of the season in the Casey station region. 6. For snow and cape petrels, the model performance in the Casey station region was similar to that obtained using the Prydz Bay data (Figure 6). We can therefore conclude that the processes linking bird observations and physical environment are similar in the two areas. The same was not true of white-chinned petrels. During the latter half of the season the model error was less than that of the null in Prydz Bay, but not the Casey station region. This result draws atten- The Australasian Data Mining Workshop DISCUSSION AND CONCLUSIONS The collection of data from polar regions is an expensive and difficult process. Such data are often noisy or incomplete and analyses using conventional statistical (hypothesistesting) techniques can be extremely difficult. Data mining and exploratory techniques may allow insights into trends and anomalies to be obtained. The relevance of such findings extends beyond intrinsic scientific interest into fields such as conservation and planning. Polar science plays a key role in matters of global importance, including species conservation and global climate change. There are therefore social, scientific, and economic obligations to make the best possible use of Antarctic scientific data. The investigations presented here used data mining techniques to obtain results of ecological relevance, such as the structures of seabird communities and the relationships of c Copyright °2002 34 The Australasian Data Mining Workshop SNPE MSE Prydz Bay Casey 0.4 0.4 0.2 0.2 0 0 CAPE MSE Prydz Bay WCPE MSE Casey 0.4 0.4 0.2 0.2 0 0 0.6 0.6 Prydz Bay 0.4 0.4 0.2 0.2 0 N D J Month F seabird observations with the physical environment. The techniques and findings also addressed matters of data management. Errors in data, which are often difficult to detect through direct inspection, may become apparent in the results of the analyses. This was illustrated by the potential errors in Antarctic tern and northern giant petrel records discussed in section 4. 0 M Spatial considerations are often of concern when dealing with ecological data. In our models of seabird observations and physical environmental parameters, the predictive ability of spatial autologistic models was found to be no better than that of ordinary logistic models (in which spatial autocorrelation is ignored). The additional computational cost of the spatial autologistic model (we used a computationally intensive Markov chain Monte Carlo implementation) is therefore not justified in this application. Casey N D J Month F M Acknowledgments Figure 6: Mean square prediction error (MSE) of logistic models of at-sea observations of three species of seabird in two areas of the Antarctic (Prydz Bay and the Casey station region). The solid line is the logistic model error and the dot-dash line is the null error rate. A filled circle indicates that the logistic MSE is significantly less than the null error at that time (p<0.05, Wilcoxon paired sample test). SNPE=snow petrels, CAPE=cape petrels, WCPE=whitechinned petrels. The authors would like to thank L Belbin and M Riddle for their ongoing support, and all observers who have recorded at-sea observations over the past 22 years. G Cruickshank, C Hodges, B Priest, and F Spruzen entered much of the data in prepraration for the analyses. D Watts constructed and maintains the WoV database. Various freely-available Matlab toolboxes were used: the m map mapping toolbox (Rich Pawlowicz, http://www2.ocgy.ubc.ca/~rich/), the Econometrics toolbox (James P. LeSage, http://www.spatialeconometrics.com), and ML estimation of mixtures of multivariate Bernoulli distributions (Miguel Á. Carreira-Perpiñán, http://cns.georgetown.edu/~miguel/). 7. SNPE Pri. − + + + + − − − − − Sec. − − − − − − + − + − [1] N. Augustin, M. Mugglestone, and S. Buckland. An autologistic model for the spatial distribution of wildlife. J Appl Ecol, 33:339–347, 1996. SEATEMP [2] M. Austin and L. Belbin. A new approach to the species classification problem in floristic analysis. Aust J Ecol, 7:75–89, 1982. SEAICE CAPE Pri. + − + + − − − − − − Sec. − + − + − − − − − − [3] J. Besag. Spatial interaction and the statistical analysis of lattice systems. J Roy Sta B, 36(2):192–236, 1974. SEASTATE WCPE Pri. + − − − − + + + + + Sec. − + + + + + + + + − N D J Month F [4] C. Bishop. Neural networks for pattern recognition. Oxford University Press, 1995. COASTDIST [5] K. Burnham and D. Anderson. Model selection and inference. Springer-Verlag, 1998. M [6] M. Carreira-Perpiñán and S. Renals. Practical identifiability of finite mixtures of multivariate Bernoulli distributions. Neural Comp, 12(1):141–152, 2000. Figure 7: The two most important predictor variables (primary, Pri., and secondary, Sec.) for logistic models of at-sea observations of three species of seabirds. A positive sign indicates that the association was positive (i.e. observations were more likely with increasing values of the environmental variable). SNPE=snow petrels, CAPE=cape petrels, WCPE=white-chinned petrels; SEATEMP=sea surface temperature, SEAICE=sea ice concentration, SEASTATE=sea state (wave height), COASTDIST=distance to nearest coast. The Australasian Data Mining Workshop REFERENCES [7] N. Cressie. Statistics for spatial data revised edition. Wiley, 1993. [8] M. Dufrêne and P. Legendre. Species assemblages and indicator species: the need for a flexible asymmetric approach. Ecol Monogr, 67:345–366, 1997. [9] B. Everitt and D. Hand. Finite Mixture Distributions. Monographs on Statistics and Applied Probability. Chapman & Hall, 1981. c Copyright °2002 35 The Australasian Data Mining Workshop [10] C. Fraley and A. Raftery. How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer J, 41:578–588, 1998. [11] M. Gyllenberg, T. Koski, E. Reilink, and M. Verlaan. Non-uniqueness in probabilistic numerical identification of bacteria. J Appl Prob, 31:542–548, 1994. [12] J. LeSage. Bayesian estimation of limited dependent variable spatial autoregressive models. Geogr Anal, 32(1):19–35, 2000. http://www.spatialeconometrics.com. [13] G. McLachlan and K. Basford. Mixture models: inference and applications to clustering. Marcel Dekker, Inc., New York, USA, 1988. [14] R. Smith, D. Ainley, K. Baker, E. Domack, S. Emslie, B. Fraser, J. Kennett, A. Leveter, E. MosleyThompson, S. Stammerjohn, and M. Vernet. Marine ecosystem sensitivity to climate change. BioScience, 49(5):393–404, 1999. [15] M. Stone. Cross-validatory choice and assessment of statistical predictions (with discussion). Biometrika, 64:29–35, 1974. [16] E. Woehler. The distribution of seabird biomass in the Australian Antarctic Territory: implications for conservation. Envir Cons, 17:256–261, 1990. [17] E. Woehler, C. Hodges, and D. Watts. An atlas of the pelagic distribution and abundance of seabirds in the southern Indian Ocean, 1981 to 1990, volume 77 of ANARE Research Notes. Australian Antarctic Division, Tasmania, 1990. [18] E. Woehler and G. Johnstone. Status and conservation of the seabirds of the Australian Antarctic Territory. In J. Croxall, editor, Seabird status and conservation: a supplement, pages 279–308. ICBP Cambridge, 1991. [19] E. Woehler, B. Raymond, and D. Watts. Decadal-scale seabird assemblages in Prydz Bay, East Antarctica. Mar Ecol Prog Ser (submitted), 2002. [20] J. Wolfe. Pattern clustering by multivariate mixture analysis. Multiv Be R, 5:329–350, 1970. The Australasian Data Mining Workshop c Copyright °2002 36 Combining Data Mining and Artificial Neural Networks for Decision Support Sérgio Viademonte Frada Burstein School of Information Management and Systems Monash University PO 197 Caulfield East 3145 Victoria, Australia School of Information Management and Systems Monash University PO 197 Caulfield East 3145 Victoria, Australia sergio.viademonte@sims.monash.edu.au frada.burstein@sims.monash.edu.au systems [4, 19]. One possible approach for the knowledge acquisition problem is to automatically induce expert knowledge directly from raw data [8]. However, this approach brings additional problems as the amount and diversity of data increases and demands specific attention. In this research project, data mining technology is applied to build knowledge from data, specifically inducing domain knowledge from raw data and also ensuring data quality. ABSTRACT This paper describes an ongoing research project concerned with the application of data mining (DM) in the context of Decision Support. Specifically, this project combines data mining and artificial neural networks (ANN) in a computational model for decision support. Data mining is applied to automatically induce expert knowledge from the historical data and incorporate it into the decision model. The resulting knowledge is represented as sets of knowledge rule bases. An ANN model is introduced to implement learning and reasoning within the proposed computational model. The proposed computational model is applied in the domain of aviation weather forecasting. The paper describes the proposed decision support model, introduces the data pre-processing activities and the data mining approach, the data models used to generate the knowledge rule bases, and their integration with the ANN system. The paper presents evaluation of the performance of the proposed approach and some discussion of further directions in this research. In the context of this project, the preprocessed sets of raw data used as input in the data mining algorithm are named data models. The knowledge obtained as a result of data mining experiments is termed knowledge models. An artificial neural network (ANN) system provides an interface for the user decision-makers to test and validate hypotheses about the specific application domain. The ANN system learns about the problem domain through the knowledge models, used as training sets. Section 2 presents the proposed computational model for decision support; section 3 discusses the knowledge discovery process, specifically the data mining phase, and the data and knowledge models. Section 4 presents the applied artificial neural network system. Section 5 discusses the decision support model evaluation and some achieved results, section 6 presents some comments and conclusions. Keywords: 2. A MODEL FOR DECISION SUPPORT BASED ON DM AND ANN Artificial neural networks, data mining, decision support, forecasting The purpose of the proposed computational model is to support decision-making by recalling past facts and decisions, hence inducing “chunks” of domain knowledge from past information and performing reasoning upon this knowledge in order to verify hypotheses and reach conclusions in a given situation. The proposed model creates an interactive software environment that uses data mining technology to automatically induce domain knowledge from historical raw data, and an ANN based system as a core for an advisory mechanism (see Figure 1). 1. INTRODUCTION This paper presents a computational model for decision support based on a combination of data mining and artificial neural network technologies. The proposed computational model has been applied in the domain of aviation weather forecasting, specifically, identifying fog phenomenon at airport terminals. Weather forecasts are based on a collection of weather observations describing the state of the atmosphere, such as precipitation levels, wind direction and velocity, dew point depression, etc [1, 10]. Access to the past decision situations and knowledge derived from them can provide valuable source of improvement in forecasting rare events, such as fog. Complexity and diversity of the weather observations and large variation in the patterns of weather phenomenon occurrences implies serious problems for forecasters trying to come up with correlation models. Consequently, the area is a potential candidate for KDD purposes [21]. The decision support model comprises a database (ideally a datawarehouse), case bases and knowledge rule bases. The database contains raw data from the application domain, in the case of this research project, historical weather observations. The case base contains selected instances of relevant cases from the specific problem at hand. In this project, each case represents a past occurrence and consists of a set of feature/value pairs and a class in which the case belongs. In this research project several case bases were generated and used as input data in the data mining algorithm. The case bases are named mining data sets in this research project [20]. Computational tools for decision support usually incorporate expert knowledge of domain experts together with specific explicit domain knowledge, e.g., factual knowledge. Early attempts in building expert systems revealed the difficulties of capture, represent and incorporate expert knowledge in those Knowledge rule bases are built based on the data mining results; they contain structured knowledge that corresponds to relevant relations found (mined) in the case bases. Several rule bases The Australasian Data Mining Workshop Copyright  2002 37 The Australasian Data Mining Workshop concerned with occurrence of fog phenomenon regardless of whether it is a local fog or not. For this reason all the Fog Type instances with “LF” value were transformed in “F” value, meaning a fog case. All the instances of Fog Type with null value were assigned “NF” values, meaning not fog case. were generated; according to different parameters used for data mining, e.g., distinct confidence factors and rule support degrees. Rule Evaluation Automatic discovery The attributes PastWeather and PresentWeather were transformed from numeric type to non-numeric type. These attributes are qualitative (categorical) attributes; which indicate weather codes. The Rainfall attribute shows two problematic behaviors for data mining: sparsity and lack of variability. It has 21.11 % of null values in fog class population and 30.60 % of null values in not fog class population. The rainfall volume is initially measured in millimeters and presented to the forecasters; who express their evaluation in codes expressing ranges of millimeters. This procedure makes sense according to the nature of the forecasting task, as it is almost impossible to differentiate precise measurements of rainfall, like 0.3 millimeters and 0.2 millimeters. The numerical values were transformed into categorical codes, which express ranges of rainfall. The instances of null rainfall will be classified into code 0, no rain. To implement this transformation a new attribute was inserted, Rainfall Range text attribute. A procedure was implemented to calculate the Rainfall Range attribute corresponding to Rainfall attribute values. Data Mining Case Base Case Base Data Warehouse Advisory System Knowledge KnowledgeBase Base DSS User – decisionmaker IDSS System Figure 1 – A computational model for decision support (as described in [21]). The ANN mechanism is applied to process the obtained knowledge (rule bases). The ANN uses the content of the knowledge bases as learning data source, to build knowledge about the specific application domain through its learning algorithm [11, 13]. After the ANN-based learning procedure has been executed, the advisory system provides an interface through its consult mode for the user to test and validate hypotheses about the current decision situation. The Wind Direction is a measure taken by instruments and it is numerically represented in degrees. However, the forecasters do not use detailed numerical measurements when reporting a forecast bulletin but a categorical representation. A categorical description of compass point is used instead, being N for North, S for South and so on. For example, lets consider the compass point 22,1 degrees. Practically, in that case the forecasters assume the compass point NNE, which is the point 22.5 degree. In that case the wind direction is said to be NNE instead of 22.1, as for forecasting purposes the distinction among 22.1, 22.2 and 22.5 degrees is not significant; again a tolerance for imprecision can be observed. It is assumed that each value in degrees belongs to the closest compass point, therefore the middle point between each two compass points was chosen as the boundary point between them, being the middle point itself belonging to the next upper compass point. To implement the transformation of Wind Direction attribute from degrees to compass points a new attribute was inserted into the data set, the WindCompas, text attribute. 3. INDUCING KNOWLEDGE THROUGH DATA MINING The database of weather observations used for automatic induction of domain knowledge was generated from Australian Data Archive for Meteorology (ADAM) data repository. It contains weather observations from Tullemarine Airport, Melbourne, Australia, from July 1970 until June 2000, and has 49,901 records and 17 attributes. 3.1 Data preprocessing The initial database had many problems concerning data quality issues, such as the significant amount of empty and missing values, sparse variables and problems with variability. An extensive pre-processing work was required to make the data appropriate for KDD process, see [20] for detailed discussion about this subject. The not fog class data set initially had 48.963 instances. After all data transformations and after the nulls instances were removed, the resulted not fog class data set has 47,995 instances, and the fog class data set has 938 instances. This database was used to select the relevant cases for data mining. The first database used in this project had 75 attributes, some of them related with data quality control and codes describing observations. For example, the dew point observation had two associated attributes, one named DWPT and other named DWPT_QUAL, this last attributes indicates a data quality control information, which was not relevant for our data mining purposes. Many other observations (attributes in the database), like wind speed, wind direction and visibility presented the same problem and had to be removed. The Year and Day attributes were not necessary for mining purposes, just the Month attribute. A derived attribute previous afternoon dew point was calculated based in the date, hour and the dew point and inserted in the table. The forecasters recommended this information as very important for fog prognosis. 3.2 Generating the data models The next step was devoted to verify the data dimensionality and class distribution in the database. As we are interested in forecasting fog, the population was discretised into two classes. One class representing fog cases (named fog class), and a second class representing cases where fog was not observed (named not fog class). The observation database shows a lowprevalence classification, it means that far fewer cases of fog class were present comparing to not fog class. The dataset has 938 instances of fog class and 47,995 instances of not fog class. Figure 2, below, shows the fog classes distribution in the entire weather observations database (population) after data preprocessing. Fog class represents 1.92% of the population, and not fog class represents 98.08% of the population. Several data transformations were performed with Bureau original data set. For example, Fog Type attribute has 3 possible values assigned: “F” when it refers to a fog event, “LF” when is “Local Fog” and null when the event is not fog. This study is The Australasian Data Mining Workshop Copyright  2002 38 The Australasian Data Mining Workshop randomly generated; a mining data set (used by the data mining algorithm), an evaluation data set (for comparison purposes) and a test data set (used as test data by the neural network system). Those data sets were randomly sampled from their original data models in 60% and 80% proportions for mining sets and 10% proportions for evaluation and test. Fog and Not Fog Class Distribution. Complete Enumeration. F The final data models are obtained by joining fog data sets with not fog data sets. Four mining data sets were obtained in this way, named by corresponding sampling proportions: Mining Model 1-60, Mining Model 1-80, Mining Model 2-60, and Mining Model 10-60. For example, Mining Model 1-60 means that this model was obtained by a sample of 10% out of the overall not fog stratum, and 60% of this sample was selected for data mining purposes. The other names follow the same structure. NF Figure 2 – Fog class distribution in the population 3.3 Applying data mining This significant difference in class distribution required the development of a specific sampling strategy in order to have a more homogeneous class distribution in the training set [7, 15, 18]. The sampling approach used in this research project can be classified as stratified multi-stage sampling [7, 23]. The original population was divided in two strata: fog stratum and not fog stratum. Sampling was separately conducted in stages within each stratum. A random sampling approach was used for fog stratum. Fog stratum was randomly split without replacement in 85% for mining data set and 7% for testing and evaluation, respectively. This research project uses an associative rules generator algorithm for data mining, based on AIS algorithm [2]. An association rule is an expression X Y, where X and Y are sets of predicates; where X is a precondition of the rule in disjunctive normal form and Y - the target post condition. Hence, the outputs of the data mining experiments are associative rules. We chose associative rules to represent the induced knowledge because it is a clear and natural way of knowledge representation that is easy for people to understand; and also because it fits well our neural network system. Section 4 addresses the integration issues between the knowledge models and the ANN system. → Not fog stratum was sampled in a different fashion; increased sizes data sets were selected from the whole stratum in 10%, 20% and 100% proportions. The sample being 10% of the whole stratum was named Model1, with 4,763 instances. The second sample named Model2, 20% of the stratum with 9,572 instances. For comparison purposes the whole stratum was also considered, we call it Model10, meaning 100% of the stratum. This section discusses some procedures that had to be performed for data mining, e.g.: features selection, selection of target attributes and attributes’ values in the database, discretization or clustering of attributes, and the selection of mining parameters. Although a detailed description of these procedures is out of the scope of this paper, we believe that it is important to mention them at least briefly. Nearly every data mining project includes the execution of these procedures at a certain level. If incorrectly performed, these procedures can potentially compromise the success of the entire data mining project. For those reasons we decided briefly discuss in this paper some of those procedures we faced in our research project. The 10%, 20% and 100% percentages were arbitrarily selected based in the size of not fog and fog strata; the aim here is build data models without a significant difference between the numbers of instances from each class. Therefore small percentages were chosen from not fog stratum. In addition, literature review provide useful insights in incremental sampling, according to Weiss and Indurkhya [23] typical subset percentages for incremental sampling might be 10%, 20%, 33%, 50%, 67% and 100%. Using 50% and higher percentages will keep the difference between not fog cases and fog case too big, therefore small percentages were chosen. The 100% subset was selected to verify the mining algorithm performance when using a significantly difference class distribution, the assumption was that this subset will produce very few or either none fog cases rules. Selection of a target attribute procedure requires the selection of an attribute from the case bases that discriminates the class in study, in our project is the attributes that indicates if a particular case corresponds to a fog observation or not; this attribute is named FogType. FogType attribute was discretised into two values, “F” and “NF”; this represents whether fog phenomenon was or not observed, respectively. Features selection, means selecting the attributes that form the antecedent part of the rules. In our research almost all attributes were selected for data mining, exceptions were the attributes indicating Year and Day. Besides that, there is an attribute in our database that indicates the visibility over the airport runaway. Experiments with and without Visibility attribute were performed to check whether the Visibility attribute might be considered as a synonymous for fog. The assumption was to verify the amount of generated rules in both cases and the prevalence of the Visibility attribute. Table 1, below, shows the generated not fog stratum models: Table 1. Sample models for not fog class Data Model Sample size Percentage of the whole stratum Model1 4.763 10% Model2 9.572 20% Model10 47.995 100% Selection of attributes values is an important procedure, which addresses dimensionality reduction, together with feature and cases selection. It could happen that some sets of attribute values are not relevant to the survey variable, or have a small frequency of occurrence in the database. In both cases such These generated models are called data models in the context of this research project, e.g. Model1, Model2 and Model10 are data models. From each data model, three data sets were The Australasian Data Mining Workshop Copyright  2002 39 The Australasian Data Mining Workshop The obtained amount of rules, specifically for fog class, was considered small for a good descriptive model. Hence, it was decided to execute the experiments again with more flexible parameters. Table 3 illustrates this fact; it summarises the amount of associative rules obtained from the mining set Model1-6 and Model1-8. When using 70% rule confidence degree, minimum rule support of 8% and maximum rule order of 7 was obtained 240 associative rules, being 54 in fog class for Model1_6, and 245 associative rules, being 35 in fog class for Model1_8. Keeping the 70% rule confidence degree and changing the minimum rule support to 6% and maximum rule order to 10 we obtained 405 associative rules, being 104 in fog class for Model1_6 and 358 associative rules with 67 in fog class for Model1_8. An increase of 50 rules can be observed in fog class for Model1_6 and 32 rules in fog class for Model1_8. The minimum number of cases, 50, remained constant in all experiments, because it was considered a satisfactory amount of cases, not very restrictive but big enough for a good coverage. attribute values do not add any valuable information in data mining and may be removed. In our experiment some values or categories of Hour, Rainfall and Month attributes were excluded from the data mining experiments because they had either a small frequency in the database, or because they had a high frequency for either classes, fog and not fog. It means high sensitivity but low specificity. Sensitivity degree of a finding is defined in relation to a class measures its frequency for that class. Specificity degree of a finding F in relation to a class C, on the other hand, is inversely proportional to the frequency with which the finding F appears in classes other than C Configuration of mining parameters includes selection of the minimum desired level of rule confidence, support degree and the maximum rule order. It means to choose the ratio of the number of records in the database that support a particular rule. The maximum rule order parameter sets the maximum number of antecedents of the rules. For example, in a rule like: Discretization of numerical attributes is used to determine the granularity of a certain variable. It can be used in general to simplify the data mining problem. Also, most data mining tools and algorithms, mainly those used in classification problems, require discrete values rather than a range of values [9]. If DRYBULB <= 8.5 And TOTALCLO > 7 And TOTALLOW > 6 And WINDSPEE <= 1.5 Then FOGTYPE = F, Confidence: 88.24%, Support: 9.29% The rule order is 4, represented by the attributes dry bulb temperature (Drybulb), amount of clouds over the airport runaway (Totalclo), amount of low clouds over the airport runaway (Totallow) and the wind speed (Windspee) at the airport runaway. Table 3. Mining Model1-6 and Model1-8 with different mining settings Mining set Number of rules Confidence degree Rule support Maximum rule order In most of the data mining applications the users are usually only interested in rules with support and confidence above some minimum threshold. Thus these parameters are important to be set. Table 2 shows the selected mining parameters in our experiments: Model16 240 70% 8% 7 Model16 405 70% 6% 10 Model18 215 70% 8% 7 Model18 358 70% 6% 10 Table 2. Selected mining parameters Mining Parameter Confidence Degree Value 50%, 70%, 80%, 90% Minimum Support Degree 8%, 6% Minimum Number of Cases 50 Maximum Rule Order Table 4 illustrates three attributes discretization in our experiment. It shows their respective assigned categorical classification in the Data Mining Model1-6. Each attribute has been assigned the same categories in all data models, e.g. Low, Med, High for Dry Bulb. But distinct value ranges occurred in different data models. 7, 10 The data mining experiments generated rules with 70%, 80% and 90% confidence degree. As it was impossible to know beforehand the amount of generated rules accordingly with a specific confidence degree, it was decided to use the most frequent percentages in data mining applications [16, 22]. Our goal here is to verify if there is a significant difference in performance accordingly with different combinations of parameters (confidence degree, minimum support degree and maximum order degree). And if so, which combination(s) of these parameters is (are) most appropriate when applying data mining in problems similar to the one we are addressing in this project. Here, we consider as performance measure the amount of rules obtained in each class, together with the amount of item sets in each rule. In general the descriptive capability of a rule is associate with its amount of item sets. Table 4. Discretisation of numerical attributes Attribute Mining Model 1-6 Categories Ranges Two sets of data mining experiments were performed. One set of experiments using minimum rule support of 8% and maximum rule order of 7. A second set of experiments, using minimum rule support of 6% and maximum rule order of 10. The first experiments resulted in more restrictive models. Dry Bulb < = 8.5 Low (Celsius degrees) > 8.5 and <= 12 Med > 12 High Total Cloud Amount <=4 Min (Eighths) > 4 and < = 7 Med >7 Max Wind Speed < = 1.5 Light (meters/second) > 1.5 and < = 3.6 Lmode > 3.6 and < = 6.2 Mode > 6.2 Fmode The Australasian Data Mining Workshop Copyright  2002 40 The Australasian Data Mining Workshop build descriptive models as comprehensive as possible for our application domain. The ultimate goal of such a knowledge modeling process is to achieve a good predictive performance of the decision support model. It includes a performance evaluation of the decision support model that will demonstrate how efficient is the descriptive model for this particular case. The discretization of a particular attribute is measure proportional on the total amount of cases in the database and the frequency of occurrence of each attribute value. Categorical attributes already express a discrete value, however numerical attributes must be discretised in ranges. The used data mining tool automatically discretizes the numerical attributes based on their frequency of occurrence and the amount of their categories. The above definitions are important for better understanding of how we generated knowledge models and what they are. The approach we used in this research project to generate knowledge is based on the data models, a data mining algorithm (our descriptive method) and the choice of data mining settings (rule confidence degree, rule support and maximum rule order). For each original data models, and combinations of mining parameters, we obtained a distinct set of associative rules. Not only the amount of rules are different, but also the rules itemsets. Each of these distinct sets of associative rules is identified as a knowledge model, or a knowledge base. 3.4 Generating knowledge models Knowledge discovery in databases constitute an interactive and iterative process, having many steps and interrelated fields. We consider knowledge modeling as an important part of the knowledge discovery process. In our research project we distinguish domain modeling, data modeling and knowledge modeling from each other. We understand domain modeling in the same way as it has been widely used by decision support, expert systems and artificial intelligence community in general. Basically, it is concerned with building a model of a particular domain under investigation for any particular purpose. Data modeling in the context of our project relates to all the activities that transform raw data into the data used for data mining. Such data modeling includes data pre-processing, features selection, reduction and transformation, and data sampling. Knowledge modeling in our context includes the activities related to extracting knowledge from data. This includes the interactive process of mining data, testing and tuning different data mining parameters and data models, e.g., adding or eliminating data features, and even cases. In fact, it is effectively an interactive and iterative process, where we try to To illustrate our approach, let us consider the data mining Model1-6 with 70% confidence degree, minimum rule support of 6% and maximum rule order of 10. After mining this data set, it generated a particular knowledge base. Similarly, the data mining Model1-6, with 80% confidence degree, minimum rule support of 6% and maximum rule order of 10 generated a different knowledge base. The process follows in this fashion until we execute all data mining sets (case bases) with the selected mining parameters; showed in table 2, sub section 3.3. Table 5 below shows the generated knowledge models, for each data model accordingly with their respective levels of confidence degree. Table 5. Generated knowledge models Knowledge Models Mining Data Models Generated Rules by Rule Confidence Degree 70 % 80% F NF Total F NF Mining Model1-6V3 104 301 405 37 Mining Model1-8V3 67 291 358 Mining Model2-6V3 45 283 Mining Model10-6V3 10 279 90% Total F NF Total 291 328 16 204 220 23 291 314 12 228 240 328 20 283 303 12 274 286 289 9 279 288 9 279 288 4. THE ARTIFICIAL NEURAL NETWORKS SYSTEM This table refers to the experiment using minimum rule support of 6% and maximum rule order of 10; this information is identified by the prefix “V3” in each mining data model. In Table 5, ‘F” relates to fog class and ‘NF” to not fog class. Rules with 50% confidence degree were also generated using the mining Model10-6, due to space limitations it is not included in table 5, but this does not compromise the understandability of the proposed approach. We applied an ANN system as the interface of our decision model. The ANN system learns about the problem domain through the knowledge models, used as training sets. Besides implementing learning capability in our decision support model, the ANN system provides an interface for the decision-makers to test and validate hypotheses about the specific application domain. For the confidence level of 90% too few rules were obtained for fog class; with a maximum amount of 16 rules when using data model Model1-6 and 12 rules when using data Model2-6. These amounts of rules are unlikely to be enough for a satisfactory description of the fog phenomenon. The performance evaluation will show how this affects the predictive capacity of our decision support model. For the ANN interface we use the Components for Artificial Neural Networks (CANN) framework [4]. The CANN framework is a research project that allows neural networks to be constructed on a component basis. The CANN project relates to the design and implementation aspects of framework architecture for decision support systems that rely on artificial neural network technology [17]. The Australasian Data Mining Workshop Copyright  2002 41 The Australasian Data Mining Workshop We selected the evidence Dewpoint to show its properties, for example, it is a string attribute and it is categorized in four categories: Low, when the temperature is smaller or equal 4 Celsius degrees; Med between 4 and 6 Celsius; High between 6 and 9 Celsius and Max, temperature higher than 9 Celsius. The CANN components are designed in an object-oriented way. It implements a class hierarchy to represent a particular application domain, the domain evidences, classes and the relationships among evidences. Figure 3 presents the screen of the CANN system that represent the evidences (attributes) used to identify fog phenomenon. At the left in figure 3 a list of evidences (attributes) about weather forecasting is presented. Figure 3: Weather evidences modelled into CANN does for a single case. Figure 5 illustrates a case base consult session. CANN system implements a mechanism that associates a data set with a particular ANN model, for example, the Combinatorial Neural Model (CNM) [13]. Through the ANN learning algorithm, CANN implements a learning mechanism. Figure 4 illustrates the outcome of the learning process executed by CANN in the meteorological domain. Figure 5: A case base consult session. For example, case 119 in figure 5 is indicated as a Fog case with confidence degree of 0.952, supported by the evidences Total cloud amount Max and Wind Speed Light. Case 120 is indicated as a Not Fog case with confidence degree 0.909, supported by the evidences Drybulb High and Total Cloud Amount Med. Figure 4: Learning about meteorological domain. CANN functionality is well suited to the purpose of our project as it is capable of a flexible domain representation, learning and consulting functionalities. A decision-maker interacts with CANN consulting mechanism, in two ways: case consult and case base consult. A case consult presents to the decision maker a selection of evidences, and their respective evaluation of relevance to the situation at hand. CANN sets up a set of hypotheses based on the presented input data. It evaluates the selected evidences and calculates a confidence degree for each hypothesis. The inference mechanism appoints the hypothesis with the higher confidence degree as the most suitable solution (class) to the problem. A detailed discussion of the CANN class hierarchy is outside the purposes of this paper as is software engineering design issues. Readers interested in these subjects should refer to [3, 4]. The Combinatorial Neural Model, its algorithm and its learning, pruning and consulting algorithms are presented in [12, 13]. 4.1 Mapping associative rules into the ANN topology The CANN knowledge representation schema reflects the knowledge model structure and content. The rules are directly mapped onto the ANN topology, and simultaneously represented through a symbolic mechanism [14]. Rules describing relations in the weather forecasting domain are represented by neurons and synapses. Figure 6 exemplifies this property. The rule: I3 & I4 & In => F, corresponds to the A case base consult is similar to the case consult, however, instead of presenting one single case (or one set of evidences) each time, several cases are simultaneously presented to the ANN system. It evaluates the set of cases in the same way it The Australasian Data Mining Workshop Copyright  2002 42 The Australasian Data Mining Workshop strengthened connections among the input nodes I3, I4 and In, the combinatorial node C3, and the output node F of the ANN. NF C2 C1 F C3 obtained with 90% of confidence degree. Those results are not a surprise, as increasing the rule confidence degree restricts the amount of obtained rules; therefore a less descriptive model is expected. The worse performance is verified when applying Model1-6-90, it happens because this data model has only 16 rules describing fog what does not represent a enough coverage to describe fog phenomena. Even though, 66.67% of correct classifications can be considered a surprisingly good result considering there are only 16 rules about fog in the rule base. … C4 … … I1 I2 I3 I4 Table 6. Test Data Model1_6 with different rule confidence degrees In Figure 6. Incorporating rules into ANN topology. For example, consider the following rule: Rule 1 for Fog Class: If Total Cloud Amount = Max And Wind Speed = Light And Wind Direction = SE Then Fog Type = F. Learning Set Correct Misclassifie d No conclusion Total Cases Model1-670 82 28 (23.3%) 10 120 Model1-680 Model1-690 The above rule is mapped into the ANN topology by representing I3 as total cloud amount max, I4 as wind speed light and In wind direction SE, and also considering the hypothesis NF as not fog case and F as fog case. Additional information as rule confidence degree will be represented in the ANN topology as confidence level associated with a particular evidence. (68.30%) 81 (8.3%) 27 (22.5%) (67.50%) 80 12 120 (10.0%) 25 (20.8%) (66.67%) 15 (12.50%) 120 An average of 67.5% of the cases were correctly classified, what indicates the applicability of the proposed model for decision support in classificatory problems. Table 7. Data Model1_6 performace discriminating fog and not fog classes 5. VALIDATION The validation of the discovered knowledge is based on the ability of the model to correctly identify meteorological observations, specifically a fog case or a not fog case. The performance of the model relies not only on the applied computation technologies (data mining and ANN), but also on the strategy we applied to obtain the data and knowledge models, e.g., sampling strategy, pre-processing, and mining parameters. Due to the space limitations we cannot discuss all these issues in this paper, but it is important to understand that all those issues have an implication on the performance of our decision support model. Learning Set Correct Fog Correct Not Fog Model1-6-70 42 (70.0%) 40 (66.67%) Model1-6-80 39 (65.0%) 42 (70%) Model1-6-90 39 (65.0%) 41 (68.33%) Analyzing individually the performance in each class also indicates that the 70% rule confidence degree generates the best set of rules, achieving the highest performance of 70.0% of correct fog cases classified. What basically differentiates each of the training models in our experiment is the number of rules representing fog cases. We selected data model Model1-6, with 70%, 80% and 90% rule confidence degrees, 6% of minimum rule support and maximum rule order 10. We identify each case set by adding the rule confidence degree in the data model name, therefore Model1-6-70 corresponds to the data model generated when selecting 70% rule confidence degree; Model1-6-80 using 80% rule confidence degree and Model1-6-90 when using 90% rule confidence degree. The change in the number of rules representing not fog cases does not represent a significant change in performance, specifically 70.0% in the best case and 66.67% in the worse case. It is because there are enough rules describing not fog cases. The same comments cannot be extended to fog class; a decrease in fog rules in Model1-6-90 caused a significant lost in predictive performance, with 70.0% of correct classification in the best performance dropping to 65.0%. Table 5, in section 3.4, describes the amount of rules for each of these data models. They were used as the ANN learning bases. For testing we used a subset of the test set generated for data model Model1-6. The test set has randomly selected 120 cases, being 60 cases of not fog and 60 cases of fog. Our experiment so far indicates that the 70% rule confidence degree seems to be the best value for this parameter, even when faced with the problem of low prevalence classification. However 70.0% may not be considered a satisfactory performance in many applications. Additional experiments can be carried on to improve the system performance, for example applying different sampling proportions to obtain a more homogeneous class distribution, or applying different data mining parameters. Such as relaxing the minimum rule support to obtain a higher number of rules or even increasing the maximum rule order to generate rules with higher itemsets, therefore better descriptive capabilities. The results of this experiment are presented in Table 6. They appoint to the efficiency and applicability of the combined approach, data mining and ANN, considering an average of 67.5% of correct classifications. The ANN system correctly classified 68.3% of the cases when training with rules obtained with 70% of confidence degree. The performance decreased a little, to 67.5% when training with rules obtained with 80% of confidence degree; and the performance decreased to 66.67% when training with rules The Australasian Data Mining Workshop Copyright  2002 43 The Australasian Data Mining Workshop (Ed.), Object-Oriented Application Framework: Applications and Experiences. (1 ed.): John Wiley. Further experiments are necessary for more accurate conclusions; however, the results obtained so far indicate the potential applicability of our approach to automatically induce domain knowledge, to handle the problem of low prevalence classification in databases, to incorporate the domain knowledge and implement learning capabilities in the proposed model for decision support. [4] Beckenkamp, F. a., & Pree, W. (2000, May, 2000.). Building Neural Networks Components. In Proceedings of Neural Computation 2000 - NC'2000, Berlin, Germany. [5] Buchanan, B., & Feigenbaum, E. (1978). DENDRAL and META-DENDRAL: Their applications dimensions. Artificial Intelligence, 1, 5 - 24. [6] Carbonell, J. G. (1989, September). Introduction: Paradigms for Machine Learning. Artificial Intelligence, 40, 1-9. [7] Catlett, J. (1991). Megainduction: Machine learning on very large databases. University of Technology, Sydney, Australia. [8] Fayyad, U. M., Mannila, H., & Ramakrishman, R. (1997). Data Mining and Knowledge Discovery. (Vol. 3). Boston: Kluwer Academic Publishers. [9] Howard, C. M., & Rayward-Smith, V. J. (1998). Discovering Knowledge from low-quality meteorological databases. Knowledge Discovery and Data Mining. (Pages: 180-202.). 6. CONCLUSION AND COMMENTS This paper presents a decision support model and its application to a real world problem. We proposed a decision support model combining data mining and neural networks. Data mining is chosen to automatically induce domain knowledge from raw data and ANN because of its adaptive capabilities, which is important for providing the means for implementation of inductive and deductive learning capabilities [6, 19]. Besides that, this project came up with an efficient sampling strategy to handle problems of dimensionality and class distribution, mainly the low prevalence classification problem, as well as conducted an in-depth investigation of the pre-processing stage to ensure data quality for data mining. The results obtained so far demonstrate the applicability of the proposed decision support model in aviation weather forecasting, specifically to correct identify fog phenomenon. [10] Keith, R. (1991). Results And Recommendations Arising From An Investigation Into Forecasting Problems At Melbourne Airport. (Meteorological Note 195). Townsville: Bureau of Meteorology, Meteorological Office. The system performance can be further improved through some additional procedures. For example, in our experiments we used neural network topology with maximum order of three. It means that the neural network combinatorial layer associates at maximum three input neurons. Using higher combinatorial order will add more evidences in the neural network learning and evaluation procedures. Considering more evidences for the cases analysis can potentially improve the system performance. Additionally, considering higher number of antecedent itemsets during data mining and relaxing the learning and pruning threshold parameters in the ANN learning algorithm may also potentially improve performance. [11] Machado, R. J., Barbosa, V. C., & Neves, P. A. (1998). Learning in the Combinatorial Neural Model. IEEE Transactions on Neural Networks, 9. September, 1998 [12] Machado, R. J., & Rocha, A., F. (1989). Handling Knowledge in High Order Neural Networks: the Combinatorial Neural Model. (Technical Report CCR076). Rio de Janeiro, Brazil.: IBM Rio Scientific Center. [13] Machado, R. J., & Rocha, A., F. (1990). The combinatorial neural network: a connectionist model for knowledge based systems. In B. B. Bouchon-Meunier, Yager, R. R. & Zadeh, L. A. (Ed.), Uncertainty in knowledge bases. Berlin, Springer Verlag. In addition, issues concerning system integration may be assessed. Currently case and knowledge bases are stored as relational tables; different technologies are under evaluation for storing the knowledge bases, for example using XML document formats and PMML (http://www.dmg.org), in order to facilitate its integration with the ANN system, based on Java implementation. [14] Medsker, L. R. (1995). Hybrid Intelligent Systems. (Vol. 1). Boston, USA: Kluwer Academic Publishers. [15] Mohammed, J. Z., Parthasarathy S., &, L. W., & Ogihara, M. (1996.). Evaluation of Sampling for Data Mining of Association Rules. (Technical Report 617). Rochester, New York. The University of Rochester, Computer Science Dept. 7. ACKNOWLEDGEMENTS This research is partly funded by the Australian Research Council and Monash University grants. We would like to thank the Regional Forecasting Centre from Australian Bureau of Meteorology, Victorian Regional Office for providing meteorological data and support. We also thank Dr Robert Dahni and Mr. Scott Williams from the Regional Forecasting Centre for their help in validation results in relation to aviation weather forecast. [16] Piatetsky-Shapiro, G., & Frawley, W. (1991). Knowledge Discovery in Database. MIT Press. [17] Pree, W., Beckenkamp, F. a., & Rosa, S. I. V. (1997, June, 17 - 20, 1997). Object-Oriented Design & Implementation of a Flexible Software Architecture for Decision Support Systems. In Proceedings of 9th. International Conference on Software Engineering & Knowledge Engineering - SEKE'97, (pp. 382 - 388). Madrid, Spain. 8. REFERENCES [1] Auer, A. H. J. (1992). Guidelines for Forecasting Fog. Part 1: Theoretical Aspects: Meteorological Service of New Zealand. [2] Agrawal, R., Imielinski, T., & Swami, A. (1993, May, 1993.). Mining association rules between sets of items in large databases. In Proceedings of Conference on Management of Data., (pp. 207-216). Washington, DC. [3] [18] Provost, F., Jensen, D. & Oates, T. (2001). Progressive Sampling. In H. L. a. H. Motoda (Ed.), Instance Selection and Construction for Data Mining (Vol. 1, pp. 151 - 170). Norwell, Massachusetts, USA: Kluwer Academic Publishers. [19] Tecuci, G. a., & Kodratoff, Y. (1995). Machine Learning and Knowledge Acquisition: Integrated Approaches. London, UK.: Academic Press. Beckenkamp, F. a., & Pree, W. (1999). Neural Network Framework Components. In S. D. C. a. J. R. Fayad M. The Australasian Data Mining Workshop Copyright  2002 44 The Australasian Data Mining Workshop About the authors: [20] Viademonte, S., Burstein, F., Dahni, R. & Williams, S. (2001). Discovering Knowledge from Meteorological Databases: A Meteorological Aviation Forecast Study. In Proceedings of Data Warehousing and Knowledge Discovery, Third International Conference - DaWaK 2001, (pp. 61-70). Munich, Germany: Springer-Verlag. Sérgio Viademonte is a Doctoral candidate at the School of Information Management and Systems at Monash University. His research is supported by ORSP and Monash Graduate Scholarships. Sergio has been working on hybrid architectures for expert systems since 1995 when he obtained a Master in Administration, Information Systems Area (by Research) from Federal University of Rio Grande do Sul (UFRGS), RS, Brazil. [21] Viademonte, S. B. & Burstein F.. (2001). An Intelligent Decision Support Model for Aviation Weather Forecasting. In Proceedings of Advances in intelligent data analysis: 4 th international conference / IDA 2001, (pp. 278 - 288). Cascais, Portugal.: Springer-Verlag. Dr Frada Burstein is Associate Professor and Knowledge Management Academic Program Director at the School of Information Management and Systems at Monash University. She is a Chief Investigator for an ARC funded industry collaborative project with Bureau of Meteorology titled ”Improving Meteorological Forecasting Practice with Knowledge Management Systems”. The results reported in this paper address a component of this project. [22] Weiss, S. M., Galen, R. S. a., & Tadepalli, P. V. (1990). Maximizing the predictive value of production rules. Artificial Intelligence, 47 - 71. [23] Weiss, S. M., & Indurkhya, N. (1998). Predictive Data Mining: A Practical Guide. (Vol. 1). San Francisco, CA: Morgan Kaufmann Publishers, Inc. The Australasian Data Mining Workshop Copyright  2002 45 46 47 The Australasian Data Mining Workshop 48 The Australasian Data Mining Workshop 49 The Australasian Data Mining Workshop 50 The Australasian Data Mining Workshop 51 The Australasian Data Mining Workshop 52 The Australasian Data Mining Workshop 53 The Australasian Data Mining Workshop 54 The Australasian Data Mining Workshop 55 The Australasian Data Mining Workshop 56 57 The Australasian Data Mining Workshop 58 The Australasian Data Mining Workshop 59 The Australasian Data Mining Workshop 60 The Australasian Data Mining Workshop 61 The Australasian Data Mining Workshop 62 The Australasian Data Mining Workshop 63 64 65 The Australasian Data Mining Workshop 66 The Australasian Data Mining Workshop 67 The Australasian Data Mining Workshop 68 The Australasian Data Mining Workshop 69 The Australasian Data Mining Workshop 70 The Australasian Data Mining Workshop 71 The Australasian Data Mining Workshop 72 The Australasian Data Mining Workshop 73 74 SemiDiscrete Decomposition: A Bump Hunting Technique S. McConnell D.B. Skillicorn School of Computing, Queen’s University, Kingston, Canada. School of Computing, Queen’s University, Kingston, Canada, and Faculty of Information Technology, University of Technology, Sydney. mcconnell@cs.queensu.ca skill@cs.queensu.ca ABSTRACT              Æ                                                 !    "          #    !              !                        1. INTRODUCTION $                                     %  &                       $                           '  !  &                                           '  !                          (                             )*         +  ,-℄                                 /                                +                        +                        !     0              !                      ! (             !                        +                 ,) 1℄        ! 2                                     3   !               4  5     5                   5     !                 ½  ¾                 6                                                                       $     !            !     "          !  5    "      6          ,)1 )7℄                           !  ,8 )9 ))℄ 2            +      !                                     ) 9 ) 6       !    :   2  !   ; ¿     !                6       !           6  "   +                                   1  3  !                       4                ) 9 )        !                      <                          6                             < ¼    !  :  6         0 )        !   ½         !   1 '           :      ¾ 6 $  =     /  1991 75    The Australasian Data Mining Workshop     )     )  6             0  /                                           >      "         ? >     4  =   /                 6          &       "     !     (                   2    &          ) 9  )   "             6                &              6               &               &               @                   A                6                           &            6      "                           (             !                                      "   6       Æ          !    2. WHAT SDD IS DOING          !             !               4       6 " * ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )         4       6 $ ) ) ) ) ) ) ) ) 9 9 ) 9 9 9 ) 9 ) ) 1 ) ) ) 1 ) ) ) ) ) ) ) ) ) ) ) 1 ) ) ) 1 ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )            !     ) ) ) ) ) ) ) ) 9 9 ) 9 9 9 ) 9 ) ) ) ) ) ) ) )  =                /  ' )0 $      !             &     " *        !     4       6   ) ) ) ) ) ) ) ) 9 9 9 ) 9 ) 9 9 ) ) ) ) ) ) ) ) 9 9 9 ) 9 ) 9 9 ) ) ) ) ) ) ) )    4   )971* 98?-* 99*B*8C 99*B*8C 999?771)                  6         !     ' ) D              2            A              "          E   # /     !                  F  "    !      ½ ½ ½   ½   "     ½   "     6     ½   ½   B B !  )#   ½  )971*   !          B B !        )971* 2      !               ' 1                      !            !   F       !             ¾   ¾                 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 ) 9 9 9 ) 9 9 9 9 9 9 9 9 9 9 9 ) 9 9 9 ) 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9            2                     @       ¾  98*-* 1991 76 The Australasian Data Mining Workshop       ' 10 $             "      !       6           !     !     "           !           "      !             2             ' 1                       :                     A      "                           6                           6    "                         6    1    &      ! :               :         6          "             G        2        "       6 !        !      6        +          A                     6                        :                 3          "       !       G        :        6      :                G                              :     " '  !       !0      4         ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 8 ) ) ) 8 ) !   9 )  9 )  ) )   9 )  4  9 )   9 )  ) ) 9 ) ) ) ) ) ) ) ) ) ) ) 8 ) ) ) 8 ) ) ) ) ) ) ) ) ) 9 9 ) 9 9 9 ) 9 ) ) ) ) ) ) ) ) 9 9 ) 9 9 9 ) 9 9 9 9 ) 9 ) 9 9 ) ) ) ) ) ) ) ) 9 9 9 ) 9 ) 9 9       4       9 9 9 ) 9 ) 9 9 ) ) ) ) ) ) ) )       4     8 98?-* 98?-* 99*B*8C 99*B*8C 999?771) ) ) ) ) ) ) ) )                                         2          "     G           2                   A 6 $  =     /  1991 77 The Australasian Data Mining Workshop   G       "           1 4 )1BH7C      !            - $          Æ        !  +   !                                  +           0 +                                    6     ' ?          !      A   :            (       ?        +               2         +            (        6                                      6 !              +       !      !                      !                             :           A             +     6                    6                  6         "              !      (      "             2                        "      $                  0 ) '              :            1                              "   ? '                   6                       "                     6                           !                  >               !                     # !   6                             2    !        8        "   (    !      6 $  =     /  1991 78 0      4     ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) 8 ) ) ) 8 ) ) ) ) ) ) ) ) ) ) ) B ) ) ) 8 ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) )                    !                 4   B-* 98?-* )7B-* 9*)*71 9*)*71                  2                       !               6            3. AN APPLICATION                         @  @  6    )7-9                  ??   ,*℄ 6      :               ,C℄ =                                    (                      89                           2                                                                         !                          F                                    &  '       @>$/                         6                   0 $   : ?         / ( <  6   I A                        $   : 1         /  /   @ A   "                     $             /    A        The Australasian Data Mining Workshop 0.15 0.1 0.05 0 −0.05 −0.1 −0.15 −0.2 −0.25 0.2 0.1 0.25 0 ' ?0 J         !       0.2 0.15   +      $   : B         / J   I A           : $   : ?                   6            "   2 !                      $                      "             !      ' C        +        6                         "                                        2      )*   1<   K         * )1   )C / ( < (  ' *              6      C /    )?   19 L   >                       M /  / % ' 3     I      /           6   L   >          6    G          M   3                !   6    I          =     /  0.05 0 −0.05 −0.1 −0.15 −0.2  /           I   !       /       G               "      "   +               2 !   !          !     +   "   &        9-* A   "    &                                     6           6 $ 0.1 1991 79 4. RELATED WORK 6            0 ) 1        0 N2    #          #     O 6       '   '# J>2= ,7℄     J>2=          "                                2  "          !   !      F&           H         0 N2                    #     O 6  !            )          ,)C )? B℄            3   Æ                E #   3 "   E #     F&            3  The Australasian Data Mining Workshop 0.4 0.3 11 20 0.2 16 13 26 0.1 27 15 19 24 33 2 U3 22 0 3 28 1 10 8 32 31 9 17 21 30 6 −0.1 18 7 4 −0.2 25 29 −0.3 23 12 5 14 0.5 0 −0.5 0.25 0.2 0.15 0.1 0.05 0 −0.05 −0.1 0.3 0.35 U1 ' C0 J      + 0.4 0.3 11 0.2 16 0.1 U3 22 0 −0.1 6 8 10 28 2 3 1 20 13 24 26 27 15 19 9 17 32 33 31 21 30 18 −0.2 7 −0.3 4 0.4 29 0.2 23 25 5 12 14 0 −0.2 U2 −0.4 −0.1 −0.05 0 0.1 0.05 0.2 0.15 0.25 0.3 0.35 U1 ' *0 J      +      6 $  =     /  1991 80  4 ) 9 P 4 9 9 4 ) ) 4 ) 9 The Australasian Data Mining Workshop ?      0 N2                    #    O 2             &           &                 ,)*℄ '  !                     &        6                 &  F                         F&                  F                                                          '  !            &                                                       %            >         "            "                  !              F                  6         +               M  #             J J ,?℄                 "    !                           "        J J          0  ) ) ) ) ) ) ) 9 9 9 9 9  ) ) ) ) ) ) 9 9 9 9 9 9     ) ) ) ) ) ) ) 9 9 9 9 9     ) ) ) ) ) ) ) 9 9 9 9 9    ) ) ) ) ) ) ) ) ) 9 9 9     ) ) ) ) ) ) ) ) ) ) ) 9  4  ) 9 ) ) ) ) ) ) ) ) ) )   9 9 9 9 ) ) ) ) ) ) ) )     9 9 9 9 ) ) ) ) ) ) ) )     9 9 9 9 9 ) ) ) ) ) ) )    9 9 9 9 9 ) ) ) ) ) ) ) 9 9 9 9 9 9 ) ) ) ) ) ) 6      J J     ' 7 6 "           ' - 6                        J J 5. CONCLUSION  =     /    1991 81   ' 70 6            J J           ' -0 6              "                    !    2                      E #           6 $                 The Australasian Data Mining Workshop                 6                                              "                 6                                                +                    6 !                  !                                       6. ,-℄ 3 3    /  <     $    Q   (   D  J ?   )887 ,B℄ J (  M    < 6     J $ :            &        2 %   8C7A8*1 1999 ,8℄ 3 L     F#< $   !           !       $ &     '   )70?11A?C7 )88B ,)9℄ 6 L     F#< /          !     $ &    '  '  %   )888 ,))℄ 6 L     F#< (                    )9-               -?AB9   + )888   REFERENCES ,)℄ = M     3 F#M  D                  ?- C0*-?A*8* )88* ,)1℄ ,1℄ =  M   >  '  <                                ? C0?9)A?1- )887 ,)?℄ M    Q J Q 6  $     >   %             6    >  88B- =    >  )888 ,?℄ ,)C℄  M  J                  1 C0?1*A?CC )88B  F#<    J          !    """ &    $     ?)0CC)ACCC )8B?  6! ) $  $  *   J  6    D   Q  199) ,C℄ $ /    /   $                        @      ! "  # "    #     )0))8A )?C 199) ,)*℄ 3  > M! ( (  (    < 3 $                2 %   '  +  """     $ '     ,$-+. = / Q     1991 ,*℄ ,)7℄  I  $ 3    :      !               6          /      J D  1999  /   @ >   3    (   3        @%   @  3      @      @ >   $ /   )88* ,7℄ Q '   @ ' M              $    )88- 6 $  =     /  1991 82 An Overview of Temporal Data Mining Weiqiang Lin Mehmet A. Orgun Graham J. Williams Department of Computing I.C.S., Macquarie University Sydney, NSW 2109, Australia Department of Computing I.C.S., Macquarie University Sydney, NSW 2109, Australia CSIRO Data Mining GPO Box 664 Canberra ACT 2601, Australia wlin@ics.mq.edu.au mehmet@ics.mq.edu.au Graham.Williams@csiro.au ABSTRACT 2.1 Temporal Data Mining is a rapidly evolving area of research that is at the intersection of several disciplines, including statistics, temporal pattern recognition, temporal databases, optimisation, visualisation, high-performance computing, and parallel computing. This paper is first intended to serve as an overview of the temporal data mining in research and applications. In this section, we first give basic definitions and aims of Temporal Data Mining. The definition of Temporal Data Mining is as follows: 1. INTRODUCTION Temporal Data Mining is a rapidly evolving area of research that is at the intersection of several disciplines, including statistics (e.g., time series analysis), temporal pattern recognition, temporal databases, optimisation, visualisation, high-performance computing, and parallel computing. This paper is intended to serve as an overview of the temporal data mining in research and applications. In addition to providing a general overview, we motivate the importance of temporal data mining problems within Knowledge Discovery in Temporal Databases (KDTD) which include formulations of the basic categories of temporal data mining methods, models, techniques and some other related areas. The paper is structured as follows. Section 2 discusses the definitions and tasks of temporal data mining. Section 3 discusses the issues on temporal data mining techniques. Section 4 discusses two major problems of temporal data mining, those of similarity and periodicity. Section 5 provides an overview of time series temporal data mining. Section 6 moves onto a discussion of several important challenges in temporal data mining and outlines our general distribution theory for answering some those challenges. The last section concludes the paper with a brief summary. Definition and Aims Definition 1. Temporal Data Mining is a single step in the process of Knowledge Discovery in Temporal Databases that enumerates structures (temporal patterns or models) over the temporal data, and any algorithm that enumerates temporal patterns from, or fits models to, temporal data is a Temporal Data Mining Algorithm. Basically temporal data mining is concerned with the analysis of temporal data and for finding temporal patterns and regularities in sets of temporal data. Also temporal data mining techniques allow for the possibility of computerdriven, automatic exploration of the data. Temporal data mining has led to a new way of interacting with a temporal database: specifying queries at a much more abstract level than say, Temporal Structured Query Language (TSQL) permits (e.g., [17], [16]). It also facilitates data exploration for problems that, due to multiple and multi-dimensionality, would otherwise be very difficult to explore by humans, regardless of use of, or efficiency issues with, TSQL. Temporal data mining tends to work from the data up and the best known techniques are those developed with an orientation towards large volumes of time related data, making use of as much of the collected temporal data as possible to arrive at reliable conclusions. The analysis process starts with a set of temporal data, uses a methodology to develop an optimal representation of the structure of the data during which time knowledge is acquired. Once Temporal knowledge has been acquired, this process can be extended to a larger set of the data working on the assumption that the larger data set has a structure similar to the sample data. 2. DEFINITION AND TASKS OF TEMPORAL DATA MINING 2.2 The temporal data mining component of the KDTD process is concerned with the algorithmic means by which temporal patterns are extracted and enumerated from temporal data. Some problems for temporal data mining in temporal databases include questions such as: How can we provide access to temporal data when the user does not know how to describe the goal in terms of a specific query? How can we find all the time related information and understand a large temporal data set? and so on. Temporal Data Mining Tasks A relevant and important question is how to apply data mining techniques on a temporal database. According to techniques of data mining and theory of statistical time series analysis, the theory of temporal data mining may involve the following areas of investigation since a general theory for this purpose is yet to be developed: 1. Temporal data mining tasks include: • Temporal data characterization and comparison, • Temporal clustering analysis, The Australasian Data Mining Workshop 83 The Australasian Data Mining Workshop • Temporal classification, 3.2 • Temporal association rules, Temporal clustering according to similarity is a concept which appears in many disciplines, so there are two basic approaches to analyze it. One is the measure of temporal similarity approach and the other is called temporal optimal partition approach. In temporal data analysis, many temporal data mining applications make use of clustering according to similarity and optimization of temporal set functions. If the number of clusters is given, then clustering techniques can be divided into three classes: (1) Metric-distance based technique, (2) Model-based technique and (3) Partition-based technique. These techniques can be used occasionally in combination, such as Probability-based vs. Distance-based clustering analysis. If the number of clusters is not given, then we can use Non-Hierarchical Clustering Algorithms to find their k. In recent years, temporal clustering techniques have been developed for temporal data mining, e.g., [23]. Some studies have been done by using EM algorithm and Monte-Carlo cross validation approach (e.g.,[12; 22; 13]). • Temporal pattern analysis, and • Temporal prediction and trend analysis. 2. A new temporal data model (supporting time granularity and time-hierarchies) may need to be developed based on: • Temporal data structures, and • Temporal semantics. 3. A new temporal data mining concept may need to be developed based on the following issues: • the task of temporal data mining can be seen as a problem of extracting an interesting part of the logical theory of a model, and 3.3 • the theory of a model may be formulated in a logical formalism able to express quantitative knowledge and approximate truth. Temporal Cluster Analysis Induction A temporal database is a store of temporally related information but more important is the information which can be inferred from it([3; 4]. There are two main inference techniques: temporal deduction and temporal induction. In addition, temporal data mining needs to include an investigation of tightly related issues such as temporal data warehousing, temporal OLAP, computing temporal measurements, and so on. 1. Temporal deduction is a technique (e.g., in [24] to infer the information that is a temporal logical consequence of the information in the temporal database. 3. TEMPORAL DATA MINING TECHNIQUES 2. Temporal induction can be described as a technique (e.g., in [25]) to infer temporal information that is generalised from the temporal database. Induction has been used in the following ways within data mining: 1) Decision Trees and 2) Rule Induction. A common form of a temporal data mining technique is rule (or functions) discovery. Various types of temporal functions can be learnt, depending upon the application domain. Also, temporal functions (or rules) can be constructed in various ways. They are commonly derived by one of the two basic approaches, bottom-up or top-down induction. 3.1 4. TWO FUNDAMENTAL TEMPORAL DATA MINING PROBLEMS Classification in Temporal Data Mining The basic goal of temporal classification is to predict temporally related fields in a temporal database based on other fields. The problem in general is cast as determining the most likely value of the temporal variable being predicted given the other fields, the training data in which the target variable is given for each observation, and a set of assumptions representing one’s prior knowledge of the problem. Temporal classification techniques are also related to the difficult problem of density estimation. In recent years, a lot of the work has been done in nontemporal classification areas by using “Statistical Approaches to Predictive Modelling”. Some techniques have been established for estimating a categorical variable, e.g., [26; 5; 20]: kernel density estimators [20; 11] and K-nearest-neighbor method [20]. These techniques are based upon the theory of statistics. Some other techniques such as in [7; 8; 6] are based upon the theory of databases. Temporal classification techniques have not been paid much attention so far. In recent years, the main idea in temporal classification is the straightforward use of sampling techniques within time series methods (distribution) to build up a model for temporal sequences. The Australasian Data Mining Workshop 84 In recent years, two kinds of fundamental problems have been studied in temporal data mining area. One is the Similarity Problem which is to find a time sequence (or TDB) similar to a given sequence (or query) or to find all pairs of similar sequences. The other is the Periodical Problem which is to find periodic patterns in TDB. 4.1 Similarity Problems In temporal data mining applications, it is often necessary to search within a temporal sequence database (e.g: TDB) for those sequences that are similar to a given query sequence. Such problems are often called Similarity Search Problem. This kind of a problem involves search on multiple and multidimensional time series sets in TDBs to find out how many series are similar to one another. It is one of the most important and growing problems in Temporal Data Mining. In recent years, we still lack a standard definition and standard theory for similarity problems in TDB. Temporal data mining techniques can be applied in similarity problems. The main steps for solving the similarity problem are as follows: • define similarity: allows us to find similarities between sequences with different scaling factors and baseline values. The Australasian Data Mining Workshop • generalized sequential pattern (GSP) algorithm: it essentially performs a level-wise or breadth-first search of the sequence lattice spanned by the subsequence relation, • choose a query sequence: allows us to find what we want to know from large sequences (TDB) (e.g, character, classification) • processing algorithm for TDB: allows us to apply some statistical methods (e.g, transformation, wavelet analysis) to TDB (e.g, remove the noisy data, interpolate the missing data). • sequential pattern discovery using equivalence classes (SPADE) algorithm: it decomposes the original problem into smaller sub-problems using equivalence classes on frequent sequences [15]. • processing an approximate algorithm: allows us to build up a classcification scheme for the TBD according to the definition of similarity by using some data mining techniques (e.g, visualisation). With any new algorithm, there is one important question that has often been asked: How can we implement the new algorithm directly on top of a Time-series TDB? The result of the Similarity Problem search in TDB can be used for temporal association, prediction, etc. 4.2 5. TIME SERIES TEMPORAL DATA MINING Periodical Problems Statistics has been an important tool for data analysis for a long time. For example, Bayesian inference is the most extensively studied statistical method for knowledge discovery (e.g, [2], [10], [18]) and Markov Model, Hidden Markov Model (e.g., [14; ?]) also have made their way into temporal knowledge discovery process. Time series is a record of the values of any fluctuating quantity measured at different points of time. One characteristic feature which distinguishes time series data from other types of data is that, in general, the values of the series at different time instants will be correlated1 . Application of time series analysis techniques in temporal data mining is often called Time Series Data Mining. A great deal of work has been done into identifying, gathering, cleaning, and labeling the data, into specifying the questions to be asked of it, and into finding the right way to view it to discover useful temporal patterns. Time series analysis method has been applied into following major categories in temporal data mining: The periodicity problem is the problem of finding periodic patterns or, cyclicity occurring in time-related databases (TDB). The problem is related to two concepts: pattern and interval. In any selected sequence of TDB, we are interested in finding patterns which repeat over time and their recurring intervals (period), or finding the repeating patterns of a sequence (or TDB) as well as the interval which corresponds to the pattern period. For solving a Periodical Problem in TDB, the main steps are as follows: • determining some definitions of the concept of a period under some assumptions: this step allows us to know what kind of a periodicity search we want to perform from TDB. • building up a set of algorithms: this step allows us to use properties of periodic time series for finding periodic patterns from a subset of TDB by using algorithms. 1. Representation of Temporal Sequence: This refers to the representation of data before actual temporal data mining techniques take place. There are two major methods: • processing simulation algorithms: this step allows us find patterns from whole TDB by the algorithms. A lot of techniques have been involved in these kind of problems by using pure mathematical analysis such as function analysis, data distribution analysis and so on, e.g.,[9]. 4.3 • General representation of data: representation of data into time series data in either continuous or discontinuous, linear/non-linear models, stationary/nonstationary models and distribution models (e.g., Time domain representation and Time series model representation). Discussion In a time-series TDB, sometimes similarity and periodical search problems are difficult even when there are many existing methods, but most of the methods are either inapplicable or prohibitively expensive. There is also another difficult problem: how we can combine multiple-level similarity or periodical search in a multiple-level model? With the reference cube structure, such difficult problems can be solved by extending the methods mentioned in previous subsections, but the problem of combining multiple-level similarity and periodicity in a multiple-level model is still unsolved. Also, more sophisticated techniques need to be developed to reduce memory work-space. In fact, similarity and periodical search problems can be combined into the problem of finding interesting sequential patterns in TDBs. In recent years, some new algorithms have been developed for “fast mining of sequential patterns in large TDBs”: • General transformation of representation of data: representation of data into time series data in either continuous or discontinuous transformation (e.g., Fourier transformation, Wavelet transformation and Discretization transformation). 2. Measure of Temporal Sequence: measuring temporal charactersistic element in given definitions of similarity and/or periodicity in a temporal sequence (or, two subsequence in a temporal sequence) or between temporal sequences. There are two methods: 1 Time Analysis Theory can be found in any standard textbook of time series analysis, e.g., [1]. The Australasian Data Mining Workshop 85 The Australasian Data Mining Workshop a data analysis model to establish the link between the present temporal knowledge and the future temporal knowledge. • Characteristic distance measuring in time domain: measuring distance between temporal charactersistics in either continuous or discontinuous time domain (e.g., Euclidean squared distance function). • Characteristic distance measuring in other than time domain: measuring distance between temporal charactersistics in either continuous or discontinuous domain other than time (e.g., distance function between two distributions). 3. Prediction of Temporal Sequence: the main goal of prediction is to predict some fields in a database based on Time domain. The techniques can be classified into two models. The techniques involved in the above two methods can be divided into following classes: 1. Temporal data clustering: temporal clustering targets separating the temporal data into subsets that are similar to each other. There are two fundamental problems of temporal clustering: • Temporal classification models: the basic goal is to predict the most likely state of a categorical variable (the class) in Time domain. • To define a meaningful similarity measure, and, • To choose the number of temporal clusters(if we do not know the cluster numbers). • Temporal regression models: the basic goal is to predict a numeric variable in a set by using different transformations (e.g, linear or non-linear) on databases to find temporal information (or, patterns) of the different (or the same) categorical data sets (class). 2. Temporal data prediction: the goal of temporal prediction is to predict some fields based on other temporal fields. Temporal data prediction also involves using prior temporal patterns (or, models, knowledge) for finding the data attributes relevant to the attribute of interest. Recently, there are various results to date on discovering temporal information which have offered forums to explore the temporal data mining progress and future work concerning temporal data mining. But the general theory and general method of temporal data analysis of discovering temporal patterns for temporal sequence data analysis are not well known. 3. Temporal data summarization: the purpose of temporal data summarization is to describe a subset of temporal data by representing extracted temporal information in a model or, in rules or in patterns. It provides a compact description for a temporal dataset. It could also involve a logic language such as temporal logic, fuzzy logic and so on. 6. CHALLENGES AND RESEARCH DIRECTIONS Recent advances in data collection and storage technologies have made it possible for companies, administrative agencies and scientific laboratories to keep vast amounts of temporal data relating to their activities. Data mining refers to such an activity to make automatic extraction of different levels of knowledge from data feasible. One of the main unresolved problems, often called General Analysis Method of Temporal Data Mining, that arise during the data mining process is treating data that contains temporal information. 6.1 • Data temporal measure analysis method: This method involves the transformation of initial data temporal domain (or space) into another domain (or space), then the use of this new domain to represent the original temporal data. 4. Temporal data dependency: Temporal dependency modelling describes time dependencies among data and/or temporal attributes of data. There are two dependency models: qualitative and quantitative. The qualitative dependency models specify temporal variables (e.g., time gap) that are locally dependent on a given state-space S. The quantitative dependency models specify the value dependencies (e.g., using numerical scale) in a statistical space P. 6.2 Challenge Questions Data mining is a step in the knowledge discovery in databases, although successful data mining applications continue to appear but the fundamental problems are still as difficult as they have been for the past decade. One such difficult and fundamental problem is the development of a general data mining analysis theory. Temporal data mining researchers have paid some attention to this problem but results still remain in their infancy. One of the important roots in data mining analysis is statistical analysis theory. The general temporal data mining analysis theory includes two important analysis methods: • Data structural temporal knowledge analysis method: This method involves the discovery of data prior temporal knowledge, and the exploitation of the knowledge into The Australasian Data Mining Workshop 86 Some Answers of the Challenge Questions During the past few years, we have proposed a formal framework for the definitions and general hidden distribution theory of temporal data mining. We have also investigated applications in temporal clustering, temporal classification and temporal feature selection for temporal data mining. The major work we have done in answering the temporal data mining challenge questions are: • We have established a General Hidden Distributionbased Analysis Theory for temporal data mining. The general mining analysis theory is based on the statistical analysis method but traditional statistical assumptions only come from the data itself. There are two important concepts in the theory: 1) data qualitative set, data quantitative set and 2) data hidden conditional distribution function. The data qualitative set is the set to decide the data moving structure such as data The Australasian Data Mining Workshop periodicity and similarity. In other words, data qualitative set is a base of the data. The data quantitative set is the set to decide the numerical range of the data moving structure. The data hidden conditional distribution function is built on the characteristics of data qualitative and data quantitative sets. Another feature of the general mining analysis method is that we can use (extension of) all existing statistical analysis methods and techniques for mining temporal patterns. temporal classification: 1) Provide a definition of temporal classification, 2) Define a distribution distance function and 3) Provide the weighting of temporal objects for changing their class membership. For large numbers of classification, we proposed a discriminant coordinates of time gap distribution to deal with such kinds of problems. • We have proposed an algorithm which is called The Additive Distributional Recursion Algorithm (ADRA) in General Hidden Distribution-based Analysis Theory for building up temporal data models. The algorithm uses the sieve method2 to discover temporal distribution function (models, pattern). • We have extended a normal measure method to a new Temporal Measure Method, which is called Time-gap Measure Method. The new measure method has brought “time length” (which is between temporal events) or “time interval” (which is within a temporal event) into a time point (or, time value) variable. After a temporal sequence is transformed, it can be measured in both state-space S and probability space P. The time-gap is used as a temporal variable in time distribution function f (tv ) or temporal variable functional equations embedded in temporal models of the sequence. • We have extended and built up a new application of fundamental mathematics techniques for dealing with large temporal datasets, massive temporal datasets and distributed temporal datasets. The new application is called Temporal Sequence Set-Chains (A special case of the Temporal Set-Chains is a Markov Set-Chains). The key issue in temporal sequence set-chains is the use of stochastic matrices of samples to build up a moving kernel distribution. The temporal sequence set-chains sequence can be used for mining a large temporal sequence, massive temporal sequence and distributed temporal sequence such as Web temporal data sequence (e.g., Web content sequence, Web usage sequence and Web strutural sequence). • We have proposed a framework of Temporal Clustering method for discovering temporal patterns. In our temporal clustering method, there are three stages of temporal data mining in temporal clustering analysis: 1) the input stage: what an appropriate measure of similarity to use, 2) the algorithm stage: what types of algorithms to use, and 3) the output stage: assessing and interpreting the results of cluster analysis. In the second stage, we also proposed a framework of Distribution-based Temporal Clustering Algorithm. The algorithm is based on our general analysis method. • We have proposed a framework of Temporal Feature Selection for discovering Temporal patterns. There are three steps for feature selection in the temporal sequence. The first step of the framework employs a distance measure function on time-gap distributions between temporal events for discovering structural temporal features. In this step, only rough shapes of patterns are decided. The temporal features are grouped into temporal classifications by employing a distribution distance measure. In the second step, the degree of similarity and periodicity between the extracted features are measured based on the data value distribution models. The third step of the framework consists of a hybrid model for selecting global features based on the results of the first two steps. • We have established the main steps of applying our general temporal data mining theory to real world datasets with different methods and models. There are three steps of applications of our general analysis for discovering knowledge from a temporal sequence: 1) preprocessing data analysis including solving data problems and transforming data from its original form into its quantitative set and qualitative set, 2) temporal pattern searching including qualitative-based pattern searching, quantitative-based pattern searching and discovering global temporal patterns (models), and 3) the interpretation of the global temporal patterns (models) and future prediction. 6.3 Future Research Directions As we mentioned earlier temporal data mining and knowledge discovery have emerged as fundamental research areas with important applications in science, medicine and business. In this section, we describe some of the major directions of research from recent general analysis theory of temporal data mining research: 1. An extension of this temporal sequence measure method to general temporal points (e.g., temporal interval-based gap function) allowing an arbitrary interval between temporal points may lead to a very powerful temporal sequence transformation method. 2. An extension of the notion of Temporal Sequence SetChains on different temporal variables, or different components of a temporal variable, can be applied to deal with following problems of temporal data mining: • We have proposed a framework of Temporal Classification. This temporal classification is generated by our Temporal Clustering method. According to our general analysis theory for Temporal Sequence Mining and its application in temporal clustering, there are also the following three steps for constructing a • the number of temporally related attributes of each observation increases, • the number of temporally related observations increases, and • the number of temporally related distribution functions increases. 2 The sieve method is an important method in number theory. The Australasian Data Mining Workshop 87 The Australasian Data Mining Workshop 3. An important extension of the general temporal mining theory is the development of distributed temporal data mining algorithms. [5] C. Glymour, D. Madigan, D. Pregibon, and P. Smyth. Statistical inference and data mining. Communications of the ACM, 39(11):35–41, Nov. 1996. 4. In applications of temporal data mining, all new temporal data mining theories, methods and techniques should be developed on/with privacy and security models and protocols appropriate for temporal data mining. [6] F. H. Grupe and M. M. Owrang. Data-base mining discovering new knowledge and competitive advantage. Information Systems Management, 12:26–31, 1995. 5. In general data mining theory, we may need to develop fundamental mathematical techniques of fuzzy methods for mining purposes (e.g., temporal fuzzy clustering and algorithms, temporal fuzzy association rules and new types of temporal databases.). [7] J. Han, Y. Cai, and N. Cercone. Knowledge discovery in databases: An attribute-oriented approach. In Proceedings of the 18th VLDB Conference, pages 547–559, Vancouver, British Columbia, Canada, Aug. 1992. [8] J. W. Han, Y. D. Cai, and N. Cercone. Data-driven discovery of quantitative rules in relational databases. Ieee Trans. On Knowledge And Data Engineering, 5:29– 40, Feburary 1993. 7. CONCLUDING REMARKS [9] J. W. Han, Y. Yin, and G.Dong. Efficient mining of partial periodic patterns in time series database. IEEE Trans. On Knowledge And Data Engineering, 1998. Temporal data mining is a very fast expanding field with many new research results reported and many new temporal data mining analysis methods or prototypes developed recently. Some articles of overview of temporal data mining have discussed in different frameworks for coveing research and application in temporal data mining. In [19], for example, Roddick and Spiliopoulou have presented a comprehensive overview of techniques for the mining of temporal data. In this report we have provided an overview of the temporal data mining process and some background to Temporal Data Mining. Also we discussed a difficult and fundamental problem, a general analysis theory of temporal data mining and provided some answers to the problem. This leads into a discussion on why there was a need for Temporal Data Mining in industry, which has been a major factor in the efforts that have gone into building the present generation of Temporal Data Mining Systems. We have presented a number of areas which are related to Temporal Data Mining in their objectives and compared and contrasted these technologies with Temporal Data Mining. [10] D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors. Learning bayesian networks: the combineation of knowledge and statistical data. AAAI Press, 1994. [11] D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97). AAAI Press, 1997. [12] E. Keogh and P. Smyth. A probabilistic approach to fast pattern matching in time series databases. page 126. [13] A. Ketterlin. Clustering sequences of complex objects. In Heckerman et al. [11], page 215. [14] C. Li and G. Biswas. Temporal pattern generation using hidden markov model based unsuperised classifcation. In Proc. of IDA-99, pages 245–256, 1999. [15] M.J.Zaki. Fast mining of sequential patterns in very large databases. Uni. of Rochester Technical report, 1997. Acknowledgements This research has been supported in part by an Australian Research Council (ARC) grant and a Macquarie University Research Grant (MURG). [16] S. a. O.Etzion, editor. Temporal databases: Research and Practice. Springer-Verlag,LNCS1399, 1998. 8. REFERENCES [17] B. Padmanabhan and A. Tuzhilin. Pattern discovery in temporal databases: A temporal logic approach. In Simoudis et al. [21], page 351. [1] D. Brillinger, editor. Time Series: Data Analysis and Theory. Holt, Rinehart and Winston, New York, 1975. [18] P.sprites, C.Glymour, and R.Scheines. Causation, Prediction and Search. Springer-Verlag, 1993. [2] P. Cheeseman and J. Stutz. Bayesian classification (AUTOCLASS): Theory and results. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining. AAAI Press / MIT Press, 1995. [19] J. Roddick and M. Spiliopoulou. A survey of temporal knowledge discovery paradigms and methods. IEEE Transactions on Knowledge and Data Engineering, 2002. [3] T. Fulton, S. Salzberg, S. Kasif, and D. Waltz. Local induction of decision trees: Towards interactive data mining. In Simoudis et al. [21], page 14. [20] R.O.Duda and P. Hart. Pattern classification and scene analysis. John Wiley and Sons, 1973. [21] E. Simoudis, J. W. Han, and U. Fayyad, editors. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96). AAAI Press, 1996. [4] B. R. Gaines and P. Compton. Induction of metaknowledge about knowledge discovery. IEEE Trans. On Knowledge And Data Engineering, 5:990–992, 1993. The Australasian Data Mining Workshop 88 The Australasian Data Mining Workshop [22] P. Smyth. Clustering using monte carlo validation. In Simoudis et al. [21], page 126. cross- [23] T.Oates. Identifying distinctive subsequences in multivariate time series by clustering. In 5th International Conference on Knowledge Discovery Data Mining, pages 322–326, 1999. [24] J. D. Ullman and C. Zaniolo. Deductive databases: achievements and future directions. SIGMOD Record (ACM Special Interest Group on Management of Data), 19(4):75–82, Dec. 1990. [25] D. Urpani, X. Wu, and J. Sykes. RITIO - rule induction two in one. In Simoudis et al. [21], page 339. [26] P. Usama Fayyad and O.L.Mangasarian. Data mining: Overview and optimization opportunities. INFORMS, Special issue on Data Mining, 1998. The Australasian Data Mining Workshop 89 90 Distances for Spatio-temporal clustering    Mirco Nanni Dino Pedreschi ISTI - Institute of CNR Via Moruzzi 1 – Loc. S. Cataldo, 56124 Pisa, Italy Dipartimento di Informatica, Università di Pisa Via F. Buonarroti 2, 56127 Pisa, Italy nanni@guest.cnuce.cnr.it pedre@di.unipi.it ABSTRACT                                                                                                !                        "!                                    #        $                      $                                 %        &         ' ()) *               +,#+-./     .    -    /   !   -  ,   )    0  '  # %        $ 12℄ Keywords )   - ,          1.  INTRODUCTION /                                           $              %      $        %            +                                                                                                      !                            4   ! #                     %                   $            $                                                                       5+,     5   +    ,  5+,!                            ')        6  5+,                 7                         #        5+,            $       %                                          #                        5+,                 $     7                                +                          Æ                                    $                 %  ! +                  %                         %                                         ,                                    +                                                       8                                                     $               9   !        %       %                 ­ 91        The Australasian Data Mining Workshop      $  !                             %                     %  %      #                                  ¯ #                                    $                        $   %          #                           # !                  %                 =  !              %   $                      ! #      %                              %                                     %                                                                                 !" #                 $   5+,       :   ;                         :  ; #             !                                            $     !                              :  ;!           #          $                         #                            $           %                                         &          $ ! +                             !      8     :  ;                                 %                %                               1.1 Aim of the paper +                                     6             $      ¯ <                                      +                                                     &     %                                    6                                     !   =                         ¯ #                          #                  # %        $ 12℄                    %                   #            ,   " >      $ + ,   ?             $         + ,   @                       !     ,   A                      ,   B        %     <   ,   C        $         $ 2. RELATED WORK D            %                        ! + 1B= C℄       %      8         E             E         %     -  %                             1?℄                        8                              $              %             ! 6    ­ 92 The Australasian Data Mining Workshop                                      1F℄ !          1@℄         8                                 9             !            ,    $          !                   8                %  G    #             >%        !              $           < 6 1"℄            G             ,             & <6 1A℄     #    $          %                  !       %                 !                        !     %        #                                    !                  ,                             %      $        <  %       1 ℄         ,)*!                   1 H℄                                   3. A DATA MODEL FOR TRAJECTORIES +                    <                                                          " ?! <    Ê·  Ê  +                              $           6          %          $                                  #      %                                  5           $                             %               *                     !                   <                                          )  $             $                %                    <           %                              +   $               !   "                                  %      #     >%                                              %    4. A FAMILY OF DISSIMILARITY MEASURES +                                          4.1 General definition and example instances #                       +                                  #                         %                                %   !            ' (     &  ½  ¾ ! I )  !  ½ ¾ ¬ ¬ ! !   ½ ¾          ½   ¾   Ê·    (      $        )                   !                              #     !           <            %                              %                       !      %                     ! ,     %                                             %         ­ 93 ! ' The Australasian Data Mining Workshop                   ,            !  6$ $7         G     6                     !                       !  ¾!!          ½        $                 !                )!         Ê  ! )  ! I   I  ! H     ! I   !   $   @    ½            ½        Ê·  +            ½  ¾ ! I ½ ½  ½ ¾! J J    ½ ¾!      ½ ¾                      !!           Ê·  1H ℄!        4.3 Computational properties                       ,   ?                   ,    )!                                         !           $       #                 %  ½ J ¾ ! !                  $   A  ½  ¾         !    "    %    '       $                                               #              *            ½ ¾  )     $   . ½  ¾               #  /  ½ ¾   *       )         &          +   '      ½ ¾  #        "   5. EFFECTS ON SOME CLUSTERING ALGORITHMS   ! I !I +         $                   ,        ,       (    -    ½ ¾    *       )                     %  ! ! 8                               '       $            12℄            !       ! I H . "! "  I   ! H  ! ?    ! I   !  ! @     ! J   !  !   ! $    -       ) !   !            ' ½ ¾ $   %   ! " - ½ ¾     )              '   ? -       ! I %        $    ! I   )        '      -            &  ) ½ ¾ +                $     %                   &  %                                                     ! $                    !          %        %! #           %           %¾      #  %    !                                    #      %                                ' !      ½ J ¾ ! I   !                     #             %        %  ¾  ! 5.2 K-means         ¾          #        $     ! I H .       J ! 5.1 Dissimilarity Matrix-based        % +      >                    %           <                                    %               4.2 Mathematical properties     !  ­ 94 The Australasian Data Mining Workshop            +                                                #              E                  E      %    !                     ! #       $       %                 % +     %             %              9        ,                   %    I   !         $   B -               I            J J  ! I   !        %               ! I     ! I   !                #            %     !              %           D(o,c’) 1 2 3 4 5 6 c’ <  <       %                 ? #      "  A               '   !!                     '   !!   ? #               !        ?          <                         %   ,   B    %                     ¼ ¼   5.3 Optimisations +       ' !      %                   % !    #                                   %                        +                              *            <        $                   '  ! '  ! J '  ! '  ! '  ! J '  ! .       %         '   ! I '   !!      '  !  '  ! '  ! '  ! J '  ! #                    1    ℄     -                                 %    '   ! <      %                  %  !                $                        '   !                     ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼        6. EXPERIMENTATIONS +                  ,   A?                   $       G         !          ) ! ½ ¾ 6.1 Sythesised dataset: The “leader” model +          %               "                  D                                       #                                                                  #                                G                      #  %                                      $        #                          !                      %            $                ! < "   %          #                                  $  6.2 Effects of Optimisation    %                 4 ,             ­ 95 The Australasian Data Mining Workshop 120 100000 Naive k-means Optimised 110 100 10000 90 80 1000 Time 70 60 100 50 40 10 30 20 1 10 0 10 20 30 40 50 60 70 80 90 100 < " #                   4      "A  AHH      "A #                           %  F ,          $                 %      %                              <   $                                    ,   A? #                                                                                       !   $     =                                                        %                  :  ;    %     8  %                        %             (                %       8               $                     %           +        E          E     $                  !  %                   6     %                         H             < ?   %                                   %          I  J  I  J                          4  #            ¼    ¼   ¼ ¼         0.1 10 1000 < ? .     $       I I                    %          !                        %             6               %                  BH K  2?CK      F@CK ,           $                      ,                  %                      %                6          %                           !                                          D                             % '   A! #             %                                           %      ,                   $                         !    ¼ 7. CONCLUSIONS   ¼  +        %     !   $                     $                  %                                           8                                                ,                $    ­ 96 100 Dataset size The Australasian Data Mining Workshop                  8                                     %         %   $ +         %      %         D %            %      L6*                                       $                                                   $           $ ¯ D                        %                                                                  = ¯             %                      $                                                !   -   )   ,     = ¯                               !                !                                  = ¯               $                   $     !   $       12℄! +                                                         =        ¯ 8. REFERENCES 1 ℄ /    (+ * 8, ,  ( , <           1    '       + - 0  1 '2345  @2HEAH  22A 1"℄ -<     (+ * <  4       %            + 6     #-78' 9445 22A 1?℄ ) ( 8     " ?2E    2FC  8 <    C" 1@℄ ,% 5    ' , #+:''   44  222  1A℄ M5N / / - $  N5$ * ' - <      + 6     -0'* 9444  AH"E A  222 1B℄   ( - % 0    * '   #     *,++#  ,   <   22C 1C℄ + 6( -    %       /  -3 0  :  '    '  8  .  &  -    22C 1F℄  ( ' 5 $ N (  4 $ 8      + ' ; <  6 .   6      < -    =   % (   -    #    ""2E"?F <  *   <* 0, 22A 12℄ 6 .  0          '  ')  )    +   E 0O  '  "HH" 1 H℄ P Q-  ,L  ,           + 0-:834>  "A E"AF 22F  ­ 97     %                 %                  !              98 99 The Australasian Data Mining Workshop 100 The Australasian Data Mining Workshop 101 The Australasian Data Mining Workshop 102 The Australasian Data Mining Workshop 103 The Australasian Data Mining Workshop 104 The Australasian Data Mining Workshop 105 The Australasian Data Mining Workshop 106 The Australasian Data Mining Workshop 107 The Australasian Data Mining Workshop 108 109 The Australasian Data Mining Workshop 110 The Australasian Data Mining Workshop 111 The Australasian Data Mining Workshop 112 The Australasian Data Mining Workshop 113 The Australasian Data Mining Workshop 114 The Australasian Data Mining Workshop 115 The Australasian Data Mining Workshop 116 117 The Australasian Data Mining Workshop 118 The Australasian Data Mining Workshop 119 The Australasian Data Mining Workshop 120 The Australasian Data Mining Workshop 121 The Australasian Data Mining Workshop 122 The Australasian Data Mining Workshop 123 The Australasian Data Mining Workshop 124 The Australasian Data Mining Workshop 125 The Australasian Data Mining Workshop 126 The Australasian Data Mining Workshop 127 The Australasian Data Mining Workshop 128 The Australasian Data Mining Workshop 129 130 Author Index Tamas Abraham …… 17 Sabine McConnell …… 75 Janice Boughton …… 65 Mirco Nanni …… 91 Richard Brookes …… 13 Tariq Nuruddin …… 109 Frada Burstein …… 37 Mehmet A. Orgun …… 83 N. Scott Cardell …… 1 Dino Pedreschi …… 91 Peter Christen …… 99, 117 Ben Raymond …… 29 Tim Churches …… 99 David B. Skillicorn …… 75 Adam Czezowski …… 117 Dan Steinberg …… 1 Olivier de Vel …… 17 Yudho Giri Sucahyo …… 109 Mikhail Golovnya …… 1 Sérgio Viademonte …… 37 Raj P. Gopalan …… 109 Zhihai Wang …… 57, 65 Ryan Kling …… 17 Geoffrey I. Webb …… 57, 65 Inna Kolyshkina …… 13 Graham. J. Williams …… 83 Shonali Krishnaswamy …… 47 Eric J. Woehler …… 29 Weiqiang Lin …… 83 Arkady Zaslavsky …… 47 Seng Wai Loke …… 47 Justin Zhu …… 99 131