Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Academia.eduAcademia.edu

Sonification of time dependent data

2002, VDM@ ECML/ …

Visual data mining is a collection of interactive methods for knowledge discovery from data, which integrates human perceptual capabilities to spot patterns, trends, relationships and exceptions with the capabilities of the modern digital computing to characterise data structures and display data. The underlying technology builds on visual and analytical processes developed in various disciplines including data mining, information visualisation, and statistical learning from data with algorithmic extensions that handle very large, multidimensional, multivariate data sets. The growing research and development in visual data mining offers machine learning and data mining communities complementary means of analysis that can assist in uncovering patterns and trends that are likely to be missed with other non-visual methods. Consequently, the machine learning and data mining communities have recognised the significance of this area. The first edition of the workshop took place at the ECML/PKDD conference in Freiburg, following a similar workshop at the ACM KDD conference in San Francisco.

Foreword

Visual data mining is a collection of interactive methods for knowledge discovery from data, which integrates human perceptual capabilities to spot patterns, trends, relationships and exceptions with the capabilities of the modern digital computing to characterise data structures and display data. The underlying technology builds on visual and analytical processes developed in various disciplines including data mining, information visualisation, and statistical learning from data with algorithmic extensions that handle very large, multidimensional, multivariate data sets. The growing research and development in visual data mining offers machine learning and data mining communities complementary means of analysis that can assist in uncovering patterns and trends that are likely to be missed with other non-visual methods. Consequently, the machine learning and data mining communities have recognised the significance of this area. The first edition of the workshop took place at the ECML/PKDD conference in Freiburg, following a similar workshop at the ACM KDD conference in San Francisco.

The first edition of the workshop offered to the ECML/PKDD2001 participants a mixture of presentations on state-of-art methods and techniques, with controversial research issues and applications. A report about this workshop has been published in SIGKDD Explorations 3 (2), pp. 78-81. The presentations were grouped in three streams: Methodologies for Visual Data Mining; Applications of Visual Data Mining; and Support for Visual Data Mining. The workshop included also two invited presentations -one from Erik Granum, the head of the Laboratory of Computer Vision and Media Technology, Aalborg University (on the behalf of the 3DVDM group), who presented an overview of the unique interdisciplinary 3DVDM group, its current projects and research opportunities there; and one from Monique Noirhomme-Fraiture, who demonstrated 2D and 3D visualisation support for visual data mining of symbolic data. The workshop brought together a number of cross-disciplinary researchers, who were pleased with the event and there was consensus about the necessity of turning it into an annual meeting, where researchers, both from the academia and industry can exchange and compare both relatively mature and green house theories, methodologies, algorithms and frameworks in the emerging field of visual data mining. Meantime

Introduction

The process of grouping the physical data or abstract data based on their similarity is called clustering. Clustering is an important analysis method in data mining, which could help people to better understand and observe the natural classification or structure of the data. Clustering algorithms are used to automatically classify data items into the relative, meaningful clusters. After clustering, the items within any cluster are highly relevant and the items across different clusters are lowly relevant. The factors listed below are always being considered when evaluating a clustering algorithm.

• Scalability: A good clustering algorithm can deal with large datasets including up to millions data items. • Discovery of clusters with arbitrary shape: A cluster may have an arbitrary shape.

A clustering algorithm should not only apply to the regular clusters. • Minimum parameters input: It is a heavy burden for the users to input those important parameters. In the meantime, this brings more trouble in getting good quality clustering.

• Insensitive to order of input records: Inputting data in different order should not lead to different results. • High-dimensionality: A dataset may include many attributes, clustering data in high-dimensional space is highly demanded in many applications. Current clustering algorithms can be broadly classified into two categories: hierarchical and partitional. Hierarchical algorithms, such as BIRCH [7], CURE [8] and CHAMELEON [9] etc, decompose a dataset into several levels of nested partitions. They start by placing each object in its own cluster and then merge these atomic clusters into larger and larger clusters until all objects are in a single cluster. Or they reverse the process by starting with all objects in a cluster and subdividing into smaller pieces. Partitional algorithms, such as CLARA [5] and CLARANS [6], partition the objects based on a clustering criterion. The popular methods, K-means and K-medoid, use the cluster average, or the closest object to the cluster center, to represent a cluster.

New clustering algorithms have been proposed in recent researches [4]. For example, DBSCAN, OPTICS and CLIQUE are based on density; STING, WAVE CLUSTER are based on grid; and COBWEB, SOM are based on model.

Knowledge Discovery in Databases (or KDD) can be defined [1] as the non-trivial process of identifying patterns in the data that are valid, novel, potentially useful and understandable. In most existing data mining tools, visualization is only used during two particular steps of the data mining process: in the first step to view the original data, and in the last step to view the final results. Between these two steps, an automatic algorithm is used to perform the data-mining task. The user has only to tune some parameters before running his algorithm and wait for its results.

Some new methods have recently appeared [2], [3], [4], trying to involve more significantly the user in the data mining process and using more intensively the visualization [5], [6], this new kind of approach is called visual data mining. In this paper we present some tools we have developed, which integrate automatic algorithms, interactive algorithms and visualization tools. These tools are two interactive classification algorithms and a visualization tool created to show the results of an automatic algorithm. The classification algorithms use both human pattern recognition facilities and computer calculus power to perform an efficient us is derived from the first one and per ne ed by this algorithm. Section 4 concludes the paper and lists some future work.

Visual data mining is a part of the KDD process [1], which places an emphasis on visualisation techniques and human cognition to identify patterns in a data set. [1] identified three different scenarios for visual data mining, two of which are connected actually with the visualisation of final or intermediate results and one operates directly with visual representation of the data. The design of data visualisation techniques, in broad sense, is the formal definition of the rules for translation of data into graphics. Generally, the term 'information visualisation' has been related to the visualisation of large volumes of abstract data. The basic assumption is that large and normally incomprehensible amounts of data can be reduced to a form that can be understood and interpreted by a human through the use of visualisation techniques. The process of finding the appropriate visualisation is not a trivial one. A number of works offer some results that can be applied as guiding heuristics. For example, [2] defined the Proximity Compatibility Principles (PCP) for various visualization methods in terms of tasks, data and displays -if a task requires the integration of multiple data variables, they should be bundled in proximity in an integrated display. Based on this principle authors have concluded that 3D graphs do not have an advantage over 2D graphs for scientific visualisation (which may not necessarily hold for visual data mining).

Visual data mining relies heavily on human visual processing channel and utilises human cognition overall. The visual data mining cycles are shown in Fig. 1. In most systems, visualisation is used to represent the output of conventional data mining algorithms (the path shown in Fig. 1a). Fig. 2 shows an example of visualisation of the output of an association rule mining algorithm. In this case, visualisation assists to comprehend the output of the data mining algorithms. Fig. 1b shows the visual data mining cycle when visualisation is applied to the original or pre-processed data. In this case, the discovery of the patterns and dependencies is left to the capacity of the human visual reasoning system. The success of the exercise depends on the metaphor selected to visualise the input data [3].

Figure 1

A Parallel Bar Chart.

Figure 2

Example of visualisation of the output of an association rule miner. Combining visual data mining and sonification

Related Work

The main idea of density-based approaches is to find regions of high-density and lowdensity, with high-density regions being separated from low-density regions. These approaches can make it easy to discover arbitrary clusters. A common way is to divide the high-dimensional space into density-based grid units. Units containing relatively high densities are the cluster centers and the boundaries between clusters fall in the regions of low-density units. For example, the CLIQUE [1] algorithm processes dense units level-by-level. It first determines 1-dimensional dense units by making a pass over the data. Having determined (k-1)-dimensional dense units, the candidates for k-dimensional units are determined. A cluster is a maximal set of connected dense units in k-dimensions. This algorithm automatically finds subspaces of the highest dimensionality such that high-density clusters exist in those subspaces, but the accuracy of the clustering result may be degraded at the expense of simplicity of the method.

The alternative way is to calculate parameter 'directly density-reachable' or 'reachability-distance' of the object. For example, the DBSCAN [3] aims at discovering clusters of arbitrary shape based on the formal notion of densityreachability for k-dimensional points. OPTICS [2] solves the problem of DBSCAN, which only computes a single level clustering. OPTICS produces a special order of the database with respect to its density-based clustering structure, and according to the order of the reachability-distance of each object, it can quickly reach the high-density region. OPTICS is good for both automatic and interactive cluster analysis, including finding intrinsic clustering structure, and is not limited to one global parameter setting. But it is infeasible to apply it in its current form to a database containing several million high-dimensional objects. In this paper, we propose an integrative clustering method which is to partition the data space based on the density and gridbased techniques and realize the visualization of the clustered result.

Clustering by Ordering Dense Units

Basic statement

The density-based technique is adopted in our approach, in which the data space is partitioned into non-overlapping rectangular units and the density distribution will be deduced by calculating the data volume of each rectangular unit.

Suppose D is a n-dimensional dataset with m items: D={X 1 , X 2 , …, X i , …, X m }, where X i =(X i1 , X i2 , …, X ij , …, X in ), (i≤m), and X ij  is the value of the j th attribute of X i . If each dimension of the dataset is equally divided into t parts, then all the items in the dataset fall into k units: U={U 1 , U 2 , …, U i , …, U k } (k≤t n ), where U i =(U i1 , U i2 , …, U ij , …, U in ) is the vector of each equally divided attribute. Two units U 1 and U 2 are defined to be adjacent only when any attribute of one unit is adjacent to that of the other unit: | U 1j -U 2j | = 1, U 1s = U 2s , (j, s ≤ k, j ≠ s). We define density peaks as those units whose densities are larger than those of the adjacent units; similarly, we define density valleys as the units whose densities are lower than those of the adjacent ones.

Algorithm

When high-dimensional space is divided into k equal subspaces (units), the density peaks are regarded as the clusters centers. So the key process is how to find the density peaks. In our approach, CODBU (Clustering by Ordering Density-Based Units), the units with densities greater than threshold are ranked by the density value. The change from density peaks to density valleys is expressed by hierarchical level. The density peak is positioned in the first level of the cluster, the adjacent units are in the second level, and finally, the density valley is positioned in the last level of the cluster. First, we calculate the density values of each unit, and then rank the units whose densities are greater than threshold value. The largest-density unit will be analysed first. Each unit is compared with its adjacent units (neighbours) in density, if its density is greater than that of any other adjacent unit, it is considered as a density peak and then be set as the first level, a new cluster is emerging. If its density is less than one of adjacent units, then this unit will be grouped into the cluster of the adjacent unit; if its value is less than many of adjacent units, then this unit will be grouped into the cluster of the lowest-level unit. Fig.1 describes the process of cluster, in which only 2 basic parameters are required: the number of subdivisions for each dimension and the density threshold value. These two values are entered manually according to the size of the dataset and the required accuracy of the result. Figure 2 shows a simple 2-dimension data set. The sequence number of each unit is shown in the unit, and ' * ' stands for the spread points among them. Now sort all the squares whose density values are larger than 3, the result is (sorted by density): 4 (11), 5 (9), 8 (8), 10 (8), 1 (7), 9 (7), 11 (7), 3 (6), 6 (5), 12 (5), 7 (4), 2 (3). Based on the above result, we can find 2 clusters (sorted by level):

C1={4 (1), 5 (2), 1 (2), 10 (2), 3 (2), 6 (3), 11 (3)}; C2={8 (1), 9 (2), 12 (2), 7 (2), 2 (2)}.

Fig.2 Two clusters based on the density values of units

We can see that, although the density of unit 5 is larger than that of unit 8, as it is adjacent to unit 4 which has higher density, unit 5 is still grouped into the first cluster. Since the density value of unit 8 is greater than that of any adjacent unit, it forms a new cluster. The density of unit 7 is lower than those of unit 6 and 8, but the unit 8 has a lower level than unit 6, so unit 7 is grouped into the second cluster.

Algorithm analysis

The dataset is only scanned once in our approach. Suppose k is the number of subspaces whose densities are greater than threshold value, the time complexity of applying quick sort is O (k*log k). By building up a search tree, the time complexity of comparing the density value of each unit with those of its adjacent units is O(nk), and n is the number of dimensions. Since the total number of units is much less than that of the data items in the dataset, the time complexity is decreased dramatically. Furthermore, the analysis of the data space based on the density order can better reflect various clusters than the traditional approach in which the data items are simply grouped together if the densities are greater than a set threshold. Therefore our approach can cluster high-dimensional data space more quickly and accurately. In addition, the quality of the clustered result will not be influenced by the shape of the high-dimensional clusters or the order of the data input, and the parameters are easily set up and modified, so all the criteria mentioned in the Section 1 have been met.

Visualization

Clustering high-dimensional datasets is used to help users to better understand the data structure and relationships. Visualization techniques play a very important role in displaying data and clustered results, making it more clear and reliable. The visualization techniques for high-dimensional dataset [10] [11] can be divided into two types. One type, such as "parallel coordinates" [12], is to divide the 2-d plane into several parts, each part representing an attribute. The other type is to reduce the dimensions, which is implemented through giving weights to the attributes of ndimensional data according to the relative importance and then combine them linearly. We present a new method to project high-dimensional data, which uses stimulation spectrum to project high-dimensional data on a 3-d space.

The natural color is the summation of energy distribution in each wavelength in the range of visible spectrum. This energy distribution is called stimulation spectrum Φ(λ). Every stimulation spectrum can be transferred to a point in RGB color space. The quantitative relationship between stimulation spectrum Φ(λ) and RGB color coordinate are listed in the following formula:

In this formula, k is the ratio. λ refers to wavelength of visible spectrum, ranging from 400 nm to 700 nm . r(λ), g(λ) and b(λ) stand for the spectrum tristimulus functions of red, green and blue, and the value of r(λ), g(λ) and b(λ) at every 5nm is measured by CIE, which is already known. Fig.3 is the spectrum tristimulus functions graph of CIE 1931 standard colorimetric system.

Figure 3

Viewpoint movements: A) Low; B) Rotate; C) Far.

Fig. 3 Spectrum tristimulus functions graph

Each data item X i =(X i1 , X i2 , …, X ij , …, X in ) in high-dimensional space can be viewed as a stimulation spectrum, and spreads equally in the range of visible spectrum with the wavelength from 400 nm to 700 nm . Here X i can be regarded as the function of λ, and the change of the attribute values corresponds to that of the spectrum tristimulus. For example, X i (400) = X i1 , …, X i (700) = X in . We can work out the 3-d coordinate of data X i in projection space according to formula (4.2), and through adjusting the value of k can make projection space not only the RGB color space, but also any 3-d space.

We can see from the above that the process of projecting the data items as stimulation spectrums can also be viewed as a kind of weight linear combination of ndimensional data through spectrum tristimulus functions r(λ), g(λ) and b(λ). Taking advantage of spectrum tristimulus functions to convert high-dimensional data can completely project the data in projection space. This is because the fundamental function of r(λ), g(λ) and b(λ) is to project stimulation spectrums in color space, so it can well reflect the feature of the original data. From fig.3, we can see that the n attributes can be divided into 3 parts, and b(λ) corresponds to the attributes of the former part of the data while g(λ) and r(λ) mainly correspond to the middle and the last part of the data attributes. In this way, the projection result of a data item will be described by all of the 3 coordinate values. On the other hand, because the spectrum tristimulus functions cover the equal area, the ranges of the coordinate axes in projection space are equal, and the data will not be over-concentrated around some coordinate axes.

Experiments

A 6-dimensional dataset containing 400 points was used in our CODBU testing experiment (it is a car dataset from http://stat.cmu.edu/datasets/). The attributes in the data set are: fuel economy in miles per U.S. gallon, number of cylinders in the engine, engine displacement in cubic inches, output of the engine in horsepower, 0 to 60 mph acceleration, and vehicle weight in U.S. pounds. Each attribute was divided into 5 parts, so there were 5 6 =15,625 units and the 400 points scattered in 53 units. The threshold was set up as 1 which means all the points were processed. 7 clusters were obtained through linking the associated units.

Two visualization techniques were explored in displaying the result: parallel coordinates and the spectrum tristimulus functions projection. Fig.4 shows the result using parallel coordinates, in which different clusters are represented by different colors but the characteristics of the clusters are not obvious. fig.5 (a), the values of the densities are reflected by color: the darker the color, the higher the density. 3 clusters are found. Fig.5 (b) displays the result of clustering using our algorithm; clusters are represented by different colors. 7 clusters are obtained based on the previous 3 big clusters. This is because that there are 7 density peaks being discovered. The clusters in white are removed due to only one data point included. Fig.5 (c) uses symbols (such as *, +, o, ^, etc.) instead of colors to display the clusters of the data.

Figure 4

Example of a surface that can be sonified

Figure 5

Part C.b: Application test: four sequences are presented to the user. Each time precise questions are asked: Question 1: Annual sheep population in England and Wales between 1867 and 1939 (see Fig. 5). − Were there more sheep in 1867 then in 1939? − In your opinion, when (which year) did the sheep population reach the minimum? Annual sheep population in England and Wales between 1867 and 1939

Conclusions

The paper presents a new clustering approach by sorting density-based units. The basic idea is to rank the units in high-dimensional data space according to the values of the density, and start from the highest density unit to search for the density peaks in order to discover clusters. The experimental results are very promising. Clusters extend from the density peaks to density valleys and this will not be affected by the ½ ½º ½º ººº ʽ ¼º ¼º¿ ¹¼º¼ ººº Ê ½Ï ¾º¿ ¾º¿ ¾º¿ ººº ººº µ ÇÊ Ê Ê Ê ½ ¼Ï ½ ½Ï ººº ʽ ¼Ï ¼ ¼º¿ ¼º ººº ʽ ¼ ¼º

Table

Can Hierarchical Clustering Improve the Efficiency 17 Abstract. Visual data-mining strategy lies in tightly coupling the visualizations and analytical processes into one data-mining tool that takes advantage of the strengths from multiple sources. This paper presents concrete cooperation between automatic algorithms, interactive algorithms and visualization tools. The first kind of cooperation is an interactive decision tree algorithm called CIAD+. It allows the user to be helped by an automatic algorithm based on a support vector machine (SVM) to optimize the interactive split performed at the current tree node or to compute the best split in an automatic mode. This algorithm is then modified to perform an unsupervised task, the resulting clustering algorithm has the same kind of help mechanism based on another automatic algorithm (the k-means). The last effective cooperation is a visualization algorithm used to explain the results of SVM algorithm. This visualization tool is also used to view the successive planes computed by the incremental SVM algorithm.

Rule

Interactive decision tree construction

an or data-analysis expert. This ne ty of the results is improved by the use of human pattern recognition ca nd not only for the int e KDD process is not just a "black bo n alg rithm performs univariate splits and allows binary splits as well as n-ary splits. thm allowing the user to perform an automatic computation of the best bivariate split. er-centered classification. This paper is organized as follows. In section 2 we briefly describe some existing interactive decision tree algorithms and then we present our new interactive algorithms, the first one is an interactive decision tree algorithm called CIAD+ (Interactive Decision Tree Construction) using support vector machine (SVM) and the second forms unsupervised classification (clustering). In section 3 we present a graphical tool used to explain the results of support vector machine algorithms. These algorithms are known to be efficient but they are used as "black boxes", there is no explanation of their results. Our visualization tool graphically explains the results of the SVM algorithm. The implemented SVM algorithm can modify an existing linear classifier by both retiring old data and adding w data. We visualize the successive separating planes comput Some new user-centered manual (i.e. interactive or non-automatic) algorithms inducing decision trees have appeared recently: Perception Based Classification (PBC) [7], Decision Tree Visualization (DTViz) [8], [9] or CIAD [10]. All of them try to involve the user more intensively in the data-mining process. They are intended to be used by a domain expert not the usual statistici w kind of approach has the following advantages: -the quali pabilities, -using the domain knowledge during the whole process (a erpretation of the results) allows a guided search for patterns, -the confidence in the results is improved, th x" giving more or less comprehensible results. The technical part of these algorithms are somewhat different: PBC and DTViz use an univariate decision tree by choosing split points on numeric attributes in an interactive visualization. They use a bar visualization of the data: within a bar, the attribute values are sorted and mapped to pixels in a line-by-line fashion according to their order. Each attribute is visualized in an independent bar (cf. fig.1). The first step is to sort the pairs (attr i , class) according to attribute values, and then to map to lines colored according to class values. When the data set number of items is too large, each pair (attr i , class) of the data set is represented with a pixel instead of a line. Once all the bars have been created, the interactive algorithm can start. The classificatio o CIAD is described in the next section and, in section 2.2, we present a new version of CIAD (called CIAD+) with a help tool added to the interactive algori [9] uses a two dimensional polygon or more precisely, an open-sided polygon (i.e. a polyline) in a two dimensional matrix. It is interactively drawn in the matrix. The display is made of one 2D matrix and one-dimensional bar graphs (like in PBC).

Only PBC provides the user with an automatic algorithm to help him choosing the best split in a given tree node. The other algorithms can only be run in a 100% manual interactive way.

Attr.1 Attr.2 Class

Attr.

CIAD

CIAD uses a bivariate decision tree using line drawing in a set of two-dimensional matrices (like scatter plot matrices [11]). The first step of the algorithm is the creation of a set of (n-1) 2 /2 two-dimensional matrices (n being the number of attributes). These matrices are the two dimensional projections of all possible pairs of attributes, the color of the point corresponds to the class value. This is a very effective way to graphically discover relationships between two quantitative attributes. One particular matrix can be selected and displayed in a larger size in the bottom right of the view (as shown in figure 2 using the Segment data set from the UCI repository [12], it is made of 19 continuous attributes, 7 classes and 2310 instances). Then the user can start the interactive decision tree construction by drawing a line in the selected matrix and performing thus a binary, univariate or bi-variate split in the current node of the tree. The strategy used to find the best split is the following. We try to find a split giving the largest pure partition, the splitting line (parallel to the axis or oblique) is interactively drawn on the screen with the mouse. The pure partition is then removed from all the projections. If a single split is not enough to get a pure partition, each half-space created by the first split will be treated alternately in a recursive way (the alternate half-space is hidden during the current one's treatment).

At each step of the classification, some additional information can be provided to the user like the size of the resulting nodes, the quality of the split (purity of the resulting partition) or overall purity. Some other interactions are available to help the user: it is possible to hide / show / highlight one class, one element or a group of elements.

Fig. 2. The Segment data set displayed with CIAD

CIAD+

The first version of the CIAD algorithm was only an interactive algorithm. No help was available for the user, and sometimes, it was difficult to find the best pure partition in the set of two-dimensional matrices. We have decided to provide such help. Our first intention was to use a modified OC1 (Oblique Classifier 1) algorithm [13]: OC1 performs real oblique cuts (we have a real n-dimensional hyperplane with n-dimensional data) and in CIAD, the cuts are only "oblique" in two dimensions. The plane coefficients are null in all the other dimensions. We have made another choice:

we use a support vector machine (SVM). This algorithm is equivalent to OC1 in its simplest use, and will allow us to benefit from all its other possibilities for further developments.

The SVM algorithms

The SVM algorithms are kernel-based classification methods. They can be seen as a geometric problem: to find the best separating plane of a two classes data set. A lot of methods can be used to find this best plane, and a lot of algorithms have been published. A review of the different algorithms can be found in [14]. They are used in a wide range of real-world applications such as text categorization, hand-written character recognition, image classification or bioinformatics. We briefly describe here the basis of the algorithm, from the geometrical point of view. The aim of the SVM algorithm is to find the best separating plane between the ndimensional elements of two classes. There are two different cases according to the nature of the data: they are linearly separable or not.

In the first case, the data are linearly separable i.e. there exists a plane that correctly classifies all the points in the two sets. But there are infinitely many separating planes as shown in Fig.3. Geometrically, the best plane is defined as being furthest from both classes (i.e. small perturbations of any point would not introduce misclassification errors). The problem is to construct such a plane. It has been shown in [15] that this problem is equivalent to finding the convex hull (i.e. the smallest convex set containing the points) of each class, and then to finding the nearest two points (one from each convex hull); the best plane bisects these closest points.

Fig. 4. The best separating plane bisects the closest points

In the second case, the data are not linearly separable (i.e. the intersection of the two convex hulls is not empty). There is no clear definition of what is the "best" plane. The solution is to create a misclassification error, and to try to minimize this error.

SVM in CIAD+

We use two different SVM algorithms in CIAD+. The first one is the geometric version described in section 2.2.1 for the linearly separable case. The convex hulls are computed in two dimensions with the quick hull algorithm [16], then the two closest points of the convex hulls are computed with the rotating calipers algorithm [17].

For the linearly inseparable case, a lot of solutions have been developed and compared, we have chosen one of these algorithms: the RLP (Robust Linear Programming) algorithm [18] because it is the best one when the data are not linearly separable.

The RLP algorithm will compute the separating plane wx=γ minimizing the average violations:

( 1)

of points of A lying on the wrong size of the plane wx=γ+1, and of points of B lying on the wrong side of the plane wx=γ-1 as shown in Fig.5.

This algorithm computes the n-dimensional hyperplane, we have modified it to compute the best two-dimensional plane. SVMs in CIAD+ are used to help the user. The first kind of help is when the user draws interactively the separating line on the screen with a pure partition on one side, the optional help optimizes the line position to reach the best line position (furthest from both groups) with the computation of the closest points of the convex hulls. The second kind of help is the same case except there is no pure partition, the best line position is computed with the RLP algorithm in the two dimensions corresponding to the selected matrix. The last kind of help is used when the user cannot find a separating plane, the help algorithm has to compute the best separating plane among all the ones corresponding to the projections along pairs of attributes. So we compute all the separating lines in the 2D projections and we keep the best one.

This help mechanism can be turned on / off by the user. It slightly improves the accuracy of the results on the training sets (this result may be more significant according to the kind of user), more on the test set, and it considerably reduces the time needed to perform the classification and increases the ease of use. An example is shown on figure 6, the left part is the original line drawn interactively by the user on the screen and the right part shows the transformed line (the best separating plane computed with the convex hulls).

Figure 6

Data for vertical travelling

Clustering

The interactive algorithm described in the previous section can also be used for unsupervised classification. The computation of the convex hulls and the nearest points can be computed with or without the class information. The proposed algorithm can perform either usual decision tree (supervised classification) or clustering (unsupervised classification). This kind of approach allows the user to perform clustering of the dataset easily using its pattern recognition capabilities and avoiding the usual complexity of the other algorithms. The same kind of help as for decision tree construction is provided to the user: the separating line drawn in a 2D projection can be optimized to be the furthest from the two clusters.

But there is one difference with the decision tree construction algorithm, when the user does not perceive clearly a separating line, this line can be computed automatically (with a modified SVM algorithm). This SVM algorithm cannot be used without the class information, so we have chosen a k-means algorithm. All possible partitions into two clusters are searched for in each matrix and the best one is kept. Then we compute the convex hulls and nearest points to find the best separating line in the same way as for decision tree construction.

As shown in [19], this kind of algorithm may not be very efficient for high dimensional data sets because axis-parallel projections may lead to an important loss of information. This restriction exists anyway because of the graphical representation we use (we cannot display a very large quantity of scatter plot matrices). On the other hand, the results are more comprehensible because we only use one attribute on each axis and not a linear combination of various number of attributes for the axes.

Some results of interactive algorithms

The characteristics of the different algorithms are summarized in table 1. Some results of interactive algorithms compared to automatic ones have been presented by their authors. To summarize them, we can say their results concerning efficiency, are generally at least as good as automatic decision tree algorithms such as CART (Classification And Regression Trees) [20], C4.5 [21], OC1 [13], SLIQ (Supervised Learning In Quest) [22] or LTree (Linear Tree) [23]. The main difference between the two kinds of algorithms is the tree size. Most of the time, interactive algorithms have smaller tree sizes. This tree size reduction can significantly increase the result comprehensibility. Furthermore, as the user is involved in the tree construction, his confidence in the model is increased too. This may be a little less significant for the LTree algorithm because of the open-sided polygon used in the classification: they are easy to understand during the visual construction step of the tree, but without this information, the resulting equations may be not so easy to understand. If we compare the interactive algorithms, we can say PBC and DTViz are particularly interesting for large data sets (because of the pixelisation technique used), but their pixelisation technique introduces some bias in the data: for example two classes very far one from the other have the same representation as the same two classes very near one to one other. We lose the distance information in this kind of representation. Ware and CIAD+ will provide smaller trees because they can use bivariate splits, but their kind of visualization tool (scatterplot matrices) are not at all suitable for the display of two qualitative attributes (a lot of points have the same projection). Only two algorithms provide the user with a help, PBC and CIAD+, this is a significant advantage of these two algorithms because during the decision tree construction, there is often a least one particular step where the best split is difficult to visually detect.

The kind of cooperation between automatic and manual algorithms in PBC and CIAD+ shows the interest of mixing the human pattern recognition facilities and the computer processing power. The human pattern recognition facilities reduce the cost of the computation of the best separating plane, and the computer processing power can be used at low cost (for a single step, instead of the whole process) when the human pattern recognition fails.

Visualization of SVM results

Another kind of cooperation is between automatic algorithms and visualization tools used to show the results. As described in the previous section, SVM are today widely used because they give high quality results, but they are used as a "black box". They give high quality results, but there is no explanation of these results.

One paper [24] talks about SVM results visualization: they use projection-based tour [25] method to visualize the results. They use a visualization of the distribution of the data predicted class (by the way of histograms), a visualization of the data and the support vectors in 2D projection, and examine the weights of the plane coordinates to find the most important attributes for the classification.

The authors recognize that their approach is very "labor intensive for the analyst. It cannot be automated because it relies heavily on the analyst's visual skills and patience for watching rotations and manually adjusting projection coefficients."

Visualization of the SVM separating plane

Our approach is to visualize all the intersections of the 2D planes (of the scatter plot matrices) with the separating plane computed by the SVM algorithm. We have chosen to use the incremental SVM algorithm from [26]. This algorithm gives the coefficient values of the separating hyperplane and the accuracy of the algorithm. We visualize the intersection of this hyperplane with the 2D scatter plot matrices, i.e. a line in each matrix (as shown in Figure 7 with the diabetes data set, from the UCI repository).

Figure 7

ÒØ Ý Ò Ü ×´ Ø Ñ×µ Û Ø Ú ÖÝ Ò× ÔÓ× Ø Ú Ú Ð٠׸ Ú Ò ÓÒ× ÕÙ Òظ Ò ÓÖ Ö ØÓ Ð Ø Ø Ñ× Û Ø ÜÔÐ Ø Ú Ô Ó Û Ö ß Ò ØÛÓ ×ÙÔÔÐ Ñ ÒØ ÖÝ Ñ Ò× ÓÒ× ÓÖÖ ×ÔÓÒ Ò ØÓ Ø ×ÙÔÔÓÖØ Ò ÓÒ¬ Ò Ó Ø ÖÙÐ × Ò ÓÖ Ö ØÓ Ö ÑÓÚ Ø Ó× ÖÙÐ × Û Ø Ú ÐÙ × Ó Ø × Ô Ö Ñ Ø Ö× ÐÓÛ ×Ô ¬ Ø Ö × ÓÐß × Ð Ø Ò ÓÒ¬ Ò ÖÙÐ × Ò ÓÖ Ö ØÓ ÒØ Ý × Ø× Ó Ø Ñ× ÒÚÓÐÚ Ò Ú ÖÝ ×ØÖÓÒ ××Ó Ø ÓÒ× ÓÒ Ø Ê Á Ø × Øº ÁÒ ¬ ÙÖ ¾¸ ÓÓÖ Ò Ø ÔÐÓØ Ó Ø ÙØ Ð Ø × Ó Ø Ø Ñ× Ò Ø ½ ¾ ÖÙÐ × × Ö Ò Ø × Ñ ÓÒ× ÕÙ ÒǾ Ä × Ö ×µ × ÔÖÓÚ º Ì Ð Ò × ÐÐ Ð ÓÓÖ Ò Ø ÔÐÓØ Ó Ø ÖÙÐ × Û Ø Ø ÓÒ× ÕÙ ÒØ ÕÙ Ð ØÓ Ä º ÁÒ ¬ ÙÖ ¿¸ ×Ù × Ø Ó ½¼ ÖÙÐ × Û Ø Ø ÓÒ× ÕÙ ÒØ ÕÙ Ð ØÓ Ä Ò Û Ø Ø ÓÒ¬ Ò Ö Ø Ö Ø Ò ¼º ´Ø Ó Ó Ø × Ø Ö × ÓÐ Û ÐÐ Ù×Ø ¬ Ò Ë Ø ÓÒ º¾µ × × ÓÛÒº ÁØ × ÛÓÖØ Ó ÒÓØ Ò Ø Ø Ù ÒÙÑ Ö Ó ÖÙÐ × Û Ø ÓÒ ÓÖ ÑÓÖ Ø Ñ× Ú Ò Ò Ø Ú Ú ÐÙ × ÓÖ ÁÍ ×Ø ÐÐ ×ÙÖÚ Ú Ø Ù× ÓÒ¬ÖÑ Ò Ø Ù ×× Ø Ø ÓÒ¬ Ò Ó × ÒÓØ Ò×ÙÖ Ø ÜÔÐ Ø Ú Ô Ó Û Ö Ó Ø Ñ ÒÚÓÐÚ Ò Ø ÖÙÐ º ÙÖØ Ö ÓÒ× Ö Ø ÓÒ Ö Ö × Ø Ñ× Û Ø ÁÍ Ú ÐÙ × Ø Ø Ö ×Ø ÐÐ ÔÖ × ÒØ Ø Ö Ø ÓÒ¬ Ò ¬ÐØ Ö Ò Ö ÔÖ × ÒØ Ò Ú ÖÝ ÑÔÓÖØ ÒØ Ñ Ò× ÓÒ× Ò ×ØÖÓÒ ××Ó Ø ÓÒ׺Visual Post Analysis of Association Rules 59 Ö Ó ÖÙÞÞ × Ò Ö ×Ø Ò Ú ÒÓ º ¿º ÈÐÓØ Ó Ø ÖÙÐ × Û Ø Ø ÓÒ× ÕÙ ÒØ ÕÙ Ð ØÓ Ä Ò Ê ¼ º º º ÈÐÓØ Ó Ø ÖÙÐ × ×ÙÖÚ Ú ØÓ Ø ÓÒ¬ Ò Ò ÁÍ ÔÖÙÒ Ò º Î ×Ù Ð ÈÓ×ع Ò ÐÝ× × Ó ××Ó Ø ÓÒ ÊÙÐ × ØÓ Ò Ò ¢´Ô · ¾µ Ø Ñ ØÖ Ü Û Ö Ò Ö ÔÖ × ÒØ× Ø ÒÙÑ Ö Ó ÖÙÐ × ×ÙÖÚ Ú ØÓ Ø ÔÖÙÒ Ò ×Ø Ô× Ò Ô ÓÖÖ ×ÔÓÒ × ØÓ Ø ØÓØ Ð ÒÙÑ Ö Ó « Ö ÒØ Ø Ñ× ÓØ Ò Ø ÒØ ÒØ Ô ÖØ Ò Ò Ø ÓÒ× ÕÙ ÒØ Ô ÖØ Ó Ø Ò ÖÙÐ × ×ÙÔÔÓÖØ Ò ÓÒ¬ Ò Ú ÐÙ × Ö Ð×Ó ÓÒ× Ö º ÖÙÐ × Ó Ý Ò ÖÝ ÖÖ Ý ××ÙÑ Ò Ú ÐÙ ½ Ø ÓÖÖ ×ÔÓÒ Ò ÓÐÙÑÒ Ø Ñ × ÔÖ × ÒØ Ò Ø ÖÙÐ Ò Ú ÐÙ ¼ ÓØ ÖÛ × º « Ö ÒØ ÖÓÐ × Ö ×× Ò ØÓ Ø Ô · ¾ Ú Ö Ð × Ò Ø Å ÐÐÓÛ Ò ØÓ ÜÔÐ Ò Ø Ð ×Ø Ù× Ö ¬Ò Ø Ö × ÓÐ Ó Ø ØÓØ Ð Ú Ö Ð ØÝ Ø Ö Ù× Ö ¬Ò ØÓÖ Ð ÔÐ Ò ÓÖ Ø ØÓÖ Ð ÔÐ Ò ×Ø ¬Ò Ý Ù× Ö Ó× Ò Ø Ñº ÇÒ Ó Ø ÔÓ×× Ð Ú Û× Ó« Ö Ý Å ÐÐÓÛ× ØÓ Ö Ô ÐÐÝ Ö ÔÖ × ÒØ Ø ××Ó Ø ÓÒ ×ØÖÙ ØÙÖ × ÑÓÒ Ø ÒØ ÒØ Ò Ø ÓÒ× ÕÙ ÒØ Ø Ñ׺ º º ÞÓÓÑ Ó Ø ÁØ Ñ Ê ÔÖ × ÒØ Ø ÓÒº Ø ÓÒ¬ Ò Ñ ×ÙÖ × Ò Ö ÔÖ × ÒØ Ý ÓÖ ÒØ × Ñ ÒØ× Ð Ò Ò Ø ÓÖ Ò Ó Ø Ü × ØÓ Ø Ö ÔÖÓ Ø ÓÒ ÓÒ Ø ÔÐ Ò × Ø Ö ÓÓÖ Ò Ø × Ö Ø ÓÖÖ Ð Ø ÓÒ Ó AE ÒØ× Û Ø Ø Ü ×º Ì ÔÖ Ú ÓÙ× ÜÔ ÒØ ÐÐÓÛ× ØÓ ÒØ Ý ¬Ö×Ø ÕÙ Ö ÒØ ÓÒØ Ò× Ø Ñ× ××Ó Ø ØÓ ÓÒ¬ Ò ÙØ ÐÓÛ ×ÙÔÔÓÖØ ÖÙР׺ Ì ÔÖÓÜ Ñ ØÝ Ø Û Ò ØÛÓ Ò Ø ÒØ Ø Ñ× × ÓÛ× Ø ÔÖ × Ò Ó × Ø Ó ÖÙÐ × × Ö Ò Ø Ñ Û Ð Ø ÔÖÓÜ Ñ ØÝ Ø Û Ò ØÛÓ ÓÒ× ÕÙ ÒØ Ø Ñ × × Ö ¹ Ð Ø ØÓ ÓÑÑÓÒ Ù× Ð ×ØÖÙ ØÙÖ º Ò ÐÐÝ¸Ø ÐÓ× Ò ×× ØÛ Ò ÒØ ÒØ Ø Ñ× Ò ÓÒ× ÕÙ ÒØ Ø Ñ× Ð Ø× Ø ÔÖ × Ò Ó × Ø Ó ÖÙÐ × Û Ø Óѹ ÑÓÒ Ô Ò Ò ×ØÖÙ ØÙÖ º ÓÖ Ü ÑÔÐ ¸Ø ÓÒ× ÕÙ ÒØ Ø Ñ Ä Ö ×ÙÐØ× Ú ÖÝ ÐÓ× ØÓ ÒØ ÒØ Ø Ñ× ×Ù × ÖØÓÓÒ×´ ʵ¸ ÓÑ ÔÖÓ Ö ÑÑ ´ ÇŵŠÙ× Ð´ÅÍ˵¸ Ø Ø Ù× ÒØ Ý Ò ÒÖ × Ø Ø ÓÒØÖ ÙØ ØÓ ÜÔÐ Ò Ø × Ñ ÓÒ× ÕÙ Òغ Ì × Ö ×ÙÐØ ÓÒ¬ÖÑ× Ø ÒØ ÖÔÖ Ø Ø ÓÒ Ó ¬ ÙÖ Û Ö Ø ÓÖ ¹ Ñ ÒØ ÓÒ Ø Ñ× × ÓÛ Ú ÖÝ Ò× ÁÍ ÔÓ× Ø Ú Ú Ð٠׺ ÇÒ Ø ÓÑÑÓÒ ×ØÖÙ ØÙÖ × Ò Ö ×Ô Ò Ø Ø Ñ Ú ×Ù Ð Þ Ø ÓÒ¸ Ø × ÔÓ×× Ð ØÓ ÓÑ ØÓ Ø ××Ó Ø ÓÒ× Ý ÔÐÓØØ Ò ¸ÓÒ Ø ØÓÖ Ð ÔÐ Ò ¸Ø ÖÙÐ × Û Ø Ñ Ò× ÓÒ ÔÖÓÔÓÖØ ÓÒ Ð ØÓ Ø Ö ÓÒ¬ Ò º ÁÒ Ø × Ö ÔÖ × ÒØ Ø ÓÒØ ÔÖÓÜ Ñ ØÝ ÑÓÒ ØÛÓ ÓÖ ÑÓÖ ÖÙÐ × Ú × Ú Ò ØÓ Ø ÔÖ × Ò Ó ÓÑÑÓÒ ×ØÖÙ ØÙÖ Ó Ø ÒØ ÒØ Ø Ñ× ××Ó Ø ØÓ « Ö ÒØ ÓÒ× ÕÙ Ò × Ò Ø ÐÐÓÛ× ØÓ Ò Ø × Ø Ó ÐÓ× ÖÙÐ × ÒØÓ Ö ÓÖ Ö Ñ ÖÓ¹ÖÙÐ Ý Ð Ò Ò Ø ÓÑÑÓÒ ÒØ ÒØ Ø Ñ× ØÓ Ø ÐÓ Ð × ÙÒ Ø ÓÒ Ó Ø « Ö ÒØ ÓÒ× ÕÙ ÒØ Ø Ñ׺ Cooperation between automatic algorithms, interactive algorithms and visualization tools for Visual Data Mining

As we can see on figure 7, the resulting lines do not necessarily separate the two classes, the hyperplane does separate the data (the accuracy of the incremental SVM is 77,8% on this dataset), but not its "2D projections". It is only an approximate interpretation of the results.

All 2D representations of one n-dimensional feature will lead to a loose part of the information like the lines we get here or the support vectors displayed in [24]. This kind of representation seems more comprehensible than the support vectors, but it can only be used with a linear kernel function.

For large data sets, it is possible to only display the separating plane and not the data. The SVM algorithm used is able to classify 1 billion points, and such a quantity of points cannot be displayed in a reasonable time (furthermore this kind of representation is not at all suitable for such data set size).

Cooperation Between Automatic Algorithms 75

For other kinds of kernel functions (not linear), this method cannot be used.

Visualization of the evolution of SVM separating planes

A very interesting feature of the incremental support vector machine we have used is its capability to modify an existing linear classifier by both withdrawing old data and adding new data. We use our visualization tool to show the successive modifications of the separating plane projections. The first plane is calculated (and projected on the 2D matrices) and then blocks of data are successively added and withdrawn. The modification of the plane is computed and the corresponding projections are displayed in the matrices.

This kind of visualization tool is very powerful to examine the variations of the separating plane according to the data evolution. Even if the projections used lose some amount of information, we know what the attributes involved in the modification of the separating plane are. The evolution of the n-dimensional plane is very difficult to show in another way. The authors of the paper describing the algorithm measure the difference between planes by calculating the angle between their normals. These angle values are then displayed like circle radii.

Conclusion and future work

Before concluding, some words about the implementation. All these tools have been developed using C/C++ and three open source libraries: OpenGL, Open-Motif and Open-Inventor. OpenGL is used to easily manipulate 3D objects, Open-Motif for the graphical user interface (menus, dialogs, buttons, etc.) and Open-Inventor to manage the 3D scene. These tools are included in a 3D environment, described in [27], where each tool can be linked to other tools and be added or removed as needed. Figure 8 shows an example with a set of 2D scatter plot matrices, a 3D matrix and parallel coordinates. The element selected in the 3D matrix appeared selected too in the set of scatter plot matrices and in the parallel coordinates. The software program can be run on any platform using X-Window, it only needs to be compiled with a standard C++ compiler. Currently, the software program is developed on SGI O2 and PCs with Linux.

Figure 8

Analyzing systolic blood pressures.

In this paper we have presented two new interactive classification tools and a visualization tool to explain the results of an automatic SVM algorithm. The classification tools are intended to involve the user in the whole classification task in order to:

-take into account the domain knowledge, -improve the result comprehensibility, and the confidence in the results (because the user has taken part in the model construction), -exploit human capabilities in graphical analysis and pattern recognition. The visualization tool was created to help the user in understanding the results of an automatic SVM algorithm. These SVM algorithms are more and more frequently used and give efficient results in various applications, but they are used as "blackboxes". Our tool gives an approximate but informative graphical interpretation of these results.

A forthcoming improvement will be another kind of cooperation between SVM and visualization tools: an interactive visualization tool will be used to improve the SVM results when we have to classify more than two classes.

Defining Like-minded Agents with the Aid of Visualization

use this notion of spaces (or surfaces) for software agents to use when meeting or seeking other agents? In this case the visualization concept is being used to assist in agent-orientated computation. Can this help in the pursuit of the use of visualization techniques for reasoning (diagrammatic reasoning [6])?

Our earlier work looked at proximity data and multivariate data in the agent domain and indicated possible metric choices for visualization [11,13]. From the agent point of view, the question is how can we apply this and how can we assist in the problem of metric choice for profiling and classification in the agent domain. How can we meaningfully identify like-minded agents and then put this to use?

The agent paradigm is considered by some to be a valuable way of looking at problems and therefore of general application. Our work in the agent and visualization fields seeks to use visualization to serve the agent community, but, from the agent-orientated computation point of view, suggests other uses of agent ideas within visualization.

The paper first briefly surveys visualization possibilities for different types of data matrices and presents definitions of profiles, considering also the desirability of comparing profiles without revealing them. It then looks at metric choice and suggests strategies to improve choice using visualization techniques and the designing of a classification system for evaluating the metrics. The idea of using the visualization position coordinates as a profile is introduced and an example given. The nature of the examples in this paper is illustrative and the intention is to show the merging process, where visualization is not necessarily the end product, as well as to present the two specific applications.

Defining Profiles

In general an agent's profile is considered to be a vector of interests and behaviours (a feature list) or a similarity measure or sets and/or combinations of these [11,13]. The purpose of our work in visualization was to find layouts (in 2D or 3D) which would satisfy (usually approximately) these data either by using mathematical transformations (effective reductions via e.g. Principal Component Analysis (PCA) or distance metrics followed by Principal Coordinates Analysis (PCoA), spring embedding [3] or Selforganizing Map (SOM) [16]) or novel representations (e.g. colour maps, hierarchical axes, 'Daisy', parallel coordinates [1,2,15]).

PCA, SOM and PCoA are described briefly here as they are used in the discussion that follows:

-Principal Components Analysis: PCA is a means by which a multivariate data table is transformed into a table of factor values, the factors being ordered by importance. The two or three most important factors, the principal components, can then be displayed in 2D or 3D space. -Self Organizing Map: The SOM algorithm [16] is an unsupervised neural net that can be used as a method of dimension reduction for visualization. It automatically organizes entities onto a two-dimensional grid so that related entities appear close to each other.

-Principal Coordinates Analysis: PCoA is used for proximity data, finding first a multivariate matrix which satisfies the distances, then transforming this into its principal components so that the two or three most important factors can be displayed in a similar fashion to PCA.

These visual representations may provide meaningful clusters or reveal patterns from which knowledge can be gained. A key problem in this area is that different methods produce different clusters (and cluster shapes). The determination of an appropriate metric 1 is a difficult problem for which general solutions are not evident. We propose the use of constructed data in a process called signature exploration [8] to assist in this area. This process uses specially constructed data sets to increase the user's understanding of the behaviour of visualization algorithms applied to high dimensional data.

Two developments suggest themselves from aspects of our previous visualization work: a tool for metric choice; the use of layout coordinates as a profile.

-Tool for metric choice: Agents need to compare profiles, i.e. when they meet they need to be able to compare themselves (or their tasks) and get a measure of similarity which they can interpret. Assuming (for the moment) that they are carrying their profile with them, they will need to apply an algorithm to calculate a similarity measure by both submitting their profiles to the algorithm, either both independently of the other, or via an intermediary. In designing a specific application a decision (by the designer) needs to made about what similarity measure is appropriate. The tool for metric choice developed in application of the principle of signature exploration provides an interactive interface which can help the designer to choose the metric. -Use the layout coordinates as a profile: For layout on the screen, the data transformation or set of similarity measures results in x/y (or x/y/z) coordinates for each entity. For complex data this usually involves a significant error (i.e. it is normally not possible to find a layout which will satisfy the similarity measurements -on the one hand -and matrix transformations and truncations to 2 or 3 attributes rely on a sharp fall off of the relevant eigenvalues, which is unusual for complex data setson the other hand). Nevertheless such algorithms are commonly used and thus the approximations involved are often adequate. The relevant point here is that the end result is that there is an x/y (or x/y/z) coordinate pair associated with each entity and within the current space of possibilities this locates their interest position. This suggests the possibility of them carrying a much more lightweight position profile with them, that also means they can compare positions without revealing profiles. The use xy or xyz coordinates as the profile avoids revealing the profile, but the implication is that either there must be a central entity which will do the calculation (and thus that one needs to reveal one's profile to) and then give the agent its coordinates and the bounds of the space (so that it can judge relative similarity). Also this does not deal easily with dynamic situations (i.e. reflecting changing profiles), as it would require a periodic return to base to profile updating. A possible alternative is to calculate one's own coordinates with respect to a number of reference points, i.e. calculate one's proximity to the reference points and then find a position in space to satisfy this reduced set of distances. For instance for a feature list of length 5, consisting of a set of five possible agent interest areas and interest values in the range 0 to 1 (say), the following is an indication of the bounds of the space.

A B C D E agent1 1 0 0 0 0 agent2 0 1 0 0 0 agent3 0 0 1 0 0 agent4 0 0 0 1 0 agent5 0 0 0 0 1 It may be unwise to base the position on a computation that satisfies the similarity measures to all of these vectors (since this increases the inaccuracy of the layout), but the agent could carry the set of coordinates for certain bounds (or other reference vectors) and profile position, having the calculations made back at base. These ideas are illustrated below.

Choosing a Metric -Possibilities

For specific applications different metrics are used, this means that often an applications area uses one metric only. Measures may be chosen because of time complexity issues, rather than that they provide the most accurate or appropriate measure. There is also a link between the creation of the feature list and the metric choice (i.e. the formulation of the feature list affects which metric provides the most appropriate clustering) which is a further complication. In general terms the choice of metric and creation of the feature list should correspond to the required classification, but in many situations the starting point is an unknown set of data and clustering indications are sought. There is no training set and no classification. It is likely that different classifications exist. In fact there are hidden classifications, that is to say, the user has a set of things they are interested in and they would like to have the entities (other users, documents..) classified according to these groupings. One of the purposes of the signature exploration process that is being developed is to explore the mapping to clusters (via various metrics) of features of interest to the user. Originating as part of work to increase comprehension and choice of algorithms for visualization of complex data, it does not focus on feature list construction but on metric choice for a given feature set. In the process of constructing data sets for evaluation of the different options the user creates an ad hoc classification system for assessment purposes (demonstrated below).

How to choose -metrics, feature selection and weighting

The first issue is to specify the variables to be used in describing the profile and the ways in which pairwise similarities can be derived from the matrix formed by the set of profiles.

Mixed variables For profiling of objects the variables will sometimes be of different types: for example a person can be described in terms of their gender (binary variable), their age (quantitative variable, their amount of interest in a subject (ordinal variable if sectioned quantitative variable is used) and their personality classification (nominal variable). A general measure is:-General similarity coefficient

where s i jk (and correspondingly d i jk ) denotes the contribution to the measure of similarity provided by the kth variable. The values of s i jk and d i jk can take definitions as appropriate to the variable type.

Selection (feature extraction) and standardization (normalization) Sometimes it is clear

what variables should be used to describe objects. In our case, with profiling queries, documents, specifications and personal profiles, it is likely that variables have to be selected from many possibilities. Thus the process is not straightforward. The pattern recognition literature describes the appropriate specification of variables as feature extraction. It is tempting to include a large number of variables to avoid excluding anything relevant, but the addition of irrelevant variables can mask underlying structure. Whilst the choice of relevant variables is important, there is also the possibility (particularly here -multidimensional nature of profiles themselves) that there is more than one relevant classification based on different, but possible overlapping, sets of variables.

Having determined appropriate variables, there is then the question of standardizing and/or differentially weighting them, followed by the construction of measures of similarity.

One aspect to the standardization is that two variables can have very different variability across the dataset. It may or may not be desirable to retain this variability. Standardization may also be with respect to the data set under consideration or with respect to a population from which the samples are drawn. In the case of quantitative variables, standardization can be made by dividing by their standard deviation or by the range of values they take in the data set. The idea of standardization lies within the larger problem of the differential weighting of variables.

Choosing a Metric -Visual Exploration

To assist the process of metric choice the use of specially constructed data sets in an exploration of the algorithm behaviours is proposed in signature exploration. Thus, by examining known data we gain a concrete idea of the behaviour of the various possible metrics. We have suggested a number of possible constructed data types [8]: generic(provided by the application to illustrate the behaviour of the particular algorithm); constructed(determined by the user to illustrate the behaviours in the data that are of interest to them, for evaluation purposes this represents an ad hoc classification); query (by visualization or sql-type, based on an unknown dataset, to examine clustering of metric in practice); landmark (to provide marker entities in the visualization); feedback (the means to enable the user to enter their assessed similarities and find or modify the appropriate metric). This paper limits itself to the first two, generic and constructed, to illustrate.

Using generic data sets

Generic data sets are those considered to illustrate the behaviour of the visualization algorithms. Simple data sets do not always give an intuitive placement after such transformations. In this examination a small matrix of 7 agents were given a randomly assigned level of interest (of 1 to 10) in 7 topics. Subsequently three other agents were added to illustrate (a) interests identical to agent1 but scaled, (b) agent1 with the same level of interest in each topic and three other agents as in (a), (c) two of the agents showing reverse behaviours of another two. These data sets were visualized with various distance measures (using the tool SpaceExplorer [11,13,12])and comments noted. The results illustrate the similarity in behaviour of the metrics, whilst indicating the differences obtained with the two basic types -Euclidean and Angular Separation. The measures used were Minkowski ( 3), City, Euclidean and Angular Separation (equations 1-4). These were followed by Principle Coordinates Analysis to find points in 2D-space that satisfied the distances. Note that the accuracy of such layouts for visualization is an issue, since it is often very low. In the case of PCoA, the eigenvalues can be examined -the sum of the values of the first three (for 3D layout) being above 70% of the sum of all the eigenvalues accounts for 70% of the variance in the data and is thus an encouraging indicator.

As an illustration of this process, figure 1 shows the three shots of City, Euclidean and Angular separation with agents a,b and c having scaled interest distribution of agent1.

User-constructed data sets

Here the user constructs data sets specific to their application, explicitly or implicitly creating a classification system with which to measure the performance of the metrics in clustering their interest feature(s). This may provide a distance matrix for comparison, or such a matrix may be obtained by an informal assessment. This could be followed by feedback analysis to obtain weightings of the feature list, but here the focus is on metric choice rather than modification. Step 1 -decide features of interest The first thing to consider is what features in the data one is interested in. We suppose that the aspects are: overlap of interest; intensity of interest; joint disinterest; similar pattern of interest (irrespective of subject). These elements provide a classification system with which we can construct a system to give numerical values to differences between a pair of agents' interests. Then these differences can be used to give a comparison measure for the behaviours of the various metrics. Statistical measures indicate the closeness of the match. The metrics may not correspond to the classification, even approximately. It could be that it is useful to use the classification system as the similarity measure itself and dispense with the metrics. However, in general, we are looking for a similarity metric that is not just a simple query, but something more subtle, something that reflects the multidimensional nature of the profiling data available. This corresponds to the scope that lies between the two questions:-Are you interested in sport? and Are you like me?

Also, if you are interested in sport, it may be valuable to know if you are a specialist or a generalist and in general terms what level your interest is on. Thus other similarity measures act as discriminators in this situation. Final choice of overall similarity measure may consist of additions of different similarity metrics (which may include results of specific queries) and can be arrived at in the manner of equation 5. The use of visualizations of data for pairs of agents can assist in the specification of features of interest. Simple diagrams such as bar and pie charts are helpful in designing a measure with which to make an informal assessment of the similarity between two agents. Figures 2, 3 are illustrative of this process and identify the features interest level and interest intensity that are used in step 2. Step 2 -create a measure for the features to generate test data sets Suppose that types of agent similarity are chosen to examine e.g.: overlaps of three or more interests of high intensity; large overlaps irrespective of intensity with high common disinterest. Data is created for a representative member and edge member of desired clusters so that representative pairs of data can be created (or groups if required) to examine the metric behaviours both visually and by comparing distance matrices. To illustrate, a data set was created to produce examples covering the range of possibilities of overlap extent and intensity with respect to a reference agent's interests (as suggested by the visual explorations of step 1). Then the metrics were examined to see how they clustered the group of similarities with number of overlap subjects 3 and intensity of overlap 2 3. One would expect the metrics to perform badly against this criterion, which is an example where a simple query would perform better (for instance, in the Yenta system, the matching between profiles is done simply on the basis of matching a single granule, which corresponds to a single interest, the metric is used in deriving the interest categories -if you simply want to exchange information on a subject that's ok, however such aspects as level of expertise are relevant, and finding like-minds needs greater subtlety). For metric discrimination, the distance comparison should be made by also evaluating the test criteria for a number of other features (such as joint disinterest and large overlaps irrespective of intensity) and combining the similarities.

Step 3 -evaluate visually and numerically The visual evaluation consists of visualizing the constructed data set and observing how well clustered the group of interest is. However, since the layout of such visualizations is an approximation (in order to satisfy the distances), and the observations not themselves measurements, evaluation by visualization is inexact. On the other hand, numerical evaluation, based on measuring differences between the estimated differences and the differences arrived at by the metric under consideration, is precise, but relies on the ability of the designer to define or estimate similarities between the data entities. For the example above this was done by awarding points according to number and intensity of topics of joint interest. Figure 4 shows PCoA layout with City, Euclidean and Angular Separation differences, the reference agent is circled and the agents that are in my group of interest (according to the criteria in step 2) are indicated. That there is little difference between City and Euclidean indicates that it would be adequate to use city where time complexity was an issue. The three outlines traced by the points in the City and Euclidean plots correspond closely to the classification system and the group of interest is well clustered in visual terms. The Angular Separation plot does not cluster so well, misplacing three agents. The layout of the angular separation distances is actually a screenshot of a 3D representation as the layout was particularly inaccurate and needed the extra dimension to improve it (the first two eigenvalues accounted for only 38% of the variance in the data and the first three for only 48%). The inaccuracy of this layout highlights the difficulty of using visualization to assess similarity.

The Use of Position Instead of Vector for Profile

The pictures of information spaces as maps or terrains derived from multivariate data using self-organizing maps [16] provide us with a compelling image of the profile or topic space we are exploring. The metrics discussed above generate similar conceptual spaces when visualized. Yet this is a misleading image, since the data are high dimensional and it is impossible to represent their similarities accurately in 2 or 3D space (direct mapping methods for multivariate data, such as colour maps and parallel coordinate plots, are not included in this comment). Nevertheless, as an approximation and as a representation, an overview perhaps, of a large body of entities, it is being found useful (see e.g. [20]). Suppose we assume the validity of the layout and propose that the agent carries with them their xy (or xyz) coordinates and uses them as their profile. When meeting a fellow agent they can ask for the agent's xy coordinates and compute the Euclidean distance to calculate their similarity. This would be more efficient than carrying a potentially long profile vector and enable them to use their profile without revealing details or requiring encryption. Two different ways of using this idea suggest themselves -calculating back at base and on the fly.

By base calculation

The agents both have the calculations done at a base point and periodically return for updates. Here the error will be that of the layout itself and the agent would be able to have details of the mean error and variance supplied with its coordinates, so that it can take this into account. Figure 5 shows the layout after City distance and PCoA of the seven agents of randomly generated data from above. Thus, if Agent1 meets Agent2 they can compare coordinates, ((-12.30, -5.20),(23.27,-8.44)), to calculate the Euclidean distance to give them the distance they are apart in this map.

By calculation on the fly

Here the agent calculates its position with respect to a number of reference vectors (either dynamically or at an earlier point in time) and then compares with another agent's position calculated similarly. Using the seven agent random data again, the reference vectors are chosen to be agents 5,6 and 7 illustrated in the generic data section. Three reference agents are the minimum since only two will create two possible arrangements when agents 1 and 2 overlay their positions. Agents 1 and 2 separately calculate their City distances to the three reference vectors and subsequently lay out these distances with PCoA as shown in figure 6.

They now have xy coordinates, but in order to compare them they must be scaled (the Euclidean distance between 5 and 6 is used here), centered (here Agent 5 is placed at 0,0) and finally rotated to bring the agents 5,6 and 7 into position. Now the coordinates of the agent's position are in a form that they can use for comparisons. The results of the base calculation and on-the-fly calculation of the difference between agents 1 and 2 are given in the table below. (Since these are normalized with respect to the distance between agents 5 and 6, a value of 1 would indicate that they were the same distance away from each other as agents 5 and 6 are) original city dist base dist on-the-fly dist 1.64 1.77 1.57 exact 8%err -4%err

Conclusions and Future Work

Visual datamining seeks to increase the integration of visualization with specific datamining techniques. This paper presents two applications with this in mind. Appropriate clusterings of data are sought, whilst at the same time layouts are required to present overviews. The user needs understanding of the layout algorithm to appreciate the implications of the overview, the implication of arriving at different clusterings with different algorithms needs to be understood by those seeking valid cluster-ings and classifications. These two purposes concern the same process, but are subtly different in their focus. The first application described in this paper, illustrating the use of signature exploration in making the behaviour of the similarity metrics more concrete and assisting in similarity metric choice, is an example of the merging of understanding of overview and determining appropriate metric. Visualization of pairs of data helped in the creation of an ad hoc, user-specific, classification with which to assess the overview and thus also the metrics. An obvious next step is to use feedback to select and modify metrics and features and this is another part of the signature exploration process. Continuing work lies in further developing the interface for exploration, the data construction engine and in conducting usability tests.

Visualizations sometimes suggest the idea of a topic or similarity space -looking at a 2D or 3D scatterplot the closeness of entities is intuitively understood as similarity. Where dimension reduction is involved, considerable approximation or abstraction is required. If this is a valid procedure (in the sense of the considerable error sometimes incurred), and such diagrams are widely used without warnings given, then the idea of using location as a form of privacy protection (the transformation is a one-way function) must also hold on some level. The simple example for using position as a profile demonstrated in this paper -a potentially most useful mechanism -is encouraging, now evaluation for many different data sets is required to test its robustness, in terms of whether the original profile is fully protected and the tolerance of approximation in locations of the entities.

Evaluation of the position-as-profile concept points to one of our most pressing problems in visualization -how valid are our visualizations when dealing with complex data and involving approximation or abstraction? How can the level of approximation be indicated to the viewer? Correspondingly, how can a measure of confidence in the agent's location in the interest space be given to the agent? The investigation of the position-as-profile idea is the same investigation as that of the validity of layout. Thus, we begin to think in terms of transfering our picture as a viewer to the agent, so that the two can become one -a kind of viewer/agent entity. The agent thus may be a software agent or a human agent. The question now becomes, how can the boundaries or parameters of the validity be described to the viewer/agent? How can they be encoded visually and in software terms? We interchange the viewer with agent and must express what the user sees (or finds useful) in a form that the software agent can work with. Via the agent paradigm we may thus be helped toward creating programs that can use graphical elements to mimic our visual thinking.

In this paper, we described the main features of IPBC (Interactive Parallel Bar Charts), a VDM system devoted to interactively analyze collections of time-series, and showed its application to a real clinical database of hemodialytic data.

We are currently carrying out a field evaluation of IPBC with the clinical staff of the hemodialysis center at the Hospital of Mede, PV, Italy. One of the major advantages of IPBC that is emerging is that the visualization and its interactive features are very quickly learned and remembered by clinicians, the major disadvantage is that usage of screen space becomes difficult if a clinician tries to relate more than 3 collections of time-series simultaneously (Section 4.3 dealt with the analysis of 3 collections). This early feedback received from the field evaluation is helping us in identifying new research directions. Besides facing the problem of analyzing more than 3 collections in a convenient way, we aim to face another problem (that is considered very relevant by clinicians), i.e. dealing with time-series at different abstraction levels, allowing for both a fine exploration of time-series (e.g., to detect specific unusual values) and their coarse exploration (to focus on more abstract derived information). In both cases, we are working at the integration of parallel bar charts with other visualizations that can provide a synthetic view of data (e.g., the medical literature is proposing some comp utation methods to derive some quality indexes of the hemodialytic session from the time-series of that session). In particular, we are experimenting with Parallel Coordinate Plots, e.g. a trajectory in a plot could connect the quality indexes (typically, 5-7 values) of a session, and this high-level perspective would be linked to the much of a session, and this high-level perspective would be linked to the much more detailed perspective of the parallel bar chart.

Acknowledgements

This work is supported by the EPSRC and British Telecom (CASE studentship -award number 99803052).

Visual Data Mining of Clinical Databases: an Application to the Hemodialytic Treatment based on 3D Interactive Bar Charts

Luca Chittaro 1 , Carlo Combi 2 , Giampaolo Trapasso 1 corporeal circuit where metabolites (e.g., urea) are eliminated, the acid-base equilibrium is re-established, and water in excess is removed. In general, hemodialysis patients are treated 3 times a week and each session lasts about 4 hours. Hemodialysis treatment is very costly and extremely demanding both from an organizational viewpoint [8] and from the point of view of the patient's quality-of-life. A medium-size hemodialysis center can manage up to 60 patients per day, i.e. more than 19000 hemodialytic sessions per year. Unfortunately, the number of patients that need hemodialysis is constantly increasing [12]. In this context, it is very important to be able to evaluate the quality of (i) each single hemodialysis session, (ii) all the sessions concerning the same patient, and (iii) sets of sessions concerning a specific hemodialyzer device or a specific day, for the early detection of problems in the quality of the hemodialytic treatment.

Modern hemodialyzers are able to acquire up to 50 different parameters from the patient (e.g., heart rate, blood pressure, weight loss due to lost liquids,…) and from the process (e.g., pressures in the extra-corporeal circuit, incoming blood flow,…), with a configurable sampling time whose lower bound is 1 sec. As an average example, considering only 25 parameters with a sampling time of 30 seconds, 12000 values (4*120*25) are collected in each session, and a medium-sized center collects more than 228 millions of values per year (considering 19000 provided treatments).

While the daily accumulation of huge amounts of data prompts the need for suitable techniques to detect and understand relevant patterns, hemodialysis software is more concerned with acquiring and storing data, rather than visualizing and analyzing it. Data mining applications can thus play a crucial role in this context. More specifically, visual data mining applications are of particular interest for three main reasons.

First, clinicians' abilities in recognizing interesting patterns are used suboptimally or not used at all in the current context. Visual mining of hemodialytic data would allow clinicians to take decisions affecting different important aspects such as therapy (personalizing the individual treatment of specific patients), management (assessing and improving the quality of care delivered by the whole hemodialysis centre), medical research (discovering relations and testing hypothesis in nephrology research).

Second, since data mining on the considered database is (at least, at initial stages) intrinsically vague for clinicians, the adoption of VDM techniques can be more promising than fully automatic techniques, because it supports clinicians in discovering structures and finding patterns by freely exploring the datasets as they see fit.

Third, the clinical context is characterized by a need for user interfaces that require minimal technical sophistication and expertise to the users, while supporting a wide variety of information intensive tasks. A proper exploitation of visual aspects and interactive techniques can greatly increase the ease of use of the provided solutions.

In summary, a clinical VDM system has to achieve two possibly conflicting goals: (i) offering powerful data analysis capabilities, while (ii) minimizing the number of concepts and functions to be learned by clinicians. In the following, we illustrate how our system attempts to achieve these two goals.

The Proposed Approach

The system we have built, called IPBC (Interactive Parallel Bar Charts) connects to the hemodialysis clinical database, produces a visualization that replaces tens of separate screens used in traditional hemodialysis systems, and extends them with a set of interactive tools that will be described in detail in this section.

Each hemodialysis session returns a time-series for each recorded clinical parameter. In IPBC, we visually represent each time-series in a bar chart format where the X axis is associated with time and the Y axis with the value (height of a bar) of the series at that time. Then, we layout the obtained bar charts side by side, using an additional axis to identify the single time-series, and we draw them in a 3D space, using an orthogonal view. It must be noted that also the additional axis has typically a temporal dimension, e.g. it is important to order the series by date of the hemodialysis session to analyze the evolution of a patient. An example is shown in Fig. 1, that illustrates a vis ualization of 50 time-series of 50 values each, resulting in a total of 2500 values (the axis on the right is the time axis for single sessions, while the axis on the left identifies the different time-series, ordered by date). Hereinafter, we refer to this representation as a parallel bar chart.

The RoundToolbar widget

In designing how the different interactive functions of IPBC should be invoked by the user, we wanted to face two different problems:

• First, one well-known limitation of many 3D visualizations is the possible waste of screen space towards the corners of the screen;

• Second, the traditional menu bar approach would require long mouse movements from the visualization to the menu bar and vice versa. To this purpose, we designed a specific round-shaped pop-up menu (see Fig. 2), called RoundToolbar (RT), that appears where the user clicks with the right mouse button. The RT can be easily positioned in the unused screen corners, thus allowing a better usage of the screen space (e.g., see Fig. 1) and a reduction of the distance between the visualization and the menu. Moreover, to further improve selection time of functions with respect to a traditional menu, the organization of modes in the toolbar is inspired by Pie Menus [3]: in particular, the main modes are on the perimeter of the RT, and when a mode is selected, the center of the RT contains the corresponding tools (which are immediately reachable by the user, who can also quickly switch back from the tools to a different mode).

Changing Viewpoint

It is well-known that free navigation in a 3D space is difficult for the average user, because (s)he has to control 6 different degrees of freedom and can follow any possible trajectory. To make 3D navigation easier, when the Viewpoint mode is selected in the RT (as in Fig. 2), the proposed controls for viewpoint movement (Rotate, High-Low and Near-Far) cause movement along limited pre-defined trajectories which can be useful to examine the visualization: in particular, Fig. 3 shows how viewpoint movement is constrained. The remaining Vertical scale control in the Viewpoint mode is used to scale the bars on the Y axis. Vertical scaling has been included in the Viewpoint mode, because it has been observed that when users scaled the bars, they typically changed the viewpoint as the following operation.

Dynamic Queries

IPBC uses color to classify time-series values into different ranges. In particular, at the beginning of a session, the user can define units of measure and her general range of interest for the values, specifying its lowest and highest value. These will be taken as the lower and upper bounds for an IPBC dynamic query control in the RT (as shown in Fig. 4) that allows the user to interactively partition the specified range into subranges of interest. Different colors are associated to the subranges and when the user moves the slider elements, colors of the affected bars in the IPBC change in realtime. Possible bars with values outside the specified general range of interest are highlighted with a proper single color. For example, Fig. 1 shows a partition that includes the three subranges corresponding to the colors shown by the slider in Fig. 4, and also some bars which are outside the user's predefined range. The color coding scheme can be personalized by the user with the Colors mode in the RT. The dynamic query control allows the user to:

• move the two slider elements independently (to change the relative size of adja-cent subranges). For example, in Fig. 4, one has been set to 130 mmHg and the other to 180 mmHg. This can be done both by dragging the edges or (more easily) the tooltips which indicate the precise value. Plus and minus signs in the tooltips also allow for a fine tuning of the value.

• Move the two slider elements together by clicking and dragging the area between the two bounds. This can be particularly useful (especially when the other areas are associated to the same color), because it will result in a "spotlight" effect on the vis ualization: as we move the area, our attention is immediately focused on its corresponding set of bars, highlighted in the visualization.

Comparing data with (time-varying) thresholds

A frequent need in VDM is to quickly perceive how many and which values are below or above a given threshold. This can be easily done with the previously described dynamic queries when the threshold is constant. However, the required threshold is often time-varying, e.g. one can be interested in knowing how many and which values are not consistent with an increasing or decreasing trend. For this need, IPBC offers a mode based on a tide metaphor. As it can be seen in Fig. 5, the Tide mode adds a semitransparent solid to the visualization: the solid metaphorically represents a mass of water that floods the bar chart, highlighting those bars which are above the level of water. The slope of the top side of the solid can be set by moving two tooltips shown in the RT (that specify the initial and final values for the solid height), thus determin-ing the desired linearly increasing or decreasing trend. The height of the solid can be also changed without affecting the slope by clicking and dragging the blue area in the RT. An opaque/transparent control allows the user to choose how much the solid should hide what is below the threshold. When the Tide mode is activated, all the bars in the user's range of interest are turned to a single color to allow the user to more easily perceive which bars are above or below the threshold; if multiple colors were maintained, the task would be more difficult, also because the chromatic interaction between the semitransparent surface and the parts of bars inside it adds new colors to the visualization.

The Tide mode can be also used to help compare sizes of bars by selecting a zero slope and changing the height of the solid (in this special case, Tide becomes analogous to the "water level" function of other visualization systems). Fig. 5 illustrates this latter case, while Fig. 9 shows a positive slope case.

Figure 9

Implementing a non-linear Tide would be relatively straightforward (only linear trends are anyway used by clinicians in the considered hemodialysis domain).

Managing Occlusions

As any 3D visualization, IPBC can suffer from occlusion problems. To face them, the approach offers two possible solutions.

First, by clicking on the 2D/3D label on the RT, the user can transform the parallel bar chart into a matrix format and vice versa. For example, Fig. 6 shows the same data as Fig. 1 in the matrix format. The transformation is simply obtained by automatically moving the viewpoint over the 3D visualization (and taking it back to the previous position when the user deselects the matrix format). This can solve any occlusion problem (and the dynamic query control can still be used to affect the color of the matrix cells), but the information given by the height of the bars is lost. Transitions to matrix format and back are animated to avoid disorienting the user and allow her to keep her attention on the part of the visualization (s)he was focusing on. Second, by directly clicking on any time-series in the 3D visualization, only the timeseries which can possibly occlude the chosen one collapse into a flat representation analogous to the matrix one, as illustrated in Fig. 7.

Pattern Matching

When the user notices an interesting sequence of values in one of the time-series, IPBC offers her the opportunity to automatically search for and highlight occurrences of a similar pattern in all the visualization (a detailed example will be described in Section 4.4).

The user selects her desired sequence of values in a time-series by simply dragging the mouse over it, then (s)he can specify how much precise the search should be, by indicating two tolerance values in the RT: (i) how much a single value can differ in percentage from the corresponding one in the given pattern, (ii) the maximum number (possibly zero) of values in a pattern that can violate the given percentage.

Mining Multidimensional Data

If multiple variables are associated to the considered time-series, IPBC can organize the screen into multiple windows, each one displaying a parallel bar chart for one of the variables. The visualizations in all the windows are linked together, e.g. if one selects a single time-series in one of the windows (or a specific value in a time-series), that time-series (or the corresponding value) is automatically highlighted in every other window. This (as some other features of IPBC) will be shown in more detail in the next section.

Mining Hemodialytic data

In the following, we will show how IPBC can be used during real clinical tasks, to help physicians evaluating the quality of the hemodialytic treatments given to single patients, on the basis of the clinical parameters acquired during the sessions. Each hemodialysis session returns a time-series for each parameter; different time-series are displayed side by side in the parallel bar chart according to date (in this case, the axis on the left chronologically orders the sessions).

The following examples are ordered according to the complexity of the related task: in particular, the first two tasks are relatively simple and are taken from the daily activity of clinicians, while the last two tasks are more complex and are performed by clinicians only in specific occasions (in the two considered examples, they are related to a detailed evaluation of the quality of care provided by nurses).

Mining patient signs data

A first task consists in analyzing patient signs, as the systolic and diastolic blood pressures and the heart rate; indeed, these parameters are important both for the health status of the patient and for the management of device settings during the hemodialytic session.

Let us consider, for example, the task of analyzing all the systolic pressures of a given patient: Fig. 8 shows a parallel bar chart (containing more than 5000 bars), representing the systolic pressure measurements (about 50 per session) during more than 100 hemodialytic sessions. In this figure, we can observe that the presence of out-ofscale values, usually related to measurement errors (e.g., the patient was mo ving; the measurement device was not properly operating), has been highlighted by specifying a proper range of interest (that highlights them in a suitable color) and hiding their height. In the specific situation represented in the figure, the presence of several outof-scale values at the beginning of each session is due to the fact that nurses activate the measurement of patient's blood pressure with some delay with respect to the beginning of the session.

In the figure, the user is focusing on a specific session, avoiding occlusion problems (as described in Section 3.5). At the same time, with a dynamic query, (s)he is able to distinguish low, normal, and high blood pressures. In this case, the clinician can observe that the systolic pressure in the chosen session, after a period of low values (yellow bars), was in the range of normal values (orange bars). While the values for the chosen session correspond to a normal state, it is easy to observe that several sessions among those in the more recent half part of the collection contain several high values (in red) for the systolic blood pressure. Thus, the clinician can conclude that in those sessions the patient had some hypertension, i.e. a clinically undesired situation.

Mining bl ood volume data

Another task is related to observing the percentage of reduction of the blood volume during hemodialysis, mainly due to the removal of the water in excess. This reduction is sometimes slowed down to avoid situations in which the patient has too low blood pressures. In this case, VDM can benefit from the usage of the Tide mode. Fig. 9 shows an IPBC with more than 9000 bars, representing 36 hemodialytic sessions, containing about 250 values each. In this case, being the percentage of reduction of the blood volume increasing during a session, Tide allows the physician to distinguish those (parts of) sessions characterized by a percentage of reduction above or below the desired trend. In the figure, for example, the selected session has a first part emerging from the tide, while the last part is below. At the same time, it is possible to observe that one of the last sessions has the percentage of reduction above the tide during almost the entire session. The clinician can thus easily identify those (parts of) sessions with a satisfying reduction of the blood volume as the emerging (parts of) sessions. Fig. 9. Visualizing the time-varying reduction of the blood volume in the Tide mode.

Mining related clinical parameters

The next task we consider is related to the analysis of three related parameters: the systolic and diastolic blood pressures (measured on the patient) and the blood flow (QB) entering the hemodialyzer. QB is initially set by the hemodyalizer, but it can be manually set (reduced) by nurses when the patient's blood pressures are considered too low by the medical staff. It is thus interesting to visually relate QB and blood pressures, to check whether suboptimal QBs are related to low pressures. Otherwise, suboptimal values of QB would be due to human errors during the manual setting of the hemodialyzer. Fig. 10 shows the coordinated visualization of three clinical parameters for the same patient: the diastolic blood pressure (small window in the upper left part), the systolic blood pressure (small window in the lower left part), and QB (right window). The user can freely organize the visualization, switching the different charts from the smaller to the larger windows (by clicking on the arrow in the upper right part of the smaller windows). In the figure, the clinician is focusing on a session where the QB was below the prescribed value during the first two hours of hemodialysis (yellow color for QB) and (s)he has selected a specific value (the system highlights that value and the corresponding values in the other windows with black arrows). It is easy to notice that the suboptimal QB was related to low blood pressures (yellow bars in the corresponding time-series in the two small windows); then, QB was set to the correct value by nurses (see black arrow in the right window) only after blood pressures reached normal values (orange color in the corresponding charts). In this case, the physician can conclude that the suboptimal QB has been correctly set by nurses because of the patient's hypotension.

Figure 10

Coordinated analysis of blood pressures and incoming blood flow.

Mining for similar patterns

Finally, let us consider a task concerning the analysis of QB. As previously mentioned, the value of QB can be manually set by nurses and it may happen that this value is below the optimal one, due to hypotensive episodes. Fig. 11 shows a visualization where the clinician noticed a change of QB from a lower value to the correct one in a session: this means that, after a period of suboptimal treatment, the proper setting had been entered. Therefore, the clinician asks IPBC to identify QB patterns similar to the one (s)he noticed, by indicating it with the mouse, and setting the tolerance parameters (see Section 3.6). Fig. 11 shows the selected pattern (see the area near the black arrow) and the similar patterns automatically found by IPBC (two are in the lower right part of the figure, one in the upper left part): these patterns are identified by a line of a suitable color, which highlights the contours of the first and last bar of the pattern and intersects the inner bars. To avoid possible occlusion problems in visually detecting the patterns, the physician can move the viewpoint or switch to the matrix representation, where each pattern can be easily observed.

Figure 11

Automatic Pattern Matching.

Sonification of time dependent data

Monique Noirhomme-Fraiture 1 , Olivier Schöller 1 , Christophe Demoulin 1 , Simeon Simoff 2

Fig. 1. Visualisation and visual data mining

Although human visual processing system remains a powerful 'tool' that can be used in data mining, there are other perceptual channels that seem to be underused. Our capability to distinguish harmonies in audio sequences (not necessarily musical ones) is one possibility to complement the visual channel. Such approach can be summarised as 'What You Hear Is What You See'. The idea of combining the visual and audio channels is illustrated in Fig. 3. The conversion of data into a sound signal is known as sonification. Similar to the application of visualisation techniques in Fig. 1b, sonification can be used both for representing the input and/or the output of the data mining algorithms. In visual data mining, sonification should be synchronised with the visualisation technique. Further, in this paper we discuss the issues connected with designing such data mining techniques and present an example of a practical implementation of combined technique. We briefly discuss the characteristics of the sound that are suitable for such approach, the actual sonification process, the design of the overall combined technique and the results of the experiments conducted with proposed technique.

2

Characteristics of sound for time dependent data representation

Several researchers have investigated the use of sound as means for data representation [4][5][6][7][8][9][10][11]. In this context, the important feature of the sound is that it has a sequential nature, having particular duration and evolving as a function of time. A sound sequence has to be heard in a given order, i.e. it is not possible to hear the end before the beginning 1 . Similarly, a time series depends on time and have the same sequential characteristics. Consequently sound provides good means to represent time series.

Sonification

The easiest way to transform time dependent data into sound is to map the data to frequencies by using linear as well as chromatic scale mappings. We call this process a pitch-based mapping. We compute the minimum and maximum data values from the chosen series and map this data interval into a frequency range, chosen in advance. Each value of the series is then mapped into a frequency. To avoid too large, non-realistic intervals, we first discard outliers (see below).

Another pre-treatment is the smoothing of the series. In fact, if we map all the points of a series into a sound, we will hear rather inconsistent sounds. A first treatment consists in smoothing the series by a standard mean, for example, by moving average method. After that, we map the smoothed curve into pitch. Beat drums can be used to enhance the shape of the curve (see below).

Detection of outliers

To detect statistically the values of the outliers, a confidence interval is computed at each time t., based on the normal distribution. Once a data value is detected outside the confidence interval, the corresponding time value is stored and sonified at the experiment phase.

Beat drums mapping

The rhythm of a beat drum increases with respect to the rate of growth of the curve (i.e. the first derivative).

Stereo panning

Variation of the stereo acoustics is introduced, for example, an increase of the volume of the right speaker and decrease of the volume of the left speaker.

3D Curve

When the time series is given at each time, not by a single value but by a function of values, we decide to "hear" at each discrete time the function. We can also choose to cut the surface at a certain level and to hear "continuously" the obtained curve as a function of time. We call these transformations respectively horizontal and vertical travelling. An example of a 3D data surface for sonification is shown in Fig. 4.

Prototype implementation

The prototype has been implemented in Java programming language, using the MIDI package of the Java Sound API 2 [12]. The MIDI sequence is constructed before the actual playback. When the designer starts the sonification, the whole sequence is computed. Then computed sequence is sent to the MIDI sequencer for playback.

Experimentation

The purpose of the experimentation is to determine how the sonification of two and three-dimensional graphs can complement or be an alternative to visually displayed 2 API -Application Programming Interface graphs. An Internet Web site has been created, where sound sequences are presented and can be evaluated by the visitors. The site contains questionnaire that has to be filled in by visitors performing the test. The structure of the questionnaire is the following one:

A. Identification of the user: name, age, gender, title/position, e-mail address. These data are used to identify the subject and to validate the answer.

B. Ability: field of activity, musical experience (instrument played, practicing period), self-evaluation of musical level (from 'no experience' to 'expert level').

C. 2D evaluation: 2D evaluation is divided into three subtasks:

Part C.a: Explanation about the four sonification techniques used: pitch based only, beat drums, stereo and extreme values detection. Each of them is briefly described and at least one example is given. This question aims to evaluate if subject can perceive a global trend in the series and to understand if the relation with the time scale is done. For each sequence, beat drums and stereo mapping are added to enhance the pitch-based sonification.

Question 2 aims to identify whether extreme values are detected. Question 3 aims to identify whether seasonal trend can be detected. Question 4 is focused on trend

Results

Below we present the results of the experiments.

The sample

23 visitors answered the questions for the case of a 2D visualisation and 18 visitorsfor the case of 3D visualisation. A large part of the sample (9) includes people working in the computer science area, who have limited musical experience or no experience at all. To see if people with musical experience get a better score, we have compared the average score in both groups. The influence of the musical and computer science background on the results is presented in Table 1 and Table 2, respectively. The score obtained for each question is the number of good answers, normalised on a 3.3 Is the evolution of electricity production in Australia characterised by seasonal trend? Qu4 Monthly Minneapolis public drunkenness intakes between January 1966 and July 1978 (151 months) 4.1 Were there more intakes in 1966 than in 1978 ? 4.2 Is the evolution of public drunkenness intakes linear?

Table 1

Table 2

The results are summarised in Table 3.

Table 3

Summary of the results for 2D