Proceedings of the 6th WSEAS International Conference on Simulation, Modelling and Optimization, Lisbon, Portugal, September 22-24, 2006
A Method for Mining Quantitative Association Rules
MARÍA N. MORENO, SADDYS SEGRERA, VIVIAN F. LÓPEZ AND M. JOSÉ POLO
Department of Computing and Automatic
University of Salamanca
Plaza Merced s/n, 37008 Salamanca
SPAIN
mmg@usal.es http://web.usal.es/~mmg
Abstract: Association rule mining is a significant research topic in the knowledge discovery area. In the last
years a great number of algorithms have been proposed with the objective of solving diverse drawbacks
presented in the generation of association rules. One of the main problems is to obtain interesting rules from
continuous numeric attributes. In this paper, a method for mining quantitative association rules is proposed. It
deals with the problem of discretizing continuous data in order to discover a manageable number of high
confident association rules, which cover a high percentage of examples in the data set. The method was validated
by applying it to data from software project management metrics.
Key-Words: Association rules, discretization, clustering
1 Introduction
Association analysis is a useful data mining technique
exploited in multiple application domains. One of the
best known is the business field where the
discovering of purchase patterns or associations
between products that clients tend to buy together is
used for developing an effective marketing. The
attributes used in this domain are mainly categorical
data, which simplifies the procedure of mining the
rules. In the last years the application areas involving
other types of attributes have increased significantly.
Some examples of recent applications are finding
patterns in biological databases, extraction of
knowledge from software engineering metrics [14] or
obtaining user's profiles for web system
personalization [15] [16].
Associative models have been even used in
classification problems as the base of some efficient
classifiers [11] [16].
Numerous methods for association rule mining
have been proposed, however many of them discover
too many rules, which represent weak associations
and uninteresting patterns. The improvement of
association rules algorithms is the subject of many
works in the literature. Most of the research efforts
have been oriented to simplify the rule set, to
generate strong and interesting patterns as well as to
improve the algorithm performance. When attributes
used for inducing the rules take continuous values,
these three objectives can be achieved by means of an
efficient data discretization procedure such as the
proposed in this paper.
The strength of an association rule in the form “If
X then Y” is mainly quantified by the following
factors:
• Confidence or predictability. A rule has confidence
c if c% of the transactions in D that contain X also
contain Y. A rule is said to hold on a dataset D if
the confidence of the rule is greater than a userspecified threshold.
• Support or prevalence. The rule has support s in D
if s% of the transactions in D contain both X and
Y.
The interestingness issue refers to finding rules
that are interesting and useful to users [12]. It can be
assessed by means of objective measures such as
support (statistical significance) and confidence
(goodness), defined before, but subjective measures
are also needed. Liu et al. [12] suggest the following
ones:
• Unexpectednes: Rules are interesting if they are
unknown to the user or contradict the user’s
existing knowledge.
• Actionability: Rules are interesting if users can do
something with them to their advantage.
Actionable rules are either expected or unexpected,
but the last ones are the most interesting rules due to
they are unknown for the user and lead to more
valuable decisions.
Most of the approaches for finding interesting rules
in a subjective way require the user participation to
articulate his knowledge or to express what rules are
interesting for him. Unfortunately these subjective
factors cannot be easily obtained in some application
areas such as project management, especially when a
large number of quantitative attributes are involved
173
Proceedings of the 6th WSEAS International Conference on Simulation, Modelling and Optimization, Lisbon, Portugal, September 22-24, 2006
and, so, it is very difficult to acquire domain
knowledge. These applications present additional
problems such as the discretization of continuous
quantitative attributes, which can take a wide range of
values. In order to reduce the number of rules
generated it is necessary to split the range of values
into a manageable number of intervals.
In this paper a multivariate discretization based
method is proposed. The procedure was applied in the
discovery of association rules from a project
management data base, yielding a reduced number of
strong association rules which cover a large
percentage of examples.
The following section contains a shallow
description of some works in the literature concerning
the improvement of association rule algorithms.
Section 3 is dedicated to the proposed discretization
procedure for rule mining. The experimental study
and the analysis of results are presented in sections 4
and 5 respectively. Finally, we present the
conclusions.
2 Related work
The concept of association between items [1] [2] was
first introduced by Agrawal and col. Since they
proposed the popular Apriori algorithm [3], the
improvement of the algorithms for mining association
rules have been the target of numerous studies. Many
other authors have studied better ways for obtaining
association rules from transactional databases. Most
of the efforts have been oriented to simplify the rule
set and improve the algorithm performance.
The best known algorithms, such as Apriori, which
reduce the search space, proceed basically by
breadth-first traversal of the lattice, starting with the
single attributes. They perform repeated passes of the
database, on each of which a candidate set of attribute
sets is examined. First, single attributes which have
low support are discarded, after that, low frequent
combination of two attributes are eliminated and so
forth. Cohen et al. [4] proposed efficient algorithms
for finding rules that have extremely high confidence
but for which there is no or extremely weak support.
Generalization is an alternative way of reducing
the number of association rules. Instead of
specializing the relationships between antecedent and
consequent parts and restricting rules to support
values, in [10] and [9] new aggregates and other
restrictions on market basket items are considered.
Imielinski et al. [8] have proposed a generalization
method named cubegrades, were a hypercube in the
multidimensional space defined by its member
attributes is used to evaluate how changes in the
attributes of the cube affect some measures of
interest. Huang and Wu [7] have developed the
GMAR (Generalized Mining Association) algorithm
which combines several pruning techniques for
generalizing rules. The numerous candidate sets are
pruned by using minimal confidence. In [22] a new
approach for mining association rules based on the
concept of frequent closed transactions is proposed.
The topic of knowledge refinement is used in some
methods in order to obtain a reduced number of
consistent and interesting patterns. In [17] and [18]
the concept of unexpectedness is introduced in an
iterative process for refining association rules. It uses
prior domain knowledge to reconcile unexpected
patterns and to obtain stronger association rules.
Domain knowledge is fed with the experience of the
managers. This is a drawback for the use of the
method in many application domains where the rules
are numeric correlations between project attributes
and they are influenced by many factors. It is very
difficult to acquire experience in this class of
problems. We have developed a refinement method
[14] which does not need use managerial experience.
It is also based on the discovery of unexpected
patterns, but it uses the best attributes for
classification in a progressive process for rules
refinement. It is an effective procedure for
classification problems that is very suitable for
applications that manage quantitative attributes where
domain knowledge cannot be easily obtained. The
aim is to provide managers with a convenient number
of good association rules for prediction, which help
them to make right decisions about the software
project. However, in many cases an efficient
discretization of project data can be more effective
than complex methods.
Extracting all association rules from a database
requires counting all possible combination of
attributes. Support and confidence factors can be used
for obtaining interesting rules which have values for
these factors grater than a threshold value. In most of
the methods the confidence is determined once the
relevant support for the rules is computed.
Nevertheless, when the number of attributes is large,
computational time increases exponentially. For a
database of m records of n attributes, assuming binary
encoding of attributes in a record, the enumeration of
subset of attributes requires m x 2n computational
steps. For small values of n, traditional algorithms are
simple and efficient, but for large values of n the
computational analysis is unfeasible. When
continuous attributes are involved in the rules, the
discretization process is critical in order to reduce the
value of n and to obtain high confident rules at the
same time.
174
Proceedings of the 6th WSEAS International Conference on Simulation, Modelling and Optimization, Lisbon, Portugal, September 22-24, 2006
Among the great variety of existent discretization
algorithms, two simple techniques commonly used
are equal-width and equal-frequency, which consist
on creating a specified number of intervals with the
same size or with the same number of records
respectively. The purpose of the discretized data and
the statistical characteristics of the sample to be
treated should be kept in mind when an algorithm is
selected. Discretization can be univariate or
multivariate. Univariate discretization quantifies one
continuous attribute at a time while multivariate
discretization considers simultaneously multiple
attributes. Attribute discretization methods for mining
association rules have been treated in the literature.
Nearly everyone take the support factor of the rules
as the main feature for splitting the attribute values
into intervals, that is, they consider the weight of the
records in the interval in relation to the total number
of records [21]. Recently, several partition methods
based on the fuzzy set theory have been proposed [6].
The mined rules are expressed in linguistic terms,
which are more natural and understandable. In these
works either both the antecedent and consequent parts
of the rules are formed by a single item or the
consequent part is not fixed. In our case the
consequent part must be fixed because there are input
and output attributes and both consequent and
antecedent parts are itemsets. So, it is more suitable a
multivariate discretization that consider all the
attributes.
3 Discretization process
All the attributes that are used in this work to
generate association rules are continuous, that is, they
can take a wide range of values. In order to reduce
the number of rules generated it is necessary to
discretize the attributes by splitting the range of
values into a manageable number of intervals.
A clustering technique was applied for discretizing
multiple attributes simultaneously. Clusters of similar
records were built by using the iterative k-means
algorithm with a Euclidean distance metric [5]. This
distance D(p,q) between two points p and q in a space
of n dimensions is:
n
[D(p,q)]2 = || p – q ||2 = Σ
(pi - qi) 2 (1)
i=1
where pi and qi are the coordinates of the points p
and q respectively. In our case, the points are the
records to be compared, and the coordinates are the n
attributes of each record.
The iterative k-means algorithm takes as input the
minimum and maximum number of clusters (k). The
selected values in this work were 1 and 10
respectively. This clustering method groups the
records in a way that the overall dispersion within
each cluster is minimized. The procedure is the
following:
1. The value of the minimum number of clusters is
assigned to k.
2. The k cluster centers are situated in random
positions in the space of n dimensions.
3. Each record in the data is assigned to the cluster
whose center is closest to it.
4. The cluster centers are recalculated based on the
new data in each cluster.
5. If there are records which are closer to the center of
a different cluster than the cluster that they belong
to, then, these records are moved to the closer
cluster.
Steps 4 and 5 are repeated until no further
improvement can be made or the maximum number
of clusters is reached.
The distribution of attribute values in the clusters
was used for making the discretization according to
the following procedure:
1. The number of intervals for each attribute is the
same of the number of clusters. If m is the mean
value of the attribute in the cluster and σ is the
standard deviation, the initial interval boundaries
are (m - σ) and (m + σ).
2. When two adjacent intervals overlap, the cut point
(superior boundary of the first and inferior
boundary of the next) is placed in the middle point
of the overlapping region. These intervals are
merged into a unique interval if one of them
includes the mean value of the other or is very
close to it.
3. When two adjacent intervals are separated, the cut
point is placed in the middle point of the separation
region.
This procedure was applied for creating intervals
of values for every one of the attributes in order to
generate the association rules.
4 Experimental study
The data used in this study come from a dynamic
simulation environment developed by Ramos et al.
[20] [21]. This environment manages data from real
projects developed in local companies and simulates
different scenarios. It works with more than 20 input
parameters and more than 10 output variables. The
number of records generated for this work is 300 and
the variables used for the data mining study are those
related with time restrictions, quality and technician
hiring.
175
Proceedings of the 6th WSEAS International Conference on Simulation, Modelling and Optimization, Lisbon, Portugal, September 22-24, 2006
The aim of the work is to obtain an associative
model that allows studying the influence of the input
variables related to the project management policy on
the output variables related to the software product
and the software process.
The clusters were created with a weight for the
output variables three times greater than for input
attributes. This is a supervised way of producing the
most suitable clusters for the prediction of the output
variables, which appear in the consequent part of the
rules. In this study the clustering algorithm produced
three clusters.
Rules representing the impact of project
management policies on software quality,
development time and effort were generated and
visualized by using Mineset, a Silicon Graphics tool
[13]. Figure 1 is a graphical representation of the
rules on a grid landscape with left-hand side (LHS)
items on one axis, and right-hand side (RHS) items
on the other. A rule (LHS → RHS) displayed at the
junction of its LHS and RHS itemset relates the
itemset containing the input attributes with the
itemset formed for the output attributes. The display
includes bars, disk and colors whose meaning is
given in the graph. Rules generator does not report
rules in which the predictability (confidence) is lesser
than the expected predictability (frequency of
occurrence of the item RHS), that is, the result of
dividing predictability by expected predictability
(pred_div_expect) should be greater than one. Good
rules are those with high values of pred_div_expect.
We have also specified a minimum predictability
threshold of 60%.
Under the exposed conditions, eleven rules were
generated. Ttheir confidence and support factors are
presented in the table 1.
Fig. 1. Association rules
Rule
1
2
3
4
5
6
7
8
9
10
11
%Confidence
100
86.67
69.77
69.23
88.24
100
100
100
100
100
100
SUM
AVERAGE
%Support
1.14
4.94
11.41
3.42
5.70
16.35
7.60
7.60
10.27
4.18
2.66
75.27
92.18
Table 1. Support and confidence factors for the association rules
5 Analysis of results
The proposed procedure generated twelve association
rules. Table 1 shows their support and confidence
factors, which capture the statistical strength of the
patterns. In our study domain, the more confident a
rule is, the more reliable it will be when it will be
used to take project management decisions. Seven
discovered rules have the maximum confidence value
(100%) and the remaining rules have high values of
this factor, yielding an average value of 92.18%,
therefore they are good for taking decision in future
projects. On the other hand, the induced associative
model is useful if it is constituted by a manageable
number of rules and the rule set covers a large
percentage of examples (records). The coverage
measure is provided by de total support of the rules,
that is, the sum of individual supports. In our case
study the proposed method gives a model that covers
the 75% of the examples with just eleven association
rules (see table 1).
In the study carried out, a reduced number of
strong rules have been generated. In addition, the rule
induction process was very fast, due to the
association rule algorithm works with a reduced
number of intervals of values of the attributes, which
are provided by the discretization method. Then, the
obtained
associative
model,
which
relates
management policy factors with quality, time and
effort, provides managers with a useful tool for taking
decisions about current or future projects.
6 Conclusions
The paper deal with the problem of finding useful
association rules from software project management
data. The main drawbacks in this application field are
176
Proceedings of the 6th WSEAS International Conference on Simulation, Modelling and Optimization, Lisbon, Portugal, September 22-24, 2006
the treatment of continuous attributes and the
difficulty to obtain domain knowledge in order to
evaluate the interestingness of the association rules.
We have proposed an association rule mining
algorithm for building a model that relates
management policy attributes with the output
attributes quality, time and effort. The success of the
algorithm is mainly due to the supervised multivariate
procedure used for discretizing the continuous
attributes in order to generate the rules. The result is
an association model constituted by a manageable
number of high confident rules representing relevant
patterns between project attributes. This allows
estimating the influence of the combination of some
variables related to management policies on the
software quality, the project duration and the
development effort simultaneously.
In addition, the proposed method avoids three of
the main drawbacks presented by the rule mining
algorithms: production of a high number of rules,
discovery of uninteresting patterns and low
performance.
References:
[1] Agrawal, R., Imielinski, T., Swami, A. Database
Mining: A performance Perspective. IEEE Trans.
Knowledge and Data Engineering, vol. 5, 6,
1993a, pp. 914-925.
[2] Agrawal, R., Imielinski, T., Swami, A. Mining
associations between sets of items in large
databases. Proc. of ACM SIGMOD Int.
Conference on Management of Data, Washinton
D.C., 1993b, pp. 207-216.
[3] Agrawal, R., Srikant, R. Fast Algorithms for
mining association rules in large databases. Proc.
of 20th Int. Conference on Very Large Databases,
Santiago de Chile, 1994, pp. 487-489.
[4] Coenen, F., G. Goulbourne and P. Leng. Tree
Structures for Mining Association Rules. Data
Mining and Knowledge Discovery, 8, 2004, pp.
25-51.
[5] Grabmeier, J. and Rudolph, A. Techniques of
Cluster Algorithms in Data Mining , Data Mining
and Knowledge Discovery, 6, 2002, pp. 303-360.
[6] Hong, T.P., Kuo, C.S. and Chi, S.C. Mining
association rules from quantitative data,
Intelligent Data Analisys (1999) 363-376.
[7] Huang, Y.F., Wu, C.M. Mining Generalized
Association Rules Using Pruning Techniques.
Proceedings of the IEEE International
Conference on Data Mining (ICDM'02), Japan,
2002, pp. 227-234.
[8] Imielinski, T., A. Virmani and A. Abdulghani.
DataMine, Application Programming Interface
and Query Language for Database Mining.
Proceedings ACM Int’l Conference Knowledge
Discovery & Data Mining, ACM Press, 1996, pp.
256-261.
[9] Lackshmanan, L.V.S., Ng, R., Han, J. and Pang,
A. Optimization of constrained frequent set
queries with 2-variable constraints. Proc. of ACM
SIGMOD Conf., 1999, pp. 158-168.
[10] Lackshmanan, L.V.S, Ng, R., Han, J. and Pang,
A. Exploratory mining and pruning optimizations
of constrained association rules. Proc. of ACM
SIGMOD Conf., 1998, pp. 13-24.
[11] Liu, B., Hsu, W. Ma, Y. Integration
Classification and Association Rule Mining.
Proc. 4th Int. Conference on Knowledge
Discovery and Data Mining, 1998, pp. 80-86.
[12] Liu, B., Hsu, W., Chen, S., Ma, Y. Analyzing the
subjective Interestingness of Association Rules.
IEEE Intelligent Systems, september/October,
2000, pp. 47-55.
[13] Mineset user’s guide, v. 007-3214-004, 5/98.,
Silicon Graphics, 1998.
[14] Moreno, M.N., Miguel, L.A., García, F.J., Polo,
M.J., Building knowledge discovery-driven
models for decision support in project
management. Decisión Support Systems, 38,
2004, pp. 305-317.
[15] Moreno, M.N., García, F.J. and Polo, M.J. “An
Architecture for Personalized Systems Based on
Web Mining Agents”. Lectures Notes in
Computer Science, LNCS 3140, 2004, pp. 563567.
[16] Moreno, M.N., García, F.J., Polo, M.J. and
López, V. “Using Association Analysis of Web
Data in Recommender Systems”, Lectures Notes
in Computer Science, LNCS 3182, 2004, pp. 1120.
[17] Padmanabhan, B., Tuzhilin, A., Knowledge
refinement based on the discovery of unexpected
patterns in data mining. Decision Support
Systems, 27, 1999, pp. 303-318.
[18] Padmanabhan, B., Tuzhilin, A., Unexpectedness
as a measure of interestingness in knowledge
discovery. Decision Support Systems, 33, 2002,
pp. 309- 321.
[19] Ramos, I., Riquelme, J. and Aroba, J.C.
Improvements in the Decision Making in
Software Projects: Application of Data Mining
Techniques, IC-AI`2001, 2001.
[20] Ruiz, M.; Ramos, I. and Toro, M., A Simplified
Model of Software Project dynamics, The
Journal of Systems and Software, 59 (2001),
2001, pp. 299-309.
177
Proceedings of the 6th WSEAS International Conference on Simulation, Modelling and Optimization, Lisbon, Portugal, September 22-24, 2006
Srikant, R. and Agrawal, R. Mining quantitative
association rules in large relational tables. Proc.
of ACM SIGMOD Conf., 1996, pp. 1-12.
[22] Zaki, M.J. Mining Non-Redundant Association
Rules. Data Mining and Knowledge Discovery, 9,
2004, pp. 223-248.
[21]
View publication stats
178