1. Introduction
Clustering has become crucial in several areas of scientific research, and it has also a significant impact in the theoretical development and applied research in several scientific disciplines [
1,
2,
3]. Despite the very large number of methods to perform clustering, the use of swarm intelligence algorithms has become increasingly relevant in order to perform this task [
3,
4,
5,
6]. However, given the ample diversity of fields, topics, and problems studied, a particular phenomenon may very well be described by both numerical and categorical variables (mixed or hybrid data). Additionally, the appearance of missing values has become increasingly common in data measurement and sampling processes (missing data). There are several actors that can cause the missing values. Among them, the most important ones can be mentioned: the impossibility to perform some measurements, the loss of already taken samples, or even the non-existence of information about the data being described.
Data described by hybrid or mixed features (or simply mixed data) represent a challenge to most automatic learning algorithms. This is due to the generally accepted assumption of data are described by features of the same kind. Consequently, most pattern recognition algorithms are designed to tackle problems whose attributes are all the same kind. Thus, some methods assume the existence of a metric space (e.g., k-Means algorithm [
7]), while other clustering algorithms require the instances to be described only by categorical features (e.g., Partitioning Around Medoids PAM [
8]).
Unlike mixed data, missing or incomplete data do not depend on the features describing the phenomenon of interest, but rather appear due to the lack of the value for a particular attribute, on a specific instance. A particular value of a specific object may be unknown for numerous causes. Among the most frequent are: the inability to measure the desired variable; lack of knowledge regarding the phenomenon of interest [
9] (commonly occurring in social sciences, where the data represents people under study); and the loss of a previously acquired value [
10] (e.g., due to data sampling or storing equipment failure, contamination, or data loss during transmission).
It is commonly agreed upon by the scientific community of pattern recognition that the presence of missing values in a particular dataset implies an additional challenge for automatic learning algorithms. This is due to the prevailing trend of designing such methods to handle whole datasets: that is, with no missing information.
On the other hand, most of the proposed algorithms for clustering operate only over numerical data [
2,
3,
5,
6,
11,
12,
13,
14]. This unfortunate situation causes that several applications, which are relevant for human activities and whose data are described not only by numeric attributes, cannot be solved effectively.
Considering the previously described scenario, the methods and algorithm for clustering over mixed and incomplete data, are an evident minority in scientific literature [
15]. The above considerations show the need of carried out thoughtful and systematic scientific research for the creation, design, implementation and application of methods and algorithms to perform intelligent clustering over mixed and incomplete data. The proposal of this paper clearly aligns along with the presented need.
In this paper, we use swarm intelligence algorithms, which are applied to clustering mixed and incomplete data in a unified and satisfactory approach. In addition, we perform a large amount of thoughtful and systematics numerical experiments. It is well-known among specialized swarm intelligence researchers that those algorithms are overly sensitive to parameter configuration; however, despite this being a highly relevant issue, to date we cannot find systematic studies analyzing the sensibility to parameter configuration for clustering trough swarm intelligence algorithms. We also address this issue in this paper.
The obtained results allow us to determine experimentally the adequate parameter configuration to obtain clusters of high quality over mixed and incomplete data. The above is done in such a way that the comparative studies performed allow us to state that, beyond doubt, the results obtained in this research are superiors to others reported in the literature.
The rest of the paper is organized as follows:
Section 2 reviews some previous works for mixed and incomplete data clustering.
Section 3 explains the proposed generic framework for swarm intelligence algorithms for clustering of mixed and incomplete data.
Section 4 explores some case of study, and
Section 5 addresses the experimental outline, the obtained results, and discusses them. The article ends with the conclusions and future works.
2. Background
Handling patterns described by mixed data or those who include missing values represents an additional challenge to tackle for automatic learning algorithms. The problem of mixed data (also known as hybrid data) can be defined as follows:
Let
be a dataset (i.e., set of objects, instances, or samples) in a universe U, where each object is described by a set of features or attributes
; each feature
has associated a definition domain
, which in turn may be of Boolean,
k-valued, integer, real, or another kind [
16].
Usually, the lack of information is denoted with the symbol “?” in many datasets available in public repositories, such as the UCI Machine Learning Repository [
17] or the KEEL Project Repository [
18,
19]. Following this criterion, the value “?” is included in the domain of definition of the dataset feature; thus, the description of an incomplete object has the value “?” in place of the missing value. On the other hand, it is entirely possible that hybrid features, while also presenting missing values for some patterns describe a dataset. In this instance, the challenge faced by classification algorithms is even greater, since they need to handle both problems.
The solutions developed to manage such situations can be grouped mainly in two general strategies: they may work either by modifying the dataset before presenting it to the classification method; or by giving a particular algorithm the mechanisms to work with hybrid data, missing values, or both. The current section is dedicated to discussing some of the most well-known alternatives for solutions to one or both problems.
2.1. Data Level Solution for Clustering Mixed and Incomplete Data
Many automatic learning algorithms are designed to work only with numerical data, or only with categorical data, or only with complete data (i.e., no missing values). Thus, data-level solutions to these problems by means of pre-processing techniques [
20] are primarily focused on one (or several, depending on the situation) of the following tasks: coding categorical data into a numerical representation, discretizing numerical data into a categorical representation, or solving missing values by eliminating or completing the corresponding patterns.
Despite the numerous advances in this regard, every data-level solution has an unavoidable impact on the dataset, since they will inevitably transform the data and even alter the intrinsic semantic relationships associated to the original representation of such data. This, in turn, has given rise to a spreading opinion among researchers and practitioners, in the sense that a more adequate solution to the problems of mixed data and missing values rests in developing automatic learning algorithms able to internally handle both aspects.
2.2. Algorithm Level Solution for Clustering Mixed and Incomplete Data
In this approach, the responsibility of handling mixed and incomplete data representations falls on the learning algorithm, considering that said method must include mechanisms aimed at such kinds of data. A recent review of clustering algorithms for mixed data can be found in [
15].
Most of the algorithms for clustering mixed data rely on dissimilarity functions able to deal with such kind of data [
21]. This approach backs to 1997, when Huang proposed the k-Prototype algorithm [
22]. k-Prototype is an extension of the k-Means algorithm [
23]. The extension is based on the definition of a dissimilarity function to compare the mixed instances, and in a new strategy to compute the cluster centroids.
All such mixed dissimilarity-based algorithms, as k-Prototype, do not consider the attribute dependences. This is due to these algorithms analyze the numerical and categorical features separately, and some of them separately construct the clusters centroids. Finally, the use of arbitrarily dissimilarity function may be inadequate for some domains.
Another strategy for mixed data clustering is to separately analyze the features, by projecting the instances. That is, the mixed dataset is divided into several subsets according to the different types of attributes, and each subset is clustered using according to some algorithms, and then, the results are combined as a new categorical dataset and then are clustered [
24]. This idea has an obvious impact on the dataset, since they will transform the data twice and even alter the intrinsic semantic relationships associated with the original representation of such data.
A combination of the above alternatives is also presented in the literature. Some algorithms do divide the data, and also use mixed dissimilarity functions. By doing this, the disadvantages of both strategies are conserved. An example is the HyDaP algorithm [
25]. It has two steps. The first one involves identifying the data structure formed by continuous variables and recognizing the relevant features for clustering. Then, the second step involves using a dissimilarity measure for mixed data to obtain clustering results.
In our opinion, a better solution can be found by applying bio-inspired algorithms. To the best of our knowledge, the AGKA algorithm by Roy and Sharma [
4] is the first algorithm using bio-inspired strategies for dealing with the clustering of mixed data. It uses a genetic algorithm to obtain the clusters.
The representation used is a string having as length the instance count, where the i-th element of the string contains the cluster index to which the i-th instance is assigned. To obtain the cluster centers, the AGKA algorithm uses the centroid representation strategy and the dissimilarity function proposed in [
21].
Despite the advantage of considering an evolutionary approach to clustering, the use of a pre-defined dissimilarity function is a drawback of the AGKA algorithm. In addition, the centroid description used may be inadequate to obtain few clusters, due to the characteristics of the cluster centroid description. On the other hand, the representation used by the AGKA algorithm makes difficult for the application of traditional operators to evolve the individuals. In addition, having thousands of instances implies having individuals described by string of thousands of lengths, which makes AGKA inadequate to handle medium and big size data.
To solve these drawbacks, we propose a generic framework for dealing with mixed and incomplete data clustering, in a more suitable way.
3. Generic Framework for Bio-Inspired Clustering of Mixed and Incomplete Data
Swarm intelligence is directly related to the intelligent behavior of collections of individuals or agents (birds, ants, fish, bats, bees, etc.) that move in an apparently unorganized way. In this context, this branch of scientific research aims to develop processes and algorithms that model and simulate the actions that these individuals perform to search for food [
3,
4].
There is an impressive range of possibilities for scientists to choose the types of agents that exhibit cooperative and self-organizing behavior. In addition to insects, birds and fish, it is also possible to consider the growth of bacteria or even swarms of robots that make up cellular robotic systems [
5].
Typically, the individuals in the swarm are simple entities, but active and with their own movements. These individual movements, when incorporated into the swarm, allow generating a cooperative movement that results in collective search strategies, which significantly improve random searches. This is precisely called swarm intelligence, whose rules are so simple that it is a cause for wonder to realize how efficiently searches are performed [
6,
11].
In the following, we will supply a generic framework for mixed data clustering using bio-inspired algorithms. We propose a unified representation of the solutions, and a strategy to update the clustering solutions in the optimization process. Our approach is quite generic, and we believe that our proposal can be easily applied to several bio-inspired algorithms. It should be noted that we only focus on iterative improvement metaheuristics, not constructive ones. Constructive algorithms obtain the desired solution by parts. That is, there is no solution to the problem until the algorithm finishes. On the other hand, improvement heuristics start with an entire solution or set of solutions (usually random), and then refine them in the iterative process. This characteristic allows the user to have at least one solution to the problem at every iteration of the algorithm.
Our rationale is that using a unified strategy to model the clustering problem as an optimization problem, will lead to good clustering results, despite the swarm intelligence algorithm used. We bet on selecting a suitable representation, a useful updating strategy and an adequate optimization function. Our hypothesis is that if we manage to model mixed and incomplete data clustering as an optimization problem, we can obtain competitive results, and we will be able to obtain a clustering that fits the natural structure of data.
3.1. Representation
The representation of candidate solutions is one of the key aspects in metaheuristic algorithms. In fact, the representation used defines the operators applicable to the candidate solutions. For example, in real-valued representations, it is possible to consider a new solution by increasing the value of the current one with an epsilon value. In binary representation, the only modification allowed is the bit changing. In order-related representations, a solution is changed or updated by swapping some of its elements (ex. 2-opt operators [
26]).
For clustering numerical data, the usual representation has been a matrix of size
, where
is the number of clusters and
is the number of feature values [
27]. However, as we are dealing with mixed values, with some absences on information, we adopted a simpler approach: we consider a candidate solution as an array of cluster centers.
Instead of creating an artificial center of a cluster
, with
, which is a clear disadvantage of previous proposals, we used the notion of prototype, in which the cluster center
is defined as the instance who minimizes the dissimilarity with respect to the rest of instances in the cluster. It is not worthy to mention that we can use any dissimilarity function in the algorithms. By doing this, we also solve the drawback of using predefined functions. In
Figure 1, it is shown an example of computing the center of a cluster, using the Euclidean distance.
Formally, let
be the dissimilarity among instances
and
, belonging to a cluster
. The cluster center
is selected as following:
If the minimum dissimilarity value is obtained with more than one instance, we can select as cluster center any of the instances that minimize the dissimilarity.
Thus, the proposed representation is an array, formed by the selected center of each cluster . Each individual is represented as . Considering this representation, an individual will have its current location (array of cluster centers) as well as other algorithm-dependent parameters. The swarm will know the set of instances to be clustered, and the optimization functions associated to the clustering problem.
From an implementation perspective, we can easily simplify this representation, by using the indexes of the selected cluster center instead of the information of the cluster center itself.
Figure 2 shows the implementation of the proposed representation of a candidate solution, using the indices instead of the feature values of the cluster centers.
This representation is simple, suitable, and easy to update, fulfilling the desirable characteristics for candidate solution representation in optimization problems. We consider that this proposal simplifies the optimization process, and is computationally effective, due to it consumes lower storage memory (just ) than the usual matrix representation. It is important to note than some problems have a large number of instances (i.e., m parameter), on the order of thousands and even hundreds of thousands, which makes the traditional representation computationally impractical. Our proposed representation contributes to computational efficiency, by diminishing the storage cost of the solution from to just .
3.2. Updating Strategy
All iterative improvement metaheuristic algorithms have in common that they update the current solution (or solutions) during the iterative cycle. The updating of the solutions usually takes into consideration the current solution, another solution in the population, the best solution in the population, and perhaps some randomly generated number (most programming languages include the capability of generating random number, usually by following a Uniform distribution. Unless specified otherwise, all random numbers referred in this paper follow a uniform distribution.). In this paper, we propose a unified updating strategy for dealing with the mixed clustering problem. Our proposal consists of changing a cluster center in the current solution. We generate a random number, representing the cluster whose center will be changed. Then, we selected from this cluster a random instance to replace the cluster center (
Figure 3).
This strategy, although very simple, has two main advantages. It maintains a balance between exploration (by considering each instance in a cluster to be the new center) and exploitation (by considering only the instances already in the cluster to be center); and it directly handles the presence of attribute dependence, by selecting existent instances instead of independently modifying the features values of the centers.
Considering this updating strategy, it is possible to integrate it in the iterative improvement metaheuristic algorithms, while preserving the major characteristics of each of the original algorithms. In addition, this strategy is computationally simple due to it consisting only of updating the cluster centers.
In our implementation, we only need to generate a random number in the
interval, which is very fast. In addition, because we only have a solution of size
instead of a
matrix, our procedure is inexpensive. The pseudocode of the proposed updating procedure is detailed in Algorithm 1. This simple procedure will be used in the three analyzed swarm intelligence algorithms.
Algorithm 1. Pseudo code of the Updating procedure in the proposed framework to clustering mixed data. |
Updating procedure |
Inputs: : individual; : dataset of instances |
Output: : modified individual |
Steps:Assign each instance of to its nearest cluster, considering the centers of new Generate a random number in the interval as . This number represent the index of the cluster center to be changed, and is defined as . Let be the array of instances of assigned to the i-th cluster. Generate a random number in the interval as . This number represent the index of the instance replacing the cluster center, and is defined as Return
|
3.3. Fitness Functions and Dissimilarities
It has been assumed so far that there is a dissimilarity measure to compute the dissimilarity between instances in the clustering process. For experimental purposes, in this research, we used the Heterogeneous Euclidean Overlap Metric HEOM dissimilarity [
28] to compare instances in such way. The HEOM dissimilarity was introduced by Wilson and Martínez [
28] as an attempt to compare objects having mixed numerical and categorical descriptions. Let there be two instances
and
, and the HEOM dissimilarity is defined as:
By using the HEOM dissimilarity, we are able to directly compare mixed and incomplete instances, and to carried out the clustering process and to evaluate the resulting clustering, with the selected optimization functions. In addition, due to its simplicity and low computational cost, the HEOM dissimilarity is a feasible choice as an optimization function in evolutionary algorithms.
In addition, we have also assumed that there is a fitness or optimization function to guide the search process of the metaheuristic algorithms. In this research, we explore the use of three different optimization functions to guide the search process. The selected functions are validity indexes that have been widely used to determine the clustering quality [
29]. Two of them correspond to maximization functions, that is, the greater the value, the better the clustering and the later corresponds to a minimization function.
The silhouette is the average, for all clusters, of the silhouette width of the instances belonging to that cluster [
29]. Let be
an instance of the
cluster. Its silhouette width is defined as
where
is the average dissimilarity among
and the other instances in its cluster, calculated as
and
is the minimum of the average dissimilarities among
and the instances belonging to every other clusters, and it is defined as
. The silhouette index of a set of clusters
is defined by:
For an instance , its silhouette width is in the [−1, 1] interval. The greater the silhouette, the more compact and separated are the clusters.
The Dunn’s index for clustering considers the ratio between the minimum distance between two clusters, and the size of the bigger cluster. The same as the silhouette index, greater values correspond to better clustering [
29].
The Davies–Bouldin index is a well-known unsupervised cluster validity index [
30]. It considers the dispersion of instances in a cluster
and the dissimilarity of clusters
. The Davies–Bouldin index measures the average similarity between each cluster and its most similar cluster. The lower the values of the Davies–Bouldin index, the more compact and separated are the clusters. Formally, Davies–Bouldin index is defined as:
where
is usually defined as
.
There are defined several inter-cluster dissimilarities, as well as measures of cluster size. In this research, we used the centroid dissimilarity, as the inter-cluster measure, and the centroid measure for the cluster size.
4. Case Study
As a case study, to solve the clustering of mixed and incomplete data, we analyze three bio-inspired algorithms: the Artificial Bee Colony (ABC) algorithm [
31], the Firefly Algorithm (FA) [
32], and the Novel Bat Algorithm (NBA) [
33], a recent development of the Bat Algorithm (BA) [
34]. We selected these metaheuristic algorithms for four main reasons:
All of them are metaheuristic algorithms, with a recent successful application to several optimization problems [
35,
36,
37].
They mimic different social behaviors, resulting in different approaches to explore the solution space.
They have different approaches to exploit the best solutions, while avoiding trapping in local optima.
ABC, FA, and BA algorithms have been successfully applied to clustering numerical data [
5,
11,
12].
There are other bio-inspired algorithms fulfilling the above-mentioned reasons, such as Whale Optimization Algorithm [
17], Dragonfly Algorithm [
19], and others. However, we selected just three algorithms as a case of study, because the very large number of experiments needed to be performed for assessing their capabilities for clustering mixed and incomplete data. We suppose that the framework introduced in
Section 3 is applicable to other bio-inspired algorithms as well.
In the later, we explain the selected metaheuristic algorithms, and we analyzed the common elements among them.
The Artificial Bee Colony (ABC) algorithm was introduced in 2007 [
31], and it mimics the foraging of honey bees. The algorithm considers three kinds of bees: employed, onlooker, and scout bees. The solutions of the optimization problem are called food sources and have a position in the n-dimensional search space. Each food source has a nectar amount, corresponding to the desired fitness function of the optimization problem.
The ABC algorithm (Algorithm 2) start by sending the scout bees into the search space to randomly generate the initial food sources. Once obtained, the employed bees are assigned to the food sources. Then, each employed bee searches for a nearby food source in which the bee is employed. This searching for a new food source can be viewed as an updating mechanism, in which intervene the current solution, a randomly selected solution, and random numbers. If the new food source is better than the previous one, the employed bee discards the previous, and considers the new solution as its employed solution.
Algorithm 2. Pseudo code of the ABC algorithm. |
ABC optimization algorithm |
Inputs: fitness function of D dimensions; : number of iterations; : population size; limit of food sources |
Output: best solution |
Steps: |
Based on the nectar amount of the food sources retained in the previous step, the ABC algorithm probabilistically determines which solutions the onlooker bees will visit. The onlooker bees visit those solutions and then fly around them to search for near food sources. Then, the algorithm considers a greedy selection process (same as the one carried out by the employed bees). After that, the scout bees search for exhausted food sources (not improved in iterations) and replace them by randomly generated new food sources. At the end of the pre-defined iterations, the ABC returns the best food source found.
The FA (Algorithm 3) was introduced in 2009 [
32]. It mimics the coordinated flashing of fireflies. Although the real purpose of the bioluminescent behavior of fireflies is a current research topic for biologists, it is believed that it is related to finding mates, protecting against predators, and attracting potential preys. The FA algorithm consider some simplified rules about the behavior of fireflies: all are unisex, the attractive of a firefly is proportional to its brightness (the less attractive fireflies will move toward the most attractive fireflies) and the brightness of the fireflies is associated to a certain fitness function.
The attractiveness of fireflies is proportional to the intensity of the flashing seen by adjacent fireflies. The movement of a firefly
being attracted by a firefly
is determined by
where the second term of the expression is due to the attractive
, and the third term is a vector of random variables
from a Gaussian distribution [
32]. Best firefly moves randomly, that is,
.
Algorithm 3. Pseudo code of the FA algorithm. |
FA optimization algorithm |
Inputs: fitness function of D dimensions; : number of iterations; : population size; |
Output: best solution |
Steps:
|
In 2009, Lukasik y Zak [
38] experimentally obtained the optimum values for the parameters of the FA algorithm. They conclude that the best values of the parameters are
,
and the population number
varying among 15 and 50 fireflies. These conclusions allow us to simplify the movement of a firefly, as follows:
, with
and
.
Thus, the movement of the firefly becomes . That is, the movement of a firefly being attracted by a firefly is determined by just a modification (or update) of the position of the firefly.
The Novel Bat Algorithm (NBA) was proposed in 2015 [
33] as an extension of the Bat Algorithm [
34]. The NBA incorporates the compensation of the Doppler Effect in echoes by the bats. In addition, in NBA, the bats can forage in different habitats [
33]. NBA uses a probability-based approach to determine the habitat selection of the bats. The bats use either a quantum-based approach, or a mechanical approach for habitat selection. In the quantum approach, the bats consider their positions, the position of the best bat and the mean position of the swarm to obtain new solutions. On the contrary, in the mechanical approach, the bats compensate the Doppler Effect and use the information in the updating process. Finally, the bats perform a local search based on the position of the best bat. The parameters of the NBA algorithm are updated with each iteration. Algorithm 4 shows the main steps of the NBA algorithm.
Although different, the analyzed algorithms have two things in common: They start by randomly generating the solutions, and then they modify the current solutions by considering a single solution (as in FA) or a combination of the current solution with another solution (as in ABC and NBA).
In either case, a random variation is introduced. Thus, the analyzed bio-inspired algorithms can be considered to operate in three stages: initialization, updating and returning. The updating stage, although different in every algorithm, includes a mechanism to modify or to update the current solutions. We used this common behavior to provide a unified representation of the solutions, and a strategy to update the clustering solutions in the optimization process.
We consider that the proposed framework for solution representation and updating is applicable to several other bio-inspired algorithms.
In the following, we explain the proposed integration into the three analyzed swarm intelligence algorithms. We introduce three clustering algorithms: The clustering based on Artificial Bee Colony (CABC), the clustering based on Firefly Algorithm (CFA), and the clustering based on Novel Bat Algorithm (CNBA). For the CABC algorithm, the pseudo code of the clustering approach is given in Algorithm 5. Note that the ABC original idea is preserved but contextualized to the mixed data clustering problem. On the other hand, for the CFA algorithm, we include the updating strategy as shown in Algorithm 6, to obtain the desired clustering.
Finally, for the CNBA algorithm, which has a more complex structure, we include the updating strategy as in Algorithm 7. In each case, we considered the elements that intervene in the updating of a certain solution, and integrate then in the proposed modification strategy, by considering the best solutions so far, the current solution, and a random process. Our approach, although quite simple, is very effective in modelling the mixed data clustering as optimization problem. In addition, we consider that this approach is useful, and can be effortless applied to other swarm intelligence algorithms.
Algorithm 4. Pseudo code of the NBA algorithm. |
NBA optimization algorithm |
Inputs: f: fitness function of D dimensions; : number of iterations; : population size; : probability for habitat selection; : pulse emission rate of the i-th bat, w: inertia weight; Dopp: the compensation rates for Doppler effect in echoes; h: contraction–expansion coefficient; G: the frequency of updating the loudness and pulse emission rate; : parameters in basic Bat Algorithm |
Output: best solution |
Steps:
|
Algorithm 5. Pseudo code of the CABC algorithm to clustering mixed data. |
CABC mixed clustering algorithm |
Inputs: : dataset of instances; : cluster number; : number of iterations; : population size; limit of food sources |
Output: : clustering of instances |
Steps:- 5.
Send a scout bee for random generation of food sources - 6.
it = 1 - 7.
While For each employed bee assigned to a food source Generate a new food source , closer to the current source, as Updating() If is better than , then ; else increase limit of
For each onlooker bee Fly to a food source with good nectar amount () Generate a new food source , closer to the current source, as Updating() If is better than , then ; else increase limit of
Send the scout bee to found the food sources that have reached the limit , and replace them by randomly generated food sources it+
- 8.
Create a cluster by assigning the instances in to its closest centers, considering the centers of the best food source - 9.
Return
|
Algorithm 6. Pseudo code of the CFA algorithm to clustering mixed data. |
CFA mixed clustering algorithm |
Inputs: : dataset of instances; : cluster number; : number of iterations; : population size |
Output: : clustering of instances |
Steps:
Randomly generate fireflies it = 1 While Create a cluster by assigning the instances in to its closest centers, considering the centers of the best firefly Return
|
Algorithm 7. Pseudo code of the CNBA algorithm to clustering mixed data. |
CNBA mixed clustering algorithm |
Inputs: : dataset of instances; : cluster number; : number of iterations; : population size; : probability for habitat selection; : pulse emission rate of the i-th bat |
Output: : clustering of instances |
Steps:
- 6.
Randomly generate bats and select the best one - 7.
it = 1 - 8.
While - 9.
Create a cluster by assigning the instances in to its closest centers, considering the centers of the best bat - 10.
Return
|
6. Conclusions
We have used several swarm intelligence algorithms to solve mixed and incomplete data clustering. Our proposal is emerging as a valuable contribution to the field, because most of the proposed algorithms for clustering operate only over numerical data, and therefore these algorithms are not capable of solving problems that are relevant for human activities and whose data are described not only by numeric attributes, but by mixed and incomplete data. Our proposal becomes a response to a problematic situation related to the fact that the methods and algorithm for clustering over mixed and incomplete data, are an evident minority in scientific literature, in addition to being ineffective.
In order to achieve an effective proposal that solves these types of problems, we have introduced a generic modification to three swarm intelligence algorithms (Artificial Bee Colony, Firefly Algorithm, and Novel Bat Algorithm). We also provide an unbiased comparison among several metaheuristics based clustering algorithms, concluding that the clusters obtained by our proposals are highly representative of the “natural structure” of data.
Our proposal significantly outperformed the k-Prototypes, AGKA and AD2011 clustering algorithms according to cluster error, and AGKA and AD2011 according to the adjusted Rand index. The numerical experiment allows us to conclude that the proposed strategy for clustering mixed and incomplete data using metaheuristic algorithms is highly competitive and leads to better results than other clustering algorithms. In addition, the experiments showed that the proposed algorithms are robust and do not heavily depend on the optimization function used. However, for CABC, we find that using Dunn’s index as fitness function led to better results according to the adjusted Rand index.
In addition, after a thoughtful analysis and a very large number of experiments, the statistical tests shown several useful results about the adequate parameter configuration for some metaheuristic algorithms applied to clustering mixed and incomplete data. Thus, according to the Wilcoxon test, we cannot reject the hypothesis of equality of performance between 10 individuals and greater number of individuals (20, 40, and 50); in addition, the test cannot reject the hypothesis of equality of performance between using 50 iterations and great number of iterations (100, 500, and 1000), in clustering mixed and incomplete data.