Clustering Algorithms For Mixed Datasets: A Review: K. Balaji and K. Lavanya

International Journal of Pure and Applied Mathematics
Volume 118 No. 7 2018, 547-556

ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)
url: http://www.ijpam.eu
Special Issue
ijpam.eu
Clustering Algorithms for Mixed Datasets: A Review

K. Balaji* and K. Lavanya
School of Computer Science and Engineering
VIT University, Vellore, India
Abstract—Clustering is an essential technique in Data Mining datasets; these algorithms could not be suitable for applications
which has been applied effectively in numerous perspectives. such as heart disease [5].
However, most of the clustering algorithms developed have been
focused either on numeric or categorical datasets, but limited to The aim of this work is to present a critical review of the
both. Clustering algorithms with mixed datasets provide most influential clustering algorithms with mixed datasets
distance measures for handling both numeric and categorical reported in the literature, describing their main features and
data attributes. In this review, we present a critical analysis of highlighting their limitations since a practical point of view.
the most effective algorithms described in the field of clustering
The main objective of this review paper can be summarized
algorithms with mixed datasets. Moreover, we present a
comparison of these algorithms, regarding a set of features which as follows:
are desirable from a practical point of view. Finally, some  To review the existing algorithms for discovering how
research lines that need to be further developed in the context of concept formation is made.
clustering mixed datasets are discussed.
 Toidentify the significances of mixed datasets clustering
Keywords—Data Mining; Clustering; Mixed Data; algorithms
Dissimilarity Measure;
 To identify the recent advances in this domain.
I. INTRODUCTION
 To present the impacts of mixed datasets clustering
Clustering is an essential method in Data Mining and algorithms in real-time applications.
Pattern Recognition. This method aims at organizing a
collection of objects into classes or clusters, such that objects The rest of this paper is as follows: In Section II the most
belonging to the same cluster are similar enough to infer they important clustering algorithms for mixed datasets of the state
are of the same type; and objects belonging to different clusters of the art are described. The comparison characteristics of
are dissimilar enough to infer they are of distinct type. There clustering algorithms are described in Section III. Section IV
are several areas in which clustering algorithms have been summarizes some research gaps that need to be explored
successfully applied, like for instance: privacy preserving [1], during the review, and finally, Section V concludes the paper.
information retrieval [2], text analysis [3], image processing,
customer segmentation, and gene expression analysis [4]. II. CLUSTEING ALGORITHMS FOR MIXED DATASETS
Let X is a collection of objects. Each object is described by
Clustering algorithms can be classified according to a set of attributes which can be numeric, categorical data
different criteria, like for example, a) Clustering numeric data attributes or both. The main aim of the clustering algorithm is
attributes, b) Clustering categorical data attributes, and to cluster the collection of objects with their mixed datasets.
c) Clustering mixed numeric and categorical data attributes. The existing clustering algorithms generally follow one of the
This work focuses its analysis on the last one criterion. following approaches:
Regarding to this criterion, a clustering algorithm can be
classified as those algorithm that build the clusters based on the  Conversion of categorical data attributes into numeric
similarity relations of both numeric and categorical data and performs numeric data clustering.
attributes. Most of the clustering algorithms developed so far
are based on the similarity relations either on numeric or  Conversion of numeric data attributes into categorical
categorical data attributes and they leave clustering with mixed and performs categorical data clustering.
 Directly handling mixed data clustering.
*Corresponding Author:
E-mail address: balaji.2016@vitstudent.ac.in
547
International Journal of Pure and Applied Mathematics Special Issue
A. Partition-based Algorithms cluster center is computed repeatedly until cost function cannot
The algorithms described in this subsection build the be minimized further. The algorithm able to perform convex
clusters based on partition. In Partition-based clustering, the or spherical shape data attributes. It predefines a user-defined
center of data point becomes center of the consistent cluster. parameter and does not deal with outliers effectively. If the
Objects are divided and clusters are updated depending on the fuzzy parameter is larger, then the membership matrix fails to
partition. These methods are beneficial for the application of cluster the objects. It cannot guarantee a global optimum
bioinformatics. solution.
K-Prototypes algorithm [6] is a partition-based clustering An Improved K-Prototype Algorithm [11] is a partition-
algorithm which combines both K-means [7] and K-modes [8] based algorithm which presents an effective implementation
for handling mixed datasets. The algorithm depends on the for the categorical data attributes in mixed datasets and also
concept of K-means algorithm and furthermore, it eliminates considering the consequence of different attributes in the
the numeric data constraint of that algorithm. The data objects process of clustering. The algorithm introduces the idea of
are grouped against K-Prototype and dynamically change the distribution centroid in categorical attributes for calculating the
K-Prototype so that to get most out of within-cluster similarity center of a cluster. The basic concept of fuzzy centroid [12] is
of objects. Initially, K objects are selected for K clusters from used to represent the categorical data attributes in a hard
the given dataset. Each object is assigned to a cluster whose clustering. The algorithm evaluates the consequence of
model is similar according to the dissimilarity measure. After attributes with the help of Huang’s approach [13]. The
allocation of each object, the cluster model is updated. maximum number of iterations, as well as the cluster number,
Recalculation of objects similarity in each cluster must be is initialized at the beginning. The objective function is
performed. If any object is similar to another cluster, then it is minimized by dividing the problem into sub-problems. The
moved to that cluster. Repeat the calculation of objects sub-problems are solved for numeric as well as categorical
similarity until no object has to modify their clusters. attributes to obtain the solution for the main problem. The
algorithm will perform the best clustering not only on mixed
K-Means Clustering for Mixed Datasets (KMCMD) datasets but also in pure numeric or categorical. In real world
algorithm [9] is a partition-based clustering algorithm which applications, the clustering of datasets needs the fuzzy scenario
removes numeric data constraint of K-means algorithm and to get better results, whereas the algorithm contributes to
overcomes the complexity of K-Prototypes algorithm. A novel getting accurate results based on a hard scenario.
distance measure and cost function is recommended depending
on the co-occurrence of data attributes. The algorithm K-Harmonic Means type Clustering algorithm for Mixed
randomly assigns a cluster number to all the objects. After Datasets (KHMCMD) algorithm [14] is an extension of K-
that, the cluster center is calculated and assigns each object to Harmonic Means (KHM) algorithm [15]. KMCMD algorithm
the nearest cluster. The cluster center is recalculated every suffers from initialization of centroid in a cluster. The result of
time whenever a new object is included. Assignment of clustering algorithm depends only on initial selection of
objects and recalculation of cluster centers are repeated until centroid in a cluster. Random initialization of centroid in a
the objects do not modify their clusters. Unlike K-Prototype cluster is a standard method. However, the results of the
algorithm, the importance of attribute is computed by clustering algorithm are not comfortable with different
discretizing the numeric attributes. The consequence of initialization of centroid in a cluster. This issue is solved in
numeric or categorical attribute is evaluated from the distance KHM algorithm for numeric data attributes only. KHMCMD
measure rather than defined by the user. The algorithm will algorithm is used to solve the issue of initialization of centroid
work not only for mixed datasets but also for pure categorical in a cluster for mixed datasets. Initially, the numeric data
and numeric datasets. attributes are discretized to make categorical attributes. The
dissimilarity measure proposed in [9] and cost function of
K-Centers algorithm [10] is a partition-based algorithm KHM algorithm in a hard scenario are used for defining cluster
which uses the concept of K-Prototypes algorithm for handling center. The new dissimilarity measure computes the cost
mixed datasets. K-Centers algorithm considers different function in a fuzzy scenario for mixed datasets. Each object is
frequencies for the attribute values during the update of cluster allocated to the cluster until no data points to modify the cluster
centers. The main disadvantage of K-Prototypes [6] and K- membership or a number of iterations reached.
Modes [8] algorithm is that both will modify the cluster centers
depending on the maximum frequency of attribute values. The B. Hiearchical Clustering Algorithms
cluster center which ignores the significant values of other
attributes will degrade the accurateness of clustering outcome. Hierarchical clustering is used to build hierarchical
A novel measure is proposed which considers different structure that combines or divides the data objects into clusters.
frequencies for attribute value on cluster centers. The K- A tree is used to represent this hierarchy of cluster.
Centers algorithm is implemented in two ways. If an object Hierarchical clustering algorithms are divided into
belongs to only one cluster, then it is called Hard K-Centers agglomerative clustering (bottom-up approach) and divisive
clustering. If an object belongs to several clusters, then it is clustering (top-down approach). The bottom-up approach of
called Fuzzy K-Centers clustering. First, it initializes the the clustering algorithm begins with only one cluster and
cluster center. Next, it will calculate the membership matrix. iteratively combines two or more of the related clusters. The
The algorithm updates the membership matrix and minimizes top-down approach of the clustering algorithm begins with a
the cost function in order to get new cluster center. The new single cluster containing all objects and iteratively divides that
548
cluster into suitable sub-clusters. This process continues until a subgroup with added features will become the primary group
stopping principle accomplished. for the K-means clustering. With the help of the primary
group, K-means clustering process will be enhanced. This
Distance Hierarchy (DH) algorithm [16] is a hierarchical characteristic will become an added advantage to this algorithm
clustering algorithm which computes the similarity measure of which will reduce the significance of an outlier. The quality of
categorical attributes and combines the result with numerical the clustering process is computed by using the entropy.
attributes. DH clustering algorithm combines several
predictable distance calculation structures called simple C. Incremental Clustering Algorithms
matching method and binary encoding technique which Non-incremental clustering algorithms stores and process
transform categorical data attributes into numeric data all the input data pattern matrix in the memory. These
attributes. The distance hierarchy is based on concept algorithms generally need the complete input data being loaded
hierarchy [17-18], where the new distance measure is into memory and as a result, the requirements of memory space
calculated by means of edge costs. The similarity measure of will become high. In an incremental clustering algorithm, there
categorical data attribute is computed by the distance between is no need to load and process the whole input data in the
the total edge costs of two nodes. DH algorithm encompasses memory. So, the required amount of memory space will
concept hierarchy in which each edge is having a cost and become less. Incremental clustering algorithms consider the
simplifies the calculation of distance measure. The concept input data pattern which is to be processed one at a time in the
hierarchy structure consists of vertices and edges. The top- memory. It is easy to add the new input data patterns into the
level vertices denote common concepts, whereas bottom-level existing clusters. Incremental clustering algorithms are
vertices denote detailed concepts. The distance of two nodes is appropriate for run-time environments as well as for very large
computed by the total edge costs between them. The pattern databases.
adjacency matrix is given as the input for DH algorithm. After
that, the matrix is used for the consequent process of clustering. Modified Adaptive Resonance Theory (M-ART) algorithm
The DH algorithm is incorporated with an agglomerative [23] is an incremental clustering algorithm. The algorithm uses
hierarchical approach, so that the data analysts can reflect their M-ART network and concept hierarchy structure for handling
knowledge for finding the similarity of data objects by using mixed datasets. ART network [24] is an extremely standard
the construction of distance hierarchies. incremental clustering algorithm with unsupervised neural
network learning technique. Category-I ART deals with
Similarity-Based Agglomerative Clustering (SBAC) numeric data which is binary. Category-II ART deals with
algorithm [19] is a hierarchical clustering algorithm. The numeric data which is general [25]. Many data systems collect
algorithm uses a standard measure called Goodall [20] which the mixed data attributes. But, Category-I ART and Category-
processes numeric and categorical data in a common structure. II ART network methodologies do not deal with mixed
After that, it can be appropriately combined with an datasets. The categorical data attributes are converted into
agglomerative approach that builds a hierarchy. SBAC binary information does not replicate the original information
approach is made on the Unweighted Pair Group Method with which will impact the quality of the clusters. M-ART network
Arithmetic (UPGMA) average [20]. The algorithm begins has two layers. Input layer consists of training datasets which
clustering process by using a distance matrix pair for the comprise distance hierarchy groups, the threshold value, and
collection of data objects. The distance among a couple of data stopping criterion. Initially, input records are assigned to input
objects is the counterpart to their measures of similarity values. vectors. If the output vectors are similar which surpasses the
At any given time, the lowest pairwise dissimilarity data threshold value, then we group those neurons into clusters.
objects of clusters are combined into a distinct group. The Otherwise, output neuron as new neurons. The process is
distance between the new cluster and the old clusters are well- repeated until the input record is empty or stopping criterion is
defined as the average distance between them. The met. Based on the concept hierarchy, each data attribute is
computation of the dissimilarity measure is repeated until all associated with distance hierarchy using link costs representing
the objects are combined in a single cluster. The termination of the distance between two data attributes. Implementation of
the cluster process outcomes in a dendrogram (or tree) where distance hierarchy can simplify the distance calculation.
the leaf vertices will specifies different data objects and root
vertices specifies a group which contains entire objects. Clustering Algorithm based on the methods of Variance
and Entropy (CAVE) algorithm [26] is an incremental
Two-step Method for Clustering Mixed Numeric and clustering algorithm. The algorithm calculates the similarity
Categorical data (TMCM) algorithm [21] is a hierarchical measure for numerical data attributes by variance and
clustering algorithm. The algorithm discovers the relationship categorical data attributes by entropy. The number of clusters
between categorical data attributes on their co-occurrence is predefined. Initially, the dissimilarity of two objects is
values. All categorical data attributes are transformed into computed and grouped into two different clusters. The
numeric data attributes, so that the overall data objects contains dissimilarity of remaining records is calculated and put it into
only numeric representation. It is very east to apply an existing the appropriate clusters. The process is repeated until the
algorithm for clustering process if all the data objects contain records are empty. The algorithm can stop processing at any
only numeric representation. At the primary step, Hierarchical time and produces the output. The algorithm incrementally
Agglomerative Clustering [22] method is applied to group the gets the updates of clusters whenever any new data attribute
initial data into some subgroups. The new shaped subgroups arrives at the cluster.
with additional options will be the input for K-means clustering
[7] of next step. Instead of choosing individual data, every
549
Mixed Self-Organizing Incremental Neural Network BILCOM clustering algorithm implements clustering in two
(MSOINN) algorithm [27] is an incremental clustering stages. The data attributes are taken from the biomedical
algorithm which automatically creates the number of clusters. datasets. The categorical datasets present semantic data on the
The MSOINN algorithm is based on Adjusted Self-Organizing objects, while numeric datasets present experimental outcomes.
Neural Network (ASOINN) algorithm [28]. A novel distance By using the method of Bayesian, it makes sense to use at the
measure estimates the categorical distance based on two first stage as categorical attributes and second stage as
learning techniques such as supervised and unsupervised. In numerical data attributes. Similarity measures for categorical
supervised learning, if the value of two data attributes are data attributes are calculated first and numerical attributes are
related to each other, then the distance measure returns 0, calculated next. The output of the first stage result is given as
otherwise 1. If the supervised learning technique is not input to the second stage and the second stage is the output of
presented, then the novel distance measure is computed using this clustering algorithm.
unsupervised learning technique. In unsupervised learning
technique, the number of various data values are recorded in AUTOCLASS algorithm [40] is a model-based clustering
categorical data attributes and their existence occurrences in algorithm and is used to define the allocation of clusters for
the datasets is considered. If the province size of the suitable classes which is inherited from the concepts of
Bayesian approach. The algorithm discovers the foremost
categorical attribute is 2, then the dissimilarity of two unrelated
attributes will be larger than the instance having province size probable categorization of data objects in a group depends on
of 20. The data attributes are not clustered in the similar the preceding allocation of each data attribute to the cluster and
cluster, if the dissimilarity measure of categorical attributes is signifies the prior view to the user. In the first stage, the user
larger. The value of each data attributes is influenced on the chooses a probabilistic distribution for each data attributes in
entropy and their occurrences in common classification are the the dataset. Categorical data attributes are demonstrated with
beneficial features in supervised learning technique. Initially, Bernoulli distribution, whereas numerical data attributes with a
the clusters are applied to the neural network. Every iteration, Gaussian distribution. The algorithm every time modifies the
the cluster will be deleted if it does not win through the classification of objects in cluster. By considering the mean
learning process. The algorithm creates an offline level to and variance which gives the maximum chance of detecting the
produce an appropriate cluster number for a known dataset object values, each object is allocated to cluster with the data
using the recommended dissimilarity measure and the modified attribute probability distributions. Furthermore, the algorithm
rules. The labels of new instance are defined by calculating the iteratively examines a different number of clusters, which are
nearest clusters using the network and clusters. The concurrent not user specified. In each cluster, the values of mean and
modifications of clusters are also performed. variance are modified by the algorithm. The algorithm iterates
until the clusters and data values of probability distributions
D. Model-based Clustering Algorithms become a steady state.
Model-based clustering algorithm chooses a detailed model Support Vector Machines (SVM) algorithm [41] is a
for every cluster and discovers the finest appropriate model. model-based algorithm and is used to group data without any
The model-based clustering is divided into two categories, such prior information of input classes. The algorithm is initialized
as neural network method and statistical learning method. The by running an SVM classifier against data attributes with each
model requires user-defined parameters and it may change input vector in the dataset arbitrarily categorized. The steps are
during the clustering process. repeated until an initial convergence occurs. After completion
BI-Level Clustering of Mixed categorical and numerical of the initialization step, the parameters of SVM for training
data types (BILCOM) Empirical Bayesian algorithm [29] is a the data attributes can be accessed. The lowest mislabeled data
model-based clustering algorithm and uses categorical data will be assigned to the label of other class. The algorithm will
attributes clustering as a model to lead the numerical data run again on the dataset and is assured to converge in this
attributes clustering. This method performs a pseudo-Bayesian condition. Meanwhile, it converged formerly and now it has a
approach with categorical data attributes as the guide. In prior smaller amount of data points to bring with mislabeling
biological applications to genes, Gene Ontology annotations drawbacks. The method recovers on its uncertainly convergent
were the categorical data attribute and gene expression data result by retraining the SVM algorithm after every relabeling of
was numerical data attribute. The model-based clustering the mislabeled input vectors. The repetition of the above
algorithm discovers the gene expression with any arbitrary method improves the clustering accuracy, at this point a degree
shaped clusters by comprising related information [30-33]. of separable until misclassification occurs. SVM clustering
Data attributes that BILCOM clustering is especially helpful to algorithm affords a very effective mechanism to make a
exists within the area of medicine. The categorical data separating hyperplane bounded by the heaviest edge, using the
attributes denote the features or symptoms of patients and training datasets. In spite of SVMs supervised nature, it has
numerical data attributes denote the outcomes of medical been useful to categorical data attributes to find groups in an
investigations on patients. The medical results of patients are unsupervised way. The method involves arbitrarily allocating
reflected in clustering the medical data sets by using this objects to a pair of groups and re-calculating the separating
algorithm. An alternative essential application for this hyperplane until object allocation and hyperplane is converged.
clustering algorithm is microarray gene expression data Earlier, SVM-Internal Clustering (typically stated as a one-
attributes which contain categorical data attributes indicating class SVM) used internal features of SVM to discover a group
known gene function [34-36] and numerical data attributes as the smallest encompassing sphere in a set of data. The
indicating gene expression through tissues [37-39]. The internal method to SVM clustering required heftiness and is
550
biased on the way to group with a spherical form in feature dissimilarity measures for numerical and categorical attributes
space. The SVM-Internal Clustering algorithm might only are computed. The clusters are updated until the stopping
perceive the quite small cluster centers in most real-world criterion is met.
applications. To overcome this problem, an External-SVM
Clustering algorithm was presented that clusters data attributes Fuzzy K-Prototype algorithm [48] is a fuzzy clustering
with no preceding information of each data object algorithm and used to represent cluster prototype by combining
classification. Primarily, in the dataset, each data object is mean and fuzzy centroid. The K-Prototypes algorithm
arbitrarily categorized and training is given to the classifier of implements hard partition which results in poor clustering of
SVM. The scores of sensitivity and specificity will become data attributes within the region of boundaries. The Fuzzy K-
low which is nearly 1 after initial convergence is achieved. Prototype algorithm improves hard clustering. The data
The algorithm then develops the outcome, by iteratively attributes are grouped into different clusters which have
relabeling the poorest misclassified data vectors. different degree of membership functions. Initially, the
number of clusters, maximum iterations, and the threshold
E. FuzzyClustering Algorithms values are set. After that, the cluster prototype is divided into
Fuzzy clustering algorithms converts the discrete values of two parts. The first part uses mean for computing numerical
{0, 1} into continuous values in between [0, 1]. Fuzzy attributes and the second part uses fuzzy centroid for
clustering describes the relationship among data objects more computing categorical attributes. The dissimilarity measure
accurately. between two objects is calculated and similar data objects are
grouped into a single cluster. The clustering process is
General Fuzzy C-Means (GFCM) algorithm [42] is a fuzzy repeated until maximum iterations or stopping criterion is
clustering algorithm based on the concept of Fuzzy C-Means reached.
(FCM) algorithm [43]. Frequency-based cluster models [44]
are used to group categorical data attributes depends on the F. Artificial Neural Networks Clustering Algorithms
method of the simple matching algorithm. In FCM algorithm, Artificial Neural Networks Clustering is based on the idea
only the numeric data attributes are divided into objects, of competitive learning technique. It is divided into two
whereas in GFCM numerical and categorical data attributes are categories such as hard competitive learning method and soft
divided into objects. The characteristics of the fuzzy p-mode competitive learning method. In hard competitive learning,
model are defined as an array of p labels which have larger only the winning neuron is permitted for learning, whereas in
frequencies than the cluster of others. In the conventional soft competitive learning, all neurons in the network can get a
algorithm which has single feature model, and simple matching chance for learning. The hard competitive learning method is
model leads to inaccurate clusters. But GFCM algorithm has the winner-take-all learning method, whereas the soft
multiple labels at the categorical data attributes and produces competitive learning method is the winner-take-competitive
accurate clustering results. Initially, the membership degree is learning method.
chosen. The membership objective function should be
minimized and dissimilarity measures are calculated. The Mixed-type Self-Organizing Map (MixSOM) algorithm
clustering process is repeated until stopping criterion is met. [49] is an artificial neural network clustering which extends
self-organizing map model to perform visualized analysis of
Kullback-Leibler Fuzzy C-Means Gaussian Mixture mixed datasets. The prototype combines the features of
Models (KL-FCM-GM) algorithm [45] is a fuzzy clustering Generalized SOM (GSOM) [50] and Visualized SOM
algorithm and based on Gath-Geva algorithm [46] for handling (ViSOM) [51-52]. MixSOM modifies the distance hierarchy
mixed datasets effectively. In existing approaches, fuzzy representation of GSOM into a more convenient representation
clustering of data is carried out by fuzzy k-prototypes of numeric and categorical data attributes. The algorithm
algorithm which uses a different variance. On the contrary, an visualizes the data in the form of high-dimensional space and
innovative fuzzy c-means algorithm makes use of entirely estimates into two-dimensional space. The distance hierarchy
probabilistic dissimilarity functional for mixed datasets is method considers meaningful characteristics of categorical
proposed. The proposed algorithm uses a fuzzy objective attributes during the training process. The algorithm reflects
function normalized by Kullback-Leibler variance facts and the association between model distance and SOM map distance
expressed on the origin of a set of likelihood conventions through attributes during the modification process. The
concerning the method of inherited clusters. The algorithm distance between the neighbors neurons are constrained to the
consists of an iterative method. The given objective function is way of predetermined attributes by making visualization of
enhanced over fuzzy membership functions, parameters, and mixed data attributes. The structure of the cluster is controlled
weights of clusters. by user defined inputs. The dissimilarity measure of
categorical data attributes is computed by distance hierarchy.
Fuzzy K-means type algorithm [47] is a fuzzy clustering Each input attributes and each item in the model is combined to
algorithm and computes the impact of attributes for numeric
their related distance hierarchies. The distance hierarchies
and categorical using probabilistic dissimilarity measure. In which are distinct is computed and then combined.
FCM algorithm, it is not possible to calculate the mean form
categorical data attributes. Instead of mean, the mode is Growing Mixed-type SOM (GMixSOM) algorithm [53] is
calculated for categorical data attributes which does not reflect an artificial neural network clustering algorithm. The
the original information. A novel dissimilarity measure with algorithm uses distance hierarchy representation and develops
the definition of cluster centroid is proposed. Initially, fuzzy excellence of projection map. The self-organizing map will
partition matrix and threshold values are set. After that, the develop from primary neurons to a large size of neurons during
551
training process. If the map is active, then it deals with the selected weights. After that, threshold value is defined
convenient arrangement of neurons instead of fixed size. according to the spreading factor. During growing stage, the
However, the Growing SOMs is only recommended for the size of neighborhood neurons and their learning rate are set.
framework of handling numeric attributes. The categorical Each neuron errors are reset in training process. After that, the
attributes are converted into numeric ones. But it does not best matching unit of input vector is identified and their
reflect the original information of categorical attributes. neighbors are updated. The error of best matching unit is
Initially, training datasets, set of distance hierarchies, spreading computed. The steps are repeated until all the neurons are
factor and number of training step are given as input to the trained.
algorithm. The input neurons are initialized with randomly
TABLE I. COMPARISON CHARACTERISTICS OF CONCEPTUAL CLUSTERING ALGORITHMS FOR MIXED DATASETS

Depending on
prior
Shape of Sensitivity to High-Dimensional Input data in any Interpretation
Algorithm Scalability knowledge and
Cluster Noise /Outliers Data order of Results
user-defined
parameter
K-prototypes Yes Convex Yes Yes Yes Yes Yes
KMCMD Yes Convex No Yes Yes Yes Yes
K-centers Yes Convex Yes Yes Yes Yes Yes
ImprovedK-prototype Yes Convex No Yes Yes Yes Yes
KHMCMD Yes Convex No Yes Yes Yes Yes
DH Yes Arbitrary No Yes Yes Yes Yes
SBAC No Arbitrary No No Yes Yes Yes
TMCM No Arbitrary Yes No Yes Yes Yes
M-ART Yes Arbitrary No No Yes Yes Yes
CAVE No Arbitrary No No Yes Yes Yes
MSOINN No Arbitrary No Yes Yes Yes Yes
BILCOM No Arbitrary Yes No No Yes Yes
AUTOCLASS No Arbitrary No No Yes Yes Yes
SVM Clustering Yes Arbitrary Yes Yes Yes Yes Yes
GFCM No Convex Yes No Yes Yes Yes
KL-FCM-GM Yes Arbitrary No Yes Yes Yes Yes
Fuzzy K-means Yes Arbitrary Yes Yes Yes Yes Yes
Fuzzy K-prototype Yes Arbitrary No Yes Yes Yes Yes
MixSOM Yes Arbitrary No Yes Yes Yes Yes
GMixSOM Yes Arbitrary No Yes Yes Yes Yes
FMSOM Yes Arbitrary No Yes Yes Yes Yes
UFLA Yes Arbitrary No Yes Yes Yes Yes
Frequency neuron Mixed Self-Organizing Map (FMSOM) likelihood tables from CPrSOM. FMSOM has the ability to
algorithm [54] is an artificial neural network clustering train the neurons in an effective and precise manner. Unlike
algorithm. The algorithm processes categorical data attributes NCSOM, the algorithm converges after a finite number of
directly without any conversion procedure. The algorithm is steps. FMSOM constructs a novel prototype to handle
designed to address the issues present in existing algorithms categorical data attributes or mixed datasets. Initially, the
GSOM [49], MixSOM [50], CPrSOM [55], and NCSOM [56]. dataset is given as input and number of iterations, size of the
FMSOM is based on the concept of NCSOM, but includes the map, radius, and neighborhood degeneration are initialized.
552
SOM topology is created and reference vectors are initialized adaptation process takes place. The updates for neuron’s
to random values. FMSOM algorithm consists of three phases. weight vectors are calculated for mixed data attributes. The
In competitive phase, dissimilarity of numerical attributes is clustering process terminates after a finite number of iterations.
calculated by using the Classic SOM algorithm such as
Euclidean distance and dissimilarity of categorical attributes is Unsupervised Feature Learning with Fuzzy ART (UFLA)
calculated by using the measure of probability. In second algorithm [57] is an artificial neural network clustering
phase, depending upon the computation of winning neuron the algorithm. The numerical and categorical features are
cooperative process begins. The Gaussian neighborhood of the represented in the form of sparse features. Initially, data pre-
winning neuron is computed and updated. In third phase, processing is performed for missing values, interval, and multi-
TABLE I (CONTINUED)
Type of Representation
Algorithm Data Structure Time Complexity
Cluster of Cluster
K-prototypes Set Dynamic Disjoint
O((s+1)cn)
KMCMD Set Dynamic Disjoint
O(a2n+a2C3+sn(ctn+ctcC))
K-centers Set Incremental Fuzzy
O(n)
ImprovedK-prototype Set Dynamic Disjoint
O(c(t+tn+St-Stn)ns)
KHMCMD Set Incremental Fuzzy
O(t2n+m2C3+sn(ctn+stcP))
DH Hierarchical Incremental Disjoint
O(n2)
SBAC Hierarchical Dynamic Disjoint
O(n2)
TMCM Hierarchical Dynamic Disjoint
O(n2)
M-ART Hierarchical Incremental Disjoint
O(R * D * O * DI)
CAVE Hierarchical Incremental Disjoint
O(N2)
MSOINN Hierarchical Incremental Disjoint
O(N)
BILCOM Set Static Disjoint
O(n2)
AUTOCLASS Set Static Disjoint
O(cd2ns)
SVM Clustering Set Static Disjoint
O(n)
GFCM Hierarchical Dynamic Fuzzy
O(n)
KL-FCM-GM Hierarchical Dynamic Fuzzy
O(n)
Fuzzy K-means Hierarchical Dynamic Fuzzy
O(s(tn+S2cs+nc+nctn+nctcS))
Fuzzy K-prototype Hierarchical Dynamic Fuzzy
O(t2n+t2S3+c(t+s+St-Stn)ns)
MixSOM Hierarchical Dynamic Fuzzy
Type + Layer
GMixSOM Hierarchical Dynamic Fuzzy
Type + Layer
FMSOM Hierarchical Dynamic Fuzzy
Type + Layer
UFLA Hierarchical Dynamic Fuzzy
Type + Layer
s – No. of iterations; c – No. of clusters; n – No. of objects; a – Total no. of attributes; C – Average no. of distinct
categorical values; P – No. of attribute values of categorical attributes; t – total no. of attributes; tn – No. of numerical
attributes; tc – No. of categorical attributes; S – Maximal no. of values for categorical attributes; R – Training Round; D –
Training data record; DI – Data Dimension; O – Output Neuron Number; N – Dataset Size; d – dimensionality;
553
value data. Binary values are assigned to categorical data and REFERENCES
normalizing the numeric data. The number of clusters is [1] M.Z. Islam, L. Brankovic, Privacy preserving data mining, a noise
predetermined. Weights are defined for each cluster. The addition framework using a novel clustering technique. Knowledge-
algorithm clusters entire dataset to produce the model and Based Systems, 2011. 24(8):p. 1214-1223.
passed to the feature encoder. The feature encoder will encode http://dx.doi.org/10.1016/j.knosys.2011.05.011
mixed data type into sparse representation. After that, the [2] G. Bordogna, G. Pasi, A quality driven Hierarchical Data Driven Soft
Clustering for information retrieval. Knowledge-Based Systems, 2012.
clustering will be performed by traditional clustering 26(1):p. 9-19. http://dx.doi.org/10.1016/j.knosys.2011.06.012
algorithm. The distance measure is calculated and clusters are [3] W. Zhang, T.Yoshida, X.J. Tang, Q. Wang, Text clustering using
updated until the stopping criterion is met. frequent itemsets. Knowledge-Based Systems, 2011. 23(5):p. 379-388.
http://dx.doi.org/10.1016/j.knosys.2010.01.011
III. COMPARISION CHARACTERISTICS OF CLUSTERING [4] W. Chen, G. Feng, Spectral Clustering, a semi-supervised approach.
ALGORITHMS FOR MIXED DATASETS Neuro-computing, 2012. 77(1):p. 229-242.
Table I shows the comparison characteristics of clustering [5] Jiawei Han and Micheline Kamber, Data Mining Concepts and
Techniques. Morgan Kaufmann Publishers, 2006.
algorithms described in Section II.
[6] Z. Huang, Clustering large data sets with mixed numeric and categorical
IV. DISCUSSION-GAPS values. In: Proceedings of the First Pacific Asia Knowledge Discovery
and Data Mining Conference, World Scientific, Singapore, 1997.
Several studies around clustering algorithms obtain [7] MacQueen J, Some methods for classification and analyis of
consideration of the researchers. Even though, several multivariate observations. Proc Fifth Berkeley Symp Math Stat Probab,
realizations have been accomplished, yet, there are still missing 1967. 1:p. 281-297.
gaps that need to be filled. We review about the gaps in [8] Z. X. Huang, Extensions to the K-means algorithm for clustering large
conceptual clustering algorithms, as follows: datasets with categorical values. Data Mining and Knowledge
Discovery, 1998. 2(3):p. 283-304.
 Many clustering algorithms are not able to handle [9] A. Ahmad, L. Dey, A K-mean clustering algorithm for mixed numeric
mixed data attributes directly which are necessary for and categorical data. Data and Knowledge Engineering, 2007. 63(2):p.
503-527.
today’s real-time applications. Many algorithms
transform one type of attribute into other, which [10] Wei-Dong Zhao, Wei-Hui Dai, and Chun-Bin Tang, K-Centers
algorithm for clustering mixed type data. PAKDD, Springer-Verlag
produce loss of information. We need to develop an Berlin Heidelberg, 2007.
efficient and accurate clustering algorithm that able to [11] J. Ji, T. Bai, C. Zhou, et al., An improved k-prototypes clusteirng
handle both mixed attributes and missing data. algorithm for mixed numeric and categorical data. Neuro computing,
2013. 120:p. 590-596.
 Most of the algorithms produce disjoint clusters. But
[12] W. Kim, K. H. Lee, D. Lee, Fuzzy clustering of categorical data using
some algorithms that producefuzzy clusters with fuzzy centroids. Pattern Recognition Letters, 2004. 25(11):p. 1263-
different membership degree provide a good 1271.
interpretation of clusters. We need to develop research [13] Z. X. Huang, M. K. Ng, H. Q. Rong, et al, Automated variable
in building fuzzy clusters and generate concept weighting in k-means type clustering. IEEE Transaction Pattern
description that can accurately produce results. Analysis Machine Intelligence, 2005. 27(5):p. 657-668.
[14] Amir Ahmad, Sarosh Hashmi, K-Harmonic means type clustering
 The hierarchical representation of clusters is algorithm for mixed datasets. Applied Soft Computing, 2016.
computationally expensive. But hierarchical structure http://dx.doi.org/10.1016/j.asoc.2016.06.019
provides necessary information for the user. We need [15] B. Zhang, Generalized K-Harmonic Means. Hewlett-Packard
to develop an efficient clustering algorithm able to build Laboratoris Technical Report, 2000.
hierarchical of clusters and concepts. [16] Chung-Chian Hsu, Chin-Long Chen, Yu-Wei Su, Hierarchical clustering
of mixed data based on distance hierarchy. Information Sciences, 2007.
 Most of the algorithms result in poor clustering due to 177:p. 4474-4492. http://dx.doi.org/10.1016/j.ins.2007.05.003
addition, deletion, and modification of objects during [17] J. Han, Y. Fu, Dynamic generation and refinement of concept
run time. We need to develop efficient and accurate hierarchies for knowledge discovery in databases. In: Proceedings of the
AAAI’94 Workshop Knowledge Discovery in Databases (KDD’94),
clustering algorithm that able to process additions, Seattle, 1994.
deletions, and modifications of objects. [18] J. Han, Y. Cai, N. Cercone, Data-driven discovery of quantitative rules
 Most of the fuzzy clustering algorithms provide good in relational databases. IEEE Transactions Knowledge on Data
Engineering, 1993. 5:p. 29-40.
clustering. We need to develop efficient fuzzy
[19] C. Li, G. Biswas, Unsupervised learning with mixed numeric and
clustering algorithms with the characteristics such as nominal data. IEEE Transactions on Knowledge and Data Engineering,
dynamic and easy interpretation of clusters. 2002. 14(4):p. 673-690.
[20] D. W. Goodall, A New Similarity Index Based On Probability.
V. CONCLUSION Biometrics, 1966. 22:p. 882-907.
One of the most essential properties of the clustering [21] Ming-Yi Shih, Jar-Wen Jheng and Lien-Fu Lai, A Two-Step Method for
algorithm is to handle mixed datasets effectively. In this Clustering Mixed Categorical and Numeric Data. Tamkang Journal of
Science and Engineering, 2010. 13(1):p. 11-19.
review paper, clustering algorithms for mixed datasets are
discussed with their limitations. After that, comparison [22] C. Hsu, Y. P. Huang, Incremental Clustering of mixed data based on
distance hierarchy. Expert System Applications, 2008. 35(3):p. 1177-
characteristics of all algorithms are presented. Finally, some 1185.
research gaps that need to further explore are discussed.
554
[23] Carpenter G, Grossberg A S, Rosen D B, Fuzzy ART, Fast stable [47] J Ji, W Pang, C Zhou, et al, A fuzzy k-prototype clustering algorithm for
learning and categorization of analog patterns by an adaptive resonance mixed numeric and categorical data. Knowledge-Based Systems, 2012.
system. Neural Networks, 1991. 4:p. 759-771. 30:p. 129-135.
[24] Carpenter G, Grossberg A S, ART2, Self-organization of stable category [48] C C Hsu, S H Lin, Visualized analysis of mixed numeric and categorical
recognition codes for analog input patterns. Applied Optics: Special data via extended self-organizing map. IEEE Transactions on Neural
Issue on Neural Networks, 1987. 26:p.4919-4930. Networks Learning Systems, 2012. 23:p. 72-86.
[25] C. C. Hsu, Y. C. Chen, Mining of mixed data with application to catalog [49] C C Hsu, Generalizing self-organizing map for categorical data. IEEE
marketing. Expert System Applications, 2007. 32(1):p. 12-23. Transactions on Neural Networks, 2006. 17(2):p. 294-304.
[26] Fakhroddin Noorbehabahani, Sayyed Rasoul Mousavi, Abdolreza [50] H Yin, ViSOM – a novel method for multivariate data projection and
Mirazaei, An Incremental mixed data clustering method using a new structure visualization. IEEE Transactions on Neural Networks, 2002.
distance measure. Soft Computing, 2015. 19:p. 731-743. 13(1):p. 237-243.
[27] Shen F, Hasegawa O, A fast nearest neighbor classifier based on self- [51] H Yin, Data visualization and manifold mapping using the ViSOM.
organizing incremental neural network. Neural Networks, 2008. Neural Networks, 2002. 15(9):p. 1005-1016.
21(10):p. 1537-1547. [52] Wei-Shen Tai, Chung-Chian Hsu, Growing Self-Organizing Map with
[28] B. Andreopoulos, A . An and X. Wang, Bi-level clustering of mixed cross insert for mixed-type data clustering. Applied Soft Computing,
categorical and numerical biomedical data. International Journal of Data 2012. 12:p. 2856-2866.
Mining and Bioinformatics, 2006. 1(1):p. 19-56. [53] Carmelo del Coso, Diego Fustes, Carlos Dafonte, et al, Mixing
[29] B. Adryan and R. Schuh, Gene ontology-based clustering of gene numerical and categorical data in a Self-Organizing Map by means of
expression data. Bioinformatics, 2004. 20(16):p. 2851-2852. frequency neurons. Applied Soft Computing, 2015. 36:p. 246-254.
[30] R. Bellazzi and B. Zupan, Towards knowledge-based gene expression [54] M. Lebbah, K Benabdeslm, Visualization and clustering of categorical
data mining. Journal of Biomedical Informatics, 2007. 40(6):p. 787- data with probabilistic self-organizing map. Neural Comput.
802. Applications, 2010. 19:p. 393-404.
[31] M. Brown, W. Grundy, D. Lin, et al, Knowledge-based analysis of [55] N Chen, N C Marques, An extension of self-organizing maps to
microarray gene expression data by using support vector machines. categorical data. In: Proceedings of the 12 th Protuguese conference on
PNAS, 2000. 97(1):p. 262-267. Progressin Artificial Intelligence, EPIA, 2005.
[32] C. Pasquier, F. Girardot, K. Jevardat de Fombelle, et al, THEA, [56] Dao Lam, Mingzhen Wei and Donald Wunsch, Clustering Data of
Ontology driven analysis of microarray data. Bioinformatics, 2004. Mixed Categorical and Numerical Type with Unsupervised Feature
20:p. 2636-2643. Learning. IEEE Transactions, 2015.
[33] Dwight S S, Harris M A, Dolinski K, et al, Saccharomyces Genome
Database provides secondary gene annotation using the gene ontology.
Nucleic Acids Research, 1999. 30:p. 69-72.
[34] Gene Ontology Consortium, Creating the gene ontology resource: design
and implementation. Genome Research, 2001. 11:p. 1425-1433.
[35] Lord P W, Stevens R D, Brass A, et al, Investigating semantic similarity
measures across the gene ontology: the relationship between sequence
and annotation. Bioinformatics, 2003. 19:p. 1275-1283.
[36] Eisen M B and Brown P O, DNA arrays for analysis of gene expression.
Methods Enzymol, 1999. 303:p. 179-205.
[37] Eisen M B, Spellmen P T, Brown P O, et al, Cluster analysis and display
of genome-wide expression patterns. Proc.Natl.Acad.Sci., USA, 1998.
95(25):p. 14863-14868.
[38] Slonim D K, Tamayo P, Mesirov J P, et al, Class prediction and
discovery using gene expression data. In: Proceedings of 4 th
International conference on Computational molecular biology, Tokyo,
Japan, 2000.
[39] Stuz J, and Cheeseman P, Bayesian classification (AUTOCLASS),
theory and results. Advances in Knowledge Discovery and Data
Mining, 1995.
[40] S. Winters-Hilt and S. Merat, SVM Clustering. BMC Bioinformatics,
2007.
[41] Mahnhoon Lee, Witold Pedrycs, The fuzzy C-means algorithm with
fuzzy P-mode prototypes for clustering objects having mixed features.
Fuzzy Sets and Systems, 2009.
[42] J C Bezdek, Pattern Recognition with Fuzzy Objective Function
Algorithms. Plenum Press, New York, 1981.
[43] D W Kim, K H Lee, D Lee, Fuzzy clustering of categorical data using
fuzzy centroids. Pattern Recognition Letters, 2004. 25:p. 1263-1271.
[44] S P Chatzis, A fuzzy c-means-type algorithm for clustering of data with
mixed numeric and categorical attributes employing a probabilistic
dissimilarity functional. Expert Systems with Applications, 2011.
38(7):p. 8684-8689.
[45] Gath I, Geva A B, Unsupervised optimal fuzzy clustering. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 1989.
11(7):p. 773-781.
[46] Sonbaty, Yaseer Ei, M A Ismail, Fuzzy clustering for Symbolic data.
IEEE Transaction on Fuzzy Systems, 1998.
555
556

Clustering Algorithms For Mixed Datasets: A Review: K. Balaji and K. Lavanya

Uploaded by

Copyright:

Available Formats

Clustering Algorithms For Mixed Datasets: A Review: K. Balaji and K. Lavanya

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering Algorithms For Mixed Datasets: A Review: K. Balaji and K. Lavanya

Uploaded by

Copyright:

Available Formats

International Journal of Pure and Applied Mathematics

Volume 118 No. 7 2018, 547-556

Clustering Algorithms for Mixed Datasets: A Review

TABLE I. COMPARISON CHARACTERISTICS OF CONCEPTUAL CLUSTERING ALGORITHMS FOR MIXED DATASETS

KMCMD Yes Convex No Yes Yes Yes Yes

K-centers Yes Convex Yes Yes Yes Yes Yes

ImprovedK-prototype Yes Convex No Yes Yes Yes Yes

KHMCMD Yes Convex No Yes Yes Yes Yes

DH Yes Arbitrary No Yes Yes Yes Yes

SBAC No Arbitrary No No Yes Yes Yes

TMCM No Arbitrary Yes No Yes Yes Yes

M-ART Yes Arbitrary No No Yes Yes Yes

CAVE No Arbitrary No No Yes Yes Yes

MSOINN No Arbitrary No Yes Yes Yes Yes

BILCOM No Arbitrary Yes No No Yes Yes

AUTOCLASS No Arbitrary No No Yes Yes Yes

SVM Clustering Yes Arbitrary Yes Yes Yes Yes Yes

GFCM No Convex Yes No Yes Yes Yes

KL-FCM-GM Yes Arbitrary No Yes Yes Yes Yes

Fuzzy K-means Yes Arbitrary Yes Yes Yes Yes Yes

Fuzzy K-prototype Yes Arbitrary No Yes Yes Yes Yes

MixSOM Yes Arbitrary No Yes Yes Yes Yes

GMixSOM Yes Arbitrary No Yes Yes Yes Yes

FMSOM Yes Arbitrary No Yes Yes Yes Yes

UFLA Yes Arbitrary No Yes Yes Yes Yes

You might also like