Clustering High-Dimensional Data Derived From Feature Selection Algorithm.
Clustering High-Dimensional Data Derived From Feature Selection Algorithm.
Abstract:
Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to
many thousands of dimensions. Such high-dimensional data spaces are often encountered in areas such as
medicine, where DNA microarray technology can produce a large number of measurements at once, and the
clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size.
Feature selection is the process of identifying a subset of the most useful features that produces compatible results
as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and
effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the
effectiveness is related to the quality of the subset of features. Based on these criteria, a FAST clustering-based
feature Selection algorithm (FAST) is proposed and experimentally evaluated. Features in different clusters are
relatively independent; the clustering-based strategy of FAST has a high probability of producing a subset of useful
and independent features.
Keywords:
high-dimensional data spaces, filter method, feature clustering, graph-based clustering, feature
Selection algorithm (FAST).
1. INTRODUCTION
IJCERT2015
www.ijcert.org
525
2. RELATED WORKS
Zheng Zhao and Huan Liu in Searching for
Interacting Features propose to efficiently handle
feature interaction to achieve efficient feature
selection [13]. S.Swetha, A.Harpika in A Novel
Feature Subset Algorithm for High Dimensional
Data, build up a novel algorithm that can capably
and efficiently deal with both inappropriate and
redundant characteristics, and get hold of a
superior feature subset [14]. T.Jaga Priya Vathana,
C.Saravanabhavan, Dr.J.Vellingiri in A Survey on
Feature Selection Algorithm for High Dimensional
Data Using Fuzzy Logic proposed fuzzy logic has
focused on minimized redundant data set and
improves the feature subset accuracy [15].
Manoranjan Dash, Huan Liub in Consistencybased search in feature selection, focuses on
inconsistency measure according to which a
feature subset is inconsistent if there exist at least
two instances with same feature values but with
different class labels. We compare inconsistency
measure with other measures and study different
search strategies such as exhaustive, complete,
heuristic and random search that can be applied to
this measure [16]. Mr. M. Senthil Kumar, Ms. V.
IJCERT2015
www.ijcert.org
526
might
exist
in
arbitrarily
in high-dimensional data:[1]
to
visualize,
and,
due
to
the
with
enumeration
each
of
all
dimension,
complete
subspaces
becomes
4. FEATURE SELECTION
To remove irrelevant features and redundant
features, the FAST [14] algorithm has two
connected components. The irrelevant feature
removal is straightforward once the right relevance
measure is defined or selected, while the
redundant feature elimination is a bit of
sophisticated. In our proposed FAST algorithm, it
involves 1) the construction of the minimum
spanning tree from a weighted complete graph; 2)
the partitioning of the MST into a forest with each
tree representing a cluster; and 3) the selection of
representative features from the clusters.
527
C.
T-Relevance
Computation
and
F-Correlation
D. MST Construction
With the F-Correlation value computed above, the
MST is constructed. A MST [12] is a sub-graph of a
weighted, connected and undirected graph. It is
acyclic, connects all the nodes in the graph, and the
sum of all of the weight of all of its edges is
minimum. That is, there is no other spanning tree,
or sub-graph which connects all the nodes and has
a smaller sum. If the weights of all the edges are
unique, then the MST is unique. The nodes in the
tree will represent the samples, and the axis of the
dimensional graph represents the n features.
The complete graph G reflects the correlations
among
all
the
target-relevant
features.
Unfortunately, graph G has k vertices and k(k-1)/2
edges. For high dimensional data, it is heavily
dense and the edges with different weights are
strongly interwoven. Moreover, the decomposition
of complete graph is NP-hard. Thus for graph G,
IJCERT2015
www.ijcert.org
528
F. Classification
After selecting feature subset, classify selected
subset using Probability-based Nave Bayes
Classifier with the help of Bayes concept.. Thus the
4. CONCLUSION
In this paper, we have proposed an Efficient FAST
clustering-based feature subset selection algorithm
for high dimensional data improves the efficiency
of the time required to find a subset of features.
The algorithm involves 1) removing irrelevant
features, 2) constructing a minimum spanning tree
from relative ones, and 3) partitioning the MST
and selecting representative features. In the
proposed algorithm, a cluster consists of features.
Each cluster is treated as a single feature and thus
dimensionality is drastically reduced and
improved the classification accuracy.
REFERENCES
[1] Almuallim H. and Dietterich T.G., Algorithms
for Identifying Relevant Features, In Proceedings
of the 9th Canadian Conference on AI, pp 38-45,
1992.
[2] Almuallim H. and Dietterich T.G., Learning
boolean concepts in the presence of many
irrelevant features, Artificial Intelligence, 69(1-2),
pp 279- 305, 1994.
[3] Arauzo-Azofra A., Benitez J.M. and Castro J.L.,
A feature set measure based on relief, In
Proceedings of the fifth international conference on
Recent Advances in Soft Computing, pp 104-109,
2004.
[4] Baker L.D. and McCallum A.K., Distributional
clustering of words for text classification, In
Proceedings of the 21st Annual international ACM
SIGIR Conference on Research and Development
in information Retrieval, pp 96- 103, 1998.
[5] Battiti R., Using mutual information for
selecting features in supervised neural net
learning, IEEE Transactions on Neural Networks,
5(4), pp 537- 550, 1994.
[6] Bell D.A. and Wang, H., formalism for
relevance and its application in feature subset
IJCERT2015
www.ijcert.org
529
IJCERT2015
www.ijcert.org
530