Resolving Gene Expression Data Using Multiobjective Optimization Approach
Resolving Gene Expression Data Using Multiobjective Optimization Approach
Toran Verma
Department of Computer Science & Engineering
Chhattisgarsh Swami Vivekananda University
Abstract
Data mining also known as knowledge discovery in database has been recognized as a promising new area for database research.
Studying the patterns hidden in gene expression data helps to understand the functionality of genes. In General, clustering
techniques are widely used for identification of partitioning from gene expression data. The proposed work in this paper is about
optimizing the data with fuzzy C-Mean clustering algorithm and using multi-objective optimization method i.e. Non-Dominant
Sorting Genetic algorithm-2. In the first phase it optimizes the data to reduce the number of comparisons using clustering. Fuzzy
C-means algorithm is invoked on the data sets, based on the highest membership values of data points with respect to different
clusters, labelled information are extracted . In each case only 10% class labelled information of data points are randomly
selected which acts as supervised information. In the Second phase it is implemented with multi-objective genetic algorithm to
find fitness function.
Keywords: ARI-index, Coefficient Entropy index, Fuzzy C-means, Genetic Algorithm, Multiobjective optimization,
Partitioning Clustering and non-dominant sorting
_______________________________________________________________________________________________________
I.
INTRODUCTION
A DNA segment that constitutes a gene in the DNA molecules is transcribed into a single-stranded sequence of RNA, called
messenger RNA (mRNA). Then, the mRNA is translated interested in an order of amino acids which finally become a protein
after some modifications. In the biological experiment, at unusual time points, gene expression values are calculated. The DNA
Microarray techniques by which measure the thousands of expression levels of genes. The discovery of DNA microarray
technology [1], it becomes likely to inspect the expression level of thousands of genes at a time. Unlikely, the application areas
of Microarray technology are gene expression profiling, medical diagnosis, bio-medical. In the biological experiment, at unusual
time points, gene expression values are calculated. The DNA Microarray techniques by which measure the thousands of
expression levels of genes. Common Problems regarding gene-expression are Problem of needing to find data, Data preprocessing and missing value estimation for DNA microarrays. The Multiobjective optimization method is used to optimize the
prearranged gene expression data through which in upcoming has a huge number of applications are there like Pattern
Recognition, Document Classification, and Information Retrievals and in medical and bio-medicine filed has a great application
for using this technique. A position represents an individual cluster which contains quite a lot of sub-clusters. Now, the four
objectives function viewing the diverse value of cluster compute and are concurrently optimized by using NSGA-2 genetic
approach. The unsupervised in sequence is occupied by the properties of first two objective functions and last is by some
supervised information. First, the trendy fuzzy c-means clustering method is practical on an individual data set. The two steps of
FCM, evaluation of fuzzy membership and re-compute of cluster canters are executed several times until there no modify in the
cluster centres. Final membership
values are obtained by bearing in mind each cluster based on their membership values.
205
The aim of data mining is to automatically or semi-automatically discover hidden knowledge, unexpected patterns and new
rules from data. There are varieties of technologies involved in the process of data mining, such as statistical analysis, modelling
techniques and database technology. During the last ten years, data mining is undergoing very fast development both on
techniques and applications. Its typical applications include market segmentation, customer profiling, fraud detection,
(electricity) loading forecasting, and credit risk analysis and so on. In the current post-genome age, understanding floods of data
in molecular biology brings great opportunities and big challenges to data mining researchers. Successful stories from this new
application will greatly benefit both computer science and biology communities. We would like to call this discovering
biological knowledge in silico by data mining.
II. SURVEY ON RELATED WORK
In semi-supervised clustering performance, the problem is arises when we using some un- supervised information in our data set.
So by the help of some distance rule like Euclidean or symmetry distance formula, we rise above the crisis. In clustering,
moreover to determine the compactness between the clusters in term of Euclidean distance and cluster validity indices are used
as inside and outside. Initially, the cluster seed at different points is put into dissimilar clusters using distance function. There are
a variety of clustering methods and their validity techniques present in the literature. To enhance the characteristics of genes
expression, optimization has to be performed properly. Based upon the clustering techniques and multi-objective optimization
approach some literatures are discussed in this chapter which are already being utilised for the gene expression data sets. Use of
Semi-supervised Clustering and Feature Selection Techniques for Gene-Expression Data [2015], Sriparna Saha, Abhay Kumar
Alok and Asif Ekbal proposed semi-supervised clustering technique; Semi-FeaClustMOO is demonstrated on five gene
expression datasets. A modern simulated annealing based Multiobjective optimization task namely AMOSA is utilized. The
features and cluster centre are presented in the form of a string and based on symmetry distance. Feature selection is process in
which to reduce the dimensionality. The feature selection technique and clustering is optimizing by the proposed MOO SemiFeaClusMOO. Encoding of strings and initialization of achieve : AMOSA is compare two items a) a set of real number and b) a
set of binary number.Here six objective function are used and are simultaneously optimized by AMOSA. First four are XBindex, FCM-index, I-index, and Sym-index is internal cluster validity indices. They are depending upon on Euclidean and pointsymmetry distance. After application of any MOO based technique, final Pareto optimal front is used to contain a large
collection of non-dominant solution. Mining for optimised data using clustering along with fuzzy association rules and genetic
algorithms [2014], G.V.S.N.R.V Prasad,Y.Dhanalakshmi , V.Vihaya Kumar, I.Ramesh Babu proposed about optimizing the
data with clustering and fuzzy association rules using multi-objective genetic algorithms. This algorithm is implemented in
two phases. In the first phase it optimizes the data to reduce the number of comparisons using clustering. In the second phase
it is implemented with multi-objective genetic algorithms to find the optimum number of Fuzzy association rules using
threshold value and fitness function. The degree of membership of each value of ik in any of the fuzzy sets specified for ik is
directly based on the evaluation of the membership function of the particular fuzzy set with the specified value of ik as input. A
fuzzy association rule is expressed as:If Q = {u1, u2, up} is F1 = {f1, f2, fp} then R = {v1, v2. . . vq} is F2 = {g1, g2, . . ,
gq},where Q and R are disjoint sets of attributes called item sets. For a rule to be interesting, it should have enough support
and high confidence value, larger than user specified thresholds.
...SUPPORT...
Pareto-optimal=non-dominant
A indifferent
B worse
D better
C indiffernt
Dominant
...CONFIDENCE...
206
comparison, quality metric defined which is replaced with the metric number of non-dominated solutions (N).However; the
performance of AMOSA is not satisfactory in terms of spacing metrics in compare to constraint method.
Gene Data Sets
YeastSporulation,
Arabidopsis Thaliana &
YeastCell
Clustering
C1
C2
CK
C3
Validity Index
ARI
Index
XieBeni
Index
Objective functions
Symindex
I-index
FCMIndex
Optimization
AMOSA Algorithm
Under the Multiobjective framework, a new semi-supervised clustering technique. Semi-gene ClustMOO has been developed.
To obtain the true partitioning results, five objective functions have been optimized simultaneously at a time. First four objective
functions are some internal cluster validity indices, Sym-index, I-index, XB-index, and FCM index. The last fifth one ARI is
external or supervised information based cluster validity. Here, the prior information only 10% of the whole data set is taken.The
data set can be simplified as a matrix of G * I dimensions. G is number of genes and I is number of individuals. The aim is to
cluster together genes for which the expression is highly correlated across all the individuals. The final result is a statistically
based grouping of genes in such clusters from which the individual gene ID can be recovered.
So, finding co-regulated genes is actually quite a difficult task, using straight clustering like k-means is in my base papers not
that productive on expression data. As I am sure we are aware of k-means is by far and away the most commonly used clustering
method. The drawback of this clustering algorithm in case of clustering of huge amount genes expression data is lack of
associated statistics and statistically informed decision making when doing things like picking cluster numbers for the portioning
are big problem.
Cluster analysis, also called segmentation analysis or taxonomy analysis, creates groups, or clusters, of data. Clusters are
formed in such a way that objects in the same cluster are very similar and objects in different clusters are very distinct. Measures
of similarity depend on the application. K-means clustering is a partitioning method. The function k-means partitions data into k
mutually exclusive clusters, and returns the index of the cluster to which it has assigned each observation. Unlike hierarchical
clustering, k-means clustering operates on actual observations (rather than the larger set of dissimilarity measures), and creates a
single level of clusters. The distinctions mean that k-means clustering is often more suitable than hierarchical clustering for large
amounts of data .From the silhouette plot [14], you can see that most points in the second cluster have a large silhouette value,
greater than 0.6, indicating that the cluster is somewhat separated from neighbouring clusters. However, the third cluster contains
many points with low silhouette values, and the first contains a few points with negative values, indicating that those two clusters
are not well separated.
207
Clustering
C1
C2
CK
C3
Validity Index
ARI
Index
Xie-Beni
Index
P Index
Coeff.
Entropy
Index
Objective functions
NSGA-2 Algorithm
Optimization
3) Clustering (Fuzzy C-Means clustering algorithm)-Clustering and gene selection are a good example of meaningful mining
strategies which is needed for analysis of any kind of information related to mining. The process in which clustering term is
used is sometimes called as an unsupervised learning process. It helps to group the k number of items into C1, C2....CK on
208
the basis of like and unlike patterns. Cluster validity is the process of determining the known cluster. Many clustering
algorithms are proposed in the literature. In clustering, moreover to determine the compactness between the clusters in term
of Euclidean distance and cluster validity indices are used as inside and outside. Initially, the cluster seed at different points
is put into different clusters using distance function. Now we discuss more briefly about FCM algorithm which is used in
this project and this function is inbuilt in MATLAB, which is very easy to understand.Fuzzy C-Mean Clustering (FCM) is
simply based on only one objective function which should be minimized as follow:
(
)
( )
Where, C= Compactness of fuzzy determined by using Euclidean Distance= Total number of genes, depend on the volume of
gene expression data measured as 474,384 and 138 respectively for data, I= Total number of cluster, X= fuzzy membership and
) = is the distance between VC and
Y = fuzzy component, VC = represents the cth gene and Ui is the centre of cth cluster, (
Ui..Firstly, FCM algorithm randomly taken a K cluster centres. After that calculating the fuzzy membership value of each
gene.Finally, plot the two cluster centres found by the fcm function. The large characters in the plot indicate cluster centres
below:
Fig.7.Every time you run, the fcm function initializes with different initial conditions.
This behaviour swaps the order in which the cluster centres are computed and plotted.
4) Xie-Beni Index (XB)
XB index which is used to determine the compactness and separability of clusters. From the below equation (2) shows the ratio
of cluster compactness and separation of a cluster. Criteria: should be the Minimum value of cluster compactness and Maximum
value of cluster separability.
( )
(
)
th
Where,
= total number of cluster, n = total number of data point,
= 1 if j data point belongs to ith cluster, 0 if jth data
point does not belongs to ith cluster, Ci = ith cluster.
5) Adjusted Rand Index (ARI)
ARI value between [-1 +1], ARI is computed as follows:
( )
( )
( )+
( )
( ) *
It shows the compatibility between obtained cluster solution and supervised information based true solution.
6) Dunn Index
Dunn index is a metric for calculating clustering algorithms. This is part of a group of validity indices including the Davies
Bouldin index or Silhouette index. It is an internal valuation method; the outcome is based on the clustered data itself.
(
)
( )
The proposed method as compared to other clusters methods essentially because this is a Multiobjective clustering method.
This is real-time optimization of various cluster validity trial helps to deal with unusual personality and leads to top quality
solutions of different data properties. Secondly, the power of supervised learning has been included with the Multiobjective
clustering powerfully. At last each solution in the non-dominated set getting some information regarding the clustering formation
of the data set, finally, merger of fuzziness makes the proposed technique improved prepare in behaviour overlapping clusters.
Genetic operators are Selection method select population members are selected. The selected individuals will be pooled with
each other to from offspring. In the Fitness Proportionate Selection is capable of via the roulette wheel algorithm. Rank
collection [9] every distinct one individual are sorted according in the direction of their fitness.
209
Data sets
Rank Population
Clustering (FCM)
Selection
Finding Fitness
Function
Evaluate
Objective
Functions
ELITISM
rank1
Crossover
rank2
rank3
Optimal Centroid
rank2
child
Combined
Population
rank1
rank3
Selection
Mutation
rank2
rank1
rank5
parents
Combine child
& parent
rank2
rank
7
rank3
Rank3+
Select N
individual
Fig. 6: the (a) Objective functions and crossover probability v/s Number of iteration for Yeast Sporultions data set
210
(a)
(b)
Fig. 7: the graph (a), (b) showing the saturation level after some iteration. The total number of iteration i= 100, in (a) after 29 iteration it gets
saturates optimize by only two objective functions whereas (b) it saturates after 90 iteration, optimize by four objective functions.
V. CONCLUSIONS
In this paper, the Multiobjective optimization technique proposes a new method for obtaining a final solution from the set of
non-dominated solutions formed by an NSGA-II based real-coded Multiobjective fuzzy clustering idea, that optimizes Xie-Beni
(XB) index and the Jm simultaneously. To meet the new requirements, researchers in the field of bioinformatics are working on
the expansion of new algorithms (mathematical formulas, statistical methods and etc) and software tools are designed for
assessing relationships among large data sets stored. Three objective functions have been optimized simultaneously at a time.
During to make the labelled data, Fuzzy C-means (FCM) clustering technique is utilized. After that the proposed MOO technique
is applied for solving gene expression data clustering problem. Subsequently, results obtained on three real life gene expression
standard data sets have been explained. Lastly, to demonstrate the worth data in terms of fitness value.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
M. B. Eisen, P. T. Spellman, P. O. Brown, D. Botstein, Cluster analysis and display of genome-wide expression patterns, In Proc. Nat. Academy of
Sciences, USA, pp. 14863-14868, 1998.
R. O. DUDA and P. E. HART, Pattern Classification and Scene Analysis. New York, NY: John Wiley and Sons, Inc., 1973.
J. C. Bezdek, Pattern recognition with fuzzy objective function algorithm, Newyork, 1981.
K. Deb, A. Pratap, S. Agrawal, and T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput., vol. 6, pp.
182197, 2002.
R. Agrawal and R. Srikant, Fast algorithms for mining association rules in Proc. 20 th VLDB Conf. Sept 1994, pp.478-499
R. Srikant and R. Agrawal Mining quantitative Association rules in large relational Tables in Proc. ACM SIGMOD Int. Conf. Management Data 1996,
1-12.
J. Handl, J. Knowles, On semi-supervised clustering via Multiobjective optimization, in: Genetic and Evolutionary Computation Conference, pp. 14651472, 2006
A. Ben-Dor and et al, Clustering gene expression patterns. J. Comput. Biol., vol. 6, p. 281297, 1999.
Nada M. A. Al Salami, Ant Colony Optimization Algorithm UbiCC Journal, Volume 4, Number 3, August 2009
S. Kirkpatrick, C. Gelatt, and M. Vecchi, Optimization by simulated annealing, Science, (22), pp. 671-680, 1983.
K. Deb, A. Pratap, S. Agrawal, and T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput., vol. 6, pp.
182197, 2002.
S. Bandyopadhyay, S. Saha, U. Maulik and K. Deb, A simulated annealing based multi-objective optimization algorithm: AMOSA, IEEE Trans. Evol.
Comput, 12, pp. 269, 2008.
U. Maulik, S. Bandyopadhyay, Fuzzy partitioning using a real coded variable length genetic algorithm for pixel classification, IEEE Transaction on
Geoscience and Remote Sensing, 41(5), pp. 1075-1081, 2003.
P.Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , J. Comp. App. Math , (20) pp.53-65,1987.
S. Bandyopadhyay , A. Mukhopadhyay ,U. Maulik, An improved algorithm for Clustering Gene expression data, Bioinformatics , 23(21), pp.2859-2865,
2007.
C. Zhang, X. Lu, and X. Zhang, Significance of gene ranking for classification of microarray samples, IEEE/ACM TRANS. ON COM. BIO. AND
BIOINF., vol. 3, no. 3, pp. 312320, JULYSEPTEMBER 2006
211