0% found this document useful (0 votes)

62 views

Resolving Gene Expression Data Using Multiobjective Optimization Approach

Data mining also known as knowledge discovery in database has been recognized as a promising new area for database research. Studying the patterns hidden in gene expression data helps to understand the functionality of genes. In General, clustering techniques are widely used for identification of partitioning from gene expression data. The proposed work in this paper is about optimizing the data with fuzzy C-Mean clustering algorithm and using multi-objective optimization method i.e. Non-Dominant Sorting Genetic algorithm-2. In the first phase it optimizes the data to reduce the number of comparisons using clustering. Fuzzy C-means algorithm is invoked on the data sets, based on the highest membership values of data points with respect to different clusters, labelled information are extracted . In each case only 10% class labelled information of data points are randomly selected which acts as supervised information. In the Second phase it is implemented with multi-objective genetic algorithm to find fitness function.

Uploaded by

IJIRST

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views

Resolving Gene Expression Data Using Multiobjective Optimization Approach

Uploaded by

IJIRST

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

IJIRST International Journal for Innovative Research in Science & Technology| Volume 3 | Issue 01 | June 2016

ISSN (online): 2349-6010

Resolving Gene Expression Data using Multiobjective Optimization Approach

Sushmita Chakraborty
Department of Computer Science & Engineering
Chhattisgarsh Swami Vivekananda University

Toran Verma
Department of Computer Science & Engineering
Chhattisgarsh Swami Vivekananda University

Abstract
Data mining also known as knowledge discovery in database has been recognized as a promising new area for database research.
Studying the patterns hidden in gene expression data helps to understand the functionality of genes. In General, clustering
techniques are widely used for identification of partitioning from gene expression data. The proposed work in this paper is about
optimizing the data with fuzzy C-Mean clustering algorithm and using multi-objective optimization method i.e. Non-Dominant
Sorting Genetic algorithm-2. In the first phase it optimizes the data to reduce the number of comparisons using clustering. Fuzzy
C-means algorithm is invoked on the data sets, based on the highest membership values of data points with respect to different
clusters, labelled information are extracted . In each case only 10% class labelled information of data points are randomly
selected which acts as supervised information. In the Second phase it is implemented with multi-objective genetic algorithm to
find fitness function.
Keywords: ARI-index, Coefficient Entropy index, Fuzzy C-means, Genetic Algorithm, Multiobjective optimization,
Partitioning Clustering and non-dominant sorting
_______________________________________________________________________________________________________
I.

INTRODUCTION

A DNA segment that constitutes a gene in the DNA molecules is transcribed into a single-stranded sequence of RNA, called
messenger RNA (mRNA). Then, the mRNA is translated interested in an order of amino acids which finally become a protein
after some modifications. In the biological experiment, at unusual time points, gene expression values are calculated. The DNA
Microarray techniques by which measure the thousands of expression levels of genes. The discovery of DNA microarray
technology [1], it becomes likely to inspect the expression level of thousands of genes at a time. Unlikely, the application areas
of Microarray technology are gene expression profiling, medical diagnosis, bio-medical. In the biological experiment, at unusual
time points, gene expression values are calculated. The DNA Microarray techniques by which measure the thousands of
expression levels of genes. Common Problems regarding gene-expression are Problem of needing to find data, Data preprocessing and missing value estimation for DNA microarrays. The Multiobjective optimization method is used to optimize the
prearranged gene expression data through which in upcoming has a huge number of applications are there like Pattern
Recognition, Document Classification, and Information Retrievals and in medical and bio-medicine filed has a great application
for using this technique. A position represents an individual cluster which contains quite a lot of sub-clusters. Now, the four
objectives function viewing the diverse value of cluster compute and are concurrently optimized by using NSGA-2 genetic
approach. The unsupervised in sequence is occupied by the properties of first two objective functions and last is by some
supervised information. First, the trendy fuzzy c-means clustering method is practical on an individual data set. The two steps of
FCM, evaluation of fuzzy membership and re-compute of cluster canters are executed several times until there no modify in the
cluster centres. Final membership
values are obtained by bearing in mind each cluster based on their membership values.

Fig. 1: A scheme of mRNA in gene transcription and protein translation

All rights reserved by www.ijirst.org

205

Resolving Gene Expression Data using Multi-objective Optimization Approach

(IJIRST/ Volume 3 / Issue 01/ 034)

The aim of data mining is to automatically or semi-automatically discover hidden knowledge, unexpected patterns and new
rules from data. There are varieties of technologies involved in the process of data mining, such as statistical analysis, modelling
techniques and database technology. During the last ten years, data mining is undergoing very fast development both on
techniques and applications. Its typical applications include market segmentation, customer profiling, fraud detection,
(electricity) loading forecasting, and credit risk analysis and so on. In the current post-genome age, understanding floods of data
in molecular biology brings great opportunities and big challenges to data mining researchers. Successful stories from this new
application will greatly benefit both computer science and biology communities. We would like to call this discovering
biological knowledge in silico by data mining.
II. SURVEY ON RELATED WORK
In semi-supervised clustering performance, the problem is arises when we using some unsupervised information in our data set.
So by the help of some distance rule like Euclidean or symmetry distance formula, we rise above the crisis. In clustering,
moreover to determine the compactness between the clusters in term of Euclidean distance and cluster validity indices are used
as inside and outside. Initially, the cluster seed at different points is put into dissimilar clusters using distance function. There are
a variety of clustering methods and their validity techniques present in the literature. To enhance the characteristics of genes
expression, optimization has to be performed properly. Based upon the clustering techniques and multi-objective optimization
approach some literatures are discussed in this chapter which are already being utilised for the gene expression data sets. Use of
Semi-supervised Clustering and Feature Selection Techniques for Gene-Expression Data [2015], Sriparna Saha, Abhay Kumar
Alok and Asif Ekbal proposed semi-supervised clustering technique; Semi-FeaClustMOO is demonstrated on five gene
expression datasets. A modern simulated annealing based Multiobjective optimization task namely AMOSA is utilized. The
features and cluster centre are presented in the form of a string and based on symmetry distance. Feature selection is process in
which to reduce the dimensionality. The feature selection technique and clustering is optimizing by the proposed MOO SemiFeaClusMOO. Encoding of strings and initialization of achieve : AMOSA is compare two items a) a set of real number and b) a
set of binary number.Here six objective function are used and are simultaneously optimized by AMOSA. First four are XBindex, FCM-index, I-index, and Sym-index is internal cluster validity indices. They are depending upon on Euclidean and pointsymmetry distance. After application of any MOO based technique, final Pareto optimal front is used to contain a large
collection of non-dominant solution. Mining for optimised data using clustering along with fuzzy association rules and genetic
algorithms [2014], G.V.S.N.R.V Prasad,Y.Dhanalakshmi , V.Vihaya Kumar, I.Ramesh Babu proposed about optimizing the
data with clustering and fuzzy association rules using multi-objective genetic algorithms. This algorithm is implemented in
two phases. In the first phase it optimizes the data to reduce the number of comparisons using clustering. In the second phase
it is implemented with multi-objective genetic algorithms to find the optimum number of Fuzzy association rules using
threshold value and fitness function. The degree of membership of each value of ik in any of the fuzzy sets specified for ik is
directly based on the evaluation of the membership function of the particular fuzzy set with the specified value of ik as input. A
fuzzy association rule is expressed as:If Q = {u1, u2, up} is F1 = {f1, f2, fp} then R = {v1, v2. . . vq} is F2 = {g1, g2, . . ,
gq},where Q and R are disjoint sets of attributes called item sets. For a rule to be interesting, it should have enough support
and high confidence value, larger than user specified thresholds.

...SUPPORT...

Pareto-optimal=non-dominant

A indifferent

B worse

D better

C indiffernt

Dominant

...CONFIDENCE...

Fig. 2: the concept of Pareto optimality

III. EXISTING SYSTEM

There is a concept of Achieve in AMOSA [12] where all the non-dominant solutions encountered during the execution are
stored. Two limits are associated with archive: a hard or strict limit denoted by HL, and a soft limit denoted by SL. For better

All rights reserved by www.ijirst.org

206

Resolving Gene Expression Data using Multi-objective Optimization Approach

(IJIRST/ Volume 3 / Issue 01/ 034)

comparison, quality metric defined which is replaced with the metric number of non-dominated solutions (N).However; the
performance of AMOSA is not satisfactory in terms of spacing metrics in compare to constraint method.
Gene Data Sets

YeastSporulation,
Arabidopsis Thaliana &
YeastCell

Clustering

Fuzzy C-Means Algorithm

In order to generate labeled
data

10% DATA POINTS ARE SELECTED which

are used as the supervised information in the
proposed semi-supervised clustering
technique

Validity Index

ARI
Index

XieBeni
Index

Objective functions
Symindex

I-index

FCMIndex

Optimization

AMOSA Algorithm

Fig. 3: Existing Methodology

Under the Multiobjective framework, a new semi-supervised clustering technique. Semi-gene ClustMOO has been developed.
To obtain the true partitioning results, five objective functions have been optimized simultaneously at a time. First four objective
functions are some internal cluster validity indices, Sym-index, I-index, XB-index, and FCM index. The last fifth one ARI is
external or supervised information based cluster validity. Here, the prior information only 10% of the whole data set is taken.The
data set can be simplified as a matrix of G * I dimensions. G is number of genes and I is number of individuals. The aim is to
cluster together genes for which the expression is highly correlated across all the individuals. The final result is a statistically
based grouping of genes in such clusters from which the individual gene ID can be recovered.
So, finding co-regulated genes is actually quite a difficult task, using straight clustering like k-means is in my base papers not
that productive on expression data. As I am sure we are aware of k-means is by far and away the most commonly used clustering
method. The drawback of this clustering algorithm in case of clustering of huge amount genes expression data is lack of
associated statistics and statistically informed decision making when doing things like picking cluster numbers for the portioning
are big problem.
Cluster analysis, also called segmentation analysis or taxonomy analysis, creates groups, or clusters, of data. Clusters are
formed in such a way that objects in the same cluster are very similar and objects in different clusters are very distinct. Measures
of similarity depend on the application. K-means clustering is a partitioning method. The function k-means partitions data into k
mutually exclusive clusters, and returns the index of the cluster to which it has assigned each observation. Unlike hierarchical
clustering, k-means clustering operates on actual observations (rather than the larger set of dissimilarity measures), and creates a
single level of clusters. The distinctions mean that k-means clustering is often more suitable than hierarchical clustering for large
amounts of data .From the silhouette plot [14], you can see that most points in the second cluster have a large silhouette value,
greater than 0.6, indicating that the cluster is somewhat separated from neighbouring clusters. However, the third cluster contains
many points with low silhouette values, and the first contains a few points with negative values, indicating that those two clusters
are not well separated.

All rights reserved by www.ijirst.org

207

Resolving Gene Expression Data using Multi-objective Optimization Approach

(IJIRST/ Volume 3 / Issue 01/ 034)

IV. PROPOSED SYSTEM

A. Introduction:
The chapter explains a novel active clustering technique with the NSGA-2 genetic algorithm, which is appropriate for solving
gene expression. Learning of dissimilar parameters like fuzzy c-mean algorithm, gene expression, and genetic algorithm are
moreover discussed in this chapter. Four objective functions are used as the validity index of gene expression data. In general
clustering algorithm is used to discover the normal patterns and draw together the vital same information in a set from the
particular bunch of data in gene expression .The Multiobjective optimization is used to effort on the problem we faced by semisupervised classification. Consequently, we modelled the semi-supervised difficulty with Multiobjective optimization problems.
There are several techniques to resolve the problem, like semi-cluster-MOO, Gene-cluster-MOO, and AMOSA. The concert of
the Multiobjective sorting algorithm based semi-supervised clustering practice has been demonstrated on four publicly available
gene expression data sets which are Yeast Sporulation, Arabidopsis Thaliana, and Yeast Cell Cycle. The worth of the projected
technique it is compared with MOO-based MOGA clustering, FCM algorithm and a single genetic algorithm based on SGA. To
find the compactness should be minimum value and reparability showing maximum value.
B. Proposed Approach:
The Overall Proposed Methodology The proposed work in my paper is about optimizing the data with FCM clustering using
multi-objective genetic algorithms. Within two steps the overall methodology has taken place. In the first phase, it optimizes the
data set by using fuzzy c-mean clustering algorithm. And in the second phase, it is implemented with multi-objective
optimization genetic algorithm i.e. non-dominant sorting genetic algorithm-2(NSGA-2).This is very useful to find the minimum
number of threshold value and fitness function. The block diagram of whole working process given below:
1) Gene Expression Data Set - The data set has been downloaded from the site httt://anirbanmukhopadhyay.50webs.com. The
data set which used in this project is Yeast Sporulation, Yeast Cell Cycle, and Arabidopsis Thaliana. Data are log
transformed and easily downloaded from the above website.
2) Clustering Algorithms- can be analysed in two ways. For gene based clustering, genes are treated as data objects, while
samples are considered as features. Conversely, for sample-based clustering, samples serve as data objects to be clustered,
while genes play the role of features. The third category of cluster analysis applied to gene expression data, which is
subspace clustering, treats genes and samples symmetrically such that either genes or samples can be regarded as objects or
features. Gene-based, sample-based and subspace clustering face very different challenges, and different computational
strategies are adopted for each situation.
YeastSporulation,
Arabidopsis Thaliana &
YeastCell

Gene Data Sets

Fuzzy C-Means Algorithm

Clustering

Each cluster top 10% data

point having highest
membership values

Validity Index

ARI
Index

Xie-Beni
Index

P Index

Coeff.
Entropy
Index

Objective functions

NSGA-2 Algorithm

Optimization

Fig. 4: Block diagram of proposed methodology

3) Clustering (Fuzzy C-Means clustering algorithm)-Clustering and gene selection are a good example of meaningful mining
strategies which is needed for analysis of any kind of information related to mining. The process in which clustering term is
used is sometimes called as an unsupervised learning process. It helps to group the k number of items into C1, C2....CK on

All rights reserved by www.ijirst.org

208

Resolving Gene Expression Data using Multi-objective Optimization Approach

(IJIRST/ Volume 3 / Issue 01/ 034)

the basis of like and unlike patterns. Cluster validity is the process of determining the known cluster. Many clustering
algorithms are proposed in the literature. In clustering, moreover to determine the compactness between the clusters in term
of Euclidean distance and cluster validity indices are used as inside and outside. Initially, the cluster seed at different points
is put into different clusters using distance function. Now we discuss more briefly about FCM algorithm which is used in
this project and this function is inbuilt in MATLAB, which is very easy to understand.Fuzzy C-Mean Clustering (FCM) is
simply based on only one objective function which should be minimized as follow:

(
)
( )
Where, C= Compactness of fuzzy determined by using Euclidean Distance= Total number of genes, depend on the volume of
gene expression data measured as 474,384 and 138 respectively for data, I= Total number of cluster, X= fuzzy membership and
) = is the distance between VC and
Y = fuzzy component, VC = represents the cth gene and Ui is the centre of cth cluster, (
Ui..Firstly, FCM algorithm randomly taken a K cluster centres. After that calculating the fuzzy membership value of each
gene.Finally, plot the two cluster centres found by the fcm function. The large characters in the plot indicate cluster centres
below:
Fig.7.Every time you run, the fcm function initializes with different initial conditions.
This behaviour swaps the order in which the cluster centres are computed and plotted.
4) Xie-Beni Index (XB)
XB index which is used to determine the compactness and separability of clusters. From the below equation (2) shows the ratio
of cluster compactness and separation of a cluster. Criteria: should be the Minimum value of cluster compactness and Maximum
value of cluster separability.

( )
(

)
th
Where,
= total number of cluster, n = total number of data point,
= 1 if j data point belongs to ith cluster, 0 if jth data
point does not belongs to ith cluster, Ci = ith cluster.
5) Adjusted Rand Index (ARI)
ARI value between [-1 +1], ARI is computed as follows:
( )

( )

( )+
( )

( ) *

It shows the compatibility between obtained cluster solution and supervised information based true solution.
6) Dunn Index
Dunn index is a metric for calculating clustering algorithms. This is part of a group of validity indices including the Davies
Bouldin index or Silhouette index. It is an internal valuation method; the outcome is based on the clustered data itself.
(

)
( )

The proposed method as compared to other clusters methods essentially because this is a Multiobjective clustering method.
This is real-time optimization of various cluster validity trial helps to deal with unusual personality and leads to top quality
solutions of different data properties. Secondly, the power of supervised learning has been included with the Multiobjective
clustering powerfully. At last each solution in the non-dominated set getting some information regarding the clustering formation
of the data set, finally, merger of fuzziness makes the proposed technique improved prepare in behaviour overlapping clusters.
Genetic operators are Selection method select population members are selected. The selected individuals will be pooled with
each other to from offspring. In the Fitness Proportionate Selection is capable of via the roulette wheel algorithm. Rank
collection [9] every distinct one individual are sorted according in the direction of their fitness.

209

Resolving Gene Expression Data using Multi-objective Optimization Approach

(IJIRST/ Volume 3 / Issue 01/ 034)

Data sets

Rank Population
Clustering (FCM)

Selection
Finding Fitness
Function

Evaluate
Objective
Functions

ELITISM
rank1

Crossover

rank2

rank3

Optimal Centroid

rank2
child

Combined
Population

rank1
rank3
Selection

Mutation
rank2

rank1

Final optimal clusters

rank5
parents

Combine child
& parent

rank2

rank
7

rank3

Rank3+

Select N
individual

Fig. 5: Detailed proposed methodology and elitism process

Fig. 6: the (a) Objective functions and crossover probability v/s Number of iteration for Yeast Sporultions data set

210

Resolving Gene Expression Data using Multi-objective Optimization Approach

(IJIRST/ Volume 3 / Issue 01/ 034)

(a)

(b)

Fig. 7: the graph (a), (b) showing the saturation level after some iteration. The total number of iteration i= 100, in (a) after 29 iteration it gets
saturates optimize by only two objective functions whereas (b) it saturates after 90 iteration, optimize by four objective functions.

V. CONCLUSIONS
In this paper, the Multiobjective optimization technique proposes a new method for obtaining a final solution from the set of
non-dominated solutions formed by an NSGA-II based real-coded Multiobjective fuzzy clustering idea, that optimizes Xie-Beni
(XB) index and the Jm simultaneously. To meet the new requirements, researchers in the field of bioinformatics are working on
the expansion of new algorithms (mathematical formulas, statistical methods and etc) and software tools are designed for
assessing relationships among large data sets stored. Three objective functions have been optimized simultaneously at a time.
During to make the labelled data, Fuzzy C-means (FCM) clustering technique is utilized. After that the proposed MOO technique
is applied for solving gene expression data clustering problem. Subsequently, results obtained on three real life gene expression
standard data sets have been explained. Lastly, to demonstrate the worth data in terms of fitness value.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]

M. B. Eisen, P. T. Spellman, P. O. Brown, D. Botstein, Cluster analysis and display of genome-wide expression patterns, In Proc. Nat. Academy of
Sciences, USA, pp. 14863-14868, 1998.
R. O. DUDA and P. E. HART, Pattern Classification and Scene Analysis. New York, NY: John Wiley and Sons, Inc., 1973.
J. C. Bezdek, Pattern recognition with fuzzy objective function algorithm, Newyork, 1981.
K. Deb, A. Pratap, S. Agrawal, and T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput., vol. 6, pp.
182197, 2002.
R. Agrawal and R. Srikant, Fast algorithms for mining association rules in Proc. 20 th VLDB Conf. Sept 1994, pp.478-499
R. Srikant and R. Agrawal Mining quantitative Association rules in large relational Tables in Proc. ACM SIGMOD Int. Conf. Management Data 1996,
1-12.
J. Handl, J. Knowles, On semi-supervised clustering via Multiobjective optimization, in: Genetic and Evolutionary Computation Conference, pp. 14651472, 2006
A. Ben-Dor and et al, Clustering gene expression patterns. J. Comput. Biol., vol. 6, p. 281297, 1999.
Nada M. A. Al Salami, Ant Colony Optimization Algorithm UbiCC Journal, Volume 4, Number 3, August 2009
S. Kirkpatrick, C. Gelatt, and M. Vecchi, Optimization by simulated annealing, Science, (22), pp. 671-680, 1983.
K. Deb, A. Pratap, S. Agrawal, and T. Meyarivan, A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Trans. Evol. Comput., vol. 6, pp.
182197, 2002.
S. Bandyopadhyay, S. Saha, U. Maulik and K. Deb, A simulated annealing based multi-objective optimization algorithm: AMOSA, IEEE Trans. Evol.
Comput, 12, pp. 269, 2008.
U. Maulik, S. Bandyopadhyay, Fuzzy partitioning using a real coded variable length genetic algorithm for pixel classification, IEEE Transaction on
Geoscience and Remote Sensing, 41(5), pp. 1075-1081, 2003.
P.Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , J. Comp. App. Math , (20) pp.53-65,1987.
S. Bandyopadhyay , A. Mukhopadhyay ,U. Maulik, An improved algorithm for Clustering Gene expression data, Bioinformatics , 23(21), pp.2859-2865,
2007.
C. Zhang, X. Lu, and X. Zhang, Significance of gene ranking for classification of microarray samples, IEEE/ACM TRANS. ON COM. BIO. AND
BIOINF., vol. 3, no. 3, pp. 312320, JULYSEPTEMBER 2006