The Genopolis Microarray database

“Design of a community
microarray database”
Andrea Splendiani, ca 2004-2007

Design of a community microarray database
• About (concept)

• Introduction by
examples

• Design

• Information modeling

• Annotation process

• Implementation

• Data access

About
• Development of a microarray database for the Genopolis consortium
(Milan, Italy), within the University of Milano-Bicocca.

• The Genopolis Consortium acts as a service provider (Aﬀymetrix
GeneChip)

• Supports a scientiﬁc community studying the behavior of immune
cells in host response interaction at the gene expression level

• Supports several research networks

• Integrated to ArrayExpress (EBI)

About:: Desiderata & peculiarities
• Data storage

• Data query/analysis

• “Integration” with other databases
• Support for an heterogeneous
community of users
• Limited to Affymetrix GeneChip
expression data

• Users tend to have an homogenous
scientific focus

• Different roles of users: service
provider, ‘customers’,...

• Neither public nor private data
(depending on agreements and
publication status)

About:: community database concept
This reﬂects in:

• Information modeling

• Annotation process

• Implementation

• Data access

Introduction by example (user describes experiment)

Introduction by example (checking experiment annotation)

Introduction by example (data input by service provider)

Introduction by example (Administrator/Service p.)

Introduction by example (Supervisor manages CVs)

Design:: information modeling
Gene
expression
values
Genes
Experiment
conditions
(stimuli)
Gene Expression data
structure.
The importance to characterize experiment condition (specially in
public repositories) is well understood, with results such as
MIAME, MAGE, MGED-Ontology and ArrayExpress)
Annotation of genes concerns both the characterization of the
measurement technology, and of genes ‘properties’ (as Gene
Ontology codes). The latter is not strictly part of a microarray
database domain.
Gene expression data can be thought as a “matrix” representing
a relation between the dimension of “stimuli” or experimental
conditions and the dimension of genes.
Genopolis Microarray data model is related to MAGE, with two main
differences: Array description is ‘not relevant’ (standard technology, can be
‘imported’ from provider), Experiment description is simpliﬁed.
(The relation between stimuli and samples is also re-designed).

Design:: information modeling (experiment)
• Objects represent
entities relevant for
experiment annotation

• Description organized as
a tree

• ‘Sample centric’
Experiment
Sourc
Sourc
Sample
Hybridization
Hybridization
Mesure
Measure
An experiment is
a ‘container’ Sample
Each sample has
associated the
list of all stimuli
affecting it. Supports
different
measures


a tree

• ‘Sample centric’ (object
centric)
Experiment
Sourc
Sourc
Sample
Hybridization
Hybridization
Mesure
Measure
An experiment is
Each sample has
associated the
list of all stimuli
different
measures
Experiment description
Measurements


a tree

• ‘Sample centric’ (object
centric)
Experiment
Sourc
Sourc
Sample
Hybridization
Hybridization
Mesure
Measure
An experiment is
Each sample has
associated the
list of all stimuli
different
measures
Replicates
No stimuli
-> controls

Design:: information modeling (data)
• Which data should be stored in the database ?

• The principle is to store in the basic information needed by any
‘interpretation technology’ (like raw scanned images) and actual
expression values that can be used ‘live’ (like Signal, evidence code...).
Some other useful intermediate data is stored as well.

Design:: annotation process
• Annotation process by database users

• Users with different views of the experiment can input different
types of information (experiment description, measurement, array
features...)

• In the description of terms, users make use of controlled
vocabularies generated by the community within this process
(ontologies)

• Checking of the coherence of the database content (data and
annotation) are both automatic and carried by supervisors: ‘draft’
and ‘certified’ information.

• Annotation process at large

• Information, once public, can be sent to a public repository (via
MAGE-ML).

Design:: implementation
• Web application (php/mysql)

• Object based. Objects represents entities of the domain, and are containers
of objects representing ﬁelds. (Display/Set/Store/Check methods)

• Two key concepts:

• Approximate relations among objects as a tree (stimuli are leafs). Use tree
traversal for: completeness/correctness checking, computation
(replicates), administration (more later...)

• Use two distinct databases: for draft and for complete information. This
can be used to improve eﬃciency (indexing, deployment on cluster).

BaseObject
DBObject
DAOedDBObject
TreeDAOedObject
objects that just know
about the system (ex. MailManager)
objects that know of underlying databases,
can make queries (ex. DBQuery)
Objects that can handle a
web representation (ex. Protocol)
objects that are organized in a tree,
allow iteration over the tree
Speciﬁc Objects
Objects that represents entities,
with speciﬁc properties.
Design:: implemenation (object types)

Design:: implementation (annotation process)
• Two databases

• TDB (temporary)

• SDB (‘standard’)

• read only

• can be duplicated on
nodes of a cluster
TDB SDB

• Terms for controlled
vocabularies come
from SDB

• New terms proposed
are stored in SDB
TDB SDB
Users Description
+ data (ﬁles)

• Supervisor accepts
new terms proposed
by users
TDB SDB
Users Description
+ data (ﬁles)
Supervisor

• Systems checks for:

• completeness of data
(required ﬁelds)

• common errors

• accepted terms

• Generates and send
reports to responsibles
TDB SDB
Users Description
+ data (ﬁles)
Supervisor
system
check:
-completeness
-errors

• Systems publish data

• off-line operation

• possible performance
optimization

• data files are parsed
in this phaseTDB SDB
Users Description
+ data (files)
Supervisor
system
check:
-completeness
-errors
system
“publish” data,
file->sql tables
Batch!

• Un-publishing for
revisions
TDB SDB
Users Description
+ data (ﬁles)
Supervisor
system
check:
-completeness
-errors system
“un-publish” data

Design:: data access
• Who can access

• Users belong to groups with a role. Experiments (data +
description) belong to groups. Depending on their role in groups
users can edit, query, view... experiments’ information.

• How to access data

• Several interfaces. Some related to data inspection (related to the
structure), some oriented to data analysis.

• It is always possible to export a subset of data as a table (for
analysis tools...)

• MAGE-ML

• Examples shown:

• Tree view

• Interactive “context based” browsing

Design:: data access (tree view)

Design:: data access (interactive browsing)
• Gene expression data as a matrix (Genes x Sample).

• For each sub-matrix “data” is the connection between a selected subset of
samples and genes

• The idea is to provide a way to navigate between sub-matrices, based on
genes’ annotation, samples’ features or data.

• Follows example...

• Extensions to this interface include pluggable search/view modules and gene
lists sharing among groups.

The Genopolis Microarray database

The Genopolis Microarray database

More Related Content

The Genopolis Microarray database