Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
“Design of a community
microarray database”
Andrea Splendiani, ca 2004-2007
Design of a community microarray database
• About (concept)

• Introduction by
examples

• Design

• Information modeling

• Annotation process

• Implementation

• Data access
About
• Development of a microarray database for the Genopolis consortium
(Milan, Italy), within the University of Milano-Bicocca.

• The Genopolis Consortium acts as a service provider (Affymetrix
GeneChip)

• Supports a scientific community studying the behavior of immune
cells in host response interaction at the gene expression level

• Supports several research networks

• Integrated to ArrayExpress (EBI)
About:: Desiderata & peculiarities
• Data storage

• Data query/analysis

• “Integration” with other databases
• Support for an heterogeneous
community of users
• Limited to Affymetrix GeneChip
expression data

• Users tend to have an homogenous
scientific focus

• Different roles of users: service
provider, ‘customers’,...

• Neither public nor private data
(depending on agreements and
publication status)
About:: community database concept
This reflects in:

• Information modeling

• Annotation process

• Implementation

• Data access
Introduction by example (user describes experiment)
Introduction by example (checking experiment annotation)
Introduction by example (data input by service provider)
Introduction by example (Administrator/Service p.)
Introduction by example (Administrator/Service p.)
Introduction by example (Supervisor manages CVs)
Design:: information modeling
Gene
expression
values
Genes
Experiment
conditions
(stimuli)
Gene Expression data
structure.
The importance to characterize experiment condition (specially in
public repositories) is well understood, with results such as
MIAME, MAGE, MGED-Ontology and ArrayExpress)
Annotation of genes concerns both the characterization of the
measurement technology, and of genes ‘properties’ (as Gene
Ontology codes). The latter is not strictly part of a microarray
database domain.
Gene expression data can be thought as a “matrix” representing
a relation between the dimension of “stimuli” or experimental
conditions and the dimension of genes.
Genopolis Microarray data model is related to MAGE, with two main
differences: Array description is ‘not relevant’ (standard technology, can be
‘imported’ from provider), Experiment description is simplified.
(The relation between stimuli and samples is also re-designed).
Design:: information modeling (experiment)
• Objects represent
entities relevant for
experiment annotation

• Description organized as
a tree

• ‘Sample centric’
Experiment
Sourc
Sourc
Sample
Hybridization
Hybridization
Mesure
Measure
An experiment is
a ‘container’ Sample
Each sample has
associated the
list of all stimuli
affecting it. Supports
different
measures
Design:: information modeling (experiment)
• Objects represent
entities relevant for
experiment annotation

• Description organized as
a tree

• ‘Sample centric’ (object
centric)
Experiment
Sourc
Sourc
Sample
Hybridization
Hybridization
Mesure
Measure
An experiment is
a ‘container’ Sample
Each sample has
associated the
list of all stimuli
affecting it. Supports
different
measures
Experiment description
Measurements
Design:: information modeling (experiment)
• Objects represent
entities relevant for
experiment annotation

• Description organized as
a tree

• ‘Sample centric’ (object
centric)
Experiment
Sourc
Sourc
Sample
Hybridization
Hybridization
Mesure
Measure
An experiment is
a ‘container’ Sample
Each sample has
associated the
list of all stimuli
affecting it. Supports
different
measures
Replicates
No stimuli
-> controls
Design:: information modeling (data)
• Which data should be stored in the database ? 

• The principle is to store in the basic information needed by any
‘interpretation technology’ (like raw scanned images) and actual
expression values that can be used ‘live’ (like Signal, evidence code...).
Some other useful intermediate data is stored as well.
Design:: annotation process
• Annotation process by database users

• Users with different views of the experiment can input different
types of information (experiment description, measurement, array
features...)

• In the description of terms, users make use of controlled
vocabularies generated by the community within this process
(ontologies)

• Checking of the coherence of the database content (data and
annotation) are both automatic and carried by supervisors: ‘draft’
and ‘certified’ information.

• Annotation process at large

• Information, once public, can be sent to a public repository (via
MAGE-ML).
Design:: implementation
• Web application (php/mysql)

• Object based. Objects represents entities of the domain, and are containers
of objects representing fields. (Display/Set/Store/Check methods)

• Two key concepts:

• Approximate relations among objects as a tree (stimuli are leafs). Use tree
traversal for: completeness/correctness checking, computation
(replicates), administration (more later...)

• Use two distinct databases: for draft and for complete information. This
can be used to improve efficiency (indexing, deployment on cluster).
BaseObject
DBObject
DAOedDBObject
TreeDAOedObject
objects that just know
about the system (ex. MailManager)
objects that know of underlying databases,
can make queries (ex. DBQuery)
Objects that can handle a
web representation (ex. Protocol)
objects that are organized in a tree,
allow iteration over the tree
Specific Objects
Objects that represents entities,
with specific properties.
Design:: implemenation (object types)
Design:: implementation (annotation process)
• Two databases

• TDB (temporary)

• SDB (‘standard’)

• read only

• can be duplicated on
nodes of a cluster
TDB SDB
Design:: implementation (annotation process)
• Terms for controlled
vocabularies come
from SDB

• New terms proposed
are stored in SDB
TDB SDB
Users Description
+ data (files)
Design:: implementation (annotation process)
• Supervisor accepts
new terms proposed
by users
TDB SDB
Users Description
+ data (files)
Supervisor
Design:: implementation (annotation process)
• Systems checks for:

• completeness of data
(required fields)

• common errors

• accepted terms

• Generates and send
reports to responsibles
TDB SDB
Users Description
+ data (files)
Supervisor
system
check:
-completeness
-errors
Design:: implementation (annotation process)
• Systems publish data 

• off-line operation

• possible performance
optimization

• data files are parsed
in this phaseTDB SDB
Users Description
+ data (files)
Supervisor
system
check:
-completeness
-errors
system
“publish” data,
file->sql tables
Batch!
Design:: implementation (annotation process)
• Un-publishing for
revisions
TDB SDB
Users Description
+ data (files)
Supervisor
system
check:
-completeness
-errors system
“un-publish” data
Design:: data access
• Who can access

• Users belong to groups with a role. Experiments (data +
description) belong to groups. Depending on their role in groups
users can edit, query, view... experiments’ information. 

• How to access data

• Several interfaces. Some related to data inspection (related to the
structure), some oriented to data analysis.

• It is always possible to export a subset of data as a table (for
analysis tools...)

• MAGE-ML

• Examples shown: 

• Tree view

• Interactive “context based” browsing
Design:: data access (tree view)
Design:: data access (interactive browsing)
• Gene expression data as a matrix (Genes x Sample).

• For each sub-matrix “data” is the connection between a selected subset of
samples and genes

• The idea is to provide a way to navigate between sub-matrices, based on
genes’ annotation, samples’ features or data. 

• Follows example...

• Extensions to this interface include pluggable search/view modules and gene
lists sharing among groups.
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The Genopolis Microarray database
The End

More Related Content

The Genopolis Microarray database

  • 1. “Design of a community microarray database” Andrea Splendiani, ca 2004-2007
  • 2. Design of a community microarray database • About (concept) • Introduction by examples • Design • Information modeling • Annotation process • Implementation • Data access
  • 3. About • Development of a microarray database for the Genopolis consortium (Milan, Italy), within the University of Milano-Bicocca. • The Genopolis Consortium acts as a service provider (Affymetrix GeneChip) • Supports a scientific community studying the behavior of immune cells in host response interaction at the gene expression level • Supports several research networks • Integrated to ArrayExpress (EBI)
  • 4. About:: Desiderata & peculiarities • Data storage • Data query/analysis • “Integration” with other databases • Support for an heterogeneous community of users • Limited to Affymetrix GeneChip expression data • Users tend to have an homogenous scientific focus • Different roles of users: service provider, ‘customers’,... • Neither public nor private data (depending on agreements and publication status)
  • 5. About:: community database concept This reflects in: • Information modeling • Annotation process • Implementation • Data access
  • 6. Introduction by example (user describes experiment)
  • 7. Introduction by example (checking experiment annotation)
  • 8. Introduction by example (data input by service provider)
  • 9. Introduction by example (Administrator/Service p.)
  • 10. Introduction by example (Administrator/Service p.)
  • 11. Introduction by example (Supervisor manages CVs)
  • 12. Design:: information modeling Gene expression values Genes Experiment conditions (stimuli) Gene Expression data structure. The importance to characterize experiment condition (specially in public repositories) is well understood, with results such as MIAME, MAGE, MGED-Ontology and ArrayExpress) Annotation of genes concerns both the characterization of the measurement technology, and of genes ‘properties’ (as Gene Ontology codes). The latter is not strictly part of a microarray database domain. Gene expression data can be thought as a “matrix” representing a relation between the dimension of “stimuli” or experimental conditions and the dimension of genes. Genopolis Microarray data model is related to MAGE, with two main differences: Array description is ‘not relevant’ (standard technology, can be ‘imported’ from provider), Experiment description is simplified. (The relation between stimuli and samples is also re-designed).
  • 13. Design:: information modeling (experiment) • Objects represent entities relevant for experiment annotation • Description organized as a tree • ‘Sample centric’ Experiment Sourc Sourc Sample Hybridization Hybridization Mesure Measure An experiment is a ‘container’ Sample Each sample has associated the list of all stimuli affecting it. Supports different measures
  • 14. Design:: information modeling (experiment) • Objects represent entities relevant for experiment annotation • Description organized as a tree • ‘Sample centric’ (object centric) Experiment Sourc Sourc Sample Hybridization Hybridization Mesure Measure An experiment is a ‘container’ Sample Each sample has associated the list of all stimuli affecting it. Supports different measures Experiment description Measurements
  • 15. Design:: information modeling (experiment) • Objects represent entities relevant for experiment annotation • Description organized as a tree • ‘Sample centric’ (object centric) Experiment Sourc Sourc Sample Hybridization Hybridization Mesure Measure An experiment is a ‘container’ Sample Each sample has associated the list of all stimuli affecting it. Supports different measures Replicates No stimuli -> controls
  • 16. Design:: information modeling (data) • Which data should be stored in the database ? • The principle is to store in the basic information needed by any ‘interpretation technology’ (like raw scanned images) and actual expression values that can be used ‘live’ (like Signal, evidence code...). Some other useful intermediate data is stored as well.
  • 17. Design:: annotation process • Annotation process by database users • Users with different views of the experiment can input different types of information (experiment description, measurement, array features...) • In the description of terms, users make use of controlled vocabularies generated by the community within this process (ontologies) • Checking of the coherence of the database content (data and annotation) are both automatic and carried by supervisors: ‘draft’ and ‘certified’ information. • Annotation process at large • Information, once public, can be sent to a public repository (via MAGE-ML).
  • 18. Design:: implementation • Web application (php/mysql) • Object based. Objects represents entities of the domain, and are containers of objects representing fields. (Display/Set/Store/Check methods) • Two key concepts: • Approximate relations among objects as a tree (stimuli are leafs). Use tree traversal for: completeness/correctness checking, computation (replicates), administration (more later...) • Use two distinct databases: for draft and for complete information. This can be used to improve efficiency (indexing, deployment on cluster).
  • 19. BaseObject DBObject DAOedDBObject TreeDAOedObject objects that just know about the system (ex. MailManager) objects that know of underlying databases, can make queries (ex. DBQuery) Objects that can handle a web representation (ex. Protocol) objects that are organized in a tree, allow iteration over the tree Specific Objects Objects that represents entities, with specific properties. Design:: implemenation (object types)
  • 20. Design:: implementation (annotation process) • Two databases • TDB (temporary) • SDB (‘standard’) • read only • can be duplicated on nodes of a cluster TDB SDB
  • 21. Design:: implementation (annotation process) • Terms for controlled vocabularies come from SDB • New terms proposed are stored in SDB TDB SDB Users Description + data (files)
  • 22. Design:: implementation (annotation process) • Supervisor accepts new terms proposed by users TDB SDB Users Description + data (files) Supervisor
  • 23. Design:: implementation (annotation process) • Systems checks for: • completeness of data (required fields) • common errors • accepted terms • Generates and send reports to responsibles TDB SDB Users Description + data (files) Supervisor system check: -completeness -errors
  • 24. Design:: implementation (annotation process) • Systems publish data • off-line operation • possible performance optimization • data files are parsed in this phaseTDB SDB Users Description + data (files) Supervisor system check: -completeness -errors system “publish” data, file->sql tables Batch!
  • 25. Design:: implementation (annotation process) • Un-publishing for revisions TDB SDB Users Description + data (files) Supervisor system check: -completeness -errors system “un-publish” data
  • 26. Design:: data access • Who can access • Users belong to groups with a role. Experiments (data + description) belong to groups. Depending on their role in groups users can edit, query, view... experiments’ information. • How to access data • Several interfaces. Some related to data inspection (related to the structure), some oriented to data analysis. • It is always possible to export a subset of data as a table (for analysis tools...) • MAGE-ML • Examples shown: • Tree view • Interactive “context based” browsing
  • 27. Design:: data access (tree view)
  • 28. Design:: data access (interactive browsing) • Gene expression data as a matrix (Genes x Sample). • For each sub-matrix “data” is the connection between a selected subset of samples and genes • The idea is to provide a way to navigate between sub-matrices, based on genes’ annotation, samples’ features or data. • Follows example... • Extensions to this interface include pluggable search/view modules and gene lists sharing among groups.