Generic Model Management: A Database Infrastructure For Schema Manipulation
Generic Model Management: A Database Infrastructure For Schema Manipulation
Philip A. Bernstein
1 Introduction
Many of the problems of cooperative information systems involve the design,
integration, and maintenance of complex application artifacts: application pro-
grams, databases, web sites, workflow scripts, formatted messages, user inter-
faces, etc. Engineers who perform this work use tools to manipulate formal de-
scriptions, or models, of these artifacts: object diagrams, database schemas, web
site layouts, control flow diagrams, XML schemas, form definitions, etc. The tools
are usually model-driven, in the sense that changes to the tool can be accom-
plished by updating the tool’s meta-model. For example, an entity-relationship
design tool may allow a user to add fields to the types Entity, Attribute, and
Relationship, and to add a function that uses those fields to generate SQL DDL.
This allows the user to customize the tool for database systems that have special
DDL capabilities not anticipated by the tool designer. Often, the run-time appli-
cation itself is model-driven. For example, a message-translator for business-to-
business e-commerce might be built as an interpreter of mappings that describe
transformations between types of messages. Changes to the message-translator
can be accomplished simply by modifying the mapping.
Model-driven tools and applications need a persistent representation of the
models that they manipulate. The database field’s last big push to meet the spe-
cial needs of these types of design applications was the development of object-
oriented database systems (OODBs) starting in the mid-to-late 1980s. Yet to-
day, most model-driven applications are built using relational databases and
files. This is rather odd, given that OODBs not only have the usual database
amenities, such as a query language and transactions, but also offer several
C. Batini et al. (Eds.): CoopIS 2001, LNCS 2172, pp. 1–6, 2001.
c Springer-Verlag Berlin Heidelberg 2001
2 Philip A. Bernstein
valuable benefits over relational databases: a rich type system in which to de-
scribe models, smooth integration with object-oriented programming languages,
a versioning facility, and dynamically extensible types. These features are surely
helpful to developers of model-driven applications, among others, yet customer
acceptance of OODBs has been disappointing. The reason usually cited is the
difficulty of displacing a dominant technology like SQL database systems by a
less mature and robust one like OODBs. This may not, however, be the whole
story.
In our opinion, what is missing from today’s persistent storage systems are
features that offer an order-of-magnitude productivity gain for programmers of
model-driven applications over existing relational and OO databases, much like
relational database systems achieved over their predecessors for business-oriented
applications. Such an order-of-magnitude gain requires a leap of abstraction in
the programming interface. The opportunity to achieve this gain lies in the
fact that most model-driven applications include a lot of record-at-a-time or
object-at-a-time navigational code, even when a query language like SQL or
OQL is available. To avoid this navigational programming, we need a higher-
level data model, one with higher-level operations on higher-level objects that
would greatly reduce the amount of navigational code required by model-driven
applications. Our proposal for this higher level data model is summarized in the
next section.
– Match - which takes two models as input and returns a mapping between
them
– Compose - which takes two mappings as input and returns their composition
– Merge - which takes two models and a mapping between them as input and
returns a model that is the merge of the two models (using the mapping to
guide the merge)
– Set operations on models - union, intersection, difference
– Project and Select on models - comparable to relational algebra
To see how these operations might be used, consider the problem of popu-
lating a data warehouse. Suppose we have a model S1 of a data source and a
mapping map1W from S1 to a model W of a data warehouse. Now we are given
a model S2 of a second data source (see Figure 1). We can integrate S2 into
the data warehouse as follows: (1) Match S2 and S1 , yielding a mapping map21 ;
(2) Compose map21 with map1W , to produce a mapping map2W from S2 to W .
Step (1) characterizes those parts of S2 that are the same as S1 . Step (2) reuses
map1W by applying it to those parts of S2 that are the same as S1 . In [5], we
Generic Model Management 3
described a detailed version of this scenario and others like it, to show how to
use the model management algebra to solve some practical data warehousing
problems.
Even if one agrees that the above algebra is a plausible candidate for raising
the level of abstraction of meta-data management, there is still the question of
defining a simple and precise semantics for generic schema management opera-
tions. A panel on this question took place at VLDB 2000 [3]. On the whole, the
panelists found it worthwhile to pursue the goal of generic schema management.
However, there was some disagreement on how difficult it will be to define the
semantics of operations that can address the complexity of real-world data.
The examples in [5] were one step toward defining and validating an opera-
tional semantics to handle this comlexity. More recently, we have made progress
on two related fronts. They are described in the next two sections.
3 Schema Matching
these approaches into a hybrid algorithm, called Cupid. Cupid uses properties
of individual elements, linguistic information of names, structural similarities,
key and referential constraints, and context-dependent matching. The algorithm
is described in [6], along with an experimental comparison to two earlier imple-
mentations, DIKE [8] and MOMIS [2], some of whose ideas were incorporated
into Cupid. While a truly robust solution is not around the corner, results so far
have been promising.
We have found that our implementation of Match is immediately useful to
some tools, even without implementations of the other model management op-
erations. For example, consider a graphical tool for defining a mapping between
XML message types. By incorporating Match, the tool can offer a user a draft
mapping, which the user can then review and refine. We expect that an im-
plementation of other operations will be similarly useful outside of a complete
model management implementation.
4 Formal Semantics
To achieve the goal of genericity, we need a semantics for model management
that is independent of data model. One approach is to define a rich data model
that includes all of the popular constructs present in data models of interest.
This would be an extended entity-relationship (EER) model of some kind, which
includes an is-a hierarchy, complex types, and the usual kinds of built-in con-
straints, such as keys and referential integrity. This is the approach used in the
Cupid prototype. Models expressed in some other data model, such as SQL DDL
or XML schema definitions (XSD [10]), are imported into the native Cupid data
model before they are matched. The resulting mapping is then exported back
into the data model of interest, such as one that can be displayed in a design
tool (e.g., for SQL or XSD). That way, Cupid can support many different data
models and, hence, many different applications.
It would be desirable to have a carefully developed mathematical semantics
that shows that the EER data model used by Cupid, or one like it, is generic.
One way to go about this would be to demonstrate a translation of each data
model of interest into the chosen EER model and analyze properties of that
translation, such as how much semantics is preserved and lost. This approach is
likely to be enlightening, but it is still on the docket.
A second approach is to analyze models and mappings in the abstract, using
category theory [7]. A category is an abstract mathematical structure consist-
ing of a set of uninterpreted objects and morphisms (i.e. transformations) be-
tween them. To apply it to models and mappings, one represents the well formed
schemas of a data model as a category, whose internal structure is uninterpreted.
Models are objects of a schema category, and mappings between models are mor-
phisms between objects of a schema category. A mapping between two categories
is called a functor. A functor can be used to provide an interpretation of schemas;
the functor maps each schema in a schema category to a category of instances,
which are the set of all databases that conform to the schema. One can then use
Generic Model Management 5
5 Conclusion
Acknowledgments
The author is very grateful to the following researchers whose work is sum-
marized in this paper: Suad Alagic, Alon Halevy, Jayant Madhavan, Rachel
Pottinger, and Erhard Rahm.
References
1. Alagic, S. and P. A. Bernstein: “A Model Theory for Generic Schema Manage-
ment,” submitted for publication. 5
2. Bergamaschi, S., S. Castano, and M. Vincini: “Semantic Integration of Semistruc-
tured and Structured Data Sources.” SIGMOD Record 28,1 (Mar. 1999), 54-59.
4
3. Bernstein, P. A.: “Panel: Is Generic Metadata Management Feasible?” Proc. VLDB
2000, pp. 660-662. Panelists’ slides available at
http://www.research.microsoft.com/ philbe. 3
4. Bernstein, P. A., Halevy, A., and R. A. Pottinger: “A Vision for Management of
Complex Models.” ACM SIGMOD Record 29, 4 (Dec. 2000). 2
5. Bernstein, P. A. and E. Rahm: “Data Warehouse Scenarios for Model Manage-
ment,” Conceptual Modelling - ER2000, LNCS1920, Springer-Verlag, pp. 1-15. 3
6. Madhavan, J., P. A. Bernstein, and E. Rahm: “Generic Schema Matching Using
Cupid,” VLDB 2001, to appear. 4
7. Mac Lane, S.: Categories for a Working Mathematician, Springer, 1998. 4
8. Palopoli, L. G. Terracina, and D. Ursino: “The System DIKE: Towards the Semi-
Automatic Synthesis of Cooperative Information Systems and Data Warehouses.”
ADBIS-DASFAA 2000, Matfyzpress, 108-117. 4
9. Rahm, E., and P. A. Bernstein: “On Matching Schemas Automatically,” Microsoft
Research Technical Report MSR-TR-2001-17, Feb. 2001. 3
10. W3C: XML Schema, http://www.w3c.org/XML/schema, 2001. 4