1 s2.0 S131915781100019X Main
1 s2.0 S131915781100019X Main
1 s2.0 S131915781100019X Main
ORIGINAL ARTICLE
a,*
KEYWORDS
Data warehouse;
ETL processes;
Database;
Data mart;
OLAP;
Conceptual modeling
92
1. Introduction
Extract
Data
Sources
Figure 1
Transform
Load
DSA
DW
93
to achieve the warehousing process. Queries will be used to represent the mapping between the source and the target data;
thus, allowing DBMS to play an expanded role as a data transformation engine as well as a data store. This approach enables
a complete interaction between mapping metadata and the
warehousing tool. In addition, it addresses the efciency of a
query-based data warehousing ETL tool without suggesting
any graphical models. It describes a query generator for reusable and more efcient data warehouse (DW) processing.
3.1.1. Mapping guideline
Mapping guideline means the set of information dened by the
developers in order to achieve the mapping between the attributes of two schemas. Actually, different kinds of mapping
guidelines are used for many applications. Traditionally, these
guidelines are dened manually during the system implementation. In the best case, they are saved as paper documents. These
guidelines are used as references each time there is a need to
understand how an attribute of a target schema has been generated from the sources attributes. This method is very weak in the
maintenance and evolution of the system. To keep updating
these guidelines is a very hard task, especially with different versions of guidelines. To update the mapping of an attribute in the
system, one should include an update for the paper document
guideline as well. Thus, it is extremely difcult to maintain such
tasks especially with simultaneous updates by different users.
3.1.2. Mapping expressions
Mapping expression of an attribute is the information needed
to recognize how a target attribute is created from the sources
attributes. Examples of the applications where mapping
expressions are used are listed as follows:
Schema mapping (Madhavan et al., 2001): for database
schema mapping, the mapping expression is needed to
dene the correspondence between matched elements.
Data warehousing tool (ETL) (Staudt et al., 1999): includes
a transformation process where the correspondence
between the sources data and the target DW data is dened.
EDI message mapping: the need of a complex message translation is required for EDI, where data must be transformed
from one EDI message format into another.
EAI (enterprise application integration): the integration of
information systems and applications needs a middleware
to manage this process (Stonebraker and Hellerstein,
2001). It includes management rules of an enterprises applications, data spread rules for concerned applications, and
data conversion rules. Indeed, data conversion rules dene
the mapping expression of integrated data.
3.1.3. Mapping expression examples
Some examples of the mapping expressions identied from different type of applications are shown as follows:
Break-down/concatenation: in this example the value of a eld
is established by breaking down the value of a source and by
concatenating it with another value, as shown in Fig. 2.
Conditional mapping: sometimes the value of a target attribute depends on the value of another attribute. In the example, if X = 1 then Y = A else Y = B, as shown in Fig. 3.
More about mapping expression rules and notation are
found in Jarke et al. (2003) and Miller et al. (2000).
94
XYZ
ized for the regular cases of ETL processes. Thus, the classes of
the template layer represent specializations (i.e., subclasses) of
the generic classes of the metamodel layer (depicted as IsA
relationships). After dening the previous framework, the
authors present the graphical notation and the metamodel of
their proposed graphical model as shown in Fig. 5. Then, they
detail and formally dene all the entities of the metamodel:
DD\MM\AA
1234 - XYZ
DD\MM\AA
AA
DDMM
1234
Figure 2
2003).
Concept
Attribute
Transformation
Note
ETL_Constraint
Active Candidate
67899
B
Part Of
Figure 3
67899
Provider 1:1
Candidate
Serial
Composition
Candidate
Data types
Functions
Elementary Activity
Metamodel Layer
Provider N: M
Relationships
RecordSet
IsA
Domain Mismatch
NotNull
SK Assignment
Source Table
Fact Table
Provider Rel
Template Layer
InstanceOf
S1.PW
NN1
DM1
SK1
DW.PS
Schema Layer
Figure 4
The metamodel for the logical entities of the ETL environment (Vassiliadis et al., 2003).
95
Figure 7
Necessary Provider
S1 and S2
Duration < 4h
Due to accuracy
and small size
U
Annual
PartSupp's
S2.Partsupp
DW.Partsupp
PKey
PKey
SK
PKey
SupKey
Date
V
F
Supkey
Qty
Date
Date
Cost
Dept
Cost
Pkey
SK
SupKey
Qty
NN
American to European
date
Figure 6
S2.Partsupp
PK
Recent
Partsupp's
SysDate
Cost
96
Figure 8
attribute class. The authors formally dene attribute/class diagrams, along with the new stereotypes, Attribute and
Contain, dened as follows:
Attribute classes are materializations of the Attribute
stereotype, introduced specically for representing the attributes of a class. The following constraints apply for the correct
denition of an attribute class as a materialization of an
Attribute stereotype:
Naming convention: the name of the attribute class is the
name of the corresponding container class, followed by a
dot and the name of the attribute.
Features: an attribute class can contain neither attributes
nor methods.
A contain relationship is a composite aggregation between a
container class and its corresponding attribute classes, originated at the end near the container class and highlighted with
the Contain stereotype.
An attribute/class diagram is a regular UML class
diagram extended with Attribute classes and Contain
relationships.
In the data warehouse context, the relationship, involves
three logical parties: (a) the provider entity (schema, table,
or attribute), responsible for generating the data to be further
propagated, (b) the consumer, that receives the data from the
provider and (c) their intermediate matching that involves the
way the mapping is done, along with any transformation and
ltering. Since a mapping diagram can be very complex, this
approach offers the possibility to organize it in different levels
thanks to the use of UML packages.
Their layered proposal consists of four levels as shown in
Fig. 8:
Database level (or level 0). In this level, each schema of the
DW environment (e.g., data sources at the conceptual level
in the SCS source conceptual schema, conceptual schema
of the DW in the DWCS data warehouse conceptual
schema, etc.) is represented as a package (Lujan-Mora
and Trujillo, 2003; Trujillo and Lujan-Mora, 2003). The
mappings among the different schemata are modeled in a
single mapping package, encapsulating all the lower-level
mappings among different schemata.
97
Dataow level (or level 1). This level describes the data relationship among the individual source tables of the involved
schemata towards the respective targets in the DW. Practically, a mapping diagram at the database level is zoomedinto a set of more detailed mapping diagrams, each capturing how a target table is related to source tables in terms of
data.
Table level (or level 2). Whereas the mapping diagram of
the dataow level describes the data relationships among
sources and targets using a single package, the mapping diagram at the table level, details all the intermediate transformations and checks that take place during this ow.
Practically, if a mapping is simple, a single package that
represents the mapping can be used at this level; otherwise,
a set of packages is used to segment complex data mappings
in sequential steps.
Attributelevel (or level 3). In this level, the mapping diagram
involves the capturing of inter-attribute mappings. Practically, this means that the diagram of the table is zoomedin and the mapping of provider to consumer attributes is
traced, along with any intermediate transformation and
cleaning.
At the leftmost part of Fig. 8, a simple relationship among
the DWCS and the SCS exists: this is captured by a single data
mapping package and these three design elements constitute
the data mapping diagram of the database level (or level 0).
Assuming that there are three particular tables in the DW that
we would like to populate, this particular data mapping package abstracts the fact that there are three main scenarios for
the population of the DW, one for each of these tables. In
the dataow level (or level 1) of our framework, the data relationships among the sources and the targets in the context of
each of the scenarios, is practically modeled by the respective
package. If we zoom in one of these scenarios, e.g., mapping
1, we can observe its particularities in terms of data transformation and cleaning: the data of source 1 are transformed in
two steps (i.e., they have undergone two different transformations), as shown in Fig. 8. Observe also that there is an intermediate data store employed, to hold the output of the rst
transformation (Step 1), before passed onto the second one
(Step 2). Finally, at the right lower part of Fig. 8, the way
the attributes are mapped to each other for the data stores
source 1 and intermediate is depicted. Let us point out that
in case we are modeling a complex and huge data warehouse,
the attribute transformation modeled at level 3 is hidden within a package denition.
Figure 9
98
Figure 10
EMD metamodel.
Figure 11
Figure 12
99
100
Figure 13
Figure 14
Figure 15
Figure 16
101
102
sage will appear to alert the user and the application will halt.
If the connection succeeded, new database ETL will be created. ETL plays the role of repository in which the metadata
about the EMD scenarios will be stored. The metadata in the
repository will be used to generate the mapping document.
After creating ETL database the user may either create
new EMD scenario or open existing one to complete it. In case
of creating new scenario, new building area will appear to enable the user to draw and build his new model, and in case of
opening an existing EMD scenario, two les will be read, the
rst one is .etl le from which the old scenario will be loaded
to the drawing area to enable the user to complete it, and the
second le is .sql in which the SQL script of the old part of
the existing scenario were written and will be complete as the
user completes his model. The next module loads both the
metadata about the databases found on the database management system and EMD Builder interface icons. The metadata includes the databases names, tables, attributes, and so on.
The interface icons will be loaded from our icon gallery, the
interface elements will be shown in next sections. The next
module facilitates the drawing process by which the user can
use our palette of controls to draw and build his EMD scenario. By using the execution module, the EMD model will
be translated into SQL script then executed on the incoming
data from the source databases, so the extraction, transformation, and loading processes can be applied and the desired records will be transferred to the target DW schema in the
required format. The last module is responsible for saving
the users EMD model. During the save operation, three les
are generates; the rst one contains the user EMD model in
a binary format, so the user can open it at any time to update
in its drawing, the second contains the generated SQL script,
and the third generated le is the mapping document which
is considered as dictionary and catalog for the ETL operations
found in the user EMD scenario. The user can specify the
folder in which the generated les will be saved. The generated
les can be transferred from one machine to be used on another one that contains the same data sources and the same
target data warehouse schema; this means that the generated
les from our tool are machine independent, however they
are data source and destination schema dependent. It is clear
that the destination is a whole schema (data warehouse or data
mart), but each part of this schema (fact or dimension) is handled as a standalone destination in a single EMD scenario.
5. Models evaluation and comparison
Table 1 contains the matrix that is used to compare the different ETL modeling approaches and evaluates our proposed
model against the other models. The letter P in the matrix
means that this model had partially supported the corresponding criteria.
6. Other related work
Figure 17
103
Criteria
Model
Mapping expressions
Conceptual constructs
UML environment
EMD
Design aspects
Complete graphical model
New constructs
(OO) concept independent
DBMS independent
Mapping operations
User dened transformation
Mapping relationship
Source independent (non-relational)
Source converting
Flat model
No
No
Yes
Yes
Yes
No
Yes
No
No
Yes
Yes
Yes
P
Yes
Yes
No
Yes
No
No
Yes
Yes
No
No
Yes
Yes
No
Yes
No
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Implementation aspects
Develop a tool
Generate SQL
Generate mapping document
Non-relational handling
Evaluation
Yes
Yes
No
No
7
Yes
No
No
No
7.5
No
No
No
No
4
Yes
Yes
Yes
No
13
104
Inmon, B., 1997. The Data Warehouse Budget. DM Review Magazine,
January 1997. <www.dmreview.com/master.cfm?NavID=55&
EdID=1315>.
Inmon, W.H., 2002. Building the Data Warehouse, third ed. John
Wiley and Sons, USA.
Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P., 2003. Fundamentals of Data Warehouses, second ed. Springer-Verlag.
Jorg, Thomas, Deloch, Stefan, 2008. Towards generating ETL
processes for incremental loading. In: ACM Proceedings of the
2008 International Symposium on Database Engineering and
Applications.
Kimball, R., Caserta, J., 2004. The Data Warehouse ETL Toolkit.
Practical Techniques for Extracting, Cleaning, Conforming and
Delivering Data. Wiley.
Kimball, R., Reeves, L., Ross, M., Thornthwaite, W., 1998. The Data
Warehouse Lifecycle Toolkit: Expert Methods for Designing,
Developing, and Deploying Data Warehouses. John Wiley and
Sons.
Lujan-Mora, S., Trujillo, J., 2003. A comprehensive method for data
warehouse design. In: Proceedings of the Fifth International
Workshop on Design and Management of Data Warehouses
(DMDW03), Berlin, Germany.
Lujan-Mora, S., Vassiliadis, P., Trujillo, J., 2004. Data mapping
diagram for data warehouse design with UML. In: International
Conference on Conceptual Modeling, Shanghai, China, November
2004.
Madhavan, J., Bernstein, P.A., Rahm, E., 2001. Generic schema
matching with cupid. In: Proceedings of the 27th International
Conferences on Very Large Databases, pp. 4958.
Maier, T., 2004. A formal model of the ETL process for OLAP-based
web usage analysis. In: Proceedings of the Sixth WEBKDD
Workshop: Webmining and Web Usage Analysis (WEBKDD04),
in conjunction with the 10th ACM SIGKDD Conference
(KDD04), Seattle, Washington, USA, August 22, 2004 (accessed
2006).
Miller, R.J., Haas, L.M., Hernandez, M.A., 2000. Schema mapping as
query discovery. In: Proceedings of the 26th VLDB Conference,
Cairo.
Moss, L.T., 2005. Moving Your ETL Process into Primetime. <http://
www.businessintelligence.com//ex/asp/code.44/xe/article.htm>
(visited June 2005).
Mrunalini, M., Kumar, T.V.S., Kanth, K.R., 2009. Simulating secure
data extraction in extraction transformation loading (ETL) processes. In: IEEE Computer Modeling and Simulation Conference.
EMS09. Third UKSim European Symposium, November 2009,
pp. 142147. ISBN: 978-1-4244-5345-0.
Munoz, Lilia, Mazon, Jose-Norberto, Trujillo, Juan, 2009. Measures
for ETL processes models in data warehouses. In: ACM Proceeding of the First International Workshop on Model Driven Service
Engineering and Data Quality and Security, November 2009.
Munoz, Lilia, Mazon, Jose-Norberto, Trujillo, Juan, 2010. Systematic
review and comparison of modeling ETL processes in data
warehouse. In: Proceedings of the Fifth Iberian Conference on
IEEE Information Systems and Technologies (CISTI), August
2010, pp. 16. ISBN: 978-1-4244-7227-7.