A Model-Driven Framework For ETL Processdevelopment
A Model-Driven Framework For ETL Processdevelopment
net/publication/220933956
CITATIONS READS
59 2,462
4 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Juan Trujillo on 22 May 2014.
Layer M3
such ontologies is a tedious task since it requires a high cor- MOF EBNF
rectness and a blow-by-blow description of the data stores.
Therefore, we advocate the use of a rich workflow language
that provides various enhancements for the designer work Tool 4GL
Behavior
without requiring the definition of any ontology. Behavioral
Packages
Grammar
Layer M2
Tool 4GL
an attempt has been concerned about the optimization of Expression
the physical ETL design through a set of algorithms [14]. Grammar
Resource
Also, a UML-based physical modeling of the ETL processes Packages
was introduced by [4]. This approach formalizes the data
storage logical structure, and the ETL hardware and soft-
ware configurations. Although both works deal with inter- M2T Transformation
esting adjacent issues about the implementation topic, they
mainly miss to produce a code for executing ETL processes.
Layer M1
Paradoxically, an exclusive ETL programming approach us- ETL Process ETL Process
M2T Application
Model Tool Code
ing the Python language has been claimed by [12]. Yet,
this approach omits to supply an independent-vendor de-
sign which decreases the reusability and easy-of-use of the
System
Vendor-Independent Vendor-Specific
provided framework. System System
3 4
http://www.omg.org/spec/MOF/2.0/ http://www.omg.org/spec/CWM/1.1/
Also, the data from the file should be filtered so as to drop
Product Category Product_Category
null Product ID records, since there is no way to validate
Product_ID Category_ID P_C_ID this constraint within a file, in contrast with a table. Then,
Product_Name Category_Name Product_ID
both flows from the database and the file can be merged
UPrice Product_Name
Category_ID UPrice
in order to compute the target measures for the fact table,
the AvgPrice. Finally, a key lookup is included to check the
referential integrity between the processed data and the old
(a) data in the Category Fact table, before its loading.
0..1 0..1
3.2.1 BPMN4ETL Model (M1) DataSubProcess 0..1 DataTask 0..1 OutputSet
Product Category Product_Category 0..* name
DB Input DB Input File Input 0..* taskType
name type
Conversion 1..*
DConversion 1
Loop DataProcess
P_C_Join Null_Category
Join Filter Condition name
P_C_Merge
Merge Figure 5: Excerpt of the data process package
by the DataSubProcess class. The flows incoming from and OMBALTER MAPPING ’CATEGORY FACT MAP’ \
ADD TABLE OPERATOR ’CATEGORY’ \
outgoing to a data task are respectively represented by the BOUND TO TABLE ’CATEGORY’
InputSet and OutputSet classes. The Expression class cap-
OMBALTER MAPPING ’CATEGORY FACT MAP’ \
tures the expressions applied by a data task on its input ADD FLAT FILE OPERATOR ’PRODUCT CATEGORY’ \
BOUND TO FLAT FILE ’ /ETLTOOL/PRODUCT CATEGORY’
sets in order to produce its output set.
#TRANSFORMATIONS
#JOIN
Additional packages of our metamodel are Condition, Com- OMBALTER MAPPING ’CATEGORY FACT MAP’ \
ADD JOINER OPERATOR ’ P C JOIN ’
putation and Query, representing different types of expres- SET PROPERTIES ( JOIN CONDITION )
VALUES ( ’ INGRP1.CATEGORY ID = INGRP2.CATEGORY ID ’ )
sions. Other packages capture fundamental aspects about
#CONVERSION
data storage in ETL processes. The Resource package rep- OMBALTER MAPPING ’CATEGORY FACT MAP’ \
resents the data resources from/to which data is extract- ADD EXPRESSION OPERATOR ’CONVERSION’ \
ALTER ATTRIBUTE ’ P C ID ’ OF GROUP
ed/loaded, and the Intermediate Data Store (IDS) package, ’ INGRP1 ’ OF OPERATOR ’CONVERSION’ \
SET PROPERTIES ( DATATYPE,PRECISION,SCALE )
represents the stagging area for storing non-persistent data VALUES ( INTEGER,30,0 ) \
...
participating in the ETL process.
#FILTER
OMBALTER MAPPING ’CATEGORY FACT MAP’ \
be seen in Listing 1. Creating a data process (MAPPING) in #ADD CONNECTIONS BETWEEN TASKS
OMBALTER MAPPING ’CATEGORY FACT MAP’ \
OMB requires first to create a project and to define a mod- ADD CONNECTION FROM GROUP ’OUTGRP1’ \
OF OPERATOR ’ JOIN ’ TO GROUP ’ INGRP1 ’ \
ule where to locate the mapping. Also, LOCATIONs for the OF OPERATOR ’MERGE’ BY NAME
resources used by the mapping should be established. Then, ...
ConvertCond.
addVGTask addSeqTask
ConvertQuery
addSubprocess matchDTasks
addVDTask ConvertComput.
addJTask ConvertCond.
addDOTask useLocation
ConvertQuery
Template relationships Use: a template means replace by the used template with creation of the current template
Template execution order Inherent: from a template means to replace without creating the current template
the necessary code to be generated by the transformation. Fig. 4 can be designed differently. Both tasks can be com-
For example, the abstract template addDProcess depicted in bined and mapped to the cube operator in the OWB, instead
Fig. 6 is implemented by the Acceleo template addDProcess, of a direct mapping to a lookup and a table operators such
an excerpt of which is defined in Listing 3 and can be seen as those proposed by our approach.
in the Acceleo editor in Fig. 7.
6. CONCLUSIONS AND FUTURE WORK
Concretely, the template addDProcess creates a new data Even though data warehouses are used since the early 1990s,
process and calls the templates for creating all the entailed the design and modeling of ETL processes is still accom-
data tasks. Indeed, depending on the type of the operator plished in a vendor-dependent way by using tools that al-
taskType, a specific template is applied, e.g. addJoTask and low to specify them according to the schemas represented in
addMeTask which aims, respectively, at adding a join and a concrete platforms. Therefore, their design, implementation
merge tasks. The final template displays a recursive tem- and even more their maintenance must be done depending
plate which is responsible of transforming a combined con- on the target platform.
dition into the expression language of OMB, i.e., SQL code.
The combined condition is represented in our metamodel in In this paper, we have provided, to the best of our knowl-
a recursive tree structure. With regards to this structure, edge, the first modeling approach for the specification of
the transformation template calls the simple condition tem- ETL processes in a vendor-independent way and the auto-
plate convertCondition in the left part of the generated code, matic generation of its corresponding code in commercial
and recall the combined condition template for its right part. platforms. This is done thanks to the expressiveness of our
BPMN-based metamodel and to the model-driven develop-
When running the resulted OMB code generated by the ment capabilities, provided by Eclipse and Acceleo, in code
transformations of our example and then opening the OWB generation.
Design Center, it is possible to verify the conformance of the
resulted process with the initial ETL model. Our approach considers Oracle and Microsoft as represen-
tative target implementation and execution tools, although
It is worth mentioning that some custom templates, not only Oracle’s solution has been described in this paper for
mentioned in the abstract template pattern, may be added space reasons. Moreover, it offers some new techniques in
during the implementation. These templates differ among order to guide developers in building other code generators
ETL tools, thus they need to be specified for each target for new platforms.
tool. Further, the generated code is closely linked to how
the transformations are programmed. In fact, the trans- Currently, our framework covers the design and implemen-
formation algorithm through the abstract template pattern tation phases of the ETL process development. One future
viewed in Section 4 is the most direct way to create the ETL work expects to extend this framework in order to handle
code. However, a more optimized code can be generated by the whole ETL development life-cycle, i.e., by involving the
using specific ’macros’ provided by tools. For instance, the analysis phase as well. This work should enhance the quality
tasks of type Lookup and DB Input in the example process in of the generated code by our approach by means of: i) en-
Figure 7: Acceleo implementation
hancing the transformation implementation using some ex- conceptual design of ETL processes for both
isting knowledge-based techniques, and ii) by building the structured and semi-structured data. International
necessary metrics for proposing the ’best’ implementation Journal on Semantic Web and Information Systems,
related to a certain ETL process. 3(4):1–24, 2007.
[11] D. Skoutas, A. Simitsis, and T. Sellis. Ontology-driven
conceptual design of ETL processes using graph
7. REFERENCES transformations. In Journal on Data Semantics XIII,
[1] J. Bézivin. On the unification power of models. number 5530 in LNCS, pages 122–149. Springer, 2009.
Software and System Modeling, 4(2):171–188, 2005.
[12] C. Thomsen and T. Pedersen. pygrametl: A powerful
[2] Z. El Akkaoui and E. Zimányi. Defining ETL programming framework for extract-transform-load
worfklows using BPMN and BPEL. In DOLAP’09, programmers. In DOLAP’09, pages 49–56. ACM
pages 41–48. ACM Press. Press.
[3] W. Inmon. Building the Data Warehouse. Wiley, 2002. [13] J. Trujillo and S. Luján-Mora. A UML based approach
[4] S. Luján-Mora and J. Trujillo. Physical modeling of for modeling ETL processes in data warehouses. In
data warehouses using UML. In DOLAP’04, pages ER’03, LNCS 2813, pages 307–320, 2003. Springer.
48–57, 2005. ACM Press. [14] V. Tziovara, P. Vassiliadis, and A. Simitsis. Deciding
[5] J. Mazón and J. Trujillo. An MDA approach for the the physical implementation of ETL workflows. In
development of data warehouses. Decision Support DOLAP’07, pages 49–56, 2007. ACM Press.
Systems, 45(1):41–58, 2008. [15] P. Vassiliadis, A. Simitsis, and E. Baikous. A
[6] A. Simitsis. Mapping conceptual to logical models for taxonomy of ETL activities. In DOLAP’09, pages
ETL processes. In DOLAP’05, pages 67–76, 2005. 25–32. ACM Press.
ACM Press. [16] P. Vassiliadis, A. Simitsis, P. Georgantas,
[7] A. Simitsis and P. Vassiliadis. A methodology for the M. Terrovitis, and S. Skiadopoulos. A generic and
conceptual modeling of ETL processes. In CAiSE’03 customizable framework for the design of ETL
Workshops, pages 305–316, 2003. CEUR Workshop scenarios. Information Systems, 30(7):492–525, 2005.
Proceedings. [17] P. Vassiliadis, A. Simitsis, and S. Skiadopoulos.
[8] A. Simitsis and P. Vassiliadis. A method for the Conceptual modeling for ETL processes. In
mapping of conceptual designs to logical blueprints for DOLAP’02, pages 14–21, 2002. ACM Press.
ETL processes. Decision Support Systems, [18] L. Wyatt, B. Caufield, and D. Pol. Principles for an
45(1):22–40, 2008. ETL benchmark. In TPCTC’09, LNCS 5895, pages
[9] D. Skoutas and A. Simitsis. Designing ETL processes 183–198, 2009. Springer.
using semantic web technologies. In DOLAP’06, pages
67–74, 2005. ACM Press.
[10] D. Skoutas and A. Simitsis. Ontology-based