Using Semantic Web Technologies For Exploratory OLAP: A Survey
Using Semantic Web Technologies For Exploratory OLAP: A Survey
Using Semantic Web Technologies For Exploratory OLAP: A Survey
Abstract—This paper describes the convergence of some of the most influential technologies in the last few years, namely
data warehousing (DW), On-Line Analytical Processing (OLAP), and the Semantic Web (SW). OLAP is used by enterprises to
derive important business-critical knowledge from data inside the company. However, the most interesting OLAP queries can no
longer be answered on internal data alone, external data must also be discovered (most often on the Web), acquired, integrated,
and (analytically) queried, resulting in a new type of OLAP, exploratory OLAP. When using external data, an important issue is
knowing the precise semantics of the data. Here, SW technologies come to the rescue, as they allow semantics (ranging from
very simple to very complex) to be specified for web-available resources. SW technologies do not only support capturing the
“passive” semantics, but also support active inference and reasoning on the data. The paper first presents a characterization of
DW/OLAP environments, followed by an introduction to the relevant SW foundation concepts. Then, it describes the relationship
of multidimensional (MD) models and SW technologies, including the relationship between MD models and SW formalisms.
Next, the paper goes on to survey the use of SW technologies for data modeling and data provisioning, including semantic
data annotation and semantic-aware extract, transform, and load (ETL) processes. Finally, all the findings are discussed and a
number of directions for future research are outlined, including SW support for intelligent MD querying, using SW technologies
for providing context to data warehouses, and scalability issues.
Index Terms—Business Intelligence, Data Warehousing, OLAP, ETL, Semantic Web, Reasoning
performed in a “closed-world” scenario based only on need for capturing semantics in an open and powerful
internal data, an essential part of Exploratory OLAP is way that can apply seamlessly across both internal
to discover, acquire, integrate, and analytically query and (newly discovered) external data. We believe that
new external data. SW technologies are a good choice for this.
The Semantic Web (SW) has been conceived as a • To support the discovery of relevant data, it is es-
means to build semantic spaces over Web published sential that the meaning of the data is “declared”
contents so that Web information can be effectively re- in an accurate, rich, and unambiguous way, so
trieved and processed by both humans and machines the right data can be found. The rich ontology
for a great variety of tasks. languages found in SW are ideal for this.
A recent paper [5] introduced the concept of fusion • To support the data acquisition, external sources
cubes to mean cubes that, based on a core of inter- must be queried in a precise, yet efficient, way, to
nal multidimensional data, gradually merge in (fuse avoid having to download complete large data
with) external data, in order to support self-service BI. sets, which are perhaps never used again, and
The paper provides a motivating example, which in where most of the content is irrelevant to the par-
our view captures the essence of exploratory OLAP ticular query. SW query languages and technolo-
well, and shows why SW technologies are needed gies, such as SPARQL and SPARQL endpoints
in this scenario. The example concerns a group of serve this purpose very well.
concerned citizens (watchdogs) that want to monitor • To support the data integration, facilities must
if the fishing catches being landed in the various EU be provided to resolve conflicts in the data, to
countries respect the overall limits set up by the EU combine data from many different formats and
marine protection schemes and how they are related sources, and to structure data in a multidimen-
to marine protection areas. The watchdogs want to an- sional format. Again, SW technologies such as
alyze the data by Time, Location, and Species, where reasoning provide a powerful foundation for this.
each of these three dimensions should be organized • To support analytical querying, measure data
into a hierarchy of levels, e.g., Day-Week-Month- must be aggregated along the OLAP dimensions.
Year, Port-Province-Country-Region, and Subspecies- Both measures and dimensions (hierarchies and
Species-Family. To do this, they must integrate statisti- levels) can now be based on external data, and
cal catch data (in a flat tabular format) with geograph- it is thus very important to capture the exact
ical data about marine protection areas (from public semantics of the multidimensional data and its
database, in SW format), fish population data (from lineage, i.e., providing not just a result, but also
various research databases, in a multitude of formats a precise specification of its meaning and where
ranging from comma separated files to SW data), and it came from. Again, SW technologies have the
finally with ontology data describing geo and species rich modeling constructs to support this.
hierarchies (in SW formats). Thus, the goal of this paper is to survey how SW technolo-
Reasoning capabilities are needed to perform the gies can aid in data discovery, acquisition, integration, and
complex integration and resolve conflicts, e.g., con- analytical querying of external data, and thus serve as a
tradicting catch data or species classifications. In our foundation for Exploratory OLAP.
view, SW technologies are powerful enough to both We note that a solid foundation for Exploratory
model all these different types of data and provide OLAP is also ideal for the wider scenario of “Ex-
the needed reasoning capabilities on top. ploratory BI”, where deep analytics and data mining
Several industrial OLAP tools already use data are performed on the exploratory OLAP cubes. How-
semantics to some extent. A notable example is the ever, considering the full range of Exploratory BI is
TARGIT Decision Suite (formerly TARGIT BI Suite) [6] beyond the scope of this paper.
which uses extended semantics to do so-called “meta- As DW mainly involves the integration of disparate
morphing”. The TARGIT meta-morphing model ex- information sources, semantic issues are highly im-
tends the traditional multidimensional model with as- portant for effectively discovering and merging data.
sociations between measures and dimensions/levels, These semantic issues are similar to those faced in the
and records and learns from the users’ behavior, SW. As a consequence, SW technologies have recently
e.g., what combinations of measures and dimen- been applied to some DW tasks such as Extract,
sions/levels are used, which type of charts is used Transform, and Load (ETL) processes, MD design
to display results, etc. This enables users to ask ques- and validation, and so on. Although they are usually
tions that are in some sense “in-complete”, using the limited to Traditional OLAP scenarios (see for example
semantics and learned user preferences to fill in the [7]), we will show that SW technologies can also be
gaps, and thus enabling easier and more intuitive useful in highly heterogeneous and open scenarios.
interaction. However, the semantics are captured in The main purpose of this paper is twofold: a) to
a closed, internal format, and is applicable only to survey and categorize how SW technologies have
already known internal cube data. been applied to solve the new requirements of Ex-
To enable Exploratory OLAP, there is thus a great ploratory OLAP systems, and analyze the associated
3
feasibility and benefits; and b) to use the analysis data provisioning and SW is not mature enough.
to identify future challenges where the potential to For the expressiveness/complexity tradeoff, our main
use SW technologies for Exploratory OLAP is high, conclusion is that researchers tend to use ad hoc
but current technologies are lacking, i.e., a “position algorithms on top of more mature standard services
paper” approach. and maintain a medium level of expressiveness with
To classify current approaches for the survey, we easier computation at the expense of incomplete (but
define five separate categorization criteria (i.e., Materi- sound) inferences. In the position part of the paper, we
alization, Transformations, Freshness, Structuredness, and list challenges related to schema design, data provisioning,
Extensibility) and show that along all five dimensions as well as semantic and computational issues. The main
there are challenges to overcome. It is important to contributions of this paper can be summarized as:
notice that the first three criteria are related to data 1) Propose a set of five novel criteria to categorize
provisioning, while the latter two are more related to DW/OLAP systems,
the data sources and the schema design. Thus, we will 2) Analyze how these criteria affect the need for
below survey and analyze how SW technologies are semantics and the feasibility of the design and
applied by current work to solve both data schema data provisioning processes,
design and data provisioning requirements. 3) Analyze how semantic-aware reasoning tech-
Moreover, the feasibility of applying SW technolo- niques can aid,
gies needs to be analyzed because, adding semantics 4) Survey and categorize existing DW/OLAP work
generally increases system complexity. In order to according to the five criteria and the three rea-
analyze the classical trade-off between expressiveness soning criteria, and
and complexity, in the context of each particular work, 5) Identify research trends in this area.
we have used three more criteria specifically related The paper is organized as follows. Section 2 in-
to SW (namely Reasoning, Computation, and Expressive- troduces the basic concepts on DW and SW, and
ness). describes the categories used in the comparison of the
We believe that these eight dimensions allow to different papers. Section 3 introduces the survey part
(1) cover the most relevant aspects of the usage of of the paper by presenting a summarized comparison
SW technologies towards Exploratory OLAP, and (2) of the different approaches reviewed. The following
separate different issues to facilitate the analysis and two sections present the details of the survey: Sec-
solution. Aspects that we find less relevant for this tion 4 surveys traditional and new features of data
purpose, and thus do not consider, include the type schema design and Section 5 does the same for data
of data storage (relational, NoSQL, etc.), any distribu- provisioning processes. Section 6 provides the posi-
tion or parallelism in data storage or computations, tion part of the paper through a global discussion of
the specific types of analysis performed (OLAP only, the main issues of the paper and potential research
or also specific types of data mining, etc.), and the lines to address them. Finally, conclusions and future
specific technologies and systems used. research directions are provided in the last section.
Our main conclusion from the survey part is that
SW technologies are a promising way to approach the
involved semantic integration issues. Although contri-
2 M ETHODOLOGY
butions and problems in data provisioning are more Nowadays, a new trend of OLAP work has emerged,
relevant than for modeling, there are less approaches which applies SW technologies to mainly address data
in the former, the reason probably being not only integration issues and the automation of data process-
the difficulty, but also that the relationship between ing. The purpose of this paper is to categorize the
4
main requirements of these new OLAP approaches, of the unified general global schema. The resulting
as well as to show how SW technologies can help to integration model (usually known as Local as View, or
fulfill the new requirements. LaV for short) is thus highly extensible, at the expense
As there are many papers proposing a large vari- of considerably more complicated query answering.
ety of system features, in this section we present a Therefore, in this integration model the reasoning
methodology that guides this survey and produces a power of SW technologies is especially needed.
clear picture of this intricate area. DW/OLAP systems use a special data model, the
We first present the characteristics of Traditional multidimensional data model (MD), for the integra-
OLAP use cases to frame the area of interest of tion layer. Here, factual data gathered from the data
our survey. Then, five criteria related to the different sources layer must be expressed in terms of numerical
relevant aspects of DW/OLAP systems are defined. measures and categorical hierarchical dimensions. The
By means of these criteria, in the rest of the paper, semantics of this model consists of representing any
current approaches are categorized. Furthermore, the interesting observation of the domain (i.e., measures)
five criteria define a space that allows us to locate in its context (i.e., dimensions). The typical processes
Exploratory OLAP use cases and to distinguish them in charge of transforming data from the data sources
from Traditional OLAP use cases. In addition, we use layer to the integration layer are called ETL pro-
another three criteria related to expressiveness, rea- cesses. In some cases (e.g., to enable fast loading
soning and complexity, to characterize existing work and querying at the expense of delivering only partly
with regard to SW technologies. cleansed/transformed data at first), the order of the
steps are switched, or interleaved, leading to Ex-
2.1 The structure of OLAP systems tract, Load, Transform (ELT) and Extract, Transform,
Load, Transform (ETLT), where the transformations
OLAP technology is aimed at gathering, transform- are (partially) delayed to provide fresh but less refined
ing and summarizing available data from existing data. Taking this to the extreme, we have Extract,
sources to generate analytical information suitable for Transform, Query (ETQ) by delaying transformations
decision-making tasks. Traditionally, OLAP has been to the last minute and serving data directly to the user
associated with data warehouses (DW), following the on demand. Thus, given a MD query derived from a
three layered structure shown in Fig. 1, namely: particular analysis goal, an ETQ process directly ex-
• the data sources layer, which consists of all the po- tracts the required information from the data sources,
tential data of any nature (e.g., relational, object- and transforms it to fit into the OLAP results.
oriented, semi-structured, and textual) that can ETQ processes are becoming essential for perform-
help to fulfill the analysis goals, ing analyses that involve external data published in
• the integration layer, which transforms and the Web, and therefore they usually deal with semi-
cleanses the data gathered from the sources, as structured, streamed and dynamic data sources (e.g.,
well as stores them in an appropriate format for [9], [10]). Fig. 1 shows how ETQ processes can interact
the subsequent analysis (i.e., the DW), and with the DW/OLAP data flow. Thus, an ETQ process
• the analysis layer, which contains a number of can take “fresh” data from the ETL staging area,
tools for extracting information and knowledge blend it with both external and DW integrated data
from the integrated data and presenting it to the (e.g., dimension hierarchies), and eventually deliver
analysts (i.e., OLAP cubes, charts, reports, etc). the results to the analytical tools. Notice that ETQ pro-
As it is clear from this description, the integration cesses can also live apart from traditional DW/OLAP,
model of Traditional OLAP systems (DW/OLAP) is avoiding thus the need of loading the integrated data
based on a global schema (i.e., the DW schema), into a DW. It can be noticed that Exploratory OLAP
which is seen as a view over the underlying data systems are tightly related to ETQ processes.
source schemas (which is usually known as Global SW technologies can help in all DW/OLAP layers
as View, or GaV for short). In this integration model, in order to support Semantic-aware and Exploratory
query answering is simple. The external data sources OLAP systems. In the data sources layer, they can aid
are (implicitly) assumed to be known in advance as in capturing the precise semantics of the data sources.
are the user needs guiding the design of the global In the integration layer, they can be used to specify the
schema. This works well when the sources and re- transformations and capture the data lineage. In the
quirements are indeed known in advance, but encoun- analysis layer, they can help specifying the semantics
ters problems when this does not occur. For those of the presented information and reasoning about it.
cases, more flexible integration models are needed. In Finally, SW technologies can serve as a proper basis
particular, the integration of external data schemas in for defining ETQ processes, since most external data
terms of a global schema (often in the form of a global is now being published as linked data (see [10], [11]).
domain ontology) has been studied (see [8]). From the As far as we know, the amount of approaches that
global schema, local schemas can be derived; i.e., the use SW technologies in the first two layers is large,
local schemas are seen as (more specialized) views whereas there are very few proposals that apply them
5
2.2.1 Materialization
Virtual
Starting from the top of the figure, we firstly find
Result Keeping Materialization. This criterion concerns the level of
materialization of the integrated data. In Traditional
Partial
DWs, all the integrated data is fully materialized
Dynamic Lightweight
Evolving Full Partition-tolerant (i.e., Full) often including a so-called data staging
Static Complex area for performing transformations and cleansing.
At the other extreme, Virtual DWs extract data
Structured Periodic
from sources at query time, integrate them on the fly,
Micro-batches
Semi-structured On demand return the result to the user, and then throw away
Right-time the integrated data. Notice that the ETQ processes de-
Unstructured Streaming
scribed in the previous section fall in this category. A
compromise, where some data is materialized, while
Fig. 2. OLAP Systems Categorization Criteria other data, typically data with many changes, are
extracted at query time, is sometimes used (see [12]).
to the third layer and ETQ processes. For this reason, Closer to the Virtual DW, the Result Keeping
our primary focus is to investigate the main issues approach first extracts data on-demand from sources
of the first two components. More specifically, in this and computes the result on the fly (e.g., for displaying
survey we analyze how SW technologies can aid in in a dashboard), but then stores/keeps the results
the integration and provisioning of data, and in the to allow repeated requests for the same result to
MD schema design of OLAP systems. be delivered quickly (see [9]). Complex ETL flows
may actually have subparts each residing in different
categories (i.e., Partial). For example, it is common
2.2 OLAP use cases categorization criteria to have an “on-line” flow that performs fast, but
less thorough, on the fly integration in main memory
In this section we propose a series of criteria that aims for immediate use, while a parallel “off-line” flow
to capture the main aspects of emerging DW/OLAP performs more thorough integration for historical use
systems. The criteria aim to reflect the main compo- and stores all data persistently (see [13]). Here, SW
nents of a traditional DW/OLAP system, and how technologies can be used to describe the data and the
these components are evolving to cover the new results, as well as the steps inbetween.
requirements posed by the new scenarios (e.g., social
networks, linked data and big data). The resulting 2.2.2 Transformations
categorization schema aims to identify the commonal- Proceeding clockwise, the next one is Transformations.
ities and differences of emerging approaches in terms This criterion concerns the level of transformations
of the changes they propose with respect to the tra- applied to the source data during the integration pro-
ditional components of DW/OLAP systems, and is cess. In Traditional DWs, it is common to apply many
the result of long discussions and analysis of some Complex and significant transformations, e.g., creat-
previous work around the new scenarios for OLAP ing versions of data (i.e., Slowly Changing Dimen-
(e.g., fusion cubes in [5]). sions -SCD- in [14]), significant cleansing, computing
The proposed categorization of DW/OLAP systems holistic aggregates, etc. At the other end of the spec-
relies on five criteria. These criteria are generally or- trum, some use cases demand only Lightweight
thogonal, but for certain types of DW/OLAP systems transformations that can be done quickly on the
there will be correlations between them, as discussed fly (even for streaming data), e.g., moving aver-
below. The criteria are of a functionality-only nature ages, simple and approximate aggregations, renam-
(i.e., independent of the underlying technologies used ing/removing columns, etc. As mentioned above,
to provide them). The first three criteria are related to such light data quality improvement and integration
data provisioning, while the latter two are closer to the are sometimes complemented with a parallel flow per-
data sources and the data schema design. The criteria forming complex transformations for later use. As a
are shown in Fig. 2. Generally, the center of the figure middle category, some systems apply transformations
represents Traditional use cases, whereas the outer that are non-simplistic, but Partition-tolerant
rim are use cases that are Exploratory or otherwise and thus parallelizable, such as categorizing values,
somehow harder, in terms of the required reason- etc. SW technologies provide a powerful framework
ing computation. SW technologies can provide large for describing the transformations and for managing
benefits even in the central area, where Semantics- the lineage of results through these.
aware Traditional systems/use cases are located, but
in the complex cases found at the outer rim, where 2.2.3 Freshness
Exploratory requirements appear, having sufficient se- The next criterion is Freshness, which concerns how
mantics and reasoning power is essential. often the data integration process is performed (i.e.,
6
how often the DW is refreshed). Traditional DWs were Finally, some cases can be very Exploratory, looking
refreshed Periodically (e.g., daily, in batch mode). for new sources most of the time, in order to answer
A variation of this is Micro-batches where the specific, but constantly changing questions, making
refreshment is run often (e.g., every 15 or 30 minutes), the source set completely Dynamic. This criterion
on the smaller batch of data accumulated in that pe- should be seen as how easy it is to evolve the schema,
riod. Other DWs (e.g., the Virtual DWs mentioned more than how often it is actually done, which does
above) refresh the data On demand, when requested not depend on the technologies used, but only on
by users. More recently, there has been a trend to the business needs. While it makes good sense to use
refresh the DW even more frequently (e.g., with prop- SW technologies to describe even rather static DWs, it
agation delays of at most a minute or so). This is often becomes essential to have a powerful semantics and
referred to as near real time (or Right time) DWs. reasoning framework for the Dynamic case.
Finally, we can have Streaming data at very high
rates like thousands of items per second, and handled 2.2.6 Discussion
more or less immediately: the so-called data stream DW/OLAP systems take many forms, but we see
approach. Thus, this criterion is somewhat related to two separate “rings” in the five-dimensional space
Materialization. For Full Materialization, all levels of emerging, with many possible stands in between. The
Freshness make sense and the same applies specifically inner ring (core) is Traditional OLAP, which is well
for the specific data covered by Partial Materializa- understood and aims at answering a rather static and
tion. The categories Result Keeping and Virtual well-defined set of questions mostly on structured
are tied to On demand Freshness. This criterion can be data. The outer ring is what we will call Exploratory
heavily affected by the usage of SW technologies (e.g., OLAP which aims at answering new and constantly
for formally stating the freshness and the associated changing questions on a much wider range of data.
criteria and policies). We note that a number of similar terms have been
suggested for the broader case of BI systems (cov-
2.2.4 Structuredness ering not only DW/OLAP, but also analytics/data
The next criterion is Structuredness which concerns mining), including “live BI” in [18], “on-demand
which types of data are found in the data sources or, BI” in [19], “ad-hoc BI” in [20], “open BI” in [21],
more specifically, how Structured the least struc- “situational BI” in [22], or lately “fusion cubes” in
tured type of source data is. In Traditional OLAP cases, [5]. However, given our focus on (multidimensional)
all sources consist of structured data, typically rela- data modeling and data acquisition, we think that
tional tables or in a few cases structured spreadsheets. the term Exploratory OLAP is more precise and bet-
More recently, Semi-structured data sources such ter captures its essence. The most important crite-
as XML, RDF and OWL have become more common. rion for distinguishing between them is Extensibility.
Lately, Unstructured data such as text files, HTML Traditional OLAP cases have a Static set of data
files, and other like e-mails or twits (part of the “big sources, while Exploratory OLAP often, or all the time,
data” movement) have become important sources. brings in new data sources, making them Evolving
SW technologies can be used for all three kinds, but or Dynamic. For the Structuredness criterion, Tradi-
their power is especially necesary to manage the more tional OLAP tends to use mostly Structured data,
complex needs of semi- and unstructured data. while Exploratory cases also use Semi-structured
and Unstructured data sources (e.g., text, social
2.2.5 Extensibility media data, etc). For the Materialization criterion, Tra-
The next and last criterion is Extensibility. This crite- ditional OLAP typically uses a Materialized DW,
rion concerns how Dynamic the set of data sources including an intermediate materialized data staging
can be, i.e., how easily new data sources could be area. In contrast, Exploratory OLAP will often use a
brought into the system. In Traditional DW/OLAP, the Virtual approach where data is pulled from sources
same (mostly internal) Static data sources are used on-demand, although some level of materialization
over and over, and new sources are only brought in is possible (e.g., caching). Data staging areas are
at new major DW releases (i.e., at most a few times typically not used. For the Transformations criterion,
per year). Recently, there has been a trend to include Traditional OLAP will typically have Complex trans-
new data sources, often from external data suppliers, formations such as maintaining SCDs and computing
into an existing DW more often in order to answer precise holistic aggregates, while Exploratory OLAP
new questions, making the source set Evolving. In will only employ transformations that can be per-
this context, an Evolving system is able to adapt formed sufficiently fast (perhaps on streaming data)
their MD schemas to evolving data sources as well as and in parallel (i.e., the transformations have to be
user requirements. These systems have been widely Lightweight or at least Partition-tolerant).
studied in the literature (see reviews from [15], [16], We note that even such transformations can in fact
[17]), and they can be considered within the tradi- require heavy computations (e.g., machine learning
tional DW/OLAP boundaries. computations, data/text analytics, User Defined Func-
7
J
Criterion/Category Traditional (·) Medium () Exploratory ( )
Structuredness Structured Semi-structured Unstructured
Materialization Full Partially Virtual
Transformation Complex Not only Relational algebra Relational algebra
Freshness Periodic Some sources on-demand On-demand
Extensibility Hard to add sources Sources easily added Sources added automatically
Criterion/Category Easy (©) Feasible () Difficult (§)
Reasoning None Standard Non-standard
Expressiveness RDF/Taxonomies Datalog, OWL2 profiles, DL-Lite OWL-DL
Computation Less than Polynomial Polynomial Exponential
TABLE 1
Generalized categories
the MD schema design of OLAP systems (i.e., at the work under the specific criteria, and allows us to
schema level), or on integration and provisioning of visually analyze the correlation between being more
data (i.e., at the data/instances). For each of these Exploratory, using semantics, and the incurred cost.
two categories (schema vs. data), we have identified For each of the relevant papers identified, we show
representative papers, and subsequently divided them its position in each of the five DW criteria, and
into two subcategories (for a total of four categories also the value in the three SW ones. Horizontally,
of papers), depending on whether SW technologies the table is divided into four parts corresponding
are applied to satisfy the requirements of Traditional to semantic-aware MD design, multidimensional query
OLAP systems (here denoted Semantic-aware OLAP definition, semantic-aware ETL processes, and ETQ pro-
systems), or to support (to some extent) the new set of cesses. When one paper deals with both issues, MD
requirements of Exploratory systems. The papers were design and data provisioning, it appears twice in the
selected based on our experience, depending on how table and is analyzed from both perspectives (which
well they exemplify the categories. may result in apparently contradictory classifications,
As previously explained, being Traditional or Ex- caused by the different viewpoints of the analysis).
ploratory is not boolean, but a continuum with a blurry Let us start from the upper part of Table 2 (i.e.,
border. Also, inside each of the five criteria in Fig. 2, semantic-aware MD design). As previously discussed
there is a continuum. Thus, to facilitate the visual (and further discussed in Section 4.1) the Transforma-
comparison, we have simplified the descriptive scale tion and Freshness OLAP criteria do not apply for these
of each criterion into only three generic categories, papers. These criteria are more related to data than to
depending on how close they are to the mid point of schema and have not been typically considered for
the space (i.e., Traditional OLAP). Table 1 summarizes MD design. Focusing on the other criteria, the com-
these fifteen resulting categories, together with the mon characteristic of all papers is that they completely
nine categories for the SW criteria in Fig. 3. Notice materialize the DW. Four of them deal with semi-
that the max (resp. min) of these three categories structured data, and among these, two facilitate to
does not necessarily coincide with the max (resp. some extent their inclusion in the DW. The exception
min) of the corresponding edge, because the purpose to this is Neumayr et al. [34], because they present
of the table is just to outline the differences in the an extension of the work (i.e., [35]) which adds the
approaches (i.e., the table shows the max -resp. min- Exploratory part. As a general rule, when a work only
found in the analyzed papers). These exactly coincide deals with the schema, it allows High Expressiveness,
for Structuredness and Materialization; for Transforma- but dealing with schemaless unstructured data re-
tion, the typical Lightweight set of operations we quires to lower the Expressiveness to Medium in order
found is Relational Algebra (so we reflected this to process the huge amount of data in a DW. In
in the table); in the case of Freshness, the more Ex- any case, the computation would consume a lot of
ploratory papers we found considered On-demand re- resources, except if one restricts the Expressiveness and
fresh; finally, with regard to Extensibility, we mapped at the same time only deal with the schema of a well
papers automatically adding sources to Dynamic. To structured database.
decide the classification of each work in each category, The second part of the table corresponds to more
we checked the content of the corresponding papers Exploratory design systems (see Section 4.2). We can
looking for matchings to the definitions in Section 2. see that in this case, the common characteristic to all
If not enough information was provided to classify of them is that they are able to deal with unstructured
some work in a given category, we crossed out the data, as soon as ontological mappings are provided.
corresponding cell. Also common to all of them is that they to some extent
Table 2 summarizes the findings in the analyzed facilitate the addition of new sources to the DW. How-
work, the details of which will be provided in Sec- ever, most of them materialize extracted data, do it
tions 4 and 5. This table is aimed at evaluating existing off-line and allow only light-weight transformations.
10
Sem.-aware
Priebe et al. [31] · × × ·
§ §
Data Schema Design
Bakhtouchi et al. [32] · · × × ·
Prat and Akoka [33] · · × × · § §
Neumayr et al. [34], [35] · · × ×
Abelló et al. [36] · × × §
Romero and Abelló [37] · × × §
© ©
J
Kämpgen et al. [11] · ·
Exploratory
§ §
J
Romero et al. [38] ·
§ §
J J
Nebot et al. [39] · ·
§ §
J
Khouri et al. [40] · ·
© ©
J J J J
Kämpgen et al. [41]
§ §
J J J J
Nebot and Berlanga [10]
©
J
Niemi et al. [29], [30] · ·
Sem.-aw.
Data Provisioning
TABLE 2
Summarized comparison
Kämpgen et al. [41] and Nebot and Berlanga [10] seem we can (beforehand) define the mappings from all
to move in a more Exploratory space, though. Not the different data sources to the ontology showing
surprisingly, High Expressiveness results in the need the watchdogs vocabulary (i.e., we must know in
of heavy computations. As the two last approaches in advance that we will use statistical catch data, geo-
this part show, the only way to avoid heavy compu- graphical information about marine protection areas,
tations is to avoid reasoning. etc). Once this technical work is done by someone
The third part of the table corresponds to semantic- in the watchdogs group, the others could use the
aware ETL (see Section 5.1). We can see that the ontological knowledge to navigate and analyze all
common characteristics in this case are Extensibility, these data, which would most probably have been
Materialization, and Structuredness. The work in this replicated onto their server for the sake of perfor-
area tries to automate to some extent the generation mance. Oppositely, if such planning cannot be done
of data flows into a DW. Then, they allow more or beforehand and the watchdogs need to discover the
less unstructured data and simple transformations. different data sources on the fly (or there is no such
Relevantly, since some of the approaches work at expert in the group able to define the mappings), then
the conceptual level, they do not pay attention to there should be a public ontology where the publisher
more physical characteristics like Freshness. Due to the of catching information and marine protection areas
inherent complexity of the problem, some drastically map their contents. Given that, the watchdogs could
limit their Expressiveness in order to be feasible. The use exploratory techniques to navigate that public
rest of them result in high Computation needs. ontology and find what is worth to be analyzed
Finally, the last rows in the table summarize the among the available sources.
work related to ETQ (see Section 5.2). They all avoid With regard to data provisioning, given the same
the complete materialization of data and thus im- premises as in the data design, semantic aware tech-
prove their Freshness. Some facilitate Extensibility more niques can help to design the extraction processes,
than others, but all allow to deal with semi and/or which may include complex cleaning algorithms, be-
unstructured data. Given the cost of querying this cause the data is extracted beforehand and locally
kind of data, they avoid reasoning or keep it to a stored. This would need the intervention of some ex-
minimum (e.g., Pedersen et al. [9] does not even use pert inside the group. On the contrary, if no complex
RDF semantic constructs). Also they do not deal with cleaning is necessary and we do not want to replicate
heavy-weight transformations, but keep this as simple the data in our server, the watchdogs could use an
as possible. Despite this fact, the needed computing exploratory approach to issue simple queries directly
power to run them is not low, in general. to the data providers. The ontological knowledge
Retaking our watchdogs example in the introduc- should facilitate the integration of those data on the
tion, a semantic aware data design could be used if fly. We do not preclude the expert from participating,
11
but just consider that his/her intervention can and fulfilling the MD integrity constraints. Although no
should be minimized. standard is available, much work has been devoted
to identify these constraints (e.g., [45]), which can be
4 DATA S CHEMA D ESIGN summarized as a formal definition of the MD space
and the notion of well-formed hierarchies, in order to
MD design is a well-known paradigm in the area of
preserve a correct aggregation of data.
DW and databases in general, always related to OLAP
To automate MD design, classical approaches (e.g.,
tools. It was popularized by Ralph Kimball at the
[14]) focus on the organization of data and assume
logical level in [14].
relational (or homogeneous) and well-structured
Multidimensionality is based on the fact-dimension
sources and therefore, they are hardly effective (or
dichotomy. This paradigm aims at analyzing the fact
feasible) in heterogeneous scenarios with disparate
(or subject of analysis) instances, from different analy-
sources. Indeed, the more automatable they are, the
sis dimensions (i.e., points of view). Several measures
more tied to a specific formalism or language (typ-
(i.e., metrics) are available for each fact instance in
ically relational sources). Consequently, they do not
order to gain insight. Furthermore, the MD model also
tackle the integration of different data models.
provides foundations to study/analyze the available
At this point, some work proposed semantic-aware
measures at various aggregation levels determined by
approaches to integrate external data. Basically, these
the hierarchical structure of the dimensions. Indeed,
first approaches follow the same principles but start-
aggregation is one of the main characteristics of the
ing from an integrated view of the sources in XML,
MD model, setting foundations for the well-known
RDF, or ontologies. SW technologies are a promising
roll-up and drill-down operators.
foundation for integrating heterogeneous data and
Conceptually, it entails that the fact is related by
most work further exploring this direction can be clas-
means of to-one relationships (i.e., functional depen-
sified either as those focusing on Web data (or similar
dencies) to the dimensions, which in turn identify
scenarios), where the presence of the SW technologies
the fact. Thus, a fact has no atomic identifier but
is granted, and those using SW technologies to tackle
instead a compound one composed by the set of
integration in any scenario.
dimension values univocally identifying the factual
Among the first ones, one may find the work dis-
data (from now on we will refer to these identifiers
covering MD schemas from XML (a review and deep
as MD identifiers). Finally, dimension hierarchies are
discussion of DW approaches for XML and Web data
composed of aggregation relationships between the
can be found in [46]). In the general case, however,
dimension levels. Discovering these kinds of relation-
these approaches can be clearly improved by using
ships is crucial in the design of the OLAP cubes and
more expressive SW formalisms that facilitate inte-
in turn of the DW.
gration by incorporating a reference semantic layer
In next subsections, we first review the Traditional
whereto every domain concept and relationship can
OLAP work using SW technologies in the design and
be mapped.
then, we discuss more Exploratory approaches, where
The suitability of DL for data modeling was claimed
no clearly delimited design phase exist.
back in 1998 (see [47]) and by then the first work
acknowledging the benefits of modeling the DW
4.1 Semantic-aware Multidimensional Design by using such formalisms appeared in the DWQ
The main features of Traditional OLAP systems di- project. This line of work highlights the need of
rectly impact on how MD design has been tackled for capturing functional dependencies and aggregation
such systems. Since these systems largely materialize relationships to model the DW. A few papers focus-
the integration layer, designers arrange the integrated ing on Datalog to model the DW also appeared at
data in an MD fashion ready to be exploited by non- that time. Although Datalog has been overlooked for
expert users. Complex and Periodic ETL processes modeling DWs until very recently, it was, indeed, the
guarantee a quality threshold for the integrated data, first logic-based approach proposed for DW modeling
which comes from Static and Structured sources in [48]. This work tackled the very same problems
(typically relational databases). In some cases, this just discussed for DL: how to deal with functional
data is completed with data coming from external or dependencies and summarizability. A proper Datalog
non-structured sources. extension was presented to do that.
Thus, Traditional MD design first focuses on iden- However, the state of the art on DL and Datalog
tifying the needed subset of source data to answer at the time did not provide strong evidence about
the end-user analytical needs and then arranges them. the feasibility of such approaches, and reasoning al-
Since sources are mainly relational and static, the gorithms (needed for validation purposes) over such
process can be automated up to some extent by in- languages were computationally expensive and thus
ferring a mapping between the two schemas. In these unfeasible for real cases. For this reason, it was not un-
scenarios, MD schemas are identified by exploring til ten years later that the first work presenting semi-
the data sources in order to discover MD patterns automatic methods to support the MD design task by
12
Fig. 4. Ontologies for Semantic Annotations Fig. 5. Ontologies for Domain Modeling
exploiting the SW technologies appeared, empowered make MD sense (i.e., preserves the MD constraints).
by recent advances both in DL and Datalog. Thus, they propose to analyze traditional data by us-
ing external ontologies as semantic dimensions (former
4.1.1 Ontologies for Semantic Annotations work in this line was proposed in [50], in which a
Fig. 4 depicts one trend in this category of work, domain ontology serves as a basis to define an MD
where the ontology plays a passive role. Thus, the schema for aggregating instance data).
data sources are checked (mainly sampling data us- For all these papers adding new data sources is re-
ing data mining techniques) to identify functional duced to linking the sources to the reference ontology.
dependencies and MD identifiers and then annotate Thus, they provide a certain degree of automation
the findings in a reference ontology, following a GaV to model the DW only once the sources have been
integration model. From this asserted information, linked (i.e., mapped) to the reference ontology. This
facts are identified and, from each fact, a star-shaped approach to add new data sources has two main
schema is produced by deploying correct hierarchies consequences: the ontology languages used are less
and aggregations by means of functional dependen- expressive (e.g., DL-Lite) but still the reasoning tasks
cies. Consequently, note that the ontology is a reposi- carried out are computationally expensive: besides
tory of semantic annotations aimed at hiding hetero- Standard reasoning such as subsumption checking,
geneities. Reasoning is used to perform satisfiability they need to compute the transitive closure of func-
and consistency checks (in the general case, reducible tional dependencies (a Non-standard service). Yet,
to subsumption checking) to validate the asserted these papers would rather use ad hoc algorithms
knowledge, whose complexity varies according to the than Non-standard reasoning services. Neverthe-
ontology language expressiveness. less, linking sources to the ontology can turn out to
As shown in Table 2 (see Section 3), all approaches be hard in some scenarios (because mappings must
in this section (see Data Schema Design - Sem.- be provided). Thus, in any case, adding new sources
aware) assume Full materialization of the DW. As is time-consuming and it is not suitable for very
a common trend, although they use SW technologies Dynamic scenarios but for those where new data
to overcome heterogeneities, the final DW produced sources are added from time to time and intended
mostly follows the Traditional assumptions and thus, to stay in the setting.
do not fulfill the requirements of Exploratory cases. All in all, the main difference between these ap-
Those papers following the trend depicted in Fig. proaches is the ontology language used, distinguish-
4 (i.e., [29], [30], [31], [32], [33], [34], [35]) neither ing two main options: those focusing solely on RDF
provide support for tackling the system Extensibility in and those aiming to support a more generic and
an automatic way nor deal with Semi-structured expressive approach. However, we note that both
or Unstructured data. More specifically, Niemi et options have paid little attention to aggregation re-
al. [29], [30] and Priebe et al. [31] have Low Ex- lationships up till now.
pressiveness and thus yield Medium Computation (e.g.,
OWL 2.0 RL). Somehow, these approaches can be
4.1.2 Ontologies for Domain Modeling
considered precursors to current Exploratory efforts
aimed at bridging the gap between OLAP and linked Fig. 5 sketches an alternative that achieves a larger
data (e.g., [49]). Oppositely, Bakhtouchi et al. [32] and degree of automation, while still in a Traditional sce-
Prat et al. [33] try to tackle any scenario and choose nario. In this case, sources are mapped to a reference
a expressive standard DL language such as OWL ontology (with no MD meaning) and reasoning is
DL. Thus, they provide High Expressiveness and yield exploited to identify functional dependencies and/or
Hard Computation. MD identifiers over the ontology rather than relying
Finally, Neumayr et al [34] and Anderlik et al. [35] on the designer explicitly asserting them; all in all, this
present a detailed layered approach consisting of a makes the ontology to play an active role. In general,
flat domain (i.e., with no MD meaning), a hierarchy expensive algorithms are triggered to identify MD
domain (showing roll-up relationships) and an MD patterns (by checking the MD constraints) and then,
domain. By means of integrity constraints (Datalog an MD schema based on this knowledge is created.
rules without head) and Datalog inference capabili- Most of this work assumes that an ontological
ties, they guarantee that the asserted information does representation of the sources is available (i.e., they are
13
In an Exploratory OLAP scenario, MD design should ments, and get the subsets that are of interest;
be driven by the user queries. This is because Ex- • Extract useful relationships between the retrieved
ploratory users expect to be aware of new external data subsets, to discover potential MD facts; and
that fulfill their current business requirements, which • Validate the discovered facts, for mixed datasets
are not covered by their internal corporate data. In that may have incompatible semantics, that need
this scenario, MD design must be flexible enough to to be integrated.
allocate new incoming external data and to conciliate In the area of MD modeling, several user-driven
them with both internal data and user requirements. and semantic-aware approaches have been proposed,
To fulfill user requirements, an Exploratory OLAP which to some extent follow the integration models
system should ask for Fresh data in a highly above. The feature common to all of them is that they
Dynamic environment (e.g., the Internet), where an allow the integration of Unstructured data by just
integration layer is hard to materialize. Thus, hetero- mapping them to the available ontologies.
geneity among sources becomes the main issue, also Firstly, Kämpgen et al. [11] propose a borderline
the lack of structure in the sources, which claims for Traditional DW to store and analyze statistical Linked
new techniques to identify MD patterns on the fly. Data (sLD) through OLAP queries. However, in this
approach, there is no MD design because the DW only
4.2.1 Ontologies for Semantic Linked Modeling accepts sLD already expressed as MD data. It uses
Fig. 6 sketches the idea behind the work in this cate- SPARQL as query language, and does not perform
gory. The main difference with regard to Fig. 5 is the any kind of reasoning over semantics.
14
In this context, semantic-aware ETL processes are mainly deal with the conceptual level of ETL design,
those that take into account these semantic annota- whereas this criterion involves lower design levels
tions to improve the integration processes required (i.e., physical). The same holds for Materialization, and
in OLAP solutions. In the rest of this section, we first all the authors more or less implicitly propose just to
review work related to semantic-aware ETL processes load the results into some kind of data storage.
and then, we discuss preliminary efforts that deal Regarding the reasoning mechanisms used, if any,
with heterogeneous and unconventional data sources, they are always Standard (i.e., subsumption) and
which are dynamically incorporated to the analysis. only involve the schema, never the data. Despite that,
given the Expressiveness chosen by some authors, Com-
putation turns out to be Hard in most of them. Only
5.1 Semantic-Aware ETL Processes
Skoutas et al. [44] reduce the computational needs by
During the initial steps of a Traditional ETL project, the restricting the reasoning to simple taxonomies.
main goal is to construct a conceptual ETL design that Firstly, Niemi et al. [29] describe methods for OLAP
identifies the data sources that are useful to the project cube construction using SW technology. The authors
and describes the corresponding data transformations use a generic OLAP ontology as an upper ontol-
needed to map these sources to the target DW con- ogy for all OLAP cubes. This ontology defines only
cepts. For achieving that, it is imperative to identify general OLAP concepts and it is independent of the
and understand the semantics of both the data sources application area. Per application need, they consider
and the target data stores. Although this is at large domain-specific ontologies (e.g., CarModel, Branch,
an open problem, some approaches propose the use Country) based on the upper one. We may need to
of SW technologies to facilitate and to automate the define ontology mapping transformations describing
design and construction of the ETL part. how the source data should be converted to conform
to the global domain ontology. In order to integrate
5.1.1 Ontologies for ETL Modeling data from different sources, the authors consider an
Fig. 7 sketches (using Pentaho PDI icons) an approach RDF data format and an RDF query language.
to use SW technology for ETL. It uses a global ontol- As an extension to their work, Niemi et al. [30]
ogy for mapping all the involved data stores to it. This discuss in more detail the method for automating the
idea resembles the LaV paradigm, where the appli- construction of OLAP schemas. Again, the source and
cation ontology, constructed as a conceptual schema target schemas are considered as known. The map-
of the domain, corresponds to the global schema and ping among the source data and the OLAP schema is
the semantic descriptions of the data stores, in terms done by converting the data in RDF using ontology
of classes and properties defined in the ontology, maps. Then, the relevant source data are extracted by
correspond to the views describing the local schemas. RDF queries generated using the ontology describing
However, the use of an OWL ontology, instead of a the OLAP schema. At the end, the extracted data
global schema, provides a formal model on which are stored in a database and analyzed using typical
automated reasoning mechanisms may be applied. OLAP techniques. Both these approaches aim at an
Furthermore, in ETL, it is not sufficient to consider end-to-end design approach, but they have two main
the integration problem as a query rewriting problem, limitations. First, they both require prior knowledge
since the transformations taking place in real-case ETL of the source and target schemas and second, they
scenarios usually include operations, that cannot be consider only simple data transformations.
easily captured by a query rewriting process (see [42]). Skoutas and Simitsis [42], [43] present an approach
The mixture of SW and ETL technologies is still to ETL design using SW technology that elaborates
quite immature. However, the preliminary efforts in more on the complexity of the data transformations
this area have shown a potential solution to the required for integrating source data from heteroge-
automation of ETL designs (mainly at the concep- neous sources into a DW. This work deals with a
tual level) given a set of business requirements, fo- core ETL design challenge: the structural and semantic
cusing on being more Exploratory along Extensibility heterogeneity. For example, two sources S1 and S2
and Structuredness, by facilitating the load of semi- may contain similar information under two different
structured (i.e., XML or RDF) data. With respect to schemas or they may use different representation for-
other categorization criteria presented in Section 2.2, mats. This approach uses ontologies to formally and
the entire spectrum of Transformations presented in explicitly specify the semantics of the source and DW
Fig. 2 could be supported by the methods pro- schemas and thus, to automate to a large extent the
posed. However, in practice, so far only simple and ETL generation. It also assumes that the source and
relational-style operations seem to be considered. target schemas are previously known. A conceptual
How to generalize this work to capture a richer set of ETL design is obtained, whose generation involves the
transformations (like user-defined functions) remains automatic derivation of the mappings from the source
to be seen. Freshness is basically left aside and not to the target attributes, along with the appropriate
even mentioned in some of the work, since they ETL transformations. However, computationally and
16
semantically complex ETL operations like pivot and fast as possible within the analytical queries (Fresh-
slowly changing dimensions are not supported. ness criterion). Unlike traditional ETL processes, ETQ
Skoutas et al. [44] extend the above mentioned work processes can also be defined over unconventional
and propose using a graph-based representation as a data sources (Structuredness criterion), such as those
conceptual model for the source and target data stores usually found on the Internet. Current approaches
and based on that, the ETL transformations can be partially cover some of these aspects by adopting
identified by means of graph transformations, instead RDF-based technology. However, applying reasoning
of generic reasoning (which results to be computa- within these processes is still an unexplored issue.
tionally cheaper). It is described how the operations Firstly, Nebot and Berlanga [10], as a step towards
comprising the ETL process can be derived through the definition of semantic-based ETQ processes, pro-
graph transformation rules, the choice and applicabil- pose a method to generate fact tables directly from
ity of which are determined by the semantics of the semantic data published as RDF(S) and OWL. This
data with respect to an attached domain ontology. method starts with the target MD query, which must
The techniques discussed so far consider as given be expressed in terms of concepts and properties of
the source and target schemas and search for the the source ontologies, and then it performs a series
mapping of one to the other, by benefiting from ISA of transformations that guarantee that the generated
relationships in a reference taxonomy. As a further factual data conforms to both the MD query and
step, Romero et al. [38] aim at building ETL and MD the source semantics. The main drawback of this
conceptual designs (see Section 4.2) starting from a approach comes from its computational complexity,
set of data sources and a set of business requirements since it requires to compute the aggregation patterns
(e.g., expressed as service level objectives). This ap- within the ontologies through a reasoner.
proach first analyzes the data sources mapped against Pedersen et al. [9] provide an example of ETQ pro-
a domain ontology. Then, in an iterative fashion, it cesses proposed to extend OLAP with external data.
produces two results: (a) an MD design that satisfies This method allows the execution of OLAP operations
the requirement at hand (e.g., fact and dimension that involve data contained in external XML (not even
tables, relationships among them, etc.); and (b) a flow RDF) sources through XPath queries. In doing so,
of conceptual ETL operations required to feed the MD external data can be used as dimensions and/or mea-
schema by connecting it to the related data sources. sures of OLAP cubes, offering high Extensibility, and
Freshness of the results. Still, the XML sources need
to be logically integrated within the OLAP system,
5.2 ETQ Processes preventing it from being completely Dynamic.
As earlier mentioned, Exploratory OLAP aims at Finally, Kämpgen et al. [41] directly define OLAP
analysing “fresh” data coming from a wide range of operations over external data expressed in RDF and
data sources. Fig. 1 shows that the provision of this annotated using the RDF Data Cube vocabulary (QB).
kind of data would require alternative ETL processes, The authors define an MD schema based on QB and
which we called ETQ. As discussed before, this area map OLAP operations in this schema to SPARQL
is generally unexplored by current approaches, but queries over the sources. In this approach, the Exten-
it contrasts with the great demand of this kind of sibility criterion is enabled through the use of RDF
processes within the community of OLAP. and QB to annotate the data sources. One drawback
SW technologies can play a crucial role in the of this approach is that the annotation with QB is
development of ETQ processes. On the one hand, manually performed, which implies a bottleneck in
they facilitate to some extent the processing of un- the design of the transformation processes, affecting
conventional data sources, containing semi-structured the Extensibility criterion. Another limitation of this
or unstructured data. Adding semantics to such data approach comes in terms of efficiency, as it depends
helps in shaping them under some notion of structure. on the SPARQL processor, which has been criticized
This can be very helpful, given the effort spent in ETL as inefficient for complex MD queries.
research and materialized in ETL tools for dealing
with structured information. On the other hand, SW 6 C HALLENGES
technologies are enabling a new emerging scenario In this section, we summarize our findings and iden-
(i.e., the Web of Data) that offers new possibilities to tify a list of challenges that require a fresh look in
publish semantic data for easy consumption by ana- the future. We divide our discussion between the two
lytical tools. In this new scenario, virtualization and areas of interest in this survey, namely schema de-
freshness can be fully achieved thanks to the use of sign and data provisioning, but we also comment on
SW references (e.g., URIs), which allows applications whether SW technologies are ready to fully support
to access the required linked data at any time. the needs of next generation OLAP systems. Fig. 8
Thus, the main aim of ETQ processes is to avoid depicts the locality of these challenges (shown as
as much as possible complex computations over the numbers from 1 to 8) with respect to the various
imported data, so that these data can be included as components of the ecosystem described in Fig. 1.
17
The more expressive a language is, the more expen- amounts of instances and although these techniques
sive the reasoning (being easily undecidable). Thus, to can facilitate integration, in general, we do not envi-
dynamically add data sources, we have to pay higher sion a pure logic-based approach. We need semantics
computational costs, which easily becomes infeasible. and we need reasoning, but it looks like they are not
compatible in the presence of a high data volume.
The column Computation in Table 2 describes the
6.3 Semantic and Computational Challenges
worst case. Thus, Computation and Expressiveness are
Being Exploratory means automating the access not usually correlated (High Expressiveness implies Hard
only to schema but also to data to some extent. This Computation). Nevertheless, it is worth mentioning
implies we have to be able to cope with semantics and that in practice some of these approaches do not
reasoning at the instance level (]7 in Fig. 8). Therefore, achieve the worst case complexity, presenting for most
most Exploratory approaches imply Hard Computation ontologies a Medium Complexity for fair expressive
if Expressiveness is not drastically reduced. It is easier languages. Moreover, some approaches may prefer
for any reasonably expressive ontology language to to maintain Medium/High Expressiveness with easier
perform reasoning tasks on the terminology without Computation at the expense of incomplete (but sound)
asserted instances (combined complexity) than query inferences, when complete reasoning is not required.
answering over the asserted axioms (data complex- It is worth noting, that no work is using
ity). The fact that work in ETL related to reasoning Non-standard reasoning services (e.g., least com-
is also done at the schema level, without actually mon subsumer, pattern matching, etc). Researchers
involving data, supports this assertion. Nevertheless, prefer to use ad hoc algorithms on top of more mature
recent approaches such as OBDA (Ontology-Based Standard services. Thus, for example, Abelló and
Data Access, see [56]) open new directions. Thus, Romero [36], [37] incur higher Computation because
tractable DLs, such as DL-Lite, allow us to exploit of implementing ad hoc algorithms on top of the
query answering by querying the terminology. reasoner. We think that constraining or at least de-
Typically, Datalog has been used to reason on the tecting some kind of expressions (e.g., MD queries)
instances of databases. It provides query answering in the reasoners could be a solution to the puzzle
(the model and data are somehow intertwined) in (i.e., similar to detecting star-join patterns in relational
polynomial time under a closed-world assumption. query optimizers). It would not (only) be a trade-
Definitely, this would be the case for a DW sup- off between language expressiveness and reasoning
porting decision-making based only on the data it cost, the kind of queries allowed should also come
physically contains (i.e., it is a database and assumes into play. So far, we described how OLAP can benefit
a closed-world). However, if a system has to be truly from SW, but it is also true that research on how SW
Exploratory, we cannot guarantee that the necessary can benefit from OLAP, considering the latter as an
information will always be available. It makes sense enabling paradigm for making complex processing
to think of decision-making in open-world scenarios and reasoning more efficient.
(e.g., what-if analysis), considering also those data From a different perspective, the computational
on the Internet that could be dynamically added to problem could be tackled by increasing the computer
our analysis. Thus, more research is necessary on power, e.g., by using a cluster of parallel machines
efficient reasoning algorithms under an open-world (see [29]). In this sense, [57] addresses the challenge of
assumption, maybe constrained to specific types of distributed reasoning by using MapReduce. A draw-
schemas (e.g., MD). back in this approach is the current structure of the
Also, there is a controversy related to the polyno- SW. The existence of data niches hidden behind end-
mial complexity of Datalog, since it is only achieved points hinders the analysis of data. On the one hand,
for standard query answering (i.e., against a fixed pro- these conceal the contents of the different sites. On
gram) and failing to guarantee this restriction (which the other hand, data to be analyzed has to be moved
happens to be rather restrictive and not assumable from one site to another due to the limited query
in OLAP) it becomes EXPTIME-complete (see [25]). functionalities offered. RDF Data Cube Vocabulary (a
It is also argued that tractable DLs like the DL-Lite W3C recommendation in [49]) goes towards solving
family provide query answering in LogSPACE (the this, by offering decisional functionalities.
same complexity as relational databases) and there- Finally, we should also question whether the man-
fore, query answering in such families can be reduced agement of the SW is the most adequate for decision
to querying a relational database. This assumption is support (]8 in Fig. 8). One of the main difficulties is
behind concepts such as OBDA. However, in practice, integrating data, and the complete independence of
current OBDA tools have problems to match relational repositories does not facilitate it. A more hierarchi-
database response time. cal organization (trading autonomy for consistency)
Datalog and DL represent two different ways would be more appropriate. At least, some registries
to deal with ontologies. In our opinion, however, showing the contents, availability, and data quality of
both approaches cannot adequately cope with large each source are clearly needed.
19
[43] A. Simitsis, D. Skoutas, and M. Castellanos, “Representation Torben Bach Pedersen is a Professor of
of conceptual ETL designs in natural language using Semantic Computer Science at Aalborg University,
Web technology,” DKE, vol. 69, no. 1, 2010. Denmark. His research concerns business
[44] D. Skoutas, A. Simitsis, and T. K. Sellis, “Ontology-Driven intelligence and big data, especially ”Big
Conceptual Design of ETL Processes Using Graph Transfor- Multidimensional Data” - the integration and
mations,” JoDS, vol. 13, 2009. analysis of large amounts of complex and
[45] J.-N. Mazón, J. Lechtenbörger, and J. Trujillo, “A survey on highly dynamic multidimensional data. He is
summarizability issues in multidimensional modeling,” DKE, an ACM Distinguished Scientist, an IEEE
vol. 68, no. 12, 2009. Senior Member, and a member of the Danish
[46] J. M. Pérez, R. Berlanga, M. J. Aramburu, and T. B. Pedersen, Academy of Technical Sciences.
“Integrating data warehouses with web data: A survey,” IEEE
TKDE, vol. 20, no. 7, 2008.
[47] D. Calvanese, M. Lenzerini, and D. Nardi, “Description logics
for conceptual data modeling,” in Logics for Databases and
Information Systems. Kluwer, 1998.
[48] M.-S. Hacid, P. Marcel, and C. Rigotti, “A rule-based data
manipulation language for OLAP systems,” in DOOD, ser. Rafael Berlanga is Professor of Computer
LNCS, vol. 1341. Springer, 1997. Science at Universitat Jaume I, Spain, and
[49] W3C, “The RDF Data Cube vocabulary. W3C Working Draft,” leader of the Temporal Knowledge Bases
http://www.w3.org/TR/vocab-data-cube, 2012. group. His current research concerns novel
[50] R. Dánger and R. Berlanga, “Analysis of ontological in- text mining methods for performing Business
stances - a data warehouse for the semantic web,” in ICSOFT Intelligence over very large resources such
(ISDM/EHST/DC). INSTICC Press, 2007. as the Semantic Web and social media. He
[51] P. G. Kolaitis, “Schema mappings, data exchange, and meta- has published more than 25 contributions to
data management,” in PODS, 2005. high-impact international journals and more
[52] D. Calvanese, E. Kharlamov, W. Nutt, and C. Thorne, “Aggre- than 50 contributions to international confer-
gate queries over ontologies,” in ONISW. ACM, 2008. ences.
[53] U. Sattler, “Description logics for the representation of aggre-
gated objects,” in ECAI. IOS Press, 2000.
[54] J. Seo, S. Guo, and M. S. Lam, “Socialite: Datalog extensions
for efficient social network analysis,” in ICDE, 2013.
[55] E. Jiménez-Ruiz, B. C. Grau, I. Horrocks, and R. B. Llavori, Victoria Nebot obtained her Ph.D. in com-
“Supporting concurrent ontology development: Framework, puter science from the Universitat Jaume I,
algorithms and tool,” DKE, vol. 70, no. 1, 2011. Spain in 2013. She is currently a post-doct-
[56] A. Poggi, D. Lembo, D. Calvanese, G. D. Giacomo, M. Lenz- oral researcher in the Temporal Knowledge
erini, and R. Rosati, “Linking data to ontologies,” JoDS, vol. 10, Bases Group (TKBG) in Universitat Jaume I.
2008. Her main research is focused on analyzing
[57] J. Urbani, S. Kotoulas, J. Maassen, F. van Harmelen, and H. E. and exploiting semistructured and complex
Bal, “WebPIE: A Web-scale Parallel Inference Engine using data derived mainly from the Semantic Web.
MapReduce,” J. Web Sem., vol. 10, 2012. Also, she is interested in text mining and
information extraction methods for the Se-
mantic Web.
Alberto Abelló is Tenure Track professor. Marı́a José Aramburu is associate pro-
PhD in Informatics, UPC. Local coordinator fessor of Computer Science at Universitat
of the Erasmus Mundus PhD program IT4BI- Jaume I. She received the PhD degree from
DC. Active researcher with more than 50 the University of Birmingham (UK) in 1998.
peer-reviewed publications and H-factor 18, Author of articles in international journals
his interests include Data Warehousing and such as Inf. Proc. & Management, Decision
OLAP, Ontologies, NOSQL databases and Support Systems as well as papers in inter-
BigData management. He has served as national conferences such as ICDE, DEXA,
Program Chair of DOLAP and MEDI, being and ECIR, her main research interests in-
member also of the PC of other database clude knowledge repositories, decision sup-
conferences like DaWaK, CIKM, VLDB, etc. port systems and integration of information.