Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Using Semantic Web Technologies For Exploratory OLAP: A Survey

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

1

Using Semantic Web Technologies for


Exploratory OLAP: A Survey
Alberto Abelló, Oscar Romero, Torben Bach Pedersen, Senior Member, IEEE, Rafael Berlanga,
Victoria Nebot, Marı́a José Aramburu, Alkis Simitsis

Abstract—This paper describes the convergence of some of the most influential technologies in the last few years, namely
data warehousing (DW), On-Line Analytical Processing (OLAP), and the Semantic Web (SW). OLAP is used by enterprises to
derive important business-critical knowledge from data inside the company. However, the most interesting OLAP queries can no
longer be answered on internal data alone, external data must also be discovered (most often on the Web), acquired, integrated,
and (analytically) queried, resulting in a new type of OLAP, exploratory OLAP. When using external data, an important issue is
knowing the precise semantics of the data. Here, SW technologies come to the rescue, as they allow semantics (ranging from
very simple to very complex) to be specified for web-available resources. SW technologies do not only support capturing the
“passive” semantics, but also support active inference and reasoning on the data. The paper first presents a characterization of
DW/OLAP environments, followed by an introduction to the relevant SW foundation concepts. Then, it describes the relationship
of multidimensional (MD) models and SW technologies, including the relationship between MD models and SW formalisms.
Next, the paper goes on to survey the use of SW technologies for data modeling and data provisioning, including semantic
data annotation and semantic-aware extract, transform, and load (ETL) processes. Finally, all the findings are discussed and a
number of directions for future research are outlined, including SW support for intelligent MD querying, using SW technologies
for providing context to data warehouses, and scalability issues.

Index Terms—Business Intelligence, Data Warehousing, OLAP, ETL, Semantic Web, Reasoning

1 I NTRODUCTION tance of using unstructured and semi-structured data


(either textual or not) in the decision making process.
Business Intelligence (BI) is aimed at gathering, trans-
Nowadays, Web 2.0 sites and Linked Open Data
forming and summarizing available data from exist-
initiatives are becoming sources of huge amounts
ing sources to generate analytical information suitable
of valuable semi-structured data. Currently no one
for decision-making tasks. The most widely used
questions the need of adding all this information to
approach to BI has been the combination of Data
the traditional corporate analysis processes. A signifi-
Warehousing (DW), On-Line Analytical Processing
cant amount of information and thus, knowledge, can
(OLAP) technologies and the Multidimensional (MD)
be found in “unconventional” data sources like Web
data model (see [1]).
portals, social media, unstructured or less-structured
DW/OLAP technologies have been successfully ap-
data stores like product reviews, customer complaints,
plied for analysis purposes, but always in a well-
e-mails, and so on.
controlled “closed-world” scenario, where the set of
Enterprises have started to look into such rich
data sources is rather static, and well structured data
information sources to increase their profits and im-
is periodically loaded in batch mode applying heavy
prove their products and services. As an example,
cleansing transformations. However, the eruption of
populating a business report that shows the effect
XML and other richer semi-structured formats like
of a product campaign in a specific time period
RDF has opened up much more heterogeneous and
may require combining information from historical,
open scenarios than those of such traditional in-house
structured data like product sales and customer data,
DW applications.
residing in a DW, with sentiments extracted from Big
In [2], Inmon outlines the opportunity and impor- Data (e.g., tweets) relating to products promoted by
the respective campaign (see [3], [4]).
• A. Abelló and O. Romero are with Polytechnic Univ. of Catalonia- Thus, companies want to explore all these new data
BarcelonaTech, Spain.
E-mail: [aabello|oromero]@essi.upc.edu
opportunities and include them in their OLAP analy-
• T. B. Pedersen is with Aalborg Univ., Denmark. ses, leading to a new type of OLAP: Exploratory OLAP.
E-mail: tbp@cs.aau.dk The main difference of Exploratory OLAP from Tra-
• R. Berlanga, V. Nebot, M.-J. Aramburu are with Univ. Jaume I, Spain.
E-mail: [berlanga|romerom|aramburu]@uji.es
ditional OLAP is naturally the issue of exploration:
• A. Simitsis is with HP Labs, USA. of new data sources, of new ways of structuring
E-mail: alkis@hp.com data, of new ways of putting data together, of new
ways of querying data. Whereas Traditional OLAP is
2

performed in a “closed-world” scenario based only on need for capturing semantics in an open and powerful
internal data, an essential part of Exploratory OLAP is way that can apply seamlessly across both internal
to discover, acquire, integrate, and analytically query and (newly discovered) external data. We believe that
new external data. SW technologies are a good choice for this.
The Semantic Web (SW) has been conceived as a • To support the discovery of relevant data, it is es-
means to build semantic spaces over Web published sential that the meaning of the data is “declared”
contents so that Web information can be effectively re- in an accurate, rich, and unambiguous way, so
trieved and processed by both humans and machines the right data can be found. The rich ontology
for a great variety of tasks. languages found in SW are ideal for this.
A recent paper [5] introduced the concept of fusion • To support the data acquisition, external sources
cubes to mean cubes that, based on a core of inter- must be queried in a precise, yet efficient, way, to
nal multidimensional data, gradually merge in (fuse avoid having to download complete large data
with) external data, in order to support self-service BI. sets, which are perhaps never used again, and
The paper provides a motivating example, which in where most of the content is irrelevant to the par-
our view captures the essence of exploratory OLAP ticular query. SW query languages and technolo-
well, and shows why SW technologies are needed gies, such as SPARQL and SPARQL endpoints
in this scenario. The example concerns a group of serve this purpose very well.
concerned citizens (watchdogs) that want to monitor • To support the data integration, facilities must
if the fishing catches being landed in the various EU be provided to resolve conflicts in the data, to
countries respect the overall limits set up by the EU combine data from many different formats and
marine protection schemes and how they are related sources, and to structure data in a multidimen-
to marine protection areas. The watchdogs want to an- sional format. Again, SW technologies such as
alyze the data by Time, Location, and Species, where reasoning provide a powerful foundation for this.
each of these three dimensions should be organized • To support analytical querying, measure data
into a hierarchy of levels, e.g., Day-Week-Month- must be aggregated along the OLAP dimensions.
Year, Port-Province-Country-Region, and Subspecies- Both measures and dimensions (hierarchies and
Species-Family. To do this, they must integrate statisti- levels) can now be based on external data, and
cal catch data (in a flat tabular format) with geograph- it is thus very important to capture the exact
ical data about marine protection areas (from public semantics of the multidimensional data and its
database, in SW format), fish population data (from lineage, i.e., providing not just a result, but also
various research databases, in a multitude of formats a precise specification of its meaning and where
ranging from comma separated files to SW data), and it came from. Again, SW technologies have the
finally with ontology data describing geo and species rich modeling constructs to support this.
hierarchies (in SW formats). Thus, the goal of this paper is to survey how SW technolo-
Reasoning capabilities are needed to perform the gies can aid in data discovery, acquisition, integration, and
complex integration and resolve conflicts, e.g., con- analytical querying of external data, and thus serve as a
tradicting catch data or species classifications. In our foundation for Exploratory OLAP.
view, SW technologies are powerful enough to both We note that a solid foundation for Exploratory
model all these different types of data and provide OLAP is also ideal for the wider scenario of “Ex-
the needed reasoning capabilities on top. ploratory BI”, where deep analytics and data mining
Several industrial OLAP tools already use data are performed on the exploratory OLAP cubes. How-
semantics to some extent. A notable example is the ever, considering the full range of Exploratory BI is
TARGIT Decision Suite (formerly TARGIT BI Suite) [6] beyond the scope of this paper.
which uses extended semantics to do so-called “meta- As DW mainly involves the integration of disparate
morphing”. The TARGIT meta-morphing model ex- information sources, semantic issues are highly im-
tends the traditional multidimensional model with as- portant for effectively discovering and merging data.
sociations between measures and dimensions/levels, These semantic issues are similar to those faced in the
and records and learns from the users’ behavior, SW. As a consequence, SW technologies have recently
e.g., what combinations of measures and dimen- been applied to some DW tasks such as Extract,
sions/levels are used, which type of charts is used Transform, and Load (ETL) processes, MD design
to display results, etc. This enables users to ask ques- and validation, and so on. Although they are usually
tions that are in some sense “in-complete”, using the limited to Traditional OLAP scenarios (see for example
semantics and learned user preferences to fill in the [7]), we will show that SW technologies can also be
gaps, and thus enabling easier and more intuitive useful in highly heterogeneous and open scenarios.
interaction. However, the semantics are captured in The main purpose of this paper is twofold: a) to
a closed, internal format, and is applicable only to survey and categorize how SW technologies have
already known internal cube data. been applied to solve the new requirements of Ex-
To enable Exploratory OLAP, there is thus a great ploratory OLAP systems, and analyze the associated
3

Fig. 1. DW/OLAP Elements and Data Flows

feasibility and benefits; and b) to use the analysis data provisioning and SW is not mature enough.
to identify future challenges where the potential to For the expressiveness/complexity tradeoff, our main
use SW technologies for Exploratory OLAP is high, conclusion is that researchers tend to use ad hoc
but current technologies are lacking, i.e., a “position algorithms on top of more mature standard services
paper” approach. and maintain a medium level of expressiveness with
To classify current approaches for the survey, we easier computation at the expense of incomplete (but
define five separate categorization criteria (i.e., Materi- sound) inferences. In the position part of the paper, we
alization, Transformations, Freshness, Structuredness, and list challenges related to schema design, data provisioning,
Extensibility) and show that along all five dimensions as well as semantic and computational issues. The main
there are challenges to overcome. It is important to contributions of this paper can be summarized as:
notice that the first three criteria are related to data 1) Propose a set of five novel criteria to categorize
provisioning, while the latter two are more related to DW/OLAP systems,
the data sources and the schema design. Thus, we will 2) Analyze how these criteria affect the need for
below survey and analyze how SW technologies are semantics and the feasibility of the design and
applied by current work to solve both data schema data provisioning processes,
design and data provisioning requirements. 3) Analyze how semantic-aware reasoning tech-
Moreover, the feasibility of applying SW technolo- niques can aid,
gies needs to be analyzed because, adding semantics 4) Survey and categorize existing DW/OLAP work
generally increases system complexity. In order to according to the five criteria and the three rea-
analyze the classical trade-off between expressiveness soning criteria, and
and complexity, in the context of each particular work, 5) Identify research trends in this area.
we have used three more criteria specifically related The paper is organized as follows. Section 2 in-
to SW (namely Reasoning, Computation, and Expressive- troduces the basic concepts on DW and SW, and
ness). describes the categories used in the comparison of the
We believe that these eight dimensions allow to different papers. Section 3 introduces the survey part
(1) cover the most relevant aspects of the usage of of the paper by presenting a summarized comparison
SW technologies towards Exploratory OLAP, and (2) of the different approaches reviewed. The following
separate different issues to facilitate the analysis and two sections present the details of the survey: Sec-
solution. Aspects that we find less relevant for this tion 4 surveys traditional and new features of data
purpose, and thus do not consider, include the type schema design and Section 5 does the same for data
of data storage (relational, NoSQL, etc.), any distribu- provisioning processes. Section 6 provides the posi-
tion or parallelism in data storage or computations, tion part of the paper through a global discussion of
the specific types of analysis performed (OLAP only, the main issues of the paper and potential research
or also specific types of data mining, etc.), and the lines to address them. Finally, conclusions and future
specific technologies and systems used. research directions are provided in the last section.
Our main conclusion from the survey part is that
SW technologies are a promising way to approach the
involved semantic integration issues. Although contri-
2 M ETHODOLOGY
butions and problems in data provisioning are more Nowadays, a new trend of OLAP work has emerged,
relevant than for modeling, there are less approaches which applies SW technologies to mainly address data
in the former, the reason probably being not only integration issues and the automation of data process-
the difficulty, but also that the relationship between ing. The purpose of this paper is to categorize the
4

main requirements of these new OLAP approaches, of the unified general global schema. The resulting
as well as to show how SW technologies can help to integration model (usually known as Local as View, or
fulfill the new requirements. LaV for short) is thus highly extensible, at the expense
As there are many papers proposing a large vari- of considerably more complicated query answering.
ety of system features, in this section we present a Therefore, in this integration model the reasoning
methodology that guides this survey and produces a power of SW technologies is especially needed.
clear picture of this intricate area. DW/OLAP systems use a special data model, the
We first present the characteristics of Traditional multidimensional data model (MD), for the integra-
OLAP use cases to frame the area of interest of tion layer. Here, factual data gathered from the data
our survey. Then, five criteria related to the different sources layer must be expressed in terms of numerical
relevant aspects of DW/OLAP systems are defined. measures and categorical hierarchical dimensions. The
By means of these criteria, in the rest of the paper, semantics of this model consists of representing any
current approaches are categorized. Furthermore, the interesting observation of the domain (i.e., measures)
five criteria define a space that allows us to locate in its context (i.e., dimensions). The typical processes
Exploratory OLAP use cases and to distinguish them in charge of transforming data from the data sources
from Traditional OLAP use cases. In addition, we use layer to the integration layer are called ETL pro-
another three criteria related to expressiveness, rea- cesses. In some cases (e.g., to enable fast loading
soning and complexity, to characterize existing work and querying at the expense of delivering only partly
with regard to SW technologies. cleansed/transformed data at first), the order of the
steps are switched, or interleaved, leading to Ex-
2.1 The structure of OLAP systems tract, Load, Transform (ELT) and Extract, Transform,
Load, Transform (ETLT), where the transformations
OLAP technology is aimed at gathering, transform- are (partially) delayed to provide fresh but less refined
ing and summarizing available data from existing data. Taking this to the extreme, we have Extract,
sources to generate analytical information suitable for Transform, Query (ETQ) by delaying transformations
decision-making tasks. Traditionally, OLAP has been to the last minute and serving data directly to the user
associated with data warehouses (DW), following the on demand. Thus, given a MD query derived from a
three layered structure shown in Fig. 1, namely: particular analysis goal, an ETQ process directly ex-
• the data sources layer, which consists of all the po- tracts the required information from the data sources,
tential data of any nature (e.g., relational, object- and transforms it to fit into the OLAP results.
oriented, semi-structured, and textual) that can ETQ processes are becoming essential for perform-
help to fulfill the analysis goals, ing analyses that involve external data published in
• the integration layer, which transforms and the Web, and therefore they usually deal with semi-
cleanses the data gathered from the sources, as structured, streamed and dynamic data sources (e.g.,
well as stores them in an appropriate format for [9], [10]). Fig. 1 shows how ETQ processes can interact
the subsequent analysis (i.e., the DW), and with the DW/OLAP data flow. Thus, an ETQ process
• the analysis layer, which contains a number of can take “fresh” data from the ETL staging area,
tools for extracting information and knowledge blend it with both external and DW integrated data
from the integrated data and presenting it to the (e.g., dimension hierarchies), and eventually deliver
analysts (i.e., OLAP cubes, charts, reports, etc). the results to the analytical tools. Notice that ETQ pro-
As it is clear from this description, the integration cesses can also live apart from traditional DW/OLAP,
model of Traditional OLAP systems (DW/OLAP) is avoiding thus the need of loading the integrated data
based on a global schema (i.e., the DW schema), into a DW. It can be noticed that Exploratory OLAP
which is seen as a view over the underlying data systems are tightly related to ETQ processes.
source schemas (which is usually known as Global SW technologies can help in all DW/OLAP layers
as View, or GaV for short). In this integration model, in order to support Semantic-aware and Exploratory
query answering is simple. The external data sources OLAP systems. In the data sources layer, they can aid
are (implicitly) assumed to be known in advance as in capturing the precise semantics of the data sources.
are the user needs guiding the design of the global In the integration layer, they can be used to specify the
schema. This works well when the sources and re- transformations and capture the data lineage. In the
quirements are indeed known in advance, but encoun- analysis layer, they can help specifying the semantics
ters problems when this does not occur. For those of the presented information and reasoning about it.
cases, more flexible integration models are needed. In Finally, SW technologies can serve as a proper basis
particular, the integration of external data schemas in for defining ETQ processes, since most external data
terms of a global schema (often in the form of a global is now being published as linked data (see [10], [11]).
domain ontology) has been studied (see [8]). From the As far as we know, the amount of approaches that
global schema, local schemas can be derived; i.e., the use SW technologies in the first two layers is large,
local schemas are seen as (more specialized) views whereas there are very few proposals that apply them
5

2.2.1 Materialization
Virtual
Starting from the top of the figure, we firstly find
Result Keeping Materialization. This criterion concerns the level of
materialization of the integrated data. In Traditional
Partial
DWs, all the integrated data is fully materialized
Dynamic Lightweight
Evolving Full Partition-tolerant (i.e., Full) often including a so-called data staging
Static Complex area for performing transformations and cleansing.
At the other extreme, Virtual DWs extract data
Structured Periodic
from sources at query time, integrate them on the fly,
Micro-batches
Semi-structured On demand return the result to the user, and then throw away
Right-time the integrated data. Notice that the ETQ processes de-
Unstructured Streaming
scribed in the previous section fall in this category. A
compromise, where some data is materialized, while
Fig. 2. OLAP Systems Categorization Criteria other data, typically data with many changes, are
extracted at query time, is sometimes used (see [12]).
to the third layer and ETQ processes. For this reason, Closer to the Virtual DW, the Result Keeping
our primary focus is to investigate the main issues approach first extracts data on-demand from sources
of the first two components. More specifically, in this and computes the result on the fly (e.g., for displaying
survey we analyze how SW technologies can aid in in a dashboard), but then stores/keeps the results
the integration and provisioning of data, and in the to allow repeated requests for the same result to
MD schema design of OLAP systems. be delivered quickly (see [9]). Complex ETL flows
may actually have subparts each residing in different
categories (i.e., Partial). For example, it is common
2.2 OLAP use cases categorization criteria to have an “on-line” flow that performs fast, but
less thorough, on the fly integration in main memory
In this section we propose a series of criteria that aims for immediate use, while a parallel “off-line” flow
to capture the main aspects of emerging DW/OLAP performs more thorough integration for historical use
systems. The criteria aim to reflect the main compo- and stores all data persistently (see [13]). Here, SW
nents of a traditional DW/OLAP system, and how technologies can be used to describe the data and the
these components are evolving to cover the new results, as well as the steps inbetween.
requirements posed by the new scenarios (e.g., social
networks, linked data and big data). The resulting 2.2.2 Transformations
categorization schema aims to identify the commonal- Proceeding clockwise, the next one is Transformations.
ities and differences of emerging approaches in terms This criterion concerns the level of transformations
of the changes they propose with respect to the tra- applied to the source data during the integration pro-
ditional components of DW/OLAP systems, and is cess. In Traditional DWs, it is common to apply many
the result of long discussions and analysis of some Complex and significant transformations, e.g., creat-
previous work around the new scenarios for OLAP ing versions of data (i.e., Slowly Changing Dimen-
(e.g., fusion cubes in [5]). sions -SCD- in [14]), significant cleansing, computing
The proposed categorization of DW/OLAP systems holistic aggregates, etc. At the other end of the spec-
relies on five criteria. These criteria are generally or- trum, some use cases demand only Lightweight
thogonal, but for certain types of DW/OLAP systems transformations that can be done quickly on the
there will be correlations between them, as discussed fly (even for streaming data), e.g., moving aver-
below. The criteria are of a functionality-only nature ages, simple and approximate aggregations, renam-
(i.e., independent of the underlying technologies used ing/removing columns, etc. As mentioned above,
to provide them). The first three criteria are related to such light data quality improvement and integration
data provisioning, while the latter two are closer to the are sometimes complemented with a parallel flow per-
data sources and the data schema design. The criteria forming complex transformations for later use. As a
are shown in Fig. 2. Generally, the center of the figure middle category, some systems apply transformations
represents Traditional use cases, whereas the outer that are non-simplistic, but Partition-tolerant
rim are use cases that are Exploratory or otherwise and thus parallelizable, such as categorizing values,
somehow harder, in terms of the required reason- etc. SW technologies provide a powerful framework
ing computation. SW technologies can provide large for describing the transformations and for managing
benefits even in the central area, where Semantics- the lineage of results through these.
aware Traditional systems/use cases are located, but
in the complex cases found at the outer rim, where 2.2.3 Freshness
Exploratory requirements appear, having sufficient se- The next criterion is Freshness, which concerns how
mantics and reasoning power is essential. often the data integration process is performed (i.e.,
6

how often the DW is refreshed). Traditional DWs were Finally, some cases can be very Exploratory, looking
refreshed Periodically (e.g., daily, in batch mode). for new sources most of the time, in order to answer
A variation of this is Micro-batches where the specific, but constantly changing questions, making
refreshment is run often (e.g., every 15 or 30 minutes), the source set completely Dynamic. This criterion
on the smaller batch of data accumulated in that pe- should be seen as how easy it is to evolve the schema,
riod. Other DWs (e.g., the Virtual DWs mentioned more than how often it is actually done, which does
above) refresh the data On demand, when requested not depend on the technologies used, but only on
by users. More recently, there has been a trend to the business needs. While it makes good sense to use
refresh the DW even more frequently (e.g., with prop- SW technologies to describe even rather static DWs, it
agation delays of at most a minute or so). This is often becomes essential to have a powerful semantics and
referred to as near real time (or Right time) DWs. reasoning framework for the Dynamic case.
Finally, we can have Streaming data at very high
rates like thousands of items per second, and handled 2.2.6 Discussion
more or less immediately: the so-called data stream DW/OLAP systems take many forms, but we see
approach. Thus, this criterion is somewhat related to two separate “rings” in the five-dimensional space
Materialization. For Full Materialization, all levels of emerging, with many possible stands in between. The
Freshness make sense and the same applies specifically inner ring (core) is Traditional OLAP, which is well
for the specific data covered by Partial Materializa- understood and aims at answering a rather static and
tion. The categories Result Keeping and Virtual well-defined set of questions mostly on structured
are tied to On demand Freshness. This criterion can be data. The outer ring is what we will call Exploratory
heavily affected by the usage of SW technologies (e.g., OLAP which aims at answering new and constantly
for formally stating the freshness and the associated changing questions on a much wider range of data.
criteria and policies). We note that a number of similar terms have been
suggested for the broader case of BI systems (cov-
2.2.4 Structuredness ering not only DW/OLAP, but also analytics/data
The next criterion is Structuredness which concerns mining), including “live BI” in [18], “on-demand
which types of data are found in the data sources or, BI” in [19], “ad-hoc BI” in [20], “open BI” in [21],
more specifically, how Structured the least struc- “situational BI” in [22], or lately “fusion cubes” in
tured type of source data is. In Traditional OLAP cases, [5]. However, given our focus on (multidimensional)
all sources consist of structured data, typically rela- data modeling and data acquisition, we think that
tional tables or in a few cases structured spreadsheets. the term Exploratory OLAP is more precise and bet-
More recently, Semi-structured data sources such ter captures its essence. The most important crite-
as XML, RDF and OWL have become more common. rion for distinguishing between them is Extensibility.
Lately, Unstructured data such as text files, HTML Traditional OLAP cases have a Static set of data
files, and other like e-mails or twits (part of the “big sources, while Exploratory OLAP often, or all the time,
data” movement) have become important sources. brings in new data sources, making them Evolving
SW technologies can be used for all three kinds, but or Dynamic. For the Structuredness criterion, Tradi-
their power is especially necesary to manage the more tional OLAP tends to use mostly Structured data,
complex needs of semi- and unstructured data. while Exploratory cases also use Semi-structured
and Unstructured data sources (e.g., text, social
2.2.5 Extensibility media data, etc). For the Materialization criterion, Tra-
The next and last criterion is Extensibility. This crite- ditional OLAP typically uses a Materialized DW,
rion concerns how Dynamic the set of data sources including an intermediate materialized data staging
can be, i.e., how easily new data sources could be area. In contrast, Exploratory OLAP will often use a
brought into the system. In Traditional DW/OLAP, the Virtual approach where data is pulled from sources
same (mostly internal) Static data sources are used on-demand, although some level of materialization
over and over, and new sources are only brought in is possible (e.g., caching). Data staging areas are
at new major DW releases (i.e., at most a few times typically not used. For the Transformations criterion,
per year). Recently, there has been a trend to include Traditional OLAP will typically have Complex trans-
new data sources, often from external data suppliers, formations such as maintaining SCDs and computing
into an existing DW more often in order to answer precise holistic aggregates, while Exploratory OLAP
new questions, making the source set Evolving. In will only employ transformations that can be per-
this context, an Evolving system is able to adapt formed sufficiently fast (perhaps on streaming data)
their MD schemas to evolving data sources as well as and in parallel (i.e., the transformations have to be
user requirements. These systems have been widely Lightweight or at least Partition-tolerant).
studied in the literature (see reviews from [15], [16], We note that even such transformations can in fact
[17]), and they can be considered within the tradi- require heavy computations (e.g., machine learning
tional DW/OLAP boundaries. computations, data/text analytics, User Defined Func-
7

tions -UDFs-, etc). Also, such transformations often


blend several traditional stages all in one complex
step (e.g., starting with business processes, over ELT- Non-standard
ing, analytics, to reporting and visualization). For the
Freshness criterion, Traditional OLAP will typically Standard
employ Periodic updates, perhaps executed very
frequently (i.e., Micro-batches), while Exploratory None
OLAP will typically extract some data either On Easy Medium Hard
Low
demand, in Right-time, or from Streaming.
A given system is often not clearly of either type, Medium
but rather somewhere in the continuum between High
them, facilitating more or less Exploratory use cases.
Also, both kind of cases will often co-exist in the same
Fig. 3. SW Technologies Categorization Criteria
organization or even as a “hybrid case” in the same
system, with more Traditional cases, or system parts,
for the core day-to-day analytical tasks on internal assumption and, accordingly, a DL ontology can have
data, and other more Exploratory cases/parts for ad many different interpretations. Furthermore, Datalog
hoc analyses. follows a centralized viewpoint, the closed-world as-
To summarize, while SW technologies can provide sumption and the unique name assumption. A direct
significant benefits even for Traditional systems, they consequence is that DL ontologies are more difficult to
become indispensable to handle the complexities and model but they better deal with incomplete data (such
dynamics found in Exploratory ones. as Web data), whereas Datalog ontologies are more
intuitive for the database community but might not
be that interesting for integration cases with missing
2.3 SW technologies for OLAP systems or partial information. In our case, DL suits deci-
SW technology is aimed at providing the neces- sion making processes where not having information
sary representation languages and tools to express about a fact does not necessarily mean it is not true;
semantic-based metadata. This focus on semantics e.g., what-if analysis.
is very useful for Exploratory OLAP systems, where In this paper, we pay special attention to the lan-
the vast amount of unstructured or semi-structured guage used to describe the ontology and semantic
sources demand new semantic-aware solutions that annotations, as the more expressive the semantics
enable machine processable data integration. of this language, the more machine-processable the
SW technologies can aid the development of Ex- annotations, but also the more computationally ex-
ploratory OLAP systems in two aspects: on the one pensive. In the next section, we describe three criteria
hand, ontologies serve the purpose of formally con- that capture this trade-off and categorize state-of-the-
ceptualizing both the domain of interest and the busi- art OLAP systems literature w.r.t. to their use of logic-
ness concepts. On the other hand, by means of seman- based ontology languages.
tic annotation, different data sources can be mapped
to ontology concepts, resulting in a homogeneous 2.4 SW technologies categorization criteria
conceptual space where we capture the meaning of
the integrated elements. Although logic-based languages are very appealing
Most ontology languages, such as the Web On- for their semantic-awareness and reasoning features,
tology Language (OWL; the W3C recommendation), reasoning is computationally hard. For this reason,
have strong foundations in logics and differ from most research done in this direction is focused on
other semantic-aware technologies in that they are balancing the language expressiveness and the rea-
machine processable and support reasoning. Thus, soning services provided according to each scenario
we can describe concepts and relationships but also (see [23]). This trade-off is traditionally captured in
infer implicit knowledge from that explicitly stated. terms of three criteria (represented in Fig. 3): Reason-
Two main families of logic-based languages currently ing capabilities provided, language Expressiveness, and
underlie most of the research done in this direction: Computation complexity. Without loss of generality,
Description Logics (DL) and Datalog-related logics in the remainder of this paper we focus on how
(see [23] and [24], respectively). As discussed in [25], research on OLAP makes use of logic-based ontology
both paradigms can be used to establish ontologies, languages and the trade-off offered w.r.t. these criteria.
but from different points of view.
DL-based languages, such as OWL, assume a de- 2.4.1 Reasoning
centralized approach and information is stored sep- Starting clockwise from the top, the Reasoning crite-
arated from data. Thus, one talks about terminology rion concerns the inference algorithms needed. We
and instances asserted. DL also follows the open-world mainly talk about the use of Standard reasoning
8

services (such as subsumption), non-Standard infer- 2.4.3 Expressiveness


ences (such as schema matching, transitive closures, Finally, the Expressiveness criterion concerns to which
temporal reasoning [26], and so on) and no use of extent we can describe the domain semantics by
reasoning (i.e., None). We say a reasoning service is means of the terminological constructs at hand. For
Standard if it is supported by most reasoners. The this axis, we distinguish between Low (mainly se-
typical inferences provided by DL reasoners are con- mantic annotations such as RDF triples or simplis-
cept satisfiability, subsumption and query answering tic ontology-based descriptions such as concept tax-
(see [23]). Concept satisfiability and subsumption sit onomies), Medium (ontology languages such as DL-
at the terminological level, whereas query answering Lite, OWL2 profiles and most Datalog languages,
also deals with instances. Relevantly, very few DL lan- which are known to have a limited expressiveness
guages (e.g., DL-Lite in [27] and the OWL2 QL profile, but able to deal with reasoning in terms of data
based on DL-Lite) properly support query answering complexity, i.e., considering the instances to resolve
which means that, in practice, query answering is the inference algorithms) and High (highly expressive
prohibitively costly for large data sets, such as those ontology languages, such as OWL DL, still feasible
in OLAP scenarios. Thus, most DL languages are for inferences at the terminological level, but not in
typically used at the terminological level. Concerning terms of data complexity). We do not consider more
Datalog, since terminology and instances are not sep- expressive languages such as OWL Full because even
arated, its reasoning services are query-oriented and their cheapest inferences are prohibitive for OLAP.
its most typical inference is query answering. Among
non-Standard reasoning services we focus under- 2.4.4 Discussion
line: concept matching, the least common subsumer, Both DL and Datalog approaches tackle the same
and computing the functional transitive closure. Dis- problem: modeling ontologies to overcome the (po-
covering functional dependencies, by means of com- tential) lack of semantics, which is the core of our dis-
puting the functional transitive closure, would also be cussion. However, their inherent characteristics make
of great interest for the OLAP community because, them tackle this problem from different perspectives.
as discussed in Section 4, discovering MD schemas is The relationship between the latter two criteria for
nowadays mainly based on functional dependencies. most DL and Datalog languages is nowadays clear
All these inferences sit at the terminological level. for Standard reasoning services. However, many
However, the matching and least common subsumer OLAP solutions are built on top of semantic-aware
have been shown to be costly and, in some cases, technologies and need Non-standard reasoning ser-
matching turns out to be undecidable (see [23]). For vices. In these cases, the feasibility (in terms of the
computing functional dependencies at the terminolog- computation axis discussed above) of using reasoning
ical level we need to compute the functional transitive services for certain ontology languages under certain
closure that is prohibitively costly in the general case. assumptions (e.g., functional transitivity for multidi-
However, its feasibility has recently been shown in mensional schemas) must be explored.
[28] for less expressive DLs. In case no reasoning is In the next sections, we analyze how semantic-
used (i.e., None), the ontology is primarily used as a aware solutions and specifically ontology languages
dictionary or common terminology among peers. can be used to overcome the difficulties discussed
in Section 2.2 as we move away from the center
2.4.2 Computation
along any of the five axes there presented (i.e., Exten-
The next criterion is Computation. For this axis we sibility, Materialization, Transformations, Structuredness
do not mean classic theoretical computational com- and Freshness) with regard to the categorization (i.e.,
plexity, but instead the feasibility of computing cer- Reasoning, Expressivity and Computation) for SW tech-
tain reasoning tasks under certain assumptions (i.e., nologies presented in this section.
in a given scenario). An expensive inference (e.g.,
computing the transitive closure of all properties in
an ontology) computed once may indeed be more 3 C OMPARISON
feasible than a relatively less complex reasoning task In previous sections, we have introduced the four
(e.g., computing subsumption in OWL DL ontologies) main stages of a DW/OLAP system (discovery, acqui-
conducted relatively often (e.g., over a very large sition, integration and querying) and later, we have
ontology and triggered by a certain event in the introduced a set of criteria to categorize current ap-
application GUI). Thus, this axis refers to how feasible proaches. As we explained before, our main focus is to
certain reasoning tasks are from a practical point of investigate how SW technologies can aid throughout
view. Accordingly, by Easy we mean feasible tasks, these stages.
Medium means a lot of computer time is needed (but In practice, current approaches are traversal to
still feasible), and Hard means infeasible (either in these four conceptual stages and thus, they cannot be
practice for large data sets -which is the most typical classified according to them. As in classical software
scenario for OLAP systems- or theoretically proven). design approaches, current solutions either focus on
9

J
Criterion/Category Traditional (·) Medium ( ) Exploratory ( )
Structuredness Structured Semi-structured Unstructured
Materialization Full Partially Virtual
Transformation Complex Not only Relational algebra Relational algebra
Freshness Periodic Some sources on-demand On-demand
Extensibility Hard to add sources Sources easily added Sources added automatically
Criterion/Category Easy (©) Feasible () Difficult (§)
Reasoning None Standard Non-standard
Expressiveness RDF/Taxonomies Datalog, OWL2 profiles, DL-Lite OWL-DL
Computation Less than Polynomial Polynomial Exponential

TABLE 1
Generalized categories

the MD schema design of OLAP systems (i.e., at the work under the specific criteria, and allows us to
schema level), or on integration and provisioning of visually analyze the correlation between being more
data (i.e., at the data/instances). For each of these Exploratory, using semantics, and the incurred cost.
two categories (schema vs. data), we have identified For each of the relevant papers identified, we show
representative papers, and subsequently divided them its position in each of the five DW criteria, and
into two subcategories (for a total of four categories also the value in the three SW ones. Horizontally,
of papers), depending on whether SW technologies the table is divided into four parts corresponding
are applied to satisfy the requirements of Traditional to semantic-aware MD design, multidimensional query
OLAP systems (here denoted Semantic-aware OLAP definition, semantic-aware ETL processes, and ETQ pro-
systems), or to support (to some extent) the new set of cesses. When one paper deals with both issues, MD
requirements of Exploratory systems. The papers were design and data provisioning, it appears twice in the
selected based on our experience, depending on how table and is analyzed from both perspectives (which
well they exemplify the categories. may result in apparently contradictory classifications,
As previously explained, being Traditional or Ex- caused by the different viewpoints of the analysis).
ploratory is not boolean, but a continuum with a blurry Let us start from the upper part of Table 2 (i.e.,
border. Also, inside each of the five criteria in Fig. 2, semantic-aware MD design). As previously discussed
there is a continuum. Thus, to facilitate the visual (and further discussed in Section 4.1) the Transforma-
comparison, we have simplified the descriptive scale tion and Freshness OLAP criteria do not apply for these
of each criterion into only three generic categories, papers. These criteria are more related to data than to
depending on how close they are to the mid point of schema and have not been typically considered for
the space (i.e., Traditional OLAP). Table 1 summarizes MD design. Focusing on the other criteria, the com-
these fifteen resulting categories, together with the mon characteristic of all papers is that they completely
nine categories for the SW criteria in Fig. 3. Notice materialize the DW. Four of them deal with semi-
that the max (resp. min) of these three categories structured data, and among these, two facilitate to
does not necessarily coincide with the max (resp. some extent their inclusion in the DW. The exception
min) of the corresponding edge, because the purpose to this is Neumayr et al. [34], because they present
of the table is just to outline the differences in the an extension of the work (i.e., [35]) which adds the
approaches (i.e., the table shows the max -resp. min- Exploratory part. As a general rule, when a work only
found in the analyzed papers). These exactly coincide deals with the schema, it allows High Expressiveness,
for Structuredness and Materialization; for Transforma- but dealing with schemaless unstructured data re-
tion, the typical Lightweight set of operations we quires to lower the Expressiveness to Medium in order
found is Relational Algebra (so we reflected this to process the huge amount of data in a DW. In
in the table); in the case of Freshness, the more Ex- any case, the computation would consume a lot of
ploratory papers we found considered On-demand re- resources, except if one restricts the Expressiveness and
fresh; finally, with regard to Extensibility, we mapped at the same time only deal with the schema of a well
papers automatically adding sources to Dynamic. To structured database.
decide the classification of each work in each category, The second part of the table corresponds to more
we checked the content of the corresponding papers Exploratory design systems (see Section 4.2). We can
looking for matchings to the definitions in Section 2. see that in this case, the common characteristic to all
If not enough information was provided to classify of them is that they are able to deal with unstructured
some work in a given category, we crossed out the data, as soon as ontological mappings are provided.
corresponding cell. Also common to all of them is that they to some extent
Table 2 summarizes the findings in the analyzed facilitate the addition of new sources to the DW. How-
work, the details of which will be provided in Sec- ever, most of them materialize extracted data, do it
tions 4 and 5. This table is aimed at evaluating existing off-line and allow only light-weight transformations.
10

Reference Str. Mat. Tra. Fre. Ext. Rea. Exp. Com.


Niemi et al. [29], [30] · × × ·  © 
 © 

Sem.-aware
Priebe et al. [31] · × × ·
 § §
Data Schema Design
Bakhtouchi et al. [32] · · × × ·
Prat and Akoka [33] · · × × ·  § §
Neumayr et al. [34], [35] · · × ×   
Abelló et al. [36] · × ×   §
Romero and Abelló [37] · × ×   §
© © 
J
Kämpgen et al. [11] · ·
Exploratory

 § §
J
Romero et al. [38] ·
 § §
J J
Nebot et al. [39] · ·
 § §
J
Khouri et al. [40] · ·
© © 
J J J J
Kämpgen et al. [41]
 § §
J J J J
Nebot and Berlanga [10]
 © 
J
Niemi et al. [29], [30] · ·
Sem.-aw.
Data Provisioning

Skoutas and Simitsis [42], [43] · ×  § §


Skoutas et al. [44] · ×  © ©
 § §
J
Romero et al. [38] ·
 § §
J J J J
Nebot and Berlanga [10]
Expl.

Pedersen et al. [9] © × ©


© © 
J J J J
Kämpgen et al. [41]

TABLE 2
Summarized comparison

Kämpgen et al. [41] and Nebot and Berlanga [10] seem we can (beforehand) define the mappings from all
to move in a more Exploratory space, though. Not the different data sources to the ontology showing
surprisingly, High Expressiveness results in the need the watchdogs vocabulary (i.e., we must know in
of heavy computations. As the two last approaches in advance that we will use statistical catch data, geo-
this part show, the only way to avoid heavy compu- graphical information about marine protection areas,
tations is to avoid reasoning. etc). Once this technical work is done by someone
The third part of the table corresponds to semantic- in the watchdogs group, the others could use the
aware ETL (see Section 5.1). We can see that the ontological knowledge to navigate and analyze all
common characteristics in this case are Extensibility, these data, which would most probably have been
Materialization, and Structuredness. The work in this replicated onto their server for the sake of perfor-
area tries to automate to some extent the generation mance. Oppositely, if such planning cannot be done
of data flows into a DW. Then, they allow more or beforehand and the watchdogs need to discover the
less unstructured data and simple transformations. different data sources on the fly (or there is no such
Relevantly, since some of the approaches work at expert in the group able to define the mappings), then
the conceptual level, they do not pay attention to there should be a public ontology where the publisher
more physical characteristics like Freshness. Due to the of catching information and marine protection areas
inherent complexity of the problem, some drastically map their contents. Given that, the watchdogs could
limit their Expressiveness in order to be feasible. The use exploratory techniques to navigate that public
rest of them result in high Computation needs. ontology and find what is worth to be analyzed
Finally, the last rows in the table summarize the among the available sources.
work related to ETQ (see Section 5.2). They all avoid With regard to data provisioning, given the same
the complete materialization of data and thus im- premises as in the data design, semantic aware tech-
prove their Freshness. Some facilitate Extensibility more niques can help to design the extraction processes,
than others, but all allow to deal with semi and/or which may include complex cleaning algorithms, be-
unstructured data. Given the cost of querying this cause the data is extracted beforehand and locally
kind of data, they avoid reasoning or keep it to a stored. This would need the intervention of some ex-
minimum (e.g., Pedersen et al. [9] does not even use pert inside the group. On the contrary, if no complex
RDF semantic constructs). Also they do not deal with cleaning is necessary and we do not want to replicate
heavy-weight transformations, but keep this as simple the data in our server, the watchdogs could use an
as possible. Despite this fact, the needed computing exploratory approach to issue simple queries directly
power to run them is not low, in general. to the data providers. The ontological knowledge
Retaking our watchdogs example in the introduc- should facilitate the integration of those data on the
tion, a semantic aware data design could be used if fly. We do not preclude the expert from participating,
11

but just consider that his/her intervention can and fulfilling the MD integrity constraints. Although no
should be minimized. standard is available, much work has been devoted
to identify these constraints (e.g., [45]), which can be
4 DATA S CHEMA D ESIGN summarized as a formal definition of the MD space
and the notion of well-formed hierarchies, in order to
MD design is a well-known paradigm in the area of
preserve a correct aggregation of data.
DW and databases in general, always related to OLAP
To automate MD design, classical approaches (e.g.,
tools. It was popularized by Ralph Kimball at the
[14]) focus on the organization of data and assume
logical level in [14].
relational (or homogeneous) and well-structured
Multidimensionality is based on the fact-dimension
sources and therefore, they are hardly effective (or
dichotomy. This paradigm aims at analyzing the fact
feasible) in heterogeneous scenarios with disparate
(or subject of analysis) instances, from different analy-
sources. Indeed, the more automatable they are, the
sis dimensions (i.e., points of view). Several measures
more tied to a specific formalism or language (typ-
(i.e., metrics) are available for each fact instance in
ically relational sources). Consequently, they do not
order to gain insight. Furthermore, the MD model also
tackle the integration of different data models.
provides foundations to study/analyze the available
At this point, some work proposed semantic-aware
measures at various aggregation levels determined by
approaches to integrate external data. Basically, these
the hierarchical structure of the dimensions. Indeed,
first approaches follow the same principles but start-
aggregation is one of the main characteristics of the
ing from an integrated view of the sources in XML,
MD model, setting foundations for the well-known
RDF, or ontologies. SW technologies are a promising
roll-up and drill-down operators.
foundation for integrating heterogeneous data and
Conceptually, it entails that the fact is related by
most work further exploring this direction can be clas-
means of to-one relationships (i.e., functional depen-
sified either as those focusing on Web data (or similar
dencies) to the dimensions, which in turn identify
scenarios), where the presence of the SW technologies
the fact. Thus, a fact has no atomic identifier but
is granted, and those using SW technologies to tackle
instead a compound one composed by the set of
integration in any scenario.
dimension values univocally identifying the factual
Among the first ones, one may find the work dis-
data (from now on we will refer to these identifiers
covering MD schemas from XML (a review and deep
as MD identifiers). Finally, dimension hierarchies are
discussion of DW approaches for XML and Web data
composed of aggregation relationships between the
can be found in [46]). In the general case, however,
dimension levels. Discovering these kinds of relation-
these approaches can be clearly improved by using
ships is crucial in the design of the OLAP cubes and
more expressive SW formalisms that facilitate inte-
in turn of the DW.
gration by incorporating a reference semantic layer
In next subsections, we first review the Traditional
whereto every domain concept and relationship can
OLAP work using SW technologies in the design and
be mapped.
then, we discuss more Exploratory approaches, where
The suitability of DL for data modeling was claimed
no clearly delimited design phase exist.
back in 1998 (see [47]) and by then the first work
acknowledging the benefits of modeling the DW
4.1 Semantic-aware Multidimensional Design by using such formalisms appeared in the DWQ
The main features of Traditional OLAP systems di- project. This line of work highlights the need of
rectly impact on how MD design has been tackled for capturing functional dependencies and aggregation
such systems. Since these systems largely materialize relationships to model the DW. A few papers focus-
the integration layer, designers arrange the integrated ing on Datalog to model the DW also appeared at
data in an MD fashion ready to be exploited by non- that time. Although Datalog has been overlooked for
expert users. Complex and Periodic ETL processes modeling DWs until very recently, it was, indeed, the
guarantee a quality threshold for the integrated data, first logic-based approach proposed for DW modeling
which comes from Static and Structured sources in [48]. This work tackled the very same problems
(typically relational databases). In some cases, this just discussed for DL: how to deal with functional
data is completed with data coming from external or dependencies and summarizability. A proper Datalog
non-structured sources. extension was presented to do that.
Thus, Traditional MD design first focuses on iden- However, the state of the art on DL and Datalog
tifying the needed subset of source data to answer at the time did not provide strong evidence about
the end-user analytical needs and then arranges them. the feasibility of such approaches, and reasoning al-
Since sources are mainly relational and static, the gorithms (needed for validation purposes) over such
process can be automated up to some extent by in- languages were computationally expensive and thus
ferring a mapping between the two schemas. In these unfeasible for real cases. For this reason, it was not un-
scenarios, MD schemas are identified by exploring til ten years later that the first work presenting semi-
the data sources in order to discover MD patterns automatic methods to support the MD design task by
12

Fig. 4. Ontologies for Semantic Annotations Fig. 5. Ontologies for Domain Modeling

exploiting the SW technologies appeared, empowered make MD sense (i.e., preserves the MD constraints).
by recent advances both in DL and Datalog. Thus, they propose to analyze traditional data by us-
ing external ontologies as semantic dimensions (former
4.1.1 Ontologies for Semantic Annotations work in this line was proposed in [50], in which a
Fig. 4 depicts one trend in this category of work, domain ontology serves as a basis to define an MD
where the ontology plays a passive role. Thus, the schema for aggregating instance data).
data sources are checked (mainly sampling data us- For all these papers adding new data sources is re-
ing data mining techniques) to identify functional duced to linking the sources to the reference ontology.
dependencies and MD identifiers and then annotate Thus, they provide a certain degree of automation
the findings in a reference ontology, following a GaV to model the DW only once the sources have been
integration model. From this asserted information, linked (i.e., mapped) to the reference ontology. This
facts are identified and, from each fact, a star-shaped approach to add new data sources has two main
schema is produced by deploying correct hierarchies consequences: the ontology languages used are less
and aggregations by means of functional dependen- expressive (e.g., DL-Lite) but still the reasoning tasks
cies. Consequently, note that the ontology is a reposi- carried out are computationally expensive: besides
tory of semantic annotations aimed at hiding hetero- Standard reasoning such as subsumption checking,
geneities. Reasoning is used to perform satisfiability they need to compute the transitive closure of func-
and consistency checks (in the general case, reducible tional dependencies (a Non-standard service). Yet,
to subsumption checking) to validate the asserted these papers would rather use ad hoc algorithms
knowledge, whose complexity varies according to the than Non-standard reasoning services. Neverthe-
ontology language expressiveness. less, linking sources to the ontology can turn out to
As shown in Table 2 (see Section 3), all approaches be hard in some scenarios (because mappings must
in this section (see Data Schema Design - Sem.- be provided). Thus, in any case, adding new sources
aware) assume Full materialization of the DW. As is time-consuming and it is not suitable for very
a common trend, although they use SW technologies Dynamic scenarios but for those where new data
to overcome heterogeneities, the final DW produced sources are added from time to time and intended
mostly follows the Traditional assumptions and thus, to stay in the setting.
do not fulfill the requirements of Exploratory cases. All in all, the main difference between these ap-
Those papers following the trend depicted in Fig. proaches is the ontology language used, distinguish-
4 (i.e., [29], [30], [31], [32], [33], [34], [35]) neither ing two main options: those focusing solely on RDF
provide support for tackling the system Extensibility in and those aiming to support a more generic and
an automatic way nor deal with Semi-structured expressive approach. However, we note that both
or Unstructured data. More specifically, Niemi et options have paid little attention to aggregation re-
al. [29], [30] and Priebe et al. [31] have Low Ex- lationships up till now.
pressiveness and thus yield Medium Computation (e.g.,
OWL 2.0 RL). Somehow, these approaches can be
4.1.2 Ontologies for Domain Modeling
considered precursors to current Exploratory efforts
aimed at bridging the gap between OLAP and linked Fig. 5 sketches an alternative that achieves a larger
data (e.g., [49]). Oppositely, Bakhtouchi et al. [32] and degree of automation, while still in a Traditional sce-
Prat et al. [33] try to tackle any scenario and choose nario. In this case, sources are mapped to a reference
a expressive standard DL language such as OWL ontology (with no MD meaning) and reasoning is
DL. Thus, they provide High Expressiveness and yield exploited to identify functional dependencies and/or
Hard Computation. MD identifiers over the ontology rather than relying
Finally, Neumayr et al [34] and Anderlik et al. [35] on the designer explicitly asserting them; all in all, this
present a detailed layered approach consisting of a makes the ontology to play an active role. In general,
flat domain (i.e., with no MD meaning), a hierarchy expensive algorithms are triggered to identify MD
domain (showing roll-up relationships) and an MD patterns (by checking the MD constraints) and then,
domain. By means of integrity constraints (Datalog an MD schema based on this knowledge is created.
rules without head) and Datalog inference capabili- Most of this work assumes that an ontological
ties, they guarantee that the asserted information does representation of the sources is available (i.e., they are
13

working with Web data and taking the presence of SW


technologies for granted).
Abelló and Romero [36] examine the ontology to
look for MD identifiers (in order to identify facts),
which is known to be expensive in the general case
due to the large amount of instances a DW must deal
with. To yield a tractable complexity, this work pro-
poses the use of a reference ontology to pre-identify
candidates and drastically reduce the number of tests
(i.e., data samplings) to be carried out. Standard
reasoning inferences (mainly subsumption checking)
are used to spot out potential identifiers. However, ad
hoc algorithms are introduced to deal with transitivity Fig. 6. Ontologies for Semantic Linked Modeling
at the terminological level for OWL DL.
Romero and Abelló [37] identify, for each fact, lack of a centralized view of the domain (in terms of
potential dimensional concepts which are arranged a single ontology) but the presence of several domain
in well-formed hierarchies by means of functional ontologies. Typically, data sources are published as
dependencies. In this sense, we can build an MD semantic linked data, and users specify their require-
schema from the inferred functional dependencies. As ments in the form of queries to existing catalogues
common background, this kind of work suffers from of published datasets, which can be accessed via
the problem of computing the transitive closure of SW query languages (e.g., SPARQL). Exploratory MD
functional dependencies. In general, the transitive clo- design requires methods that should be user-driven
sure cannot be computed with Standard reasoning and highly extensible. Semantic-aware approaches
services and therefore [28] proposes a trick to simulate have been shown appropriate for conciliating user-
it in a DL − LiteA ontology by means of certain requirements and existing data sources during the
answers, while other similar approaches use ad hoc MD design phase. Extensibility is usually achieved
algorithms. Fortunately, transitive functional depen- through a loose coupling between conceptual schemas
dencies have been considered in OWL 2.0, under and user requirements. Reasoning is applied for func-
certain constraints, which opens the door for tackling tional dependency discovery and for validating the
this issue with Standard reasoning services (to at resulting MD schemas, which can be expressed as DL
least some extent). expressions. However, a fully Exploratory MD design
To our knowledge, no work has tackled the problem method has not been proposed yet. This would ba-
of discovering dimensions or MD identifiers from sically consist of discovering, selecting and automati-
Datalog-based ontologies. However, Neumayr et al. cally linking those available data sources that best fit
[34] set foundations towards discovering functional the user requirements. Assuming that all data sources
dependencies by means of query answering over their are published under SW standards (e.g., Linked Open
Datalog metamodel. Data -LOD-), the main issues that we have to address
are the following:
4.2 Multidimensional Query Definition • Discover datasets that fulfill the user require-

In an Exploratory OLAP scenario, MD design should ments, and get the subsets that are of interest;
be driven by the user queries. This is because Ex- • Extract useful relationships between the retrieved

ploratory users expect to be aware of new external data subsets, to discover potential MD facts; and
that fulfill their current business requirements, which • Validate the discovered facts, for mixed datasets

are not covered by their internal corporate data. In that may have incompatible semantics, that need
this scenario, MD design must be flexible enough to to be integrated.
allocate new incoming external data and to conciliate In the area of MD modeling, several user-driven
them with both internal data and user requirements. and semantic-aware approaches have been proposed,
To fulfill user requirements, an Exploratory OLAP which to some extent follow the integration models
system should ask for Fresh data in a highly above. The feature common to all of them is that they
Dynamic environment (e.g., the Internet), where an allow the integration of Unstructured data by just
integration layer is hard to materialize. Thus, hetero- mapping them to the available ontologies.
geneity among sources becomes the main issue, also Firstly, Kämpgen et al. [11] propose a borderline
the lack of structure in the sources, which claims for Traditional DW to store and analyze statistical Linked
new techniques to identify MD patterns on the fly. Data (sLD) through OLAP queries. However, in this
approach, there is no MD design because the DW only
4.2.1 Ontologies for Semantic Linked Modeling accepts sLD already expressed as MD data. It uses
Fig. 6 sketches the idea behind the work in this cate- SPARQL as query language, and does not perform
gory. The main difference with regard to Fig. 5 is the any kind of reasoning over semantics.
14

Romero et al. [38] follow the GaV philosophy. Thus,


all data sources are tightly integrated into one global
schema (called the annotation ontology). Then, an
MD schema is constructed and validated constrained
to the user’s requirements, which are also specified
in terms of the annotation ontology. Finally, ETL
processes are semi-automatically designed to popu-
late the generated model (see also Section 5.1). Ac-
cording to Extensibility, this work can be considered
Evolving, since the inclusion or change of data
sources implies the re-definition of the global schema,
as well as their derived MD schemas and ETL flows.
Nebot et al. [39] propose a semantic DW, where
both external and internal data are expressed, linked
and stored as SW data (OWL format). Users specify Fig. 7. Ontologies for ETL Modeling
their requirements by picking up concepts from the
stored ontologies. Then, the system extracts the nec- concept taxonomic hierarchy (i.e., is-a relationships) to
essary logical modules to set up a global ontology define roll-up operations, whereas most approaches
from which facts and dimensions will be validated mentioned in Section 4.1 deal with role chains.
and generated from the DW.
Although this approach can be considered a Tra- 5 DATA PROVISIONING
ditional DW, it has several features that make it Ex- Another challenge towards Exploratory OLAP is to
ploratory. First, if data sources are already expressed shift from DW-centric, Traditional ETL flows to broader
in OWL/RDF like in LOD, then their inclusion into data flows consisting of complex analytic operations,
the DW is straightforward. If not, a process of seman- involving a plurality of different data types and
tic annotation will be necessary. Mappings between sources, spanning multiple execution engines, and
ontologies and data sources of the DW can be semi- running under different freshness requirements and at
automatically obtained as in current semantic integra- different paces, ranging from slow or frequent batches
tion literature. Second, as the MD design is performed to micro-batches or real-time processing.
according to the user requirements, which are ex- As with Traditional and Exploratory OLAP, where the
pressed in terms of the current ontologies in the DW, latter requires a solution that captures, transforms,
the resulting schemas are fitted to their specific needs and presents fresh data in order to answer chang-
at any moment. However, like Romero et al. [38], any ing questions, we also see the need for Exploratory
change in the data sources will imply the re-definition ETL processes. Depicted in Fig. 1, we have named
and validation of the generated schemas and ETL these processes Extract, Transform, and Query (ETQ)
flows. It must be pointed out that both approaches, in order to differentiate them from traditional ETL-
[38] and [39], apply reasoning to discovering facts as ing. ETQs should be able to gather data, apply com-
well as validating the generated MD schemas. putations, and produce dynamic reports or popu-
Khouri et al. [40] propose another user-driven se- late dashboards directly from –potentially evolving–
mantic integration approach. In this case, the authors user requirements. In addition, ETQ processes deviate
propose a pre-defined global ontology into which from Traditional processing in that they may affect
data source ontologies are loosely integrated. User various stages of the design, as for example they
requirements are expressed as queries over the global may be used to populate DW constructs or to answer
ontology, which are executed to build up the DW con- a business query by fetching data directly from the
ceptual schema (local ontology). In this approach, rea- sources (like on-demand ETL in [12]).
soning is applied to classify and validate the classes With the widespread adoption of new technologies
of the local ontology. in the Web, such as XML and other richer semi-
Kämpgen et al. [41] propose to directly query sLD structured formats like RDF, important and useful
to fulfill OLAP queries expressed in MDX, which information is being captured in a large variety of data
resembles an Exploratory ETL flow as those described sources. Thus, real-world ETL scenarios have to deal
in Section 5.2. As Khouri et al. [40], SPARQL is used with the integration of heterogeneous information
and no reasoning tasks are performed. sources. Dealing with the problem of accessing struc-
Finally, an alternative approach for extracting se- tured, less-structured, or unstructured data in an inte-
mantic dimensions from external ontologies is pre- grated and transparent way still has open challenges.
sented by Nebot and Berlanga [10]. In this case, two Semantic annotation appears as a promising means
statistical measures are proposed for extracting MD- for dealing with such issues, as it provides a semantic
shaped hierarchies from the concept hierarchy. It is abstraction that can support both homogeneous access
worth mentioning that these approaches deal with the to disparate data sources and resource discovery.
15

In this context, semantic-aware ETL processes are mainly deal with the conceptual level of ETL design,
those that take into account these semantic annota- whereas this criterion involves lower design levels
tions to improve the integration processes required (i.e., physical). The same holds for Materialization, and
in OLAP solutions. In the rest of this section, we first all the authors more or less implicitly propose just to
review work related to semantic-aware ETL processes load the results into some kind of data storage.
and then, we discuss preliminary efforts that deal Regarding the reasoning mechanisms used, if any,
with heterogeneous and unconventional data sources, they are always Standard (i.e., subsumption) and
which are dynamically incorporated to the analysis. only involve the schema, never the data. Despite that,
given the Expressiveness chosen by some authors, Com-
putation turns out to be Hard in most of them. Only
5.1 Semantic-Aware ETL Processes
Skoutas et al. [44] reduce the computational needs by
During the initial steps of a Traditional ETL project, the restricting the reasoning to simple taxonomies.
main goal is to construct a conceptual ETL design that Firstly, Niemi et al. [29] describe methods for OLAP
identifies the data sources that are useful to the project cube construction using SW technology. The authors
and describes the corresponding data transformations use a generic OLAP ontology as an upper ontol-
needed to map these sources to the target DW con- ogy for all OLAP cubes. This ontology defines only
cepts. For achieving that, it is imperative to identify general OLAP concepts and it is independent of the
and understand the semantics of both the data sources application area. Per application need, they consider
and the target data stores. Although this is at large domain-specific ontologies (e.g., CarModel, Branch,
an open problem, some approaches propose the use Country) based on the upper one. We may need to
of SW technologies to facilitate and to automate the define ontology mapping transformations describing
design and construction of the ETL part. how the source data should be converted to conform
to the global domain ontology. In order to integrate
5.1.1 Ontologies for ETL Modeling data from different sources, the authors consider an
Fig. 7 sketches (using Pentaho PDI icons) an approach RDF data format and an RDF query language.
to use SW technology for ETL. It uses a global ontol- As an extension to their work, Niemi et al. [30]
ogy for mapping all the involved data stores to it. This discuss in more detail the method for automating the
idea resembles the LaV paradigm, where the appli- construction of OLAP schemas. Again, the source and
cation ontology, constructed as a conceptual schema target schemas are considered as known. The map-
of the domain, corresponds to the global schema and ping among the source data and the OLAP schema is
the semantic descriptions of the data stores, in terms done by converting the data in RDF using ontology
of classes and properties defined in the ontology, maps. Then, the relevant source data are extracted by
correspond to the views describing the local schemas. RDF queries generated using the ontology describing
However, the use of an OWL ontology, instead of a the OLAP schema. At the end, the extracted data
global schema, provides a formal model on which are stored in a database and analyzed using typical
automated reasoning mechanisms may be applied. OLAP techniques. Both these approaches aim at an
Furthermore, in ETL, it is not sufficient to consider end-to-end design approach, but they have two main
the integration problem as a query rewriting problem, limitations. First, they both require prior knowledge
since the transformations taking place in real-case ETL of the source and target schemas and second, they
scenarios usually include operations, that cannot be consider only simple data transformations.
easily captured by a query rewriting process (see [42]). Skoutas and Simitsis [42], [43] present an approach
The mixture of SW and ETL technologies is still to ETL design using SW technology that elaborates
quite immature. However, the preliminary efforts in more on the complexity of the data transformations
this area have shown a potential solution to the required for integrating source data from heteroge-
automation of ETL designs (mainly at the concep- neous sources into a DW. This work deals with a
tual level) given a set of business requirements, fo- core ETL design challenge: the structural and semantic
cusing on being more Exploratory along Extensibility heterogeneity. For example, two sources S1 and S2
and Structuredness, by facilitating the load of semi- may contain similar information under two different
structured (i.e., XML or RDF) data. With respect to schemas or they may use different representation for-
other categorization criteria presented in Section 2.2, mats. This approach uses ontologies to formally and
the entire spectrum of Transformations presented in explicitly specify the semantics of the source and DW
Fig. 2 could be supported by the methods pro- schemas and thus, to automate to a large extent the
posed. However, in practice, so far only simple and ETL generation. It also assumes that the source and
relational-style operations seem to be considered. target schemas are previously known. A conceptual
How to generalize this work to capture a richer set of ETL design is obtained, whose generation involves the
transformations (like user-defined functions) remains automatic derivation of the mappings from the source
to be seen. Freshness is basically left aside and not to the target attributes, along with the appropriate
even mentioned in some of the work, since they ETL transformations. However, computationally and
16

semantically complex ETL operations like pivot and fast as possible within the analytical queries (Fresh-
slowly changing dimensions are not supported. ness criterion). Unlike traditional ETL processes, ETQ
Skoutas et al. [44] extend the above mentioned work processes can also be defined over unconventional
and propose using a graph-based representation as a data sources (Structuredness criterion), such as those
conceptual model for the source and target data stores usually found on the Internet. Current approaches
and based on that, the ETL transformations can be partially cover some of these aspects by adopting
identified by means of graph transformations, instead RDF-based technology. However, applying reasoning
of generic reasoning (which results to be computa- within these processes is still an unexplored issue.
tionally cheaper). It is described how the operations Firstly, Nebot and Berlanga [10], as a step towards
comprising the ETL process can be derived through the definition of semantic-based ETQ processes, pro-
graph transformation rules, the choice and applicabil- pose a method to generate fact tables directly from
ity of which are determined by the semantics of the semantic data published as RDF(S) and OWL. This
data with respect to an attached domain ontology. method starts with the target MD query, which must
The techniques discussed so far consider as given be expressed in terms of concepts and properties of
the source and target schemas and search for the the source ontologies, and then it performs a series
mapping of one to the other, by benefiting from ISA of transformations that guarantee that the generated
relationships in a reference taxonomy. As a further factual data conforms to both the MD query and
step, Romero et al. [38] aim at building ETL and MD the source semantics. The main drawback of this
conceptual designs (see Section 4.2) starting from a approach comes from its computational complexity,
set of data sources and a set of business requirements since it requires to compute the aggregation patterns
(e.g., expressed as service level objectives). This ap- within the ontologies through a reasoner.
proach first analyzes the data sources mapped against Pedersen et al. [9] provide an example of ETQ pro-
a domain ontology. Then, in an iterative fashion, it cesses proposed to extend OLAP with external data.
produces two results: (a) an MD design that satisfies This method allows the execution of OLAP operations
the requirement at hand (e.g., fact and dimension that involve data contained in external XML (not even
tables, relationships among them, etc.); and (b) a flow RDF) sources through XPath queries. In doing so,
of conceptual ETL operations required to feed the MD external data can be used as dimensions and/or mea-
schema by connecting it to the related data sources. sures of OLAP cubes, offering high Extensibility, and
Freshness of the results. Still, the XML sources need
to be logically integrated within the OLAP system,
5.2 ETQ Processes preventing it from being completely Dynamic.
As earlier mentioned, Exploratory OLAP aims at Finally, Kämpgen et al. [41] directly define OLAP
analysing “fresh” data coming from a wide range of operations over external data expressed in RDF and
data sources. Fig. 1 shows that the provision of this annotated using the RDF Data Cube vocabulary (QB).
kind of data would require alternative ETL processes, The authors define an MD schema based on QB and
which we called ETQ. As discussed before, this area map OLAP operations in this schema to SPARQL
is generally unexplored by current approaches, but queries over the sources. In this approach, the Exten-
it contrasts with the great demand of this kind of sibility criterion is enabled through the use of RDF
processes within the community of OLAP. and QB to annotate the data sources. One drawback
SW technologies can play a crucial role in the of this approach is that the annotation with QB is
development of ETQ processes. On the one hand, manually performed, which implies a bottleneck in
they facilitate to some extent the processing of un- the design of the transformation processes, affecting
conventional data sources, containing semi-structured the Extensibility criterion. Another limitation of this
or unstructured data. Adding semantics to such data approach comes in terms of efficiency, as it depends
helps in shaping them under some notion of structure. on the SPARQL processor, which has been criticized
This can be very helpful, given the effort spent in ETL as inefficient for complex MD queries.
research and materialized in ETL tools for dealing
with structured information. On the other hand, SW 6 C HALLENGES
technologies are enabling a new emerging scenario In this section, we summarize our findings and iden-
(i.e., the Web of Data) that offers new possibilities to tify a list of challenges that require a fresh look in
publish semantic data for easy consumption by ana- the future. We divide our discussion between the two
lytical tools. In this new scenario, virtualization and areas of interest in this survey, namely schema de-
freshness can be fully achieved thanks to the use of sign and data provisioning, but we also comment on
SW references (e.g., URIs), which allows applications whether SW technologies are ready to fully support
to access the required linked data at any time. the needs of next generation OLAP systems. Fig. 8
Thus, the main aim of ETQ processes is to avoid depicts the locality of these challenges (shown as
as much as possible complex computations over the numbers from 1 to 8) with respect to the various
imported data, so that these data can be included as components of the ecosystem described in Fig. 1.
17

to address the temporal issues in Exploratory OLAP


scenarios (]3 in Fig. 8). On the other hand, temporal
analysis has not yet been approached by Exploratory
OLAP work. However, one must bear in mind that
temporalities usually lead to hard non-standard rea-
soning (see [26]).

6.2 Data Provisioning Challenges


As we presented in Section 5, there are preliminary
efforts that use SW technologies to automate and to
improve data provisioning. Our main conclusion is
that SW technologies can be a powerful tool for data
and semantic integration. Counterintuitively, reality
Fig. 8. Challenges in next generation OLAP shows that most approaches so far have dealt with
the design problem, letting the data integration part
6.1 Schema Design Challenges largely unexplored.
Most of the existing work for bringing together OLAP The mixture of SW and ETL technologies is still
and SW technologies focuses on schema design issues. quite immature. Preliminary efforts proved that au-
But our analysis on the work surveyed in Section 4 tomating the ETL design is doable. But existing ap-
reveals that there are still unresolved challenges. proaches solve the problem partially and mostly at the
The common characteristic of the existing ap- conceptual level. Example solutions proposed involve
proaches is that these can deal with unstructured (or creating ETL designs based on an analysis of busi-
less structured) data, as soon as ontological map- ness requirements and providing solutions for loading
pings are provided. Getting these mappings is not semi-structured data. Although these efforts seem to
trivial. In many cases, appropriate domain ontologies be in the correct direction, additional work is needed
do not even exist. Thus, a first challenge involves to move to a more Exploratory level (]4 in Fig. 8).
the automatic (or semi-automatic) derivation of such A challenge that follows from our discussion on MD
mappings (]1 in Fig. 8). Research on other, but related design issues involves the richness of integration op-
fields, like schema mappings and data exchange has erations that can be automated with SW technologies.
provided very useful results that have not been ex- Current practice considers ETL processes as a series
ploited yet by research on DW design (see [51]). of relational-style operations, but in practice ETL is
Another limitation going toward more Exploratory much more complex. Dealing with non-relational op-
schema designs is formed by existing approaches that erations, like user defined functions, machine learn-
assume that the extracted data are materialized, which ing based computations, etc. is not straightforward.
typically happens off-line and is based on complex Dealing with complex data types like bag-semantics,
transformations. As we discussed, the reality and handling nulls, etc. is not trivial either. Considering
modern trends require a departure from this practice. non-persistent data, like streaming data, events, etc.
Our survey also revealed that there is a lack of complicates the semantics involved too.
SW tools for fully supporting OLAP functionality. An Referring back to our criteria, Freshness and Ma-
important substance of MD design is aggregation, as terialization are not even considered by existing ap-
aggregation relationships are needed to rollup data. proaches to semantic-aware integration. Future re-
For automating DW design, we need to be able search needs to couple the power coming from the
to reason on aggregations (]2 in Fig. 8). However, semantic richness offered by SW technologies to the
these are not part of the OWL standard and only physical design, characteristics, and performance of
preliminary theoretical results are available. In [52], integration designs (]5 in Fig. 8). For example, the
the authors argue that calculating aggregates under derivation of integration operations from the ontolog-
an open-world assumption is hard. As a step toward ical mappings should take into account performance
tackling this challenge, [53] shows that part-whole related issues (e.g., data selectivity or availability) to
relationships can be represented in DL (despite that optimize the design created.
some of their properties cannot be modeled) and [54] Despite it not being reflected in Table 2, another rel-
explains how some kinds of aggregates can actually evant characteristic is the integration model being fol-
be managed using Datalog. lowed. Clearly, Exploratory OLAP systems need to use
Recent work has dealt with the problem of ontology LaV to be able to dynamically integrate data sources,
evolution and versioning in the SW scenario, even without defining the global schema and mappings in
though it is in its infancy (see [55]). Future work advance (]6 in Fig. 8). However, the computational
should investigate how this kind of tools can be cost of answering queries in this kind of architecture
integrated with traditional DW evolution like in [17] (especially for highly expressive logics) is well known.
18

The more expressive a language is, the more expen- amounts of instances and although these techniques
sive the reasoning (being easily undecidable). Thus, to can facilitate integration, in general, we do not envi-
dynamically add data sources, we have to pay higher sion a pure logic-based approach. We need semantics
computational costs, which easily becomes infeasible. and we need reasoning, but it looks like they are not
compatible in the presence of a high data volume.
The column Computation in Table 2 describes the
6.3 Semantic and Computational Challenges
worst case. Thus, Computation and Expressiveness are
Being Exploratory means automating the access not usually correlated (High Expressiveness implies Hard
only to schema but also to data to some extent. This Computation). Nevertheless, it is worth mentioning
implies we have to be able to cope with semantics and that in practice some of these approaches do not
reasoning at the instance level (]7 in Fig. 8). Therefore, achieve the worst case complexity, presenting for most
most Exploratory approaches imply Hard Computation ontologies a Medium Complexity for fair expressive
if Expressiveness is not drastically reduced. It is easier languages. Moreover, some approaches may prefer
for any reasonably expressive ontology language to to maintain Medium/High Expressiveness with easier
perform reasoning tasks on the terminology without Computation at the expense of incomplete (but sound)
asserted instances (combined complexity) than query inferences, when complete reasoning is not required.
answering over the asserted axioms (data complex- It is worth noting, that no work is using
ity). The fact that work in ETL related to reasoning Non-standard reasoning services (e.g., least com-
is also done at the schema level, without actually mon subsumer, pattern matching, etc). Researchers
involving data, supports this assertion. Nevertheless, prefer to use ad hoc algorithms on top of more mature
recent approaches such as OBDA (Ontology-Based Standard services. Thus, for example, Abelló and
Data Access, see [56]) open new directions. Thus, Romero [36], [37] incur higher Computation because
tractable DLs, such as DL-Lite, allow us to exploit of implementing ad hoc algorithms on top of the
query answering by querying the terminology. reasoner. We think that constraining or at least de-
Typically, Datalog has been used to reason on the tecting some kind of expressions (e.g., MD queries)
instances of databases. It provides query answering in the reasoners could be a solution to the puzzle
(the model and data are somehow intertwined) in (i.e., similar to detecting star-join patterns in relational
polynomial time under a closed-world assumption. query optimizers). It would not (only) be a trade-
Definitely, this would be the case for a DW sup- off between language expressiveness and reasoning
porting decision-making based only on the data it cost, the kind of queries allowed should also come
physically contains (i.e., it is a database and assumes into play. So far, we described how OLAP can benefit
a closed-world). However, if a system has to be truly from SW, but it is also true that research on how SW
Exploratory, we cannot guarantee that the necessary can benefit from OLAP, considering the latter as an
information will always be available. It makes sense enabling paradigm for making complex processing
to think of decision-making in open-world scenarios and reasoning more efficient.
(e.g., what-if analysis), considering also those data From a different perspective, the computational
on the Internet that could be dynamically added to problem could be tackled by increasing the computer
our analysis. Thus, more research is necessary on power, e.g., by using a cluster of parallel machines
efficient reasoning algorithms under an open-world (see [29]). In this sense, [57] addresses the challenge of
assumption, maybe constrained to specific types of distributed reasoning by using MapReduce. A draw-
schemas (e.g., MD). back in this approach is the current structure of the
Also, there is a controversy related to the polyno- SW. The existence of data niches hidden behind end-
mial complexity of Datalog, since it is only achieved points hinders the analysis of data. On the one hand,
for standard query answering (i.e., against a fixed pro- these conceal the contents of the different sites. On
gram) and failing to guarantee this restriction (which the other hand, data to be analyzed has to be moved
happens to be rather restrictive and not assumable from one site to another due to the limited query
in OLAP) it becomes EXPTIME-complete (see [25]). functionalities offered. RDF Data Cube Vocabulary (a
It is also argued that tractable DLs like the DL-Lite W3C recommendation in [49]) goes towards solving
family provide query answering in LogSPACE (the this, by offering decisional functionalities.
same complexity as relational databases) and there- Finally, we should also question whether the man-
fore, query answering in such families can be reduced agement of the SW is the most adequate for decision
to querying a relational database. This assumption is support (]8 in Fig. 8). One of the main difficulties is
behind concepts such as OBDA. However, in practice, integrating data, and the complete independence of
current OBDA tools have problems to match relational repositories does not facilitate it. A more hierarchi-
database response time. cal organization (trading autonomy for consistency)
Datalog and DL represent two different ways would be more appropriate. At least, some registries
to deal with ontologies. In our opinion, however, showing the contents, availability, and data quality of
both approaches cannot adequately cope with large each source are clearly needed.
19

7 C ONCLUSIONS [18] M. Castellanos, U. Dayal, and M. Hsu, “Live business in-


telligence for the real-time enterprise,” in From Active Data
This paper first classified OLAP schema design and Management to Event-Based Systems and More, 2010.
data provisioning approaches that leverage SW tech- [19] M. Essaidi, “ODBIS: towards a platform for on-demand busi-
ness intelligence services.” in EDBT/ICDT Workshops, 2010.
nologies, based on the following criteria: Material- [20] H. Berthold, P. Rösch, S. Zöller, F. Wortmann, A. Carenini,
ization, Transformations, Freshness, Structuredness, and S. Campbell, P. Bisson, and F. Strohmaier, “An architecture for
Extensibility. It then analyzed the SW technologies ad-hoc and collaborative business intelligence,” in EDBT/ICDT
Workshops, 2010.
according to the criteria of Reasoning, Computation, and [21] J.-N. Mazón, J. J. Zubcoff, I. Garrigós, R. Espinosa, and
Expressivity. The main conclusion was that SW tech- R. Rodrı́guez, “Open business intelligence: on the importance
nologies are indeed a promising approach for the new of data quality awareness in user-friendly data mining,” in
EDBT/ICDT Workshops, 2012.
and challenging research area of Exploratory OLAP. [22] V. Markl, “Situational business intelligence,” in BIRTE, 2008.
The paper then identified a number of challenges [23] F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, and P. F.
for future research that must be met to fullfill theis Patel-Schneider, Eds., The Description Logic Handbook: Theory,
promise, related to schema design and data provision- Implementation, and Applications. Cambridge Univ. Press, 2003.
[24] S. Ceri, G. Gottlob, and L. Tanca, “What you always wanted
ing, as well as semantic and computational issues. to know about datalog (and never dared to ask),” IEEE TKDE,
vol. 1, no. 1, 1989.
[25] P. F. Patel-Schneider and I. Horrocks, “Position paper: a com-
ACKNOWLEDGMENT parison of two modelling paradigms in the semantic web,” in
WWW, 2006.
This work has been partly supported by the Spanish [26] B. Motik, “Representing and querying validity time in rdf and
owl: A logic-based approach,” J. Web Sem., vol. 12, 2012.
Ministerio de Ciencia e Innovación under projects
[27] A. Artale, D. Calvanese, R. Kontchakov, and M. Za-
TIN2011-24147 and TIN2011-24747. kharyaschev, “The DL-Lite Family and Relations,” J. of AI
Research, vol. 36, 2009.
[28] O. Romero, D. Calvanese, A. Abelló, and M. Rodriguez-Muro,
R EFERENCES “Discovering functional dependencies for multidimensional
design,” in DOLAP. ACM, 2009.
[1] S. Rizzi, A. Abelló, J. Lechtenbörger, and J. Trujillo, “Research [29] T. Niemi, S. Toivonen, M. Niinimäki, and J. Nummenmaa,
in data warehouse modeling and design: dead or alive?” in “Ontologies with semantic web/grid in data integration for
DOLAP, 2006. OLAP,” Int. J. Semantic Web Inf. Syst., vol. 3, no. 4, 2007.
[2] W. Inmon, D. Strauss, and G. Neushloss, DW2.0. Morgan [30] M. Niinimäki and T. Niemi, “An ETL process for OLAP using
Kaufmann, 2008. RDF/OWL ontologies,” JoDS, vol. 5530, 2009.
[3] A. Simitsis, K. Wilkinson, M. Castellanos, and U. Dayal, “Op- [31] T. Priebe, A. Reisser, and D. T. A. Hoang, “Reinventing the
timizing analytic data flows for multiple execution engines,” Wheel?! Why Harmonization and Reuse Fail in Complex Data
in ACM SIGMOD Conf., 2012. Warehouse Environments and a Proposed Solution to the
[4] A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, Problem,” in Tagung Wirtschaftsinformatik, 2011.
and H.-A. Jacobsen, “BigBench: towards an industry standard [32] A. Bakhtouchi, L. Bellatreche, S. Jean, and Y. Aitameur, “MIR-
benchmark for big data analytics,” in ACM SIGMOD, 2013. SOFT: mediator for integrating and reconciling sources using
[5] A. Abelló, J. Darmont, L. Etcheverry, M. Golfarelli, J.-N. ontological functional dependencies,” Int. J. of Web and Grid
Mazón, F. Naumann, T. B. Pedersen, S. Rizzi, J. Trujillo, Services, vol. 8, no. 1/2012, 2012.
P. Vassiliadis, and G. Vossen, “Fusion cubes: Towards self- [33] I. M. Nicolas Prat and J. Akoka, “Multidimensional models
service business intelligence,” IJDWM, vol. 9, no. 2, 2013. meet the semantic web: Defining and reasoning on OWL-DL
[6] M. Middelfart and T. B. Pedersen, “The meta-morphing model ontologies for OLAP,” in DOLAP. ACM, 2012.
used in targit bi suite,” in ER Workshops, 2011. [34] B. Neumayr, S. Anderlik, and M. Schrefl, “Towards ontology-
[7] R. Berlanga, O. Romero, A. Simitsis, V. Nebot, T. B. Pedersen, based OLAP: datalog-based reasoning over multidimensional
A. Abelló, and M. J. Aramburu, “Semantic web technologies ontologies,” in DOLAP. ACM, 2012.
for business intelligence,” in Business Intelligence Applications [35] S. Anderlik, B. Neumayr, and M. Schrefl, “Using domain
and the Web: Models, Systems and Technologies. IGI Global, 2011. ontologies as semantic dimensions in data warehouses,” in
[8] A. Y. Levy, “The information manifold approach to data ER, ser. LNCS. Springer, 2012, vol. 7532.
integration,” IEEE Intelligent Systems, vol. 13, 1998. [36] A. Abelló and O. Romero, “Ontology driven search of com-
[9] D. Pedersen, J. Pedersen, and T. B. Pedersen, “Integrating XML pound IDs,” KAIS, vol. 32, no. 1, 2012.
data in the TARGIT OLAP system,” in ICDE, 2004. [37] O. Romero and A. Abelló, “A framework for multidimensional
[10] V. Nebot and R. Berlanga, “Building data warehouses with design of data warehouses from ontologies,” DKE, vol. 69,
semantic web data,” DSS, vol. 52, no. 4, 2012. no. 11, 2010.
[11] B. Kämpgen and A. Harth, “Transforming statistical linked [38] O. Romero, A. Simitsis, and A. Abelló, “GEM: Requirement-
data for use in OLAP systems,” in I-SEMANTICS, ser. ACM driven generation of ETL and multidimensional conceptual
Int. Conf. Proc., 2011. designs,” in DaWaK, ser. LNCS, vol. 6862. Springer, 2011.
[12] U. Dayal, M. Castellanos, A. Simitsis, and K. Wilkinson, “Data [39] V. Nebot, R. Berlanga, J. M. Pérez-Martı́nez, M. J. Aramburu,
integration flows for business intelligence,” in EDBT, 2009. and T. B. Pedersen, “Multidimensional integrated ontologies:
[13] P. Vassiliadis and A. Simitsis, “Near Real Time ETL,” in New A framework for designing semantic data warehouses,” JoDS,
Trends in Data Warehousing and Data Analysis, 2009. vol. 5530, 2009.
[14] R. Kimball and M. Ross, The Data Warehouse Toolkit: The [40] S. Khouri, I. Boukhari, L. Bellatreche, E. Sardet, S. Jean, and
Complete Guide to Dimensional Modeling. J. Wiley & Sons, 2002. M. Baron, “Ontology-based structured web data warehouses
[15] P. Vassiliadis, M. Bouzeghoub, and C. Quix, “Towards quality- for sustainable interoperability: requirement modeling, design
oriented data warehouse usage and evolution,” Information methodology and tool,” Computers in Industry, vol. 63, 2012.
Systems, vol. 25, no. 2, 2000. [41] B. Kämpgen, S. O’Riain, and A. Harth, “Interacting with
[16] M. Golfarelli, J. Lechtenbörger, S. Rizzi, and G. Vossen, statistical linked data via OLAP operations,” in Int. Workshop
“Schema versioning in data warehouses: Enabling cross- on Interacting with Linked Data (ILD 2012), Extended Semantic
version querying via schema augmentation,” DKE, vol. 59, Web Conf. (ESWC). CEUR-WS.org, 2012.
no. 2, 2006. [42] D. Skoutas and A. Simitsis, “Ontology-based conceptual de-
[17] M. Golfarelli and S. Rizzi, “A survey on temporal data ware- sign of ETL processes for both structured and semi-structured
housing,” IJDWM, vol. 5, no. 1, 2009. data,” Int. J. Semantic Web Inf. Syst., vol. 3, no. 4, 2007.
20

[43] A. Simitsis, D. Skoutas, and M. Castellanos, “Representation Torben Bach Pedersen is a Professor of
of conceptual ETL designs in natural language using Semantic Computer Science at Aalborg University,
Web technology,” DKE, vol. 69, no. 1, 2010. Denmark. His research concerns business
[44] D. Skoutas, A. Simitsis, and T. K. Sellis, “Ontology-Driven intelligence and big data, especially ”Big
Conceptual Design of ETL Processes Using Graph Transfor- Multidimensional Data” - the integration and
mations,” JoDS, vol. 13, 2009. analysis of large amounts of complex and
[45] J.-N. Mazón, J. Lechtenbörger, and J. Trujillo, “A survey on highly dynamic multidimensional data. He is
summarizability issues in multidimensional modeling,” DKE, an ACM Distinguished Scientist, an IEEE
vol. 68, no. 12, 2009. Senior Member, and a member of the Danish
[46] J. M. Pérez, R. Berlanga, M. J. Aramburu, and T. B. Pedersen, Academy of Technical Sciences.
“Integrating data warehouses with web data: A survey,” IEEE
TKDE, vol. 20, no. 7, 2008.
[47] D. Calvanese, M. Lenzerini, and D. Nardi, “Description logics
for conceptual data modeling,” in Logics for Databases and
Information Systems. Kluwer, 1998.
[48] M.-S. Hacid, P. Marcel, and C. Rigotti, “A rule-based data
manipulation language for OLAP systems,” in DOOD, ser. Rafael Berlanga is Professor of Computer
LNCS, vol. 1341. Springer, 1997. Science at Universitat Jaume I, Spain, and
[49] W3C, “The RDF Data Cube vocabulary. W3C Working Draft,” leader of the Temporal Knowledge Bases
http://www.w3.org/TR/vocab-data-cube, 2012. group. His current research concerns novel
[50] R. Dánger and R. Berlanga, “Analysis of ontological in- text mining methods for performing Business
stances - a data warehouse for the semantic web,” in ICSOFT Intelligence over very large resources such
(ISDM/EHST/DC). INSTICC Press, 2007. as the Semantic Web and social media. He
[51] P. G. Kolaitis, “Schema mappings, data exchange, and meta- has published more than 25 contributions to
data management,” in PODS, 2005. high-impact international journals and more
[52] D. Calvanese, E. Kharlamov, W. Nutt, and C. Thorne, “Aggre- than 50 contributions to international confer-
gate queries over ontologies,” in ONISW. ACM, 2008. ences.
[53] U. Sattler, “Description logics for the representation of aggre-
gated objects,” in ECAI. IOS Press, 2000.
[54] J. Seo, S. Guo, and M. S. Lam, “Socialite: Datalog extensions
for efficient social network analysis,” in ICDE, 2013.
[55] E. Jiménez-Ruiz, B. C. Grau, I. Horrocks, and R. B. Llavori, Victoria Nebot obtained her Ph.D. in com-
“Supporting concurrent ontology development: Framework, puter science from the Universitat Jaume I,
algorithms and tool,” DKE, vol. 70, no. 1, 2011. Spain in 2013. She is currently a post-doct-
[56] A. Poggi, D. Lembo, D. Calvanese, G. D. Giacomo, M. Lenz- oral researcher in the Temporal Knowledge
erini, and R. Rosati, “Linking data to ontologies,” JoDS, vol. 10, Bases Group (TKBG) in Universitat Jaume I.
2008. Her main research is focused on analyzing
[57] J. Urbani, S. Kotoulas, J. Maassen, F. van Harmelen, and H. E. and exploiting semistructured and complex
Bal, “WebPIE: A Web-scale Parallel Inference Engine using data derived mainly from the Semantic Web.
MapReduce,” J. Web Sem., vol. 10, 2012. Also, she is interested in text mining and
information extraction methods for the Se-
mantic Web.

Alberto Abelló is Tenure Track professor. Marı́a José Aramburu is associate pro-
PhD in Informatics, UPC. Local coordinator fessor of Computer Science at Universitat
of the Erasmus Mundus PhD program IT4BI- Jaume I. She received the PhD degree from
DC. Active researcher with more than 50 the University of Birmingham (UK) in 1998.
peer-reviewed publications and H-factor 18, Author of articles in international journals
his interests include Data Warehousing and such as Inf. Proc. & Management, Decision
OLAP, Ontologies, NOSQL databases and Support Systems as well as papers in inter-
BigData management. He has served as national conferences such as ICDE, DEXA,
Program Chair of DOLAP and MEDI, being and ECIR, her main research interests in-
member also of the PC of other database clude knowledge repositories, decision sup-
conferences like DaWaK, CIKM, VLDB, etc. port systems and integration of information.

Alkis Simitsis is a senior research scien-


tist in the Analytics Lab at HP Labs. His
Oscar Romero is a tenure-track lecturer at
research career spans multi-objective opti-
UPC. He obtained his PhD in Computing
mization of hybrid analytics flows, real-time
in 2010. His research interests concerns
business intelligence, massively parallel pro-
data modeling and storage for next genera-
cessing, query processing/optimization, data
tion data warehousing systems, focusing on
warehouses, ETL, web services, and user-
distributed data management, management
friendly query interfaces focusing on key-
and optimization of data intensive flows,
word search and NLP techniques. He has
semantic-aware information integration and
published more than 90 papers in refereed
the development of user-centric functionali-
international journals and conferences in the
ties for service-oriented BI.
above areas and has served in various roles in many program
committees including SIGMOD, VLDB, ICDE, and EDBT. He is a
member of ACM, IEEE, and the Technical Chamber of Greece.

You might also like