Data Modeling Overview
Data Modeling Overview
Data Modeling Overview
Abstract
These last years we have been witnessing a tremendous growth in the volume and availability of
data. This fact results primarily from the emergence of a multitude of sources (e.g. computers,
mobile devices, sensors or social networks) that are continuously producing either structured,
semi-structured or unstructured data. Database Management Systems and Data Warehouses are
no longer the only technologies used to store and analyze datasets, namely due to the volume and
complex structure of nowadays data that degrade their performance and scalability. Big Data is
one of the recent challenges, since it implies new requirements in terms of data storage, process-
ing and visualization. Despite that, analyzing properly Big Data can constitute great advantages
because it allows discovering patterns and correlations in datasets. Users can use this processed
information to gain deeper insights and to get business advantages. Thus, data modeling and data
analytics are evolved in a way that we are able to process huge amounts of data without compro-
mising performance and availability, but instead by “relaxing” the usual ACID properties. This pa-
per provides a broad view and discussion of the current state of this subject with a particular focus
on data modeling and data analytics, describing and clarifying the main differences between the
three main approaches in what concerns these aspects, namely: operational databases, decision
support databases and Big Data technologies.
Keywords
Data Modeling, Data Analytics, Modeling Language, Big Data
1. Introduction
We have been witnessing to an exponential growth of the volume of data produced and stored. This can be ex-
plained by the evolution of the technology that results in the proliferation of data with different formats from the
How to cite this paper: Ribeiro, A., Silva, A. and da Silva, A.R. (2015) Data Modeling and Data Analytics: A Survey from a Big
Data Perspective. Journal of Software Engineering and Applications, 8, 617-634.
http://dx.doi.org/10.4236/jsea.2015.812058
A. Ribeiro et al.
most various domains (e.g. health care, banking, government or logistics) and sources (e.g. sensors, social net-
works or mobile devices). We have assisted a paradigm shift from simple books to sophisticated databases that
keep being populated every second at an immensely fast rate. Internet and social media also highly contribute to
the worsening of this situation [1]. Facebook, for example, has an average of 4.75 billion pieces of content
shared among friends every day [2]. Traditional Relational Database Management Systems (RDBMSs) and Data
Warehouses (DWs) are designed to handle a certain amount of data, typically structured, which is completely
different from the reality that we are facing nowadays. Business is generating enormous quantities of data that
are too big to be processed and analyzed by the traditional RDBMSs and DWs technologies, which are strug-
gling to meet the performance and scalability requirements.
Therefore, in the recent years, a new approach that aims to mitigate these limitations has emerged. Companies
like Facebook, Google, Yahoo and Amazon are the pioneers in creating solutions to deal with these “Big Data”
scenarios, namely recurring to technologies like Hadoop [3] [4] and MapReduce [5]. Big Data is a generic term
used to refer to massive and complex datasets, which are made of a variety of data structures (structured, semi-
structured and unstructured data) from a multitude of sources [6]. Big Data can be characterized by three Vs:
volume (amount of data), velocity (speed of data in and out) and variety (kinds of data types and sources) [7].
Still, there are added some other Vs for variability, veracity and value [8].
Adopting Big Data-based technologies not only mitigates the problems presented above, but also opens new
perspectives that allow extracting value from Big Data. Big Data-based technologies are being applied with
success in multiple scenarios [1] [9] [10] like in: (1) e-commerce and marketing, where count the clicks that the
crowds do on the web allow identifying trends that improve campaigns, evaluate personal profiles of a user, so
that the content shown is the one he will most likely enjoy; (2) government and public health, allowing the de-
tection and tracking of disease outbreaks via social media or detect frauds; (3) transportation, industry and sur-
veillance, with real-time improved estimated times of arrival and smart use of resources.
This paper provides a broad view of the current state of this area based on two dimensions or perspectives:
Data Modeling and Data Analytics. Table 1 summarizes the focus of this paper, namely by identifying three
representative approaches considered to explain the evolution of Data Modeling and Data Analytics. These ap-
proaches are: Operational databases, Decision Support databases and Big Data technologies.
This research work has been conducted in the scope of the DataStorm project [11], led by our research group,
which focuses on addressing the design, implementation and operation of the current problems with Big Data-
based applications. More specifically, the goal of our team in this project is to identify the main concepts and
patterns that characterize such applications, in order to define and apply suitable domain-specific languages
(DSLs). Then these DSLs will be used in a Model-Driven Engineering (MDE) [12]-[14] approach aiming to
ease the design, implementation and operation of such data-intensive applications.
To ease the explanation and better support the discussion throughout the paper, we use a very simple case
study based on a fictions academic management system described below:
The Academic Management System (AMS) should support two types of end-users: students and professors. Each person has a name, gend-
er, date of birth, ID card, place of origin and country. Students are enrolled in a given academic program, which is composed of many
courses. Professors have an academic degree, are associated to a given department and lecture one or more courses. Each course has a name,
academic term and can have one or more locations and academic programs associated. Additionally, a course is associated to a schedule
composed of many class periods determining its duration and the day it occurs.
The outline of this paper is as follows: Section 2 describes Data Modeling and some representative types of
data models used in operational databases, decision support databases and Big Data technologies. Section 3 de-
tails the type of operations performed in terms of Data Analytics for these three approaches. Section 4 compares
and discusses each approach in terms of the Data Modeling and Data Analytics perspectives. Section 5 discusses
our research in comparison with the related work. Finally, Section 6 concludes the paper by summarizing its key
points and identifying future work.
2. Data Modeling
This section gives an in-depth look of the most popular data models used to define and support Operational Da-
tabases, Data Warehouses and Big Data technologies.
618
A. Ribeiro et al.
Multiple Classes
Data Analytics
OLTP OLAP (Batch-oriented processing, stream-processing,
Perspective
OLTP and Interactive ad-hoc queries)
Databases are widely used either for personal or enterprise use, namely due to their strong ACID guarantees
(atomicity, consistency, isolation and durability) guarantees and the maturity level of Database Management
Systems (DBMSs) that support them [15].
The data modeling process may involve the definition of three data models (or schemas) defined at different
abstraction levels, namely Conceptual, Logical and Physical data models [15] [16]. Figure 1 shows part of the
three data models for the AMS case study. All these models define three entities (Person, Student and Professor)
and their main relationships (teach and supervise associations).
Conceptual Data Model. A conceptual data model is used to define, at a very high and platform-independent
level of abstraction, the entities or concepts, which represent the data of the problem domain, and their relation-
ships. It leaves further details about the entities (such as their attributes, types or primary keys) for the next steps.
This model is typically used to explore domain concepts with the stakeholders and can be omitted or used in-
stead of the logical data model.
Logical Data Model. A logical data model is a refinement of the previous conceptual model. It details the
domain entities and their relationships, but standing also at a platform-independent level. It depicts all the
attributes that characterize each entity (possibly also including its unique identifier, the primary key) and all the
relationships between the entities (possibly including the keys identifying those relationships, the foreign keys).
Despite being independent of any DBMS, this model can easily be mapped on to a physical data model thanks to
the details it provides.
Physical Data Model. A physical data model visually represents the structure of the data as implemented by
a given class of DBMS. Therefore, entities are represented as tables, attributes are represented as table columns
and have a given data type that can vary according to the chosen DBMS, and the relationships between each ta-
ble are identified through foreign keys. Unlike the previous models, this model tends to be platform-specific,
because it reflects the database schema and, consequently, some platform-specific aspects (e.g. database-specific
data types or query language extensions).
Summarizing, the complexity and detail increase from a conceptual to a physical data model. First, it is im-
portant to perceive at a higher level of abstraction, the data entities and their relationships using a Conceptual
Data Model. Then, the focus is on detailing those entities without worrying about implementation details using a
Logical Data Model. Finally, a Physical Data Model allows to represent how data is supported by a given
DBMS [15] [16].
619
A. Ribeiro et al.
Figure 1. Example of three data models (at different abstraction levels) for the Academic Management System.
the table (known as tuple) corresponds to a single element of the represented domain entity. In the Relational
Model each row is unique and therefore a table has an attribute or set of attributes known as primary key, used
to univocally identify those rows. Tables are related with each other by sharing one or more common attributes.
These attributes correspond to a primary key in the referenced (parent) table and are known as foreign keys in
the referencing (child) table. In one-to-many relationships, the referenced table corresponds to the entity of the
“one” side of the relationship and the referencing table corresponds to the entity of the “many” side. In many-
to-many relationships, it is used an additional association table that associates the entities involved through
their respective primary keys. The Relational Model also features the concept of View, which is like a table
whose rows are not explicitly stored in the database, but are computed as needed from a view definition. Instead,
a view is defined as a query on one or more base tables or other views [17].
Entity-Relationship (ER) Model. The Entity Relationship (ER) Model [20], proposed by Chen in 1976, ap-
peared as an alternative to the Relational Model in order to provide more expressiveness and semantics into the
620
A. Ribeiro et al.
database design from the user’s point of view. The ER model is a semantic data model, i.e. aims to represent the
meaning of the data involved on some specific domain. This model was originally defined by three main con-
cepts: entities, relationships and attributes. An entity corresponds to an object in the real world that is distin-
guishable from all other objects and is characterized by a set of attributes. Each attribute has a range of possible
values, known as its domain, and each entity has its own value for each attribute. Similarly to the Relational
Model, the set of attributes that identify an entity is known as its primary key.
Entities can be though as nouns and correspond to the tables of the Relational Model. In turn, a relationship
is an association established among two or more entities. A relationship can be thought as a verb and includes
the roles of each participating entities with multiplicity constraints, and their cardinality. For instance, a rela-
tionship can be of one-to-one (1:1), one-to-many (1:M) or many-to-many (M:N). In an ER diagram, entities are
usually represented as rectangles, attributes as circles connected to entities or relationships through a line, and
relationships as diamonds connected to the intervening entities through a line.
The Enhanced ER Model [21] provided additional concepts to represent more complex requirements, such as
generalization, specialization, aggregation and composition. Other popular variants of ER diagram notations are
Crow’s foot, Bachman, Barker’s, IDEF1X and UML Profile for Data Modeling [22].
621
A. Ribeiro et al.
Figure 2. Example of two star schema models for the Academic Management System.
622
A. Ribeiro et al.
of the number of sources (e.g. users, systems or sensors) that are continuously producing data. These data
sources produce huge amounts of data with variable representations that make their management by the tradi-
tional RDBMSs and DWs often impracticable. Therefore, there is a need to devise new data models and tech-
nologies that can handle such Big Data.
NoSQL (Not Only SQL) [26] is one of the most popular approaches to deal with this problem. It consists in a
group of non-relational DBMSs that consequently do not represent databases using tables and usually do not use
SQL for data manipulation. NoSQL systems allow managing and storing large-scale denormalized datasets, and
are designed to scale horizontally. They achieve that by compromising consistency in favor of availability and
partition-tolerance, according to Brewer’s CAP theorem [27]. Therefore, NoSQL systems are “eventually con-
sistent”, i.e. assume that writes on the data are eventually propagated over time, but there are limited guarantees
that different users will read the same value at the same time. NoSQL provides BASE guarantees (Basically
Available, Soft state and Eventually consistent) instead of the traditional ACID guarantees, in order to greatly
improve performance and scalability [28].
NoSQL databases can be classified in four categories [29]: Key-value stores, (2) Document-oriented databas-
es, (3) Wide-column stores, and (4) Graph databases.
Key-value Stores. A Key-Value store represents data as a collection (known as dictionary or map) of key-
value pairs. Every key consists in a unique alphanumeric identifier that works like an index, which is used to
access a corresponding value. Values can be simple text strings or more complex structures like arrays. The
Key-value model can be extended to an ordered model whose keys are stored in lexicographical order. The fact
of being a simple data model makes Key-value stores ideally suited to retrieve information in a very fast, availa-
ble and scalable way. For instance, Amazon makes extensive use of a Key-value store system, named Dynamo,
to manage the products in its shopping cart [30]. Amazon’s Dynamo and Voldemort, which is used by Linkedin,
are two examples of systems that apply this data model with success. An example of a key-value store for both
students and professors of the Academic Managements System is shown in Figure 4.
Document-oriented Databases. Document-oriented databases (or document stores) were originally created
to store traditional documents, like a notepad text file or Microsoft Word document. However, their concept of
document goes beyond that, and a document can be any kind of domain object [26]. Documents contain encoded
data in a standard format like XML, YAML, JSON or BSON (Binary JSON) and are univocally identified in the
database by a unique key. Documents contain semi-structured data represented as name-value pairs, which can
vary according to the row and can nest other documents. Unlike key-value stores, these systems support second-
ary indexes and allow fully searching either by keys or values. Document databases are well suited for storing
and managing huge collections of textual documents (e.g. text files or email messages), as well as semi-struc-
tured or denormalized data that would require an extensive use of “nulls” in an RDBMS [30]. MongoDB and
CouchDB are two of the most popular Document-oriented database systems. Figure 5 illustrates two collections
of documents for both students and professors of the Academic Management System.
623
A. Ribeiro et al.
Wide-column Stores. Wide-column stores (also known as column-family stores, extensible record stores or
column-oriented databases) represent and manage data as sections of columns rather than rows (like in RDBMS).
Each section is composed of key-value pairs, where the keys are rows and the values are sets of columns, known
as column families. Each row is identified by a primary key and can have column families different of the other
rows. Each column family also acts as a primary key of the set of columns it contains. In turn each column of
column family consists in a name-value pair. Column families can even be grouped in super column families
[29]. This data model was highly inspired by Google’s BigTable [31]. Wide-column stores are suited for scena-
rios like: (1) Distributed data storage; (2) Large-scale and batch-oriented data processing, using the famous
MapReduce method for tasks like sorting, parsing, querying or conversion and; (3) Exploratory and predictive
analytics. Cassandra and Hadoop HBase are two popular frameworks of such data management systems [29].
Figure 6 depicts an example of a wide-column store for the entity “person” of the Academic Managements
System.
Graph Databases. Graph databases represent data as a network of nodes (representing the domain entities)
that are connected by edges (representing the relationships among them) and are characterized by properties ex-
pressed as key-value pairs. Graph databases are quite useful when the focus is on exploring the relationships
between data, such as traversing social networks, detecting patterns or infer recommendations. Due to their vis-
ual representation, they are more user-friendly than the aforementioned types of NoSQL databases. Neo4j and
Allegro Graph are two examples of such systems.
3. Data Analytics
This section presents and discusses the types of operations that can be performed over the data models described
in the previous section and also establishes comparisons between them. A complementary discussion is provided
in Section 4.
624
A. Ribeiro et al.
ous database objects. First it allows managing schemas, which are named collections of all the database objects
that are related to one another. Then inside a schema, it is possible to manage tables specifying their columns
and types, primary keys, foreign keys and constraints. It is also possible to manage views, domains and indexes.
An index is a structure that speeds up the process of accessing to one or more columns of a given table, possibly
improving the performance of queries [15] [16].
For example, considering the Academic Management System, a system manager could create a table for stor-
ing information of a student by executing the following SQL-DDL command:
On the other hand, SQL-DML is the language that enables to manipulate database objects and particularly to
extract valuable information from the database. The most commonly used and complex operation is the
SELECT operation, which allows users to query data from the various tables of a database. It is a powerful op-
eration because it is capable of performing in a single query the equivalent of the relational algebra’s selection,
projection and join operations. The SELECT operation returns as output a table with the results. With the
SELECT operation is simultaneously possible to: define which tables the user wants to query (through the
FROM clause), which rows satisfy a particular condition (through the WHERE clause), which columns should
appear in the result (through the SELECT clause), order the result (in ascending or descending order) by one or
more columns (through the ORDER BY clause), group rows with the same column values (through the GROUP
BY clause) and filter those groups based on some condition (through the HAVING clause). The SELECT opera-
tion also allows using aggregation functions, which perform arithmetic computation or aggregation of data (e.g.
counting or summing the values of one or more columns).
Many times there is the need to combine columns of more than one table in the result. To do that, the user can
use the JOIN operation in the query. This operation performs a subset of the Cartesian product between the in-
volved tables, i.e. returns the row pairs where the matching columns in each table have the same value. The most
common queries that use joins involve tables that have one-to-many relationships. If the user wants to include in
the result the rows that did not satisfied the join condition, then he can use the outer joins operations (left, right
and full outer join). Besides specifying queries, DML allows modifying the data stored in a database. Namely, it
allows adding new rows to a table (through the INSERT statement), modifying the content of a given table’s
rows (through the UPDATE statement) and deleting rows from a table (through the DELETE statement) [16].
625
A. Ribeiro et al.
SQL-DML also allows combining the results of two or more queries into a single result table by applying the
Union, Intersect and Except operations, based on the Set Theory [15].
For example, considering the Academic Management System, a system manager could get a list of all stu-
dents who are from G8 countries by entering the following SQL-DML query:
Figure 7. Representation of cube operations for the Academic Management System: slice (top-left), dice (top-right), drill
up/down (bottom-left) and pivot (bottom-right).
626
A. Ribeiro et al.
a common dimension is a dimension associated to time. The most popular language for manipulating OLAP
cubes is MDX (Multidimensional Expressions) [32], which is a query language for OLAP databases that sup-
ports all the operations mentioned above. MDX is exclusively used to analyze and read data since it was not de-
signed with SQL-DML in mind. The star schema and the OLAP cube are designed a priori with a specific pur-
pose in mind and cannot accept queries that differentiate much from the ones they were design to respond too.
The benefit in this, is that queries are much simpler and faster, and by using a cube it is even quicker to detect
patterns, find trends and navigate around the data while “slicing and dicing” with it [23] [25].
Again considering the Academic Management System example, the following query represents an MDX se-
lect statement. The SELECT clause sets the query axes as the name and the gender of the Student dimension and
the year 2015 of the Date dimension. The FROM clause indicates the data source, here being the Students cube,
and the WHERE clause defines the slicer axis as the “Computer Science” value of the Academic Program di-
mension. This query returns the students (by names and gender) that have enrolled in Computer Science in the
year 2015.
SELECT
{ [Student].[Name],
[Student].[Gender]} ON COLUMNS
{ [Date].[Academic Year] &[2015] } ON ROWS
FROM [Students Cube]
WHERE ([Academic Program].[Name] &[Computer Science])
627
A. Ribeiro et al.
OLTP, as we have seen before, is mainly used in the traditional RDBMS. However, these systems cannot as-
sure an acceptable performance when the volume of data and requests is huge, like in Facebook or Twitter.
Therefore, it was necessary to adopt NoSQL databases that allow achieving very high performances in systems
with such large loads. Systems like Cassandra4, HBase5 or MongoDB6 are effective solutions currently used. All
of them provide their own query languages with equivalent CRUD operations to the ones provided by SQL. For
example, in Cassandra is possible to create Column Families using CQL, in HBase is possible to delete a col-
umn using Java, and in MongoDB insert a document into a collection using JavaScript. Below there is a query in
JavaScript for a MongoDB database equivalent to the SQL-DML query presented previously.
db.students.find({ country: [“Canada”, “France”, “Germany”, “Italy”, “Japan”, “Russia”, “UK”, “USA”] }, { name: 1, country: 1 }).
sort({ country: 1 })
At last, Interactive ad-hoc queries and analysis consists in a paradigm that allows querying different large-
scale data sources and query interfaces with a very low latency. This type of systems argue that queries should
not need more then few seconds to execute even in a Big Data scale, so that users are able to react to changes if
needed. The most popular of these systems is Drill7. Drill works as a query layer that transforms a query written
in a human-readable syntax (e.g. SQL) into a logical plan (query written in a platform-independent way). Then,
the logical plan is transformed into a physical plan (query written in a platform-specific way) that is executed in
the desired data sources (e.g. Cassandra, HBase or MongoDB) [35].
4. Discussion
In this section we compare and discuss the approaches presented in the previous sections in terms of the two
perspectives that guide this survey: Data Modeling and Data Analytics. Each perspective defines a set of fea-
tures used to compare Operational Databases, DWs and Big Data approaches among themselves.
Regarding the Data Modeling Perspective, Table 2 considers the following features of analysis: (1) the data
model; (2) the abstraction level in which the data model resides, according to the abstraction levels (Conceptual,
Logical and Physical) of the database design process; (3) the concepts or constructs that compose the data model;
4
http://cassandra.apache.org
5
https://hbase.apache.org
6
https://www.mongodb.org
7
https://drill.apache.org
628
A. Ribeiro et al.
Sparx Enterprise
Entity
Architect,
Entity- Relationship Chen’s, Crow’s
Conceptual, Visual Paradigm,
Relationship Attribute foot, Bachman’s,
Logical Oracle Designer,
Model Primary Key Barker’s, IDEF1X
MySQL Workbench,
Foreign Key
ER/Studio
Operational
Table
Sparx Enterprise
Row
Architect, Microsoft SQL Server,
Attribute
Relational Logical, SQL-DDL, UML Visual Paradigm, Oracle, MySQL,
Primary Key
Model Physical Data Profile Oracle Designer, PostgreSQL,
Foreign Key,
MySQL Workbench, IBM DB2
View,
ER/Studio
Index
Oracle Warehouse
Dimensions, Levels,
Common Essbase Studio Tool, Builder, Essbase
OLAP Conceptual, Cube faces, Time
Warehouse Enterprise Architect, Studio Tool,
Cube Logical dimension,
Metamodel Visual Paradigm Microsoft Analysis
Local dimension
Services
Decision Support SQL-DDL, DML,
Fact table, UML Data Model Enterprise Architect, Microsoft SQL Server,
Star Logical, Attributes table, Profile, UML Visual Paradigm, Oracle, MySQL,
Schema Physical Dimensions, Profile for Oracle SQL Data PostgreSQL,
Foreign Key Modeling Data Modeler IBM DB2
Warehouse Usage
SQL-DDL,
Logical, Key, Dynamo,
Key-Value Dynamo Query
Physical Value Voldemort
Language
Logical, Document, SQL-DDL, MongoDB,
Document
Physical Primary Key Javascript CounchDB
Keyspace, Table,
Big Data Column,
Logical, Column Family, Cassandra,
Wide-Column CQL, Groovy
Physical Super Column, HBase
Primary Key,
Index
(4) the concrete languages used to produce the data models and that apply the previous concepts; (5) the model-
ing tools that allow specifying diagrams using those languages and; (6) the database tools that support the data
model. Table 2 presents the values of each feature for each approach. It is possible to verify that the majority of
the data models are at a logical and physical level, with the exception of the ER model and the OLAP cube
model, which are more abstract and defined at conceptual and logical levels. It is also possible to verify that Big
Data has more data models than the other approaches, what can explain the work and proposals that have been
conducted over the last years, as well as the absence of a de facto data model. In terms of concepts, again Big
Data-related data models have a more variety of concepts than the other approaches, ranging from key-value
pairs or documents to nodes and edges. Concerning concrete languages, it is concluded that every data model
presented in this survey is supported by a SQL-DDL-like language. However, we found that only the operational
databases and DWs have concrete languages to express their data models in a graphical way, like Chen’s nota-
tion for ER model, UML Data Profile for Relational model or CWM [36] for multidimensional DW models.
Also, related to that point, there are none modeling tools to express Big Data models. Thus, defining such a
modeling language and respective supporting tool for Big Data models constitute an interesting research direc-
tion that fills this lack. At last, all approaches have database tools that support the development based on their
629
A. Ribeiro et al.
data models, with the exception of the ER model that is not directly used by DBMSs.
On the other hand, in terms of the Data Analytics Perspective, Table 3 considers six features of analysis: (1)
the class of application domains, which characterizes the approach suitability; (2) the common operations used
in the approach, which can be reads and/or writes; (3) the operations types most typically used in the approach;
(4) the concrete languages used to specify those operations; (5) the abstraction level of these concrete languages
(Conceptual, Logical and Physical); and (6) the technology support of these languages and operations.
Table 3 shows that Big Data is used in more classes of application domains than the operational databases
and DWs, which are used for OLTP and OLAP domains, respectively. It is also possible to observe that opera-
tional databases are commonly used for reads and writes of small operations (using transactions), because they
need to handle fresh and critical data in a daily basis. On the other hand, DWs are mostly suited for read opera-
tions, since they perform analysis and data mining mostly with historical data. Big Data performs both reads and
writes, but in a different way and at a different scale from the other approaches. Big Data applications are built
to perform a huge amount of reads, and if a huge amount of writes is needed, like for OLTP, they sacrifice con-
sistency (using “eventually consistency”) in order to achieve great availability and horizontal scalability. Opera-
tional databases support their data manipulation operations (e.g. select, insert or delete) using SQL-ML, which
has slight variations according to the technology used. DWs also use SQL-ML through the select statement, be-
cause their operations (e.g. slice, dice or drill down/up) are mostly reads. DWs also use SQL-based languages,
like MDX and XMLA (XML for Analysis) [37], for specifying their operations. On the other hand, regarding
Big Data technologies, there is a great variety of languages to manipulate data according to the different class
application domains. All of these languages provide equivalent operations to the ones offered by SQL-ML and
add new constructs for supporting both ETL, data stream processing (e.g. create stream, window) [34] and Ma-
pReduce operations. It is important to note that concrete languages used in the different approaches reside at
logical and physical levels, because they are directly used by the supporting software tools.
5. Related Work
As mentioned in Section 1, the main goal of this paper is to present and discuss the concepts surrounding data
Map-Reduce, Select,
Batch-oriented Insert, Update, Delete, Hive QL,
Read/Write Logical, Physical Hadoop, Hive Pig
processing Load, Import, Export, Pig Latin
OrderBy, GroupBy
630
A. Ribeiro et al.
modeling and data analytics, and their evolution for three representative approaches: operational databases, de-
cision support databases and Big Data technologies. In our survey we have researched related works that also
explore and compare these approaches from the data modeling or data analytics point of view.
J.H. ter Bekke provides a comparative study between the Relational, Semantic, ER and Binary data models
based on an examination session results [38]. In that session participants had to create a model of a case study,
similar to the Academic Management System used in this paper. The purpose was to discover relationships be-
tween the modeling approach in use and the resulting quality. Therefore, this study just addresses the data mod-
eling topic, and more specifically only considers data models associated to the database design process.
Several works focus on highlighting the differences between operational databases and data warehouses. For
example, R. Hou provides an analysis between operational databases and data warehouses distinguishing them
according to their related theory and technologies, and also establishing common points where combining both
systems can bring benefits [39]. C. Thomsen and T.B. Pedersen compare open source ETL tools, OLAP clients
and servers, and DBMSs, in order to build a Business Intelligence (BI) solution [40].
P. Vassiliadis and T. Sellis conducted a survey that focuses only on OLAP databases and compare various
proposals for the logical models behind them. They group the various proposals in just two categories: commer-
cial tools and academic efforts, which in turn are subcategorized in relational model extensions and cube-
oriented approaches [41]. However, unlike our survey they do not cover the subject of Big Data technologies.
Several papers discuss the state of the art of the types of data stores, technologies and data analytics used in
Big Data scenarios [29] [30] [33] [42], however they do not compare them with other approaches. Recently, P.
Chandarana and M. Vijayalakshmi focus on Big Data analytics frameworks and provide a comparative study
according to their suitability [35].
Summarizing, none of the following mentioned work provides such a broad analysis like we did in this paper,
namely, as far as we know, we did not find any paper that compares simultaneously operational databases, deci-
sion support databases and Big Data technologies. Instead, they focused on describing more thoroughly one or
two of these approaches
6. Conclusions
In recent years, the term Big Data has appeared to classify the huge datasets that are continuously being pro-
duced from various sources and that are represented in a variety of structures. Handling this kind of data
represents new challenges, because the traditional RDBMSs and DWs reveal serious limitations in terms of per-
formance and scalability when dealing with such a volume and variety of data. Therefore, it is needed to rein-
vent the ways in which data is represented and analyzed, in order to be able to extract value from it.
This paper presents a survey focused on both these two perspectives: data modeling and data analytics, which
are reviewed in terms of the three representative approaches nowadays: operational databases, decision support
databases and Big Data technologies. First, concerning data modeling, this paper discusses the most common
data models, namely: relational model and ER model for operational databases; star schema model and OLAP
cube model for decision support databases; and key-value store, document-oriented database, wide-column store
and graph database for Big Data-based technologies. Second, regarding data analytics, this paper discusses the
common operations used for each approach. Namely, it observes that operational databases are more suitable for
OLTP applications, decision support databases are more suited for OLAP applications, and Big Data technolo-
gies are more appropriate for scenarios like batch-oriented processing, stream processing, OLTP and interactive
ad-hoc queries and analysis.
Third, it compares these approaches in terms of the two perspectives and based on some features of analysis.
From the data modeling perspective, there are considered features like the data model, its abstraction level, its
concepts, the concrete languages used to described, as well as the modeling and database tools that support it.
On the other hand, from the data analytics perspective, there are taken into account features like the class of ap-
plication domains, the most common operations and the concrete languages used to specify those operations.
From this analysis, it is possible to verify that there are several data models for Big Data, but none of them is
represented by any modeling language, neither supported by a respective modeling tool. This issue constitutes
an open research area that can improve the development process of Big Data targeted applications, namely ap-
plying a Model-Driven Engineering approach [12]-[14]. Finally, this paper also presents some related work on
the data modeling and data analytics areas.
631
A. Ribeiro et al.
As future work, we consider that this survey may be extended to capture additional aspects and comparison
features that are not included in our analysis. It will be also interesting to survey concrete scenarios where Big
Data technologies prove to be an asset [43]. Furthermore, this survey constitutes a starting point for our ongoing
research goals in the context of the Data Storm and MDD Lingo initiatives. Specifically, we intend to extend
existing domain-specific modeling languages, like XIS [44] and XIS-Mobile [45] [46], and their MDE-based
framework to support both the data modeling and data analytics of data-intensive applications, such as those re-
searched in the scope of the Data Storm initiative [47]-[50].
Acknowledgements
This work was partially supported by national funds through FCT—Fundação para a Ciência e a Tecnologia,
under the projects POSC/EIA/57642/2004, CMUP-EPB/TIC/0053/2013, UID/CEC/50021/2013 and Data Storm
Research Line of Excellency funding (EXCL/EEI-ESS/0257/2012).
References
[1] Mayer-Schönberger, V. and Cukier, K. (2014) Big Data: A Revolution That Will Transform How We Live, Work, and
Think. Houghton Mifflin Harcourt, New York.
[2] Noyes, D. (2015) The Top 20 Valuable Facebook Statistics. https://zephoria.com/top-15-valuable-facebook-statistics
[3] Shvachko, K., Hairong Kuang, K., Radia, S. and Chansler, R. (2010) The Hadoop Distributed File System. 26th Sym-
posium on Mass Storage Systems and Technologies (MSST), Incline Village, 3-7 May 2010, 1-10.
http://dx.doi.org/10.1109/msst.2010.5496972
[4] White, T. (2012) Hadoop: The Definitive Guide. 3rd Edition, O'Reilly Media, Inc., Sebastopol.
[5] Dean, J. and Ghemawat, S. (2008) MapReduce: Simplified Data Processing on Large Clusters. Communications, 51,
107-113. http://dx.doi.org/10.1145/1327452.1327492
[6] Hurwitz, J., Nugent, A., Halper, F. and Kaufman, M. (2013) Big Data for Dummies. John Wiley & Sons, Hoboken.
[7] Beyer, M.A. and Laney, D. (2012) The Importance of “Big Data”: A Definition. Gartner.
https://www.gartner.com/doc/2057415
[8] Duncan, A.D. (2014) Focus on the “Three Vs” of Big Data Analytics: Variability, Veracity and Value. Gartner.
https://www.gartner.com/doc/2921417/focus-vs-big-data-analytics
[9] Agrawal, D., Das, S. and El Abbadi, A. (2011) Big Data and Cloud Computing: Current State and Future Opportunities.
Proceedings of the 14th International Conference on Extending Database Technology, Uppsala, 21-24 March, 530-533.
http://dx.doi.org/10.1145/1951365.1951432
[10] McAfee, A. and Brynjolfsson, E. (2012) Big Data: The Management Revolution. Harvard Business Review.
[11] DataStorm Project Website. http://dmir.inesc-id.pt/project/DataStorm.
[12] Stahl, T., Voelter, M. and Czarnecki, K. (2006) Model-Driven Software Development: Technology, Engineering,
Management. John Wiley & Sons, Inc., New York.
[13] Schmidt, D.C. (2006) Guest Editor’s Introduction: Model-Driven Engineering. IEEE Computer, 39, 25-31.
http://dx.doi.org/10.1109/MC.2006.58
[14] Silva, A.R. (2015) Model-Driven Engineering: A Survey Supported by the Unified Conceptual Model. Computer
Languages, Systems & Structures, 43, 139-155.
[15] Ramakrishnan, R. and Gehrke, J. (2012) Database Management Systems. 3rd Edition, McGraw-Hill, Inc., New York.
[16] Connolly, T.M. and Begg, C.E. (2005) Database Systems: A Practical Approach to Design, Implementation, and Man-
agement. 4th Edition, Pearson Education, Harlow.
[17] Codd, E.F. (1970) A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, 13, 377-
387. http://dx.doi.org/10.1145/362384.362685
[18] Bachman, C.W. (1969) Data Structure Diagrams. ACM SIGMIS Database, 1, 4-10.
http://dx.doi.org/10.1145/1017466.1017467
[19] Chamberlin, D.D. and Boyce, R.F. (1974) SEQUEL: A Structured English Query Language. In: Proceedings of the
1974 ACM SIGFIDET (Now SIGMOD) Workshop on Data Description, Access and Control (SIGFIDET’ 74), ACM
Press, Ann Harbor, 249-264.
[20] Chen, P.P.S. (1976) The Entity-Relationship Model—Toward a Unified View of Data. ACM Transactions on Database
Systems, 1, 9-36. http://dx.doi.org/10.1145/320434.320440
632
A. Ribeiro et al.
[21] Tanaka, A.K., Navathe, S.B., Chakravarthy, S. and Karlapalem, K. (1991) ER-R, an Enhanced ER Model with Situa-
tion-Action Rules to Capture Application Semantics. Proceedings of the 10th International Conference on Entity-
Relationship Approach, San Mateo, 23-25 October 1991, 59-75.
[22] Merson, P. (2009) Data Model as an Architectural View. Technical Note CMU/SEI-2009-TN-024, Software Engineer-
ing Institute, Carnegie Mellon.
[23] Kimball, R. and Ross, M. (2013) The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. 3rd
Edition, John Wiley & Sons, Inc., Indianapolis.
[24] Zhang, D., Zhai, C., Han, J., Srivastava, A. and Oza, N. (2009) Topic Modeling for OLAP on Multidimensional Text
Databases: Topic Cube and Its Applications. Statistical Analysis and Data Mininig, 2, 378-395.
http://dx.doi.org/10.1002/sam.10059
[25] Gray, J., et al. (1997) Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-
Totals. Data Mining and Knowledge Discovery, 1, 29-53. http://dx.doi.org/10.1023/A:1009726021843
[26] Cattell, R. (2011) Scalable SQL and NoSQL Data Stores. ACM SIGMOD Record, 39, 12-27.
http://dx.doi.org/10.1145/1978915.1978919
[27] Gilbert, S. and Lynch, N. (2002) Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant
Web Services. ACM SIGACT News, 33, 51-59.
[28] Vogels, W. (2009) Eventually Consistent. Communications of the ACM, 52, 40-44.
http://dx.doi.org/10.1145/1435417.1435432
[29] Grolinger, K., Higashino, W.A., Tiwari, A. and Capretz, M.A.M. (2013) Data Management in Cloud Environments:
NoSQL and NewSQL Data Stores. Journal of Cloud Computing: Advances, Systems and Applications, 2, 22.
http://dx.doi.org/10.1186/2192-113x-2-22
[30] Moniruzzaman, A.B.M. and Hossain, S.A. (2013) NoSQL Database: New Era of Databases for Big data Analytics-
Classification, Characteristics and Comparison. International Journal of Database Theory and Application, 6, 1-14.
[31] Chang, F., et al. (2006) Bigtable: A Distributed Storage System for Structured Data. Proceedings of the 7th Symposium
on Operating Systems Design and Implementation (OSDI’ 06), Seattle, 6-8 November 2006, 205-218.
[32] Spofford, G., Harinath, S., Webb, C. and Civardi, F. (2005) MDX Solutions: With Microsoft SQL Server Analysis
Services 2005 and Hyperion Essbase. John Wiley & Sons, Inc., Indianapolis.
[33] Hu, H., Wen, Y., Chua, T.S. and Li, X. (2014) Toward Scalable Systems for Big Data Analytics: A Technology Tu-
torial. IEEE Access, 2, 652-687. http://dx.doi.org/10.1109/ACCESS.2014.2332453
[34] Golab, L. and Özsu, M.T. (2003) Issues in Data Stream Management. ACM SIGMOD Record, 32, 5-14.
http://dx.doi.org/10.1145/776985.776986
[35] Chandarana, P. and Vijayalakshmi, M. (2014) Big Data Analytics Frameworks. Proceedings of the International Con-
ference on Circuits, Systems, Communication and Information Technology Applications (CSCITA), Mumbai, 4-5 April
2014, 430-434. http://dx.doi.org/10.1109/cscita.2014.6839299
[36] Poole, J., Chang, D., Tolbert, D. and Mellor, D. (2002) Common Warehouse Metamodel. John Wiley & Sons, Inc.,
New York.
[37] XML for Analysis (XMLA) Specification. https://msdn.microsoft.com/en-us/library/ms977626.aspx.
[38] ter Bekke, J.H. (1997) Comparative Study of Four Data Modeling Approaches. Proceedings of the 2nd EMMSAD
Workshop, Barcelona, 16-17 June 1997, 1-12.
[39] Hou, R. (2011) Analysis and Research on the Difference between Data Warehouse and Database. Proceedings of the
International Conference on Computer Science and Network Technology (ICCSNT), Harbin, 24-26 December 2011,
2636-2639.
[40] Thomsen, C. and Pedersen, T.B. (2005) A Survey of Open Source Tools for Business Intelligence. Proceedings of the
7th International Conference on Data Warehousing and Knowledge Discovery (DaWaK’05), Copenhagen, 22-26 Au-
gust 2005, 74-84. http://dx.doi.org/10.1007/11546849_8
[41] Vassiliadis, P. and Sellis, T. (1999) A Survey of Logical Models for OLAP Databases. ACM SIGMOD Record, 28, 64-
69. http://dx.doi.org/10.1145/344816.344869
[42] Chen, M., Mao, S. and Liu, Y. (2014) Big Data: A Survey. Mobile Networks and Applications, 19, 171-209.
http://dx.doi.org/10.1007/978-3-319-06245-7
[43] Chen, H., Hsinchun, R., Chiang, R.H.L. and Storey, V.C. (2012) Business Intelligence and Analytics: From Big Data
to Big Impact. MIS Quarterly, 36, 1165-1188.
[44] Silva, A.R., Saraiva, J., Silva, R. and Martins, C. (2007) XIS-UML Profile for Extreme Modeling Interactive Systems.
Proceedings of the 4th International Workshop on Model-Based Methodologies for Pervasive and Embedded Software
633
A. Ribeiro et al.
634