Data Modeling Guidelines For Nosql Document-Store Databases
Data Modeling Guidelines For Nosql Document-Store Databases
Abdullahi Abubakar Imam1,a,b , Shuib Basri2,a , Rohiza Ahmad3,a , Junzo Watada4,a , Maria T. Gonzlez-Aparicio5,c ,
Malek Ahmad Almomani6,a
a
CIS Department, Universiti Teknologi PETRONAS, Bandar Seri Iskandar, 31570, Perak, Malaysia
b
CS Department, Ahmadu Bello University, Zaria-Nigeria
c
Computing Department, University of Oviedo Gijon, Spain
Abstract—Good database design is key to high data avail- [11], developer autonomy [1], [12] and inadequate modeling
ability and consistency in traditional databases, and numerous guidelines [13], have posed numerous challenges in NoSQL
techniques exist to abet designers in modeling schemas appropri- schema best-practice implementation. This has increasingly led
ately. These schemas are strictly enforced by traditional database to erroneous database modeling and designs [1], [14], [15],
engines. However, with the emergence of schema-free databases [16], [17], which defeats the notion of robustness in NoSQL
(NoSQL) coupled with voluminous and highly diversified datasets
(big data), such aid becomes even more important as schemas in
databases and results in the production of low-performance,
NoSQL are enforced by application developers, which requires a non-secure and less-durable systems.
high level of competence. Precisely, existing modeling techniques
and guides used in traditional databases are insufficient for big- For example, consider the security aspect of NoSQL
data storage settings. As a synthesis, new modeling guidelines for document-oriented databases. The databases offer a query
NoSQL document-store databases are posed. These guidelines language or an Application Program Interface (API) that has
cut across both logical and physical stages of database designs. the ability to retrieve the contents of any document in a
Each is developed based on solid empirical insights, yet they are collection. These APIs, although they provide flexibility in
prepared to be intuitive to developers and practitioners. To realize data access across heterogeneous platforms, can be used as
this goal, we employ an exploratory approach to the investigation breaking points by hackers when incorrectly implemented
of techniques, empirical methods and expert consultations. We [18], [19]. Recently, Flexcoin, a United States bank, was
analyze how industry experts prioritize requirements and analyze attacked, and more than a half-million USD was lost [20].
the relationships between datasets on the one hand and error
In addition, an airport was completely shut down due to
prospects and awareness on the other hand. Few proprietary
guidelines were extracted from a heuristic evaluation of 5 NoSQL a system failure [21] in the UK, resulting in several flight
databases. In this regard, the proposed guidelines have great cancelations. These tragic events were strongly attributed to
potential to function as an imperative instrument of knowledge improper database design, as discussed in Section 3. However,
transfer from academia to NoSQL database modeling practices. some of the latest reported security breaches are as follows:
1) schema: because of its flexibility, mere record insertion
Keywords—Big Data; NoSQL; Logical and Physical Design;
can automatically create a new schema within a collection, 2)
Data Modeling; Modeling Guidelines; Document-Stores; Model
Quality queries: unsafe queries can be created via string concatenation,
and 3) JavaScript (JS): the clause of db.eval(), $where takes
in JS functions as parameters [18]. Such types of issues are
I. I NTRODUCTION what drew the attention of researchers to provide viable and
With the rise in data sizes, types and rates of generation, substantial solutions. However, many of the solutions come in
i.e., big data, traditional datastores have become less capable as testing tools for already developed databases [4], [22], [23]
for many reasons, such as structural rigidity and untimely or are proprietary [10], [17], [24], [25], which opposes our
response due to high access latency [1], [2], [3], [4], [5]. This understanding that the solutions should come at the earliest
unacceptable performance has led to a reevaluation of how stage of design (data modeling). Clearly, there is a need for a
such data can be efficiently managed in a new generation of standard guide in practice.
applications where performance and availability are paramount
As such, a set of NoSQL modeling guidelines for the
[5], [6]. As a result, NoSQL (Not Only SQL) databases were
logical and physical design of document-store databases is
introduced to augment the features of Traditional Databases
proposed. In these guidelines, all possible relationships are
(TD) with new concepts such as schema flexibility, scalability,
retrieved, analyzed, categorized and prioritized. The result-
high performance, partition tolerance, and other new extended
ing guidelines are expected to serve as an important tool
features [7]. The schemas of such databases are enforced by
of knowledge for beginners, intermediates or even advanced
client-side application developers rather than database engines,
NoSQL database developers. For the actualization of this goal,
as in the case of TD [2], [8].
we employ an exploratory approach for the investigation of
Consequently, several giant organizations, such as Google, existing works, empirical methods and expert consultations.
Facebook, and Amazon, have adopted NoSQL technology We analyze how industry experts prioritize the guidelines and
for data management and storage [5]. However, the inherent analyze the relationships between datasets on the one hand
complexity and unpredictable nature of todays data [9], along and error prospects and awareness on the other hand. Few
with the low competence level of data modelers [3], [10], proprietary guidelines were extracted and harmonized from a
www.ijacsa.thesai.org 544 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018
heuristic evaluation of 5 different existing NoSQL databases. In the words of William (Lead Technical Engineer at
In this regard, the proposed guidelines have great potential MongoDB) [10], guidance is strongly required for MongoDB
to function as an imperative instrument of knowledge transfer developers, upon which few guidelines were produced to ease
from academia to NoSQL database modeling practices. the modeling process. Moreover, Ryan CrawCuor and David
Makogon [17] created a comprehensive presentation on how to
The remainder of this paper is structured as follows. model data in JSON. In addition, eBay [24] and Netflix [25]
Section II reviews and analyzes existing works. Section III produced some guidelines for schema design in Cassandra.
puts forward the proposed guidelines and their application However, these guidelines, though comprehensive, are complex
scenarios. Section IV prioritizes guidelines in 3 different and designed for the referenced databases only, i.e., they are
categories. Section V discusses the findings (limitations and proprietary. Consequently, straightforward and more general
potentials). Finally, Section VI concludes and highlights the guidelines are needed in practice.
future focus.
In [8] and [12], reuse of existing modeling expertise (from
RDBMS) is allowed to minimize the high level of competence
II. R ELATED W ORKS
required to model NoSQL databases. This was achieved using
The origin of Data Modeling (DM) in databases can be Idef1X (a standard data-modeling language) and Formal Con-
traced back to the mid-20th century as a technique for struc- cept Analysis (FCA). However, an experiment conducted by
turing and organizing data [33]. The exercise is astonishingly [13] evidently showed the limitation of the existing modeling
similar to construction designs where walls are planned, flows expertise when applied to new-generation complex datasets
are optimized, and materials are chosen based on the type of (big data). Clearly, NoSQL databases need a different modeling
utility that it will accommodate and the level of interaction approach to efficiently manage big data due to its diverse
needed between sections [34]. DM gained the attention of characteristics [32].
researchers in the field of information systems and data vi-
In [1], a cost-based approach for schema recommendation
sualizations in the 1970s (see [35], [36]). In the late 1990s,
is proposed with the aim of replacing the rules of thumb
a Unified Modeling Language (UML) [34] was introduced to
currently followed by less competent NoSQL database design-
consolidate the data modeling symbols and notations invented
ers. In this approach, the expected performance of the target
by [35], [36] into one standardized language, all for the
application is estimated, upon which a candidate schema is
purpose of simplifying data visualization and modeling in
recommended. The approach made schema modeling more
relational databases.
stable and secure than before. However, more stages are added
Now, with the emergence of unstructured, voluminous and to the design processes, such as data analysis, application of
complex datasets, i.e., big data, requirement to have more flex- the tool to propose schema, and then translation of the schema
ible and higher-performance databases have become essential into real application. Moreover, the approach is applicable to
[27], [28], [33], [37], which has given rise to the concept of column family databases only. In addition, the tool focuses
NoSQL databases. The high flexibility of NoSQL databases only on the expected performance of candidate schema, despite
makes data modeling even more challenging, as schemas are the fact that NoSQL schema design is largely driven by the
written and enforced by the client-side application developers nature of the target data [16]. Alternately, an interactive,
rather than database engines, as in the case of RDBMS [12], schema-on-read approach was proposed in [41] for finding
[38], [26], [29]. This raises the question of competence, which multidimensional structures in document stores. [42] proposed
may lead to the production of high- or low-quality models a data migration architecture that migrates data from SQL to
[10], [12]. A recent report by [20] shows how a low level of NoSQL document-stores while taking into account the data
competence in NoSQL data modeling cost a United States- models of both the two categories of databases. Although these
based company called Flexcoin a half-million US dollars. approaches yielded relatively good findings, more generic,
A hacker was able to make several transactions before the simple, and data-driven guidance prepared for at least one
account-balance-document was updated (low consistency). In category of NoSQL databases [12], [31], [32] is still needed
another case, an airport was completely shut down as a result of for practitioners.
a major IT system failure in London [21], for which the experts
The heterogeneity of todays systems, data complexity
assigned the blame to the poor back-end system design. These
growth, and lack of modeling expertise have been stated as
are officially reported instances, while several other cases, such
motivations of the aforementioned works. These claims have
as those discussed in [39], [40], do exist.
been confirmed by error-rate reports [20], [21], [39], [40] in
To mitigate these challenges, experts shared their experi- real-world of NoSQL-driven projects. Undoubtedly, there is a
ences on the most common questions asked by the client-side need for well-founded guidelines in practice. The following
application developers online. Some of these questions are (i) section presents the proposed guidelines, which were synthe-
how to model one-to-N relationships in document databases, sized from empirical research and professional involvements.
(ii) how to know when to reference instead of embedding a
document, and (iii) whether document databases allow Entity III. P ROPOSED G UIDELINES
Relationship modeling at all. In an attempt to address these and
similar questions, experts highlighted the necessity of having In this section, the proposed guidelines which were syn-
a standardized modeling guide for these powerful data stores thesized from empirical work are introduced. The section is
[10], [12], [17], [30]. This is partly because many of the divided into four subsections. In Section 3.1 an example model
questions keep reappearing repeatedly on multiple platforms from university social media networking system is described
or even the same platform. which was used for this research. Section 3.2 highlights, in
www.ijacsa.thesai.org 545 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018
summary, the empirical research upon which the proposed education and work. Each of these entities has other sub-
guidelines are built. Section 3.3 presents the guidelines and entities which, as the tree expands; many entities repeatedly
their respective explanations. Section 3.4 shows how the pro- appear in different parent entities. For example, likers and
posed guidelines can improve the model presented in Section commenters entities contain the same list of people as in
3.1. friends & family entity. Furthermore, the list of people in
friends and family entity are also the system users who are
recorded in the User entity. Now, these repetitions might
A. An Example Model improve data availability but at the expense of consistency or
To illustrate the proposed guidelines, a running example speed during inserts, updates or deletes. This will be further
shown in Fig. 1 was used. The model describes entities and explained later when the model in Fig. 1 is improved using
their connections of a university social media networking our guidelines.
system which was developed by the university programmers.
The modeling was done without considering the proposed B. Empirical Research Background
guidelines and, as will be seen later, will improve when the
proposed guidelines are applied. The research background upon which the proposed guide-
lines are defined is described in this section. The widely
The model shown in Fig. 1 follows the Entity Relationship acceptance and adoption of ERD model in relational databases
Diagram (ERD) notations and symbols proposed by [35] and is connected with its ease of comprehension and application
[36] which are the most popular relational database modeling onto structured datasets. In prior research, we thoroughly
technique in both the academia and industry. investigated the new generation datasets (big data) while taking
Although a Unified Modeling Language (UML) [43] was into account the connection between NoSQL databases and the
introduce to standardize approaches and notations, the model factors leading to their comprehension and proper modeling.
in Fig. 1 adopted few fundamental symbols and notations from Factors such as understanding, error probability, and ambiguity
[35], [36], [43] for demonstration purposes. Rectangle, arrows, are experimented as well as other factors that motivated
and curly and square brackets were used to show, conceptually, the guidelines propositions. The findings are summarized as
the activity flow. Generally, in ERD, rectangles are used to follow.
indicate entities while arrows correspond to data flows or
connections between the entities. Moreover, notations such as • Understanding relates to the degree to which datasets
curly and square brackets were used to indicate attribute and and system requirements can be easily understood. Its
arrays of keys respectively. a strong basis on which data is classified, categorized
and modeled. In an experiment reported in [13] we
The given model in Fig. 1 roughly describes a user entity introduced new cardinality notations and relationship
and user-dependent entities. A user has direct entities such styles for NoSQL databases. From our engagement
as contact info, basic info, friends and family, messages, and with programmers regarding the new notations and
www.ijacsa.thesai.org 546 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018
The proposed guidelines are as follows. They adhere to post made on a social media (Fig. 1), the likes tag
the categorizations as depicted in Fig. 2. In the beginning, changes so often; thus unbound it from the main
embedding is put forward. document so that the main document is not always
accessed each time likes button is hit.
1) Embedding: This section presents the first set of the
proposed guidelines (G1 — G9) which aim to answer questions G11: Reference standalone entities: avoid embedding a
related to embedding (i.e. insertion of one document into child document/object if it will be at one time ac-
another). cessed alone. Documents, when embedded, cannot be
retrieved alone as a single entity without retrieving the
G1: Embed sub documents unless forced otherwise: For main entity [10].
better system performance in terms of saving and
retrieving speed, try to always embed child documents G12: Use array of references for the many side of the
except when it is necessary to do otherwise. With relationship: when a relationship is one-to-many as in
embedding, there is no need to perform a separate [13] or a document is a standalone document, array
query to retrieve the embedded documents [7]. of references are best recommended.
G2: Use array concept when embedding: It is recom- G13: Parent referencing is recommended for large quantity
mended to use array of embedded documents when of documents: for instance, when the many side of a
modeling few relationships [10], [17]. relationship is squillion (introduced in [13]), parent-
referencing is preferred.
G3: Define array upper bound in parent document: Avoid
the use of unlimited array of ObjectID references in G14: Do not embed sub-documents if they are many: a
the many side of the relationship if it contains a few key entity with many other sub-entities should adopt
thousands of documents [17]. referencing rather than embedding [13]. This will
minimize high-cardinality arrays [41].
G4: Embed records which are managed together: when
records are queried, operated and updated together, G15: Index all documents for better performance: If docu-
they should be embedded [13]. ments are indexed correctly and projection spacefarers
like the relationship styles discussed in [13] are used,
G5: Embed dependent documents: dependency is one of the applications level joins are nothing to be worried
the key indicators to embed a document [17]. For about.
example, order details are solely dependent to the
order itself; thus they should be kept together. 3) Bucketing: Bucketing refers to splitting of documents
G6: Embed one-to-one (explained in [13]) relationships: into smaller manageable sizes. It balances between the rigidity
when modeling one-to-one relationship, embedding of embedding and flexibility of referencing [13].
style should be applied. G16: Combine embedding and referencing if necessary:
G7: Group data with same volatility: data should be embedding and referencing can be merged together
grouped based on the rate to which it changes [13]. For and work perfectly [10]. For example, consider a prod-
example, persons bio-data and status of several social uct advert on Amazon website, there is the product
media accounts. The volatility of social media status is information, the price which may change, and a list
higher than the bio-data which does not change quite of comments and likes. This advert actually combines
often like email address or does not even change at reasons to embed as well as to reference, thus merging
all, explicitly, date of birth. the two techniques together can be the best practice
in this case.
G8: Two-way embedding is preferred when N size is
close to the M size in N:M relationship (presented in G17: Bucket documents with large content: to split a docu-
[13]): in N:M relationship, try establish a relationship ment into discreet batches such as days, months, hour,
balance by predicting the maximum number of N and quantity etc, bucketing should be considered [13]. For
maximum number of M [7], [13]. Two-way embed- example, the squillions (introduced in [13]) side of
ding is preferred when the N size is close to the M the relationship can be divided into 500 records per
size. display as the case of pagination.
G9: One-way embedding is preferred if theres a huge gap 4) General: There are few guidelines that do not fall into
in size between N to M: if gap is for example 3 in N any of the earlier discussed categories (embedding, referencing
side and 300,000 in M side, then one-way embedding and bucketing). Such guidelines are grouped and presented as
should be considered [13]. follows.
2) Referencing: Referencing can be explained as a process G18: Denormalize document when read/write frequency is
of connecting two or more documents together using a unique very low: denormalize document only if it is not
identifier [13]. The following guidelines (G10 G15) aim to updated regularly. So, access frequency prediction
answer questions related to referencing. should guide the decision to denormalize any entity.
G10: Reference highly volatile documents: high volatility of G19: Denormalize two connected documents for semi-
document gives a good signal to reference a document combined retrievals: Sometimes two documents are
instead of embedding. For example, lets consider a connected, but only one is to be retrieved and few
www.ijacsa.thesai.org 548 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018
fields from the second document, denormalization results from the application of the proposed guidelines. The
can help here [13]. For example, when retrieving a application of each of these guidelines is explained as follows.
presentations session, a speakers name would need to
In the original model, some modeling problems were
be displayed as well but not all the speakers details,
identified such as too much redundancy of information which
so, the second document (speaker) is denormalized to
of cause leads to inconsistencies among entities. For example,
get only the name of the presenter and attach it to
there exists an entity of user which contains some informa-
session document. tion about users, this entity is fully repeated in places like
G20: Use tags implementation style for data transfer: if “family & friends”, “commenters”, “likers” etc. in different
information is not sensitive, packaging it within tags branches of the model. The problem with this approach is that,
like in XML document is recommended [46]. updating a single attribute for instance will require updating
all documents with the same attribute. Now, in a situation
G21: Use directory hierarchies if security is a priority: apply where an attribute changes so frequently and the affected
role based authorization to each of the directories for documents are many, more serious issues like inconsistency,
access protection [19]. A user can have the privilege temporary insecurity (for access authorization) and perfor-
to access one directory or a collection of directories, mance deterioration may arise. Such events motivated many
depending on the users role. guidelines such as G1 which recommends the embedment
G22: Use documents collections implementation style for of all documents or G6 which recommends embedding of
better read/write performance: this is the same as G21, a single document. To maintain the availability provided by
but with addition of better read/write performance. duplicating users data even when its embedded into “User”
entity, G17 came in to take the few rarely changed attributes
G23: Use Non-visible metadata for data transfer between from the main document to the areas where they are accessed
nodes or servers: in many cases, APIs dont have quite often. However, as “User” entity is bucketed, referencing
security mechanisms embedded in them [47]. So, became required (G11). Similarly, higher volatile documents
encoding sensitive information before transfer and like “Discussions” and “Posts” were bucketed from “User”
decoding upon arrival is strongly recommended. This entity and grouped based on the rate to which they change
will improve data security on the air. (G7) which allows write/update operations without necessarily
accessing the parent documents. Also, G11 was considered for
TABLE I. OVERVIEW OF THE P ROPOSED G UIDELINES independent access of “Discussions” and “Posts ” since they
Gl Embed sub-documents unless forced otherwise may be accessed alone in most cases.
G2 Use array concept when embedding While referencing related documents, G2 was used, which
G3 Define array upper bound in parent document
G4 Embed records which are managed together states the use of array concept when referencing documents
G5 Embed dependent documents using their IDs especially in the M (many) side of the rela-
G6 Embed one-to-one relationships tionship (G12), besides, upper bound was defined for any array
G7 Group data with same volatility of IDs (G3). But in a situation of a large number of entities
GS Two-way embedding is preferred when N size is close to the M
size in N:M relationship
like “comments”, the spirit of parent referencing (G13) was
G9 One-way embedding is preferred if there’s hug gap in size between followed.
N to M
G10 Reference highly volatile document
By referring to the original model, since the write
G11 Reference standalone entities frequency is very high in the entities of “comments”
G12 Use array of references for the many side of the relationship and “likes”, embedding was avoided (G14), instead we
G13 Parent referencing is recommended for large quantity of entities denormalized “commenters” and “likers” entities (G19)
G14 Do not embed sub-documents if they are many and reference each of them (G10) such that embedding
G15 Index all documents for better performance
G16 Combine embedding and referencing if necessary and referencing can be combined (G16) using only the
G17 Bucket documents with large content commenters name and ID, leading to achieving both
G18 Denormalize document when read write frequency is very low availability and consistency at the same time. The rationale
G19 Denormalize two connected documents for semi-combined re- behind this is that, only commenters or likers name is often
trievals
G20 Use tags implementation style for data transfer
required for each comment or like. Therefore, for high
G21 Use directory hierarchies if security is a priority availability, only the name of a user should be denormalized
G22 Use document collections implementation style and for the consistency during updates, array of IDs can be
G23 Use Non-visible metadata for data transfer between nodes or server used for more round trips.
On the section of “User” entity again, Basic Info” and
The following section explains the application of the afore- “Contact Info” are not only dependent to “User” entity but
mentioned guidelines. they are also managed together. Since such information has
low read/write frequencies (G18), putting them together based
D. Application on their collectivity in management (G4) or based on their
dependencies on one another (G5) will significantly minimize
To demonstrate the proposed guideline, we will show how the number of round trips to the server for a single update.
the original social media model (Fig. 1) can be transformed Given that, the predicted records for all the three entities
into a more stable model. In Fig. 3, we marked and labeled (“Basic Info”, “contact info” and “User”) are almost at same
some areas of improvement on the same model using guideline level, two-way embedding is best recommended (G8) to permit
identifies. A transformed model is presented as in Fig. 4 which connection from either directions. But in the case of “Posts”
www.ijacsa.thesai.org 549 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018
and “Likes”, One-way embedding is most preferred (G9) since ties (data exchange) with other applications is an important
the number of “likes” can be more than a million for a aspect to consider right from modeling stage to avoid using
particular post. proprietary data export format, G20 proposes the use of tags
In view of the fact that, performance is usually a pri- formatting style such as XML which is open source and can
ority requirement, indexing all documents (G15) is strongly be formatted (G23) and read by almost all programming lan-
recommended. Also, considering the node balance challenge guages. In many cases, web-services are allowed to determine
posed by hierarchical data modeling style, document collection everything including using special characters; this flexibility
implementation style (G22) is maintained for many reasons creates security vulnerabilities such as NoSQL injections via
such as horizontal schema flexibility as database scales up and restful APIs. Such high expectations of security breaches
down. motivated the use of hierarchical data modeling (G21) which
Although it is not frequently used, interfacing possibili- ease the application of role based authorization on each node
www.ijacsa.thesai.org 550 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018
of the tree, or G22 which clusters documents into collection PETRONAS, Malaysia which led to comprehensive refinement
of documents at different stages. of the guidelines. Secondly, SQL and NoSQL professionals in
It is important to note that not all the guidelines are our network were contacted to take part in reviewing, analyzing
applicable to the original model, some guidelines such as and prioritizing the guidelines based on their expert opinions,
G20 - G23 were exemplified in a more generic way. This is this include five experts from Malaysia one from Sweden, one
because the original model did not interface with other models from Spain, and two from Nigeria. A total of nine professionals
or applications. Also, the overall number of entities has been with an average modeling experience of 4 years complied
reduced from 24 in the original model to 17 in the transformed with our request and assisted in refining the guidelines and
model as a result of prioritizing guidelines such as G1, G4, prioritizing their application under different circumstances.
G5, G7 etc. In summary, the original model is restructured and Each of the professionals contacted, receives a verbal
transformed to less redundant model with high availability and or written presentation of the proposed guidelines from the
consistency without changing the model behavior. researchers. After that, all professionals were asked to indi-
vidually review each guideline and add or remove to/from the
IV. P RIORITIZING G UIDELINES list. Next, each professional was also asked to rank the refined
In the preceding section, we illustrated how each element proposed guidelines with respect to three different categories,
of the proposed guidelines can be applied on a real datasets. namely, availability (read operation), consistency (write and
However, in a situation where two or more guidelines are update operations) and cardinality notations using a scale of
applicable, the modeler needs to be guided towards taking 1-23. For this scale, the ranking begins from 1 which indicates
the most appropriate direction based on system requirements. a perception of being the highest relative potential, while rank
For instance, while embedding dependent documents (G7) of 23 in the scale indicates the lowest relative potential. This
increases read/write performance, requirement to access doc- inquiry guided the researchers to infer a priority scheme which
ument independently may necessitate referencing standalone can resolve conflicts among rival guidelines.
entities (G11) or bucketing the frequently accessed entities While ranking the guidelines, all participating experts were
(G15) into affordable elements. This is because, embedded allowed to give equal rank to more than one guideline. How-
child document cannot be retrieved alone without retrieving ever, for each participant, a constraint of a total number of 276
the parent document [32]. This situation is explained in the (= 1 + 2 + 3 + 4 + 23) assigned ranks was expected.
previous section, which clearly demanded more sensible pri- These assigned ranks were accumulated per guideline lead-
orities when applying the proposed guidelines. ing to results as presented in Table II. It can be seen from this
Its important to note that, in as much as we tried to simplify table that, G6 is considered to have the highest potential to
the guidelines; their diverse nature significantly increases the improve data availability, as it has total rank scores of 12.
challenge of resolving conflicts between them. For a given While, G21 is deliberated to have the least potential to
model, many conflicting guidelines can be applicable in one improve data availability with an average score of 202. The
section, and many sections can adopt one guideline. total scores of the remaining guidelines fall between these
The scope of this paper does not include a more com- extremes.
prehensive prioritization which is theoretically motivated and On the other hand, since availability is not always a
empirically validated. Nevertheless, we have taken the fol- priority of all systems [4], the prioritization also considered an
lowing approach to arrive at some guidance on guideline important database concept which is consistency in replicated,
application prioritization. At first, a presentation of the pro- connected or dependent data. As such, another set of priority
posed guidelines was made to experts in Universiti Teknologi
www.ijacsa.thesai.org 551 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018
TABLE III. P RIORITIZING G UIDELINES BASED ON C ONSISTENCY (W RITE & U PDATE O PERATIONS )
list was debated; results of which is shown in Table III. This complex datasets in seven different relationships such as one-
table suggests that, in consistency, G1 has the highest potential to-one (1:1), one-to-few (1:F) etc. In line with this, our study
to improve consistency among different clusters, documents or reveals that, more than one guideline can be in the same
datasets as it has an accumulated score of 17 ranks. In contrast, priority level for a single cardinality as shown in Fig. 5.
G17 is considered to have the lowest potential to do so with In each of the cardinalities (in Fig. 5), guidelines are
an accumulated score of 201 ranks. The remaining guidelines prioritize on a scale of seven (priority levels 1 — 7) which
fall between the two extremes. are color coded (light gray to dark gray). As it was mentioned
In addition to prioritizing guidelines for availability and before, professionals were allowed to allocate the same rank
consistency, cardinality can also be considered as an important to more than one guideline, therefore, many guidelines were
factor for the categorization of the proposed guidelines. There- given the same level in the same category which indicates their
after, prioritize their application in each of the categories. To potential equality in improving design performance.
do so, the new generation cardinalities proposed by [13] were In general, the suggested use of these rankings in three dif-
considered. These cardinalities have the potential to categorize ferent categories (availability, consistency and cardinalities) is
www.ijacsa.thesai.org 552 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018
that guidelines with higher positions should be favored over the have already inspired us with a few guidelines to be considered
guidelines with lower positions or conflicting guidelines. For in the future.
instance, while referencing standalone entities (G11) increases
B. Potential
data availability for independent or round-trip queries, require-
This section continues to discuss the potential of the
ment to have high consistency may necessitate combining G7
proposed guidelines beyond their detailed explanations (see
and G15. This means that, in the case of security access,
Section 3.3) and application (see Section 3.4). The first mod-
authorization across cluster can be controlled (consistency) and
eling guidelines prepared to guide data modelers for NoSQL
the solo records within the main document can be bucketed
document-store databases, coupled with the increase in com-
into a different smaller document for independent retrieval
plexity of todays data (big data), greatly increase the potential
(availability). In other words, the application of G7 can in-
to widely accept and adopt the proposed guidelines in both
terfere with the impact of applying G11 or G15 because it
industry for practice and academia for learning.
appears higher, but when categorized (availability, consistency
On the technical side, two potential aspects are identified.
and cardinalities), their levels of application changes based on
First, the proposed guidelines can be the basis of automating
the requirement.
the modeling process from scratch, which may not require
It is worth mentioning that most of the elements in the
more technical background. Second, if a model already exists,
presented guidelines were broadly reorganized by the experts
improvement might be required, as shown in Fig. 4, which
as they have used some of them in their NoSQL modeling
resulted from applying the guidelines in Fig. 3. Instead of man-
process already which led to better understanding on how best
ually transforming the model using the proposed guidelines,
they can be prioritized.
the process can be intelligently automated to identify errors
V. D ISCUSSION and mark them such that existing models can be automatically
In this section, the proposed guidelines are investigated transformed. Solutions or approaches like these will require
regarding two different aspects. First, some limitations of the further in-depth and formal research on both aspects, as well
proposed guidelines are discussed. Thereafter, several aspects as potentially more.
of their potentials are elaborated. The proposed guidelines also point to more potentials for
the competence analysis of modelers. This might be achieved
A. Limitations
by measuring the structures of the produced models, which
While the proposed guidelines are stronger in their foun-
might be based on some assumptions, such as to what extent
dation and more generalized than many existing proprietary
the proposed guidelines considered the model requirements.
guidelines, some limitations must be noted. The first limitation
Modelers with high levels of competence are likely to detect
relates to the development of the proposed guidelines and
any model that deviates from the proposed guidelines. In an
their validity: the fundamental principles and the empirical
experiment that involved designing a complete mini-NoSQL-
insights that ground the introduction of said guidelines would
based system, a model was repeatedly redesigned for improve-
have been more thorough and evolving if the number of
ments as a result of a low level of competence, which can be
professionals involved was greater than nine. However, the
associated with the lack of basic skills [1]. In this manner, the
scarcity of NoSQL modelers (expert-level) made it difficult to
proposed model offers easier methods with simple language to
find the typically used number of professionals. This is because
identify difficulties associated with complex datasets as well
NoSQL databases are new and used to manage new-generation
as the best methods to relate the entities.
datasets (big data), and they thus have not yet matured in
academia and industry. VI. C ONCLUSION AND F UTURE W ORK
The second limitation is that the proposed guidelines In this paper, the mismatch between proprietary recommen-
assume that all modelers have basic SQL modeling skills. This dations for NoSQL document-store modeling and technical
means that the symbols, notations and terminologies proposed insight into NoSQL modeling practice are addressed. Prior
by [35] and [36] are prerequisite skills for the effective use of empirical research and expert suggestions were consolidated,
the proposed guidelines. People with no database modeling which led to the derivation of the proposed guidelines. Con-
background may find it challenging to start modeling with trary to proprietary guidelines, our guidelines were built from
the proposed guidelines. However, in the world of diversifi- a strong research foundation, which was practically motivated,
cation, such individuals should also be considered in a more empirically derived and conceptually validated. In contrast to
automated manner where a modeler answers a few questions the existing research on database modeling, our guidelines
and a suitable model is automatically produced, subject to were made specifically for document-store NoSQL databases
an experts analysis. This will minimize errors in modeling, with simple and straightforward explanations. In this manner,
thereby producing more stable NoSQL models. the proposed guidelines addressed the practical modeling prob-
The third limitation relates to the guideline prioritization lems that are being faced by many modelers in industry. This
described in Section 4. The ranking was derived from a fact among others was emphasized by the low competence
number of presentations and expert scorings. Although this level of casual NoSQL modelers [32], [10] and the high rates
could be seen as needing wider expert coverage, it also raises of errors, repetition and insecurity [19], [20].
questions such as what other alternative ranking roots are In addition to these virtues, the proposed guidelines also
available, for instance, through experimentation. Nevertheless, revealed some limitations to reflect upon. More significantly,
it seems less attractive at this stage to focus on producing very although the guidelines were prioritized based on three iden-
perfect guidance on how best the proposed guidelines can be tified categories (availability, consistency and cardinalities),
prioritized and applied. This is why we have high expectations we believe that, as big data and NoSQL mature, several
that the proposed guidelines will be further extended in the other categories will be harnessed, which may call for re-
near future to cover more application scenarios, as professional prioritization to suit new categories. Furthermore, humans
www.ijacsa.thesai.org 553 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018
(who are naturally prone to errors) are, to a large extent, [14] A. Ron, A. Shulman-Peleg, and A. Puzanov, Analysis and Mitigation
involved in the application of the proposed guidelines, and as of NoSQL Injections, IEEE Secur. Priv., vol. 14, no. 2, pp. 3039, 2016.
such, several automations are required to minimize possible [15] M. Obijaju, NoSQL NoSecurity Security issues with NoSQL Database,
human error, thereby producing more stable models. Such Perficient: Data and Analytics Blog, 2015. [Online]. Available:
http://blogs.perficient.com/dataanalytics/2015/06/22/nosql-nosecuity-
solutions are slotted into our future research schedule. security-issues-with-nosql-database/. [Accessed: 21-Sep-2016].
In addition to the future focuses mentioned earlier, the [16] M. J. Mior, Automated schema design for NoSQL databases, Proc. 2014
applicability and usability of the proposed guidelines are SIGMOD PhD Symp. - SIGMOD14 PhD Symp., pp. 4145, 2014.
another important aspect. While considering other usability [17] R. CrawCuor and D. Makogon, Modeling Data in Document Databases.
test approaches, such as that in [48], where the applicability of United States: Developer Experience & Document DB, 2016.
SEUAL quality was assessed, the proposed guidelines might [18] M. Chow, Abusing NoSQL Databases, Proceedings of DEF CON 21
be subjected to similar usability assessment in the future, Hacking Conference. 2013.
particularly the use of a standard survey, which may result [19] L. Okman, N. Gal-Oz, Y. Gonen, E. Gudes, and J. Abramov, Security
in further improvement of the proposed guidelines. issues in NoSQL databases, in Proc. 10th IEEE Int. Conf. on Trust,
Security and Privacy in Computing and Communications, TrustCom
Finally, with high optimism, the proposed guidelines have 2011, 8th IEEE Int. Conf. on Embedded Software and Systems, ICESS
great potential to function as an imperative instrument of 2011, 6th Int. Conf. on FCST 2011, 2011, pp. 541547.
knowledge transfer from academia to NoSQL database model- [20] E. G. Sirer, NoSQL Meets Bitcoin and Brings Down Two Exchanges:
ing practices, which may bridge the two disconnected commu- The Story of Flexcoin and Poloniex, Hacking, Distributed, 2014.
nities (academia and industry) with respect to NoSQL database [Online]. Available: http://hackingdistributed.com/2014/04/06/another-
modeling. one-bites-the-dust-flexcoin/. [Accessed: 31-Jul-2017].
[21] J. FORTIN and A. Cruz, System Failure at British Airways Shuts
ACKNOWLEDGMENT Down Flights Out of London, The New York Times, 2017. [Online].
The authors wish to acknowledge the support from Uni- Available: https://www.nytimes.com/2017/05/27/world/europe/british-
versiti Teknologi PETRONAS (UTP) for funding this research airways-flights-heathrow-and-gatwick-airports-.html. [RAccessed:
01-Aug-2017].
through Yayasan and Graduate Assistantship Scheme (UTP-
GA). [22] W. Naheman, Review ofNoSQL Databases and Performance Testing
on HBase, 2013 Int. Conf. Mechatron. Sci. Electr. Eng. Comput., pp.
R EFERENCES 23042309, 2013.
[1] Micheal J. Mior, K. Salem, A. Aboulnaga, and R. Liu, NoSE: Schema [23] C. O. Truica, F. Radulescu, A. Boicea, and I. Bucur, Performance
design for NoSQL applications, IEEE Trans. Knowl. Data Eng. From evaluation for CRUD operations in asynchronously replicated document
2016 IEEE 32nd Int. Conf. Data Eng. ICDE 2016, vol. 4347, no. c, pp. oriented database, in Proceedings - 2015 20th International Conference
181192, 2016. on Control Systems and Computer Science, CSCS 2015, 2015, pp.
[2] H. Zhang, G. Chen, B. C. Ooi, K. L. Tan, and M. Zhang, In-Memory 191196.
Big Data Management and Processing: A Survey, IEEE Trans. Knowl. [24] J. Patel, Cassandra data modeling best practices, part 1, ebaytech-
Data Eng., vol. 27, no. 7, pp. 19201948, 2015. blog, 2012. [Online]. Available: http://ebaytechblog.com/?p=1308. [Ac-
[3] G. C. Everest, Stages of Data Modeling Conceptual vs . Logical vs . cessed: 02-Aug-2017].
Physical Stages of Data Modeling, in Carlson School of Management [25] N. Korla, Cassandra data modeling - practical consid-
University of Minnesota, Presentation to DAMA, Minnesota, 2016, pp. erations @ Netflix, Netflix, 2013. [Online]. Available:
130. http://www.slideshare.net/nkorla1share/cass-summit-3. [Accessed:
[4] M. T. Gonzalez-Aparicio, M. Younas, J. Tuya, and R. Casado, A 02-Aug-2017].
New Model for Testing CRUD Operations in a NoSQL Database, in [26] N. Jatana, S. Puri, and M. Ahuja, A Survey and Comparison of
2016 IEEE 30th International Conference on Advanced Information Relational and Non-Relational Database, Int. J. , vol. 1, no. 6, pp. 15,
Networking and Applications (AINA), 2016, vol. 6, pp. 7986. 2012.
[5] IBM, Why NoSQL? Your database options in the new non-relational [27] C. JMTauro, A. S, and S. A.B, Comparative Study of the New
world, Couchbase, no. March, p. 6, 2014. Generation, Agile, Scalable, High Performance NOSQL Databases, Int.
[6] J. Bhogal and I. Choksi, Handling Big Data Using NoSQL, in Proceed- J. Comput. Appl., vol. 48, no. 20, pp. 14, 2012.
ings - IEEE 29th International Conference on Advanced Information [28] R. April, NoSQL Technologies: Embrace NoSQL as a relational
Networking and Applications Workshops, WAINA 2015, 2015, pp. Guy Column Family Store, DBCouncil, 2016. [Online]. Available:
393398. https://dbcouncil.net/category/nosql-technologies/. [Accessed: 21-Apr-
[7] MongoDB, How a Database Can Make Your Organization Faster, Better, 2017].
Leaner, MongoDB White Pap., no. October, p. 16, 2016. [29] S. Visigenic, ODBC 2.0 Programmers Manual, Version 2. United States:
[8] V. Jovanovic and S. Benson, Aggregate Data Modeling Style, Proc. TimesTen Performance Software, 2000.
South. Assoc. Inf. Syst. Conf. Savannah, GA, USA March 8th9th, pp. [30] G. Matthias, Knowledge Base of Relational and NoSQL Database
7075, 2013. Management Systems: DB-Engines Ranking per database model
[9] H. He and E. A. Garcia, Learning from imbalanced data, IEEE Trans. category, DB-Engines, 2017. [Online]. Available: https://db-
Knowl. Data Eng., vol. 21, no. 9, pp. 12631284, 2009. engines.com/en/ranking categories. [Accessed: 21-Apr-2017].
[10] Z. William, 6 Rules of Thumb for MongoDB [31] Gartner and M. Fowler, The NoSQL Generation: Embracing the Doc-
Schema Design, MongoDB, 2014. [Online]. Available: ument Model, MarkLogic Corp. Hype Cycle Big Data, no. May, 2014.
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb- [32] P. Atzeni, Data Modelling in the NoSQL world: A contradiction?, Int.
schema-design-part-1. [Accessed: 23-Jan-2017]. Conf. Comput. Syst. Technol. - CompSysTech16, no. June, pp. 2324,
[11] X. Wu, X. Zhu, G. Q. Wu, and W. Ding, Data mining with big data, 2016.
IEEE Trans. Knowl. Data Eng., vol. 26, no. 1, pp. 97107, 2014. [33] P. Lake and P. Crowther, A History of Databases: Concise guide to
[12] V. Varga, K. T. Jnosi, and B. Klmn, Conceptual Design of Document databases: a practical introduction, Springer-Verlag London, vol. 17,
NoSQL Database with Formal Concept Analysis, Acta Polytech. Hun- no. 1, p. 307, 2013.
garica, vol. 13, no. 2, pp. 229248, 2016. [34] K. Dembczy, Evolution of Database Systems, Intell. Decis. Support
[13] A. A. Imam, S. Basri, R. Ahmad, N. Abdulaziz, and M. T. Gonzlez- Syst. Lab. Pozna n Univ. Technol. Pol., vol. 16, p. 139, 2015.
aparicio, New Cardinality Notations and Styles for Modeling NoSQL [35] P. P.-S. Chen, The Entity-Relationship Unified View of Data Model-
Document-stores Databases, in IEEE Region 10 Conference (TEN- Toward a, ACM Trans. Database Syst., vol. 1, no. 1, pp. 936, 1976.
CON), Penang, Malaysia, 2017, p. 6.
www.ijacsa.thesai.org 554 | P a g e
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 9, No. 10, 2018
[36] G. C. Everest, Basic Data Structure Models Explained with a Common Venice, Italy, 2017.
Example, in In Proc. Fifth Texas Conference on Computing Systems, [42] M. Mughees, DATA MIGRATION FROM STANDARD SQL TO
9176, pp. 1819. NoSQL, 2013.
[37] J. Han, E. Haihong, G. Le, and J. Du, Survey on NoSQL database, [43] T. Halpin, UML data models from an ORM perspective: Part 1 - 10, J.
Proc. - 2011 6th Int. Conf. Pervasive Comput. Appl. ICPCA 2011, pp. Concept. Model. 8, no. August, pp. 17, 1999.
363366, 2011.
[44] V. Abramova and J. Bernardino, NoSQL databases: MongoDB vs
[38] T. A. Alhaj, M. M. Taha, and F. M. Alim, Synchronization Wireless cassandra, Proc. Int. C* Conf. Comput. Sci. Softw. Eng. ACM 2013,
Algorithm Based on Message Digest ( SWAMD ) For Mobile Device pp. 1422, 2013.
Database, 2013 Int. Conf. Comput. Electr. Electron. Eng. Synchroniza-
tion, pp. 259262, 2013. [45] M. Gelbmann, DB-Engines Ranking of Document
Stores, DB-Engines, 2017. [Online]. Available: https://db-
[39] K. Storm, How I stole roughly 100 BTC from an exchange and engines.com/en/ranking/document+store. [Accessed: 21-Feb-2017].
how I could have stolen more!, reddit, 2014. [Online]. Available:
https://www.reddit.com/r/Bitcoin/comments/1wtbiu/how i s [46] A. P. George Papamarkos, Lucas Zamboulis, XML Databases. School
of Computer Science and Information Systems, Birkbeck College,
tole roughly 100 btc from an exchange and. [Accessed: 02-Aug-
2017]. University of London, 2013.
[40] G. Khan, Why you should never, ever, ever use documet- [47] A. Ron, A. Shulman-Peleg, and E. Bronshtein, No SQL, No Injection?
store databases like MongoDB, reddit, 2015. [Online]. Available: Examining NoSQL Security, arXiv Prepr. arXiv1506.04082, 2015.
https://www.reddit.com/r/programming/comments/3dvzsl/w [48] D. L. Moody, S. Guttorm, T. Brasethvik, and A. S. lvberg, Evaluating
hy you should never ever ever use mongodb. [Accessed: 02-Aug- the Quality of Process Models: Empirical Testing of a Quality Frame-
2017]. work, in S. Spaccapietra, S.T. March, Y. Kambayashi (Eds.), Conceptual
[41] M. L. Chouder, S. Rizzi, and R. Chalal, Enabling Self-Service BI on Modeling ER 2002, 21st International Conference on Conceptual
Document Stores, Work. Proceed- c ings EDBT/ICDT 2017 Jt. Conf. Modeling, Tampere, Finland, October 711, Proceedings, Lecture Notes
in Computer Science, Vol. 2503, Springer, 2002, pp. 380396.
www.ijacsa.thesai.org 555 | P a g e