Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

The Rise of Big Data On Cloud Computing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Information Systems 47 (2015) 98–115

Contents lists available at ScienceDirect

Information Systems
journal homepage: www.elsevier.com/locate/infosys

The rise of “big data” on cloud computing: Review and open


research issues
Ibrahim Abaker Targio Hashem a,n, Ibrar Yaqoob a, Nor Badrul Anuar a,
Salimah Mokhtar a, Abdullah Gani a, Samee Ullah Khan b
a
Faculty of Computer Science and information Technology, University of Malaya, 50603 Kuala Lumpur, Malaysia
b
NDSU-CIIT Green Computing and Communications Laboratory, North Dakota State University, Fargo, ND 58108, USA

a r t i c l e in f o abstract

Article history: Cloud computing is a powerful technology to perform massive-scale and complex
Received 11 June 2014 computing. It eliminates the need to maintain expensive computing hardware, dedicated
Received in revised form space, and software. Massive growth in the scale of data or big data generated through
22 July 2014
cloud computing has been observed. Addressing big data is a challenging and time-
Accepted 24 July 2014
demanding task that requires a large computational infrastructure to ensure successful
Recommended by: Prof. D. Shasha
Available online 10 August 2014 data processing and analysis. The rise of big data in cloud computing is reviewed in this
study. The definition, characteristics, and classification of big data along with some
Keywords: discussions on cloud computing are introduced. The relationship between big data and
Big data
cloud computing, big data storage systems, and Hadoop technology are also discussed.
Cloud computing
Furthermore, research challenges are investigated, with focus on scalability, availability,
Hadoop
data integrity, data transformation, data quality, data heterogeneity, privacy, legal and
regulatory issues, and governance. Lastly, open research issues that require substantial
research efforts are summarized.
& 2014 Elsevier Ltd. All rights reserved.

Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2. Definition and characteristics of big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.1. Classification of big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3. Cloud computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4. Relationship between cloud computing and big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5. Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.1. Organization case Studies from vendors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.1.1. A. SwiftKey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.1.2. B. 343 Industries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.3. C. redBus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.4. D. Nokia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.5. E. Alacer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

n
Corresponding author. Tel.: +60 173946811.
E-mail addresses: targio@siswa.um.edu.my (I.A.T. Hashem), ibraryaqoob@siswa.um.edu.my (I. Yaqoob), badrul@um.edu.my (N.B. Anuar),
salimah@um.edu.my (S. Mokhtar), abdullah@um.edu.my (A. Gani), samee.khan@ndsu.edu (S. Ullah Khan).

http://dx.doi.org/10.1016/j.is.2014.07.006
0306-4379/& 2014 Elsevier Ltd. All rights reserved.
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 99

6. Big data storage system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106


7. Hadoop background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.1. MapReduce in clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
8. Research challenges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.1. Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.2. Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.3. Data integrity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.4. Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.5. Data quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.6. Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.7. Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.8. Legal/regulatory issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.9. Governance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9. Open research issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.1. Data staging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.2. Distributed storage systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.3. Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.4. Data security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
10. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

1. Introduction computing environments provided by vendors, such as


IBM, Microsoft Azure, and Amazon AWS [5]. Virtualization
The continuous increase in the volume and detail of data is one of the base technologies applicable to the implemen-
captured by organizations, such as the rise of social media, tation of cloud computing. The basis for many platform
Internet of Things (IoT), and multimedia, has produced an attributes required to access, store, analyze, and manage
overwhelming flow of data in either structured or unstruc- distributed computing components in a big data environ-
tured format. Data creation is occurring at a record rate [1], ment is achieved through virtualization.
referred to herein as big data, and has emerged as a widely Virtualization is a process of resource sharing and
recognized trend. Big data is eliciting attention from the isolation of underlying hardware to increase computer
academia, government, and industry. Big data are character- resource utilization, efficiency, and scalability.
ized by three aspects: (a) data are numerous, (b) data cannot The goal of this study is to implement a comprehensive
be categorized into regular relational databases, and (c) data investigation of the status of big data in cloud computing
are generated, captured, and processed rapidly. Moreover, environments and provide the definition, characteristics,
big data is transforming healthcare, science, engineering, and classification of big data along with some discussions
finance, business, and eventually, the society. The advance- on cloud computing. The relationship between big data
ments in data storage and mining technologies allow for the and cloud computing, big data storage systems, and
preservation of increasing amounts of data described by a Hadoop technology are discussed. Furthermore, research
change in the nature of data held by organizations [2]. The challenges are discussed, with focus on scalability, avail-
rate at which new data are being generated is staggering [3]. ability, data integrity, data transformation, data quality,
A major challenge for researchers and practitioners is that data heterogeneity, privacy, legal and regulatory issues,
this growth rate exceeds their ability to design appropriate and governance. Several open research issues that require
cloud computing platforms for data analysis and update substantial research efforts are likewise summarized.
intensive workloads. The rest of this paper is organized as follows. Section 2
Cloud computing is one of the most significant shifts in presents the definition, characteristics, and classification of
modern ICT and service for enterprise applications and has big data. Section 3 provides an overview of cloud comput-
become a powerful architecture to perform large-scale and ing. The relationship between cloud computing and big
complex computing. The advantages of cloud computing data is presented in Section 4. Section 5 presents the
include virtualized resources, parallel processing, security, storage systems of big data. Section 6 presents the Hadoop
and data service integration with scalable data storage. Cloud background and MapReduce. Several issues, research chal-
computing can not only minimize the cost and restriction for lenges, and studies that have been conducted in the
automation and computerization by individuals and enter- domain of big data are reviewed in Section 7. Section 8
prises but can also provide reduced infrastructure mainte- provides a summary of current open research issues and
nance cost, efficient management, and user access [4]. As a presents the conclusions. Table 1 shows the list of abbre-
result of the said advantages, a number of applications that viations used in the paper.
leverage various cloud platforms have been developed and
resulted in a tremendous increase in the scale of data 2. Definition and characteristics of big data
generated and consumed by such applications. Some of the
first adopters of big data in cloud computing are users that Big data is a term utilized to refer to the increase in the
deployed Hadoop clusters in highly scalable and elastic volume of data that are difficult to store, process, and analyze
100 I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115

Table 1
List of abbreviations.

Abbreviations Full meaning

ACID Atomicity, Consistency, Isolation, Durability


ASF Apache Software Foundation
DAS Direct Attached Storage
Doc Document
DSMS Data Stream Management System
EC2 Amazon Elastic Compute Cloud
GFS Google File System
HDDs Hard Disk Drives
HDFS Hadoop Distributed File System
IaaS Infrastructure as a Service
ICT Information Communication Technology
IoT Internet of Things
IT Information Technology Fig. 1. Four Vs of big data.
JSON JavaScript Object Notation
KV Key Value
NAS Network Attached Storage
expand. The benefit of gathering large amounts of data
NoSQL Not Only SQL includes the creation of hidden information and patterns
OLM Online Lazy Migration through data analysis. Laurila et al. [11] provided a
PaaS Platform as a Service unique collection of longitudinal data from smart mobile
PDF Portable Document Format
devices and made this collection available to the research
RDBMS Relational Database Management System
SAN Storage Area Network community. The aforesaid initiative is called mobile data
SQL Structured Query Language challenge motivated by Nokia [11]. Collecting longitudi-
SDLM Scientific Data Lifecycle Management nal data requires considerable effort and underlying
S3 Simple Storage Service investments. Nevertheless, such mobile data challenge
SaaS Software as a Service
URL Uniform Resource Locator
produced an interesting result similar to that in the
XML Extensible Markup Language examination of the predictability of human behavior
patterns or means to share data based on human
mobility and visualization techniques for complex data.
through traditional database technologies. The nature of big (2) Variety refers to the different types of data collected
data is indistinct and involves considerable processes to via sensors, smartphones, or social networks. Such
identify and translate the data into new insights. The term data types include video, image, text, audio, and data
“big data” is relatively new in IT and business. However, logs, in either structured or unstructured format. Most
several researchers and practitioners have utilized the term of the data generated from mobile applications are in
in previous literature. For instance, [6] referred to big data as unstructured format. For example, text messages,
a large volume of scientific data for visualization. Several online games, blogs, and social media generate differ-
definitions of big data currently exist. For instance, [7] ent types of unstructured data through mobile devices
defined big data as “the amount of data just beyond and sensors. Internet users also generate an extremely
technology's capability to store, manage, and process effi- diverse set of structured and unstructured data [12].
ciently.” Meanwhile, [8] and [9] defined big data as char- (3) Velocity refers to the speed of data transfer. The contents
acterized by three Vs: volume, variety, and velocity. The of data constantly change because of the absorption of
terms volume, variety, and velocity were originally intro- complementary data collections, introduction of pre-
duced by Gartner to describe the elements of big data viously archived data or legacy collections, and streamed
challenges. IDC also defined big data technologies as “a data arriving from multiple sources [9].
new generation of technologies and architectures, designed (4) Value is the most important aspect of big data; it refers to
to economically extract value from very large volumes of a the process of discovering huge hidden values from large
wide variety of data, by enabling the high velocity capture, datasets with various types and rapid generation [13].
discovery, and/or analysis.” [10] specified that big data is not
only characterized by the three Vs mentioned above but may 2.1. Classification of big data
also extend to four Vs, namely, volume, variety, velocity, and
value (Fig. 1, Fig. 2). This 4V definition is widely recognized Big data are classified into different categories to better
because it highlights the meaning and necessity of big data. understand their characteristics. Fig. 2 shows the numer-
The following definition is proposed based on the above- ous categories of big data. The classification is important
mentioned definitions and our observation and analysis of because of large-scale data in the cloud. The classification
the essence of big data. Big data is a set of techniques and is based on five aspects: (i) data sources, (ii) content
technologies that require new forms of integration to uncover format, (iii) data stores, (iv) data staging, and (v) data
large hidden values from large datasets that are diverse, processing.
complex, and of a massive scale. Each of these categories has its own characteristics and
complexities as described in Table 2. Data sources include
(1) Volume refers to the amount of all types of data internet data, sensing and all stores of transnational
generated from different sources and continue to information, ranges from unstructured to highly
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 101

Big Data classification

Data Sources Content Format Data Stores Data Staging Data processing

Web & Social Structured Document-oriented Cleaning Batch

Machine Semi-structured Column-oriented Normalization Real time

Sensing Unstructured Graph based Transform

Transactions Key-value

IoT

Fig. 2. Big data classification.

structured are stored in various formats. Most popular is issues, such as infrastructure, flexibility, and availability of
the relational database that come in a large number of resources [35]. Moreover, combining the cloud computing
varieties [29]. As the result of the wide variety of data utility model and a rich set of computations, infrastructures,
srouces, the captured data differ in zise with respect to and storage cloud services offers a highly attractive envir-
redundancy, consisteny and noise, etc. onment where scientists can perform their experiments
[36]. Cloud service models typically consist of PaaS, SaaS,
3. Cloud computing and IaaS.

Cloud computing is a fast-growing technology that has  PaaS, such as Google's Apps Engine, Salesforce.com,
established itself in the next generation of IT industry and Force platform, and Microsoft Azure, refers to different
business. Cloud computing promises reliable software, resources operating on a cloud to provide platform
hardware, and IaaS delivered over the Internet and remote computing for end users.
data centers [30]. Cloud services have become a powerful  SaaS, such as Google Docs, Gmail, Salesforce.com, and
architecture to perform complex large-scale computing Online Payroll, refers to applications operating on a
tasks and span a range of IT functions from storage and remote cloud infrastructure offered by the cloud pro-
computation to database and application services. The vider as services that can be accessed through the
need to store, process, and analyze large amounts of Internet [37].
datasets has driven many organizations and individuals  IaaS, such as Flexiscale and Amazon's EC2, refers to
to adopt cloud computing [31]. A large number of scientific hardware equipment operating on a cloud provided by
applications for extensive experiments are currently service providers and used by end users upon demand.
deployed in the cloud and may continue to increase
because of the lack of available computing facilities in The increasing popularity of wireless networks and
local servers, reduced capital costs, and increasing volume mobile devices has taken cloud computing to new heights
of data produced and consumed by the experiments [32]. because of the limited processing capability, storage capa-
In addition, cloud service providers have begun to inte- city, and battery lifetime of each device [126]. This condi-
grate frameworks for parallel data processing in their tion has led to the emergence of a mobile cloud computing
services to help users access cloud resources and deploy paradigm. Mobile cloud facilities allow users to outsource
their programs [33]. tasks to external service providers. For example, data can
Cloud computing “is a model for allowing ubiquitous, be processed and stored outside of a mobile device [38].
convenient, and on-demand network access to a number of Mobile cloud applications, such as Gmail, iCloud, and
configured computing resources (e.g., networks, server, Dropbox, have become prevalent recently. Juniper research
storage, application, and services) that can be rapidly predicts that cloud-based mobile applications will increase
provisioned and released with minimal management effort to approximately 9.5$ billion by 2014 [39]. Such applica-
or service provider interaction” [34]. Cloud computing has a tions improve mobile cloud performance and user experi-
number of favorable aspects to address the rapid growth of ence. However, the limitations associated with wireless
economies and technological barriers. Cloud computing networks and the intrinsic nature of mobile devices have
provides total cost of ownership and allows organizations imposed computational and data storage restrictions
to focus on the core business without worrying about [40,127].
102 I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115

Table 2
Various categories of big data.

Classification Description

Data sources
Social media Social media is the source of information generated via URL to share or exchange information and ideas in virtual communities
and networks, such as collaborative projects, blogs and microblogs, Facebook, and Twitter.
Machine-generated Machine data are information automatically generated from a hardware or software, such as computers, medical devices, or
data other machines, without human intervention.
Sensing Several sensing devices exist to measure physical quantities and change them into signals.
Transactions Transaction data, such as financial and work data, comprise an event that involves a time dimension to describe the data.
IoT IoT represents a set of objects that are uniquely identifiable as a part of the Internet. These objects include smartphones, digital
cameras, and tablets. When these devices connect with one another over the Internet, they enable more smart processes and
services that support basic, economic, environmental, and health needs. A large number of devices connected to the Internet
provides many types of services and produces huge amounts of data and information [14].

Content format
Structured Structured data are often managed SQL, a programming language created for managing and querying data in RDBMS.
Structured data are easy to input, query, store, and analyze. Examples of structured data include numbers, words, and dates.
Semi-structured Semi-structured data are data that do not follow a conventional database system. Semi-structured data may be in the form of
structured data that are not organized in relational database models, such as tables. Capturing semi-structured data for analysis
is different from capturing a fixed file format. Therefore, capturing semi-structured data requires the use of complex rules that
dynamically decide the next process after capturing the data [15].
Unstructured Unstructured data, such as text messages, location information, videos, and social media data, are data that do not follow a
specified format. Considering that the size of this type of data continues to increase through the use of smartphones, the need
to analyze and understand such data has become a challenge.

Data stores
Document-oriented Document-oriented data stores are mainly designed to store and retrieve collections of documents or information and support
complex data forms in several standard formats, such as JSON, XML, and binary forms (e.g., PDF and MS Word). A document-
oriented data store is similar to a record or row in a relational database but is more flexible and can retrieve documents based
on their contents (e.g., MongoDB, SimpleDB, and CouchDB).
Column-oriented A column-oriented database stores its content in columns aside from rows, with attribute values belonging to the same column
stored contiguously. Column-oriented is different from classical database systems that store entire rows one after the other
[16], such as BigTable [17].
Graph database A graph database, such as Neo4j, is designed to store and represent data that utilize a graph model with nodes, edges, and
properties related to one another through relations [18].
Key-value Key-value is an alternative relational database system that stores and accesses data designed to scale to a very large size [19].
Dynamo [20] is a good example of a highly available key-value storage system; it is used by amazon.com in some of its services.
Similarly, [21] proposed a scalable key-value store to support transactional multi-key access using a single key access supported
by key-value for use in G-store designs. [22] presented a scalable clustering method to perform a large task in datasets. Other
examples of key-value stores are Apache Hbase [23], Apache Cassandra [24], and Voldemort. Hbase uses HDFS, an open-source
version of Google's BigTable built on Cassandra. Hbase stores data into tables, rows, and cells. Rows are sorted by row key, and
each cell in a table is specified by a row key, a column key, and a version, with the content contained as an un-interpreted array
of bytes.

Data staging
Cleaning Cleaning is the process of identifying incomplete and unreasonable data [25].
Transform Transform is the process of transforming data into a form suitable for analysis.
Normalization Normalization is the method of structuring database schema to minimize redundancy [26].

Data processing
Batch MapReduce-based systems have been adopted by many organizations in the past few years for long-running batch jobs [27].
Such system allows for the scaling of applications across large clusters of machines comprising thousands of nodes.
Real time One of the most famous and powerful real time process-based big data tools is simple scalable streaming system (S4) [28]. S4 is
a distributed computing platform that allows programmers to conveniently develop applications for processing continuous
unbounded streams of data. S4 is a scalable, partially fault tolerant, general purpose, and pluggable platform.

4. Relationship between cloud computing and big data distributed fault-tolerant database and processed through
a programing model for large datasets with a parallel
Cloud computing and big data are conjoined. Big data distributed algorithm in a cluster. The main purpose of
provides users the ability to use commodity computing to data visualization, as shown in Fig. 3, is to view analytical
process distributed queries across multiple datasets and results presented visually through different graphs for
return resultant sets in a timely manner. Cloud computing decision making.
provides the underlying engine through the use of Big data utilizes distributed storage technology based on
Hadoop, a class of distributed data-processing platforms. cloud computing rather than local storage attached to a
The use of cloud computing in big data is shown in Fig. 3. computer or electronic device. Big data evaluation is driven
Large data sources from the cloud and Web are stored in a by fast-growing cloud-based applications developed using
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 103

Data sources Analytics/Reports


Decision making

Query Engine e.g. Hive, Mahout


Data visualization

Distributed configuration and


synchronization service
Programming model for processing large data
Storage sets with a parallel, distributed algorithm on a
cluster like MapReduce.

APIs
Distributed fault tolerant database for large
unstructured data sets like NOSQL.

Web
Hadoop Distributed File System (HDFS)

Fig. 3. Cloud computing usage in big data.

Table 3
Comparison of several big data cloud platforms.

Google Microsoft Amazon Cloudera

Big data storage Google cloud services Azure S3


MapReduce AppEngine Hadoop on Azure Elastic MapReduce (Hadoop) MapReduce YARN
Big data analytics BigQuery Hadoop on Azure Elastic MapReduce (Hadoop) Elastic MapReduce (Hadoop)
Relational database Cloud SQL SQL Azure MySQL or Oracle MySQL, Oracle, PostgreSQL
NoSQL database AppEngine Datastore Table storage DynamoDB Apache Accumulo
Streaming processing Search API Streaminsight Nothing prepackaged Apache Spark
Machine learning Prediction API Hadoop þMahout Hadoop þMahout Hadoop þOryx
Data import Network Network Network Network
Data sources A few sample datasets Windows Azure marketplace Public Datasets Public Datasets
Availability Some services in private beta Some services in private beta Public production Industries

virtualized technologies. Therefore, cloud computing not communications. Likewise, Bollier and Firestone [44]
only provides facilities for the computation and processing emphasized the ability of cluster computing to provide a
of big data but also serves as a service model. Table 3 shows hospitable context for data growth. However, Miller [45]
the comparison of several big data cloud providers. argued that the lack of data availability is expensive
Talia [41] discussed the complexity and variety of data because users offload more decisions to analytical meth-
types and processing power to perform analysis on large ods; incorrect use of the methods or inherent weaknesses
datasets. The author stated that cloud computing infra- in the methods may produce wrong and costly decisions.
structure can serve as an effective platform to address DBMSs are considered a part of the current cloud comput-
the data storage required to perform big data analysis. ing architecture and play an important role to ensure the
Cloud computing is correlated with a new pattern for the easy transition of applications from old enterprise infra-
provision of computing infrastructure and big data proces- structures to new cloud infrastructure architectures. The
sing method for all types of resources available in the pressure for organizations to quickly adopt and implement
cloud through data analysis. Several cloud-based technol- technologies, such as cloud computing, to address the
ogies have to cope with this new environment because challenge of big data storage and processing demands
dealing with big data for concurrent processing has entails unexpected risks and consequences.
become increasingly complicated [42]. MapReduce [43] is Table 4 presents several related studies that deal with
a good example of big data processing in a cloud environ- big data through the use of cloud computing technology.
ment; it allows for the processing of large amounts of The table provides a general overview of big data and
datasets stored in parallel in the cluster. Cluster computing cloud computing technologies based on the area of study
exhibits good performance in distributed system environ- and current challenges, techniques, and technologies that
ments, such as computer power, storage, and network restrict big data and cloud computing.
104 I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115

Table 4
Several related studies that deal with big data through the use of cloud computing technology.

Reference Title of paper Objectives

[46] “Data quality management, data usage experience and To propose a model for the acquisition intention of big data analytics
acquisition intention of big data analytics”
[47] “Big Data Analytics Framework for Peer-to-Peer Botnet To develop open-source tools, such as Hadoop, to provide a scalable
Detection Using Random Forests” implementation of a quasi-real-time intrusion detection system
[48] “MERRA Analytic Services: Meeting the Big Data Challenges of To address big data challenges in climate science
Climate4 Science through Cloud-enabled Climate Analytics-as-a-
Service”
[49] “System of Systems and Big Data Analytics – Bridging the Gap” To demonstrate the construction of a bridge between System of
Systems and Data Analytics to develop reliable models
[50] “Symbioses of Big Data and Cloud Computing: Opportunities & To highlight big data opportunity
Challenges”
[51] “A Special Issue of Journal of Parallel and Distributed Computing: To address special issues in big data management and analytics
Scalable Systems for Big Data Management and Analytics”
[52] “Smarter fraud investigations with big data analytics” To investigate smarter fraud with big data analytics
[53] Moving Big Data to the Cloud: An Online Cost-Minimizing To upload data into the cloud from different geographical locations
Approach with minimum cost of data migration. Two algorithms (OLM, RFHC)
are proposed. These algorithms provide optimization for data
aggregation and processing and a route for data.
[54] “Leveraging the capabilities of service-oriented decision To propose a framework for decision support systems in a cloud
support systems: putting analytics and big data in cloud”
[32] “Cloud Computing and Scientific Applications — Big Data, To review some of the papers published in Cloud Computing and
Scalable Analytics, and Beyond” Scientific Applications (CCSA2012) event
[41] “Clouds for Scalable Big Data Analytics” To discuss the use of cloud for scalable big data analytics
[55] “Cloud Computing Availability: Multi-clouds for Big Data To overcome the issue of single cloud
Service”
[56] “Adapting scientific computing problems to clouds using To review the challenges of reducing the number of iterative
MapReduce” algorithms in the MapReduce model
[57] “p-PIC: Parallel Power Iteration Clustering for Big Data”; To explore different parallelization strategies
Journal of Parallel and Distributed Computing
[58] “Cloud and heterogeneous computing solutions exist today for To review cloud and heterogeneous computing solutions existing
the emerging big data problems in biology” today for the emerging big data problem in biology

Table 5
Summary of Organization case studies from Vendors.

Case Business needs Cloud service models Big data solution Assessment Reference

SwiftKey Language technology IaaS Amazon Elastic MapReduce Success [59]


343 Industries Video game developer IaaS Apache Hadoop Success [60]
redBus Online travel agency IaaS, PaaS BigQuery Success [61]
Nokia Mobile communications IaaS Apache Hadoop, Enterprise Data Warehouse Success [62]
Alacer Big data solution IaaS Big data algorithms Success [63]

5. Case studies and velocity of digital information. We selected this


collection of five cases because they demonstrate the
Our discussion on the relationship between big data extensive variety of research communities that use cloud
and cloud computing is complemented by reported case computing. Table 5 summarizes the case studies of big
studies on big data using cloud computing technology. data implemented by using existing cloud computing
Our discussion of the case studies was divided into two platforms.
parts. The first part describes a number of reported case
studies provided by different vendors who integrate big 5.1.1. A. SwiftKey
data technologies into their cloud environment. The sec- SwiftKey is a language technology founded in London
ond part describes a number of case studies that have been in 2008. This language technology aids touchscreen typing
published by scholarly/academic sources. by providing personalized predictions and corrections. The
company collects and analyzes terabytes of data to create
5.1. Organization case Studies from vendors language models for many active users. Thus, the company
needs a highly scalable, multilayered model system that
Customer case studies from vendors, such as Google, can keep pace with steadily increasing demand and that
Amazon, and Microsoft, were obtained. These case studies has a powerful processing engine for the artificial intelli-
show the use of cloud computing technologies in big data gence technology used in prediction generation. To
analytics and in managing the increasing volume, variety, achieve its goals, the company uses Apatche Hadoop
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 105

running on Amazon Simple Storage Service and Amazon 5.1.5. E. Alacer


Elastic Compute Cloud to manage the processing of multi- An online retailer was experiencing revenue leakage
ple terabytes of data. By using this new solution, SwiftKey because of unreliable real-time notifications of service
is able to scale services on demand during peak time. problems within its cloud-based e-commerce platform.
Alacer used big data algorithms to create a cloud monitor-
ing system that delivers reactive and proactive notifica-
5.1.2. B. 343 Industries tions. By using cloud computing with Alacer's monitoring
The Halo is science fiction media franchise that has platform, the incident response time was reduced from
grown into a global entertainment phenomenon. More one hour to seconds, thus dramatically improving custo-
than 50 million copies of the Halo video games have been mer satisfaction and eliminating service level agreement
sold worldwide. Before launching Halo 4, the developers penalties.
analyzed data to obtain insights into player preferences
and online tournaments. To complete this task, the team 5.1.5.1. Case studies from scholarly/academic sources. The
used Windows Azure HDInsight Service, which is based on following case studies provide recent example of how
the Apache Hadoop big data framework. The team was researchers have used cloud computing technology for
able to provide game statistics to tournament operators, their big data projects. Table 6 details the five case report
which used the data to rank players based on game play, studies which explored the use of cloud for big data.
by using HDInsight Service to process and analyze raw
data from Windows Azure. The team also used HDInsight
5.1.5.2. Case study 1: cloud computing in genome infor-
Service to update Halo 4 every week and to support daily
matics. Reid et al. [64] have investigated the growth of
e-mail campaigns designed to increase player retention.
next-generation sequencing data in laboratories and hospitals.
Organizations can also utilize data to make prompt busi-
This growth has shifted the bottleneck in clinical genetics
ness decisions.
from DNA sequence production to DNA sequence analysis.
However, accurate and reproducible genomic results at a scale
ranging from individuals to large cohorts should be provided.
5.1.3. C. redBus
They developed a Mercury analysis pipeline and deployed it
The online travel agency redBus introduced Internet
in the Amazon web service cloud via the DNAnexus platform.
bus ticketing in India in 2006, thus unifying tens of
Thus, they established a powerful combination of a robust and
thousands of bus schedules into a single booking opera-
fully validated software pipeline and a scalable computational
tion. The company needed a powerful tool to analyze
resource that have been applied to more than 10,000 whole
inventory and booking data across their system of hun-
genome and whole exome samples.
dreds of bus operators serving more than 10,000 routes.
They considered using clusters of Hadoop servers to
process the data but decided that the system would take 5.1.5.3. Case study 2: mining Twitter in the cloud. Noordhuis
considerable time and resources to maintain. Furthermore, et al. [65] used cloud computing to analyze of large
the use of clusters of Hadoop servers would not provide amounts of data on Twitter. The author applied the
the lightning-fast analysis needed by the company. Thus, PageRank algorithm on the Twitter user base to obtain
redBus implemented GoogleQuery to analyze large data- user rankings. The Amazon cloud infrastructure was used
sets by using the Google data processing infrastructure. to host all related computations. Computations were
The insights rapidly gained through BigQuery have made conducted in a two-phase process: in the crawling phase,
redBus a strong company. By minimizing the time needed all data were retrieved from Twitter. In the processing
for staff members to solve technical problems, BigQuery phase, the PageRank algorithm was applied to compute
helps improve customer service and reduce lost sales. the acquired data. During the crawling stage, the author
web crawled a graph containing 50 million nodes and 1.8
billion edges, which is approximately two-thirds of the
5.1.4. D. Nokia estimated user base of Twitter. Thus, a relatively cheap
Nokia is a mobile communications company whose solution for data acquisition and analysis is implemented
products comes to be an integral part of the people live. by using the Amazon cloud infrastructure.
Many people around the world use Nokia mobile phones
to communicate, capture photos and share experiences. 5.1.5.4. Case study 3: scientific data processing. Zhang et al.
Thus, Nokia gathers and analyzes large amounts of data [66] developed a Hadoop-based cloud computing appli-
from mobile phones. However, in order to support its cation that processes sequences of microscopic images of
extensive use of big data, Nokia relies on a technology lives cells by using MATLAB. The project was a collaboration
ecosystem that includes a Teradata Enterprise Data Ware- between groups in Genome Quebec/McGill University in
house, numerous Oracle and MySQL data marts, visualiza- Montreal and at the University of Waterloo. The goal was
tion technologies, and Hadoop. Nokia has over 100 to study the complex molecular interactions that regulate
terabytes of structured data on Teradata and petabytes of biological systems. The application, which was built on the
multistructured data on the Hadoop Distributed File Sys- basis of Hadoop, allows users to submit data processing
tem (HDFS). The HDFS data warehouse allows the storage jobs in the cloud. The authors used a homogeneous cluster
of all semi/multistructured data and offers data processing to conduct initial system development and proof-of-concept
at the petabyte scale. tests. The cluster comprises 21 Sun Fire X4100 servers with
106 I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115

Table 6
Summary of case studies from scholarly/academic sources.

Case Situation/context Objective Approach Result

1 Massively parallel DNA To provide accurate and Develop a Mercury analysis Established a powerful
sequencing generates reproducible genomic results at pipeline and deploy it in the combination of a robust and
staggering amounts of data. a scale ranging from individuals Amazon web service cloud via fully validated software
to large cohorts. the DNAnexus platform. pipeline and a scalable
computational resource that
have been applied to more
than 10,000 whole genome
and whole exome samples.
2 Given that conducting analyses To use cloud services as a Use PageRank algorithm on the Implemented a relatively
on large social networks such possible solution for the Twitter user base to obtain user cheap solution for data
as Twitter requires analysis of large amounts of rankings. Use the Amazon acquisition and analysis by
considerable resources because data. cloud infrastructure to host all using the Amazon cloud
of the large amounts of data related computations. infrastructure.
involved, such activities are
usually expensive.
3 To study the complex To develop a Hadoop-based Use Hadoop cloud computing Allows users to submit data
molecular interactions that cloud computing application framework. processing jobs in the cloud
regulate biological systems. that processes sequences of
microscopic images of live cells.
4 Applications running on cloud Design a failure scenario Create a series of failure Help to identify failure
computing likely may fail. scenarios on a Amazon cloud vulnerabilities in Hadoop
computing platform applications running in cloud.

two dual-core AMD Opteron 280 CPUs interconnected by for servers that are interconnected on a small scale. Given
gigabit Ethernet. the aforesaid low scalability, storage capacity is increased
but expandability and upgradeability are limited signifi-
5.1.5.5. Case study 4: failure scenario as a service (FSaaS) for cantly. NAS is a storage device that supports a network.
Hadoop Clusters. Faghri et al. [67] have created a series of NAS is connected directly to a network through a switch or
failure scenarios on a Amazon cloud computing platform hub via TCP/IP protocols. In NAS, data are transferred as
to provide Hadoop service with the means to test their files. Given that the NAS server can indirectly access a
applications against the risk of massive failure. They storage device through networks, the I/O burden on a NAS
developed a set failure scenarios for Hadoop clusters server is significantly lighter than that on a DAS server.
with 10 Amazon web service EC2 machines. These types NAS can orient networks, particularly scalable and
of failures could happen inside Hadoop jobs include bandwidth-intensive networks. Such networks include
CPU intensive, namely I/O-intensive and network-intensive. high-speed networks of optical-fiber connections. The
Thus, running such scenario against Hadoop applications SAN system of data storage is independent with respect
can help to identify failure vulnerabilities in these appli- to storage on the local area network (LAN). Multipath data
cations. switching is conducted among internal nodes to maximize
data management and sharing. The organizational systems
6. Big data storage system of data storages (DAS, NAS, and SAN) can be divided into
three parts: (i) disc array, where the foundation of a
The rapid growth of data has restricted the capability of storage system provides the fundamental guarantee, (ii)
existing storage technologies to store and manage data. connection and network subsystems, which connect one
Over the past few years, traditional storage systems have or more disc arrays and servers, and (iii) storage manage-
been utilized to store data through structured RDBMS [13]. ment software, which oversees data sharing, storage
However, almost storage systems have limitations and are management, and disaster recovery tasks for multiple
inapplicable to the storage and management of big data. servers.
A storage architecture that can be accessed in a highly
efficient manner while achieving availability and reliability 7. Hadoop background
is required to store and manage large datasets. The storage
media currently employed in enterprises are discussed and Hadoop [73] is an open-source Apache Software Foun-
compared in Table 7. dation project written in Java that enables the distributed
Several storage technologies have been developed to processing of large datasets across clusters of commodity.
meet the demands of massive data. Existing technologies Hadoop has two primary components, namely, HDFS and
can be classified as direct attached storage (DAS), network MapReduce programming framework. The most signifi-
attached storage (NAS), and storage area network (SAN). In cant feature of Hadoop is that HDFS and MapReduce
DAS, various hard disk drives (HDDs) are directly con- are closely related to each other; each are co-deployed
nected to the servers. Each HDD receives a certain amount such that a single cluster is produced [73]. Therefore, the
of input/output (I/O) resource, which is managed by storage system is not physically separated from the pro-
individual applications. Therefore, DAS is suitable only cessing system.
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 107

Table 7
Comparison of storage media.

Storage type Specific use Advantages Limitations Refer-


ence

Hard drives To store data up to four Density, cost per bit storage, and speedy Require special cooling and [68]
terabytes start up that may only take several seconds high read latency time; the
spinning of the platters can
sometimes result in vibration
and produce more heat than
solid state memory
Solid-state To store data up to two Fast access to data, fast movement of huge Ten times more expensive than [69]
memory terabytes quantities of data, start-up time only takes hard drives in terms of per
several milliseconds, no vibration, and gigabyte capacity
produces less heat than hard drives
Object storage To store data as Scales with ease to find information and Complexity in tracking indices. [70]
variable-size objects has a unique identifier to identify data
rather than fixed-size objects; ensures security because
blocks information on physical location cannot be
obtained from disk drives; supports
indexing access
Optical storage To store data at Least expensive removable storage medium Complex; its ability to produce [71]
different angles multiple optical disks in a
throughout the storage single unit is yet to be proven
medium
Cloud storage To serve as a Useful for small organizations that do not Security is the primary [72]
provisioning and have sufficient storage capacity; cloud challenge because of data
storage model and storage can store large amounts of data, but outsourcing
provide on-demand its services are billable
access to services, such
as storage

Table 8 Mahout, Pig, Zookeeper, Spark, and Avro. Twister [76]


Summary of the process of the map/reduce function. provides support for efficient and iterative MapReduce
computations. An overview of current MapReduce projects
Mapper (key1, value1)-List [(key2, value2)]
Reducer [key2, list (value2)]-List (key3, value3) and related software is shown in Table 9. MapReduce
allows an unexperienced programmer to develop parallel
programs and create a program capable of using compu-
ters in a cloud. In most cases, programmers are required to
specify two functions only: the map function (mapper)
HDFS [74] is a distributed file system designed to run and the reduce function (reducer) commonly utilized in
on top of the local file systems of the cluster nodes and functional programming. The mapper regards the key/
store extremely large files suitable for streaming data value pair as input and generates intermediate key/value
access. HDFS is highly fault tolerant and can scale up from pairs. The reducer merges all the pairs associated with the
a single server to thousands of machines, each offering same (intermediate) key and then generates an output.
local computation and storage. HDFS consists of two types Table 8 summarizes the process of the map/reduce function.
of nodes, namely, a namenode called “master” and several The map function is applied to each input (key1,
datanodes called “slaves.” HDFS can also include secondary value1), where the input domain is different from the
namenodes. The namenode manages the hierarchy of file generated output pairs list (key2, value2). The elements of
systems and director namespace (i.e., metadata). File the list (key2, value2) are then grouped by a key. After
systems are presented in a form of namenode that regis- grouping, the list (key2, value2) is divided into several lists
ters attributes, such as access time, modification, permis- [key2, list (value2)], and the reduce function is applied to
sion, and disk space quotas. The file content is split into each [key2, list (value2)] to generate a final result list
large blocks, and each block of the file is independently (key3, value3).
replicated across datanodes for redundancy and to peri-
odically send a report of all existing blocks to the
namenode. 7.1. MapReduce in clouds
MapReduce [43] is a simplified programming model for
processing large numbers of datasets pioneered by Google MapReduce accelerates the processing of large amounts
for data-intensive applications. The MapReduce model was of data in a cloud; thus, MapReduce, is the preferred
developed based on GFS [75] and is adopted through computation model of cloud providers [86]. MapReduce
open-source Hadoop implementation, which was popular- is a popular cloud computing framework that roboti-
ized by Yahoo. Apart from the MapReduce framework, cally performs scalable distributed applications [56] and
several other current open-source Apache projects are provides an interface that allows for parallelization and
related to the Hadoop ecosystem, including Hive, Hbase, distributed computing in a cluster of servers [12]. Srirama
108 I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115

Table 9
Current MapReduce projects and related software.

Reference Software Brief description

[77] Hive Hive offers a warehouse structure in HDFS


[78] Hbase Scalable distributed database that supports structured data storage for large tables
[79] MadoutTM Mahout is a machine-learning and data-mining library that has four main groups: collective filtering,
categorization, clustering, and parallel frequent pattern mining; compared with other pre-existing
algorithms, the Mahout library belongs to the subset that can be executed in a distributed mode and is
executable by MapReduce
[80] Pig Pig framework involves a high-level scripting language (Pig Latin) and offers a run-time platform that
allows users to execute MapReduce on Hadoop
[81] ZookeeperTM High-performance service to coordinate the processes of distributed applications; ZooKeeper allows
distributed processes to manage and contribute to one another through a shared hierarchical namespace of
data registers (z-nodes) similar to a file system; ZooKeeper is a distributed service with master and slave
nodes and stores configuration information
[82] SparkTM A fast and general computation engine for Hadoop data
[83] Chukwa Chukwa has just passed its development stage; it is a data collection and analysis framework incorporated
with MapReduce and HDFS; the workflow of Chukwa allows for data collection from distributed systems,
data processing, and data storage in Hadoop; as an independent module, Chukwa is included in the Apache
Hadoop distribution
[76] TwisterTM Provides support for iterative MapReduce computations and Twister; extremely faster than Hadoop
MAPR Comprehensive distribution processing for Apache Hadoop and Hbase
YARN A new Apache–Hadoop–MapReduce framework
[84] Cassandra A scalable multi-master database with no single point of failure
[85] Avro The tasks performed by Avro include data serialization, remote procedure calls, and data passing from one
program or language to another; in the Avro framework, data are self-describing and are always stored with
their own schema; this software is suitable for application to scripting language, such as Pig, because of
these qualities.

Table 10
Summary of several SQL interfaces in the MapReduce framework in related literature.

Author(s) Title of paper Result/techniques/algorithm Objective/description

[89] “Jaql: A scripting language for large scale Jaql Declarative query language designed for
semi-structured data analysis” JavaScript Object Notation
[90] “Tenzing an SQL implementation in the Tenzing An SQL query execution engine
MapReduce framework”
[91] “HadoopDB: an architectural hybrid of HadoopDB Comparison between Hadoop
MapReduce and DBMS technologies for implementation of MapReduce
analytical workloads” framework and parallel SQL database
management systems
[92] “SQL/MapReduce: A practical approach to SQL/MapReduce Provides a parallel computation of
self-describing, polymorphic, and procedural functions across hundreds of
parallelizable user-defined functions” servers working together as a single
relational database
[77] “Hive - A Warehousing Solution Over a Data summarization and ad hoc Presents an open-source warehouse
Map-Reduce Framework” querying Hive solution built on top of Hadoop
[80] “Pig latin: a not-so-foreign language for Pig Latin The software takes a middle position
data processing” between expressing tasks using the
high-level declarative querying model
in the spirit of SQL and the low-level/
procedural programming model using
MapReduce
[93] “Interpreting the data: Parallel analysis Sawzall Sawzall defines the operations to be
with Sawzall” performed in a single record of the data
used at Google on top of MapReduce

et al. [56] presented an approach to apply scientific instances [87]. Several strategies have been proposed to
computing problems to the MapReduce framework where improve the performance of big data processing. More-
scientists can efficiently utilize existing resources in the over, effort has been exerted to develop SQL interfaces in
cloud to solve computationally large-scale scientific data. the MapReduce framework to assist programmers who
Currently, many alternative solutions are available to prefer to use SQL as a high-level language to express their
deploy MapReduce in cloud environments; these solutions task while leaving all of the execution optimization details
include using cloud MapReduce runtimes that maximize to the backend engine [88]. Table 10 shows a summary of
cloud infrastructure services, using MapReduce as a ser- several SQL interfaces in the MapReduce framework avail-
vice, or setting up one's own MapReduce cluster in cloud able in existing literature.
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 109

8. Research challenges clustering (p-PIC). The implementation considers two key


components, namely, similarity matrix calculation and
Although cloud computing has been broadly accepted normalization and iterative matrix–vector multiplication.
by many organizations, research on big data in the cloud The process begins with the master processor indicating
remains in its early stages. Several existing issues have not the beginning and ending indices for the remote data
been fully addressed. Moreover, new challenges continue chunk. Therefore, each processor reads data from the input
to emerge from applications by organization. In the sub- file and provides a similarity sub-matrix by performing the
sequent sections, some of the key research challenges, following calculation.
such as scalability, availability, data integrity, data trans- Ai ðr; cÞ ¼ J xr Jx2r :xJcxc J 2, where r ac//from the input [57]
formation, data quality, data heterogeneity, privacy and Ai ðr; : Þ ¼ Ai ðr; : Þ=∑ Ai ðr; cÞ//normalizes by row sum [57]
c
legal issues, and regulatory governance, are discussed. The master processor collects all row runs from the
other processors and concatenates them into an overall
8.1. Scalability row sum. Each processor that interacts with the main
processor updates its vector by performing matrix–vector
Scalability is the ability of the storage to handle multiplication.
increasing amounts of data in an appropriate manner. Wang et al. [95] proposed a new scalable data cube
Scalable distributed data storage systems have been a analysis technique called HaCube in big data clusters to
critical part of cloud computing infrastructures [34]. The overcome the challenges of large-scale data. HaCube is an
lack of cloud computing features to support RDBMSs extension of MapReduce; it incorporates some of MapRe-
associated with enterprise solutions has made RDBMSs duce's features, such as scalability and parallel DBMS. The
less attractive for the deployment of large-scale applica- experimental results provided in the study indicated that
tions in the cloud. This drawback has resulted in the HaCube performs at least 1.6  to 2.8  faster than
popularity of NoSQL [94]. Hadoop in terms of view maintenance. However, some
A NoSQL database provides the mechanism to store and improvements in performance, such as integrating more
retrieve large volumes of distributed data. The features of techniques from DBMS (e.g., indexing techniques), are still
NoSQL databases include schema-free, easy replication required.
support, simple API, and consistent and flexible modes.
Different types of NoSQL databases, such as key-value [21], 8.2. Availability
column-oriented, and document-oriented, provide support
for big data. Table 11 shows a comparison of various Availability refers to the resources of the system acces-
NoSQL database technologies that provide support for sible on demand by an authorized individual [98]. In a
large datasets. cloud environment, one of the main issues concerning
The characteristics of scalable data storage in a cloud cloud service providers is the availability of the data stored
environment are shown in Table 12. Yan et al. [57] in the cloud. For example, one of the pressing demands on
attempted to expend power iteration clustering (PIC) data cloud service providers is to effectively serve the needs of
scalability by implementing parallel power iteration the mobile user who requires single or multiple data

Table 11
Comparison of NoSQL databases.

Feature/ NoSQL database name


capability
DynamoDB Redis Voldemort Cassandra Hbase MangoDB SimpleDB CouchDB BigTable Apache
Jackrabbit

Storage type KV KV KV KV KV Doc Doc & KV Doc CO Doc


Initial release 2012 2009 2009 2008 2010 2009 2007 2005 2005 2010
Consistency N/A ✓ N/A N/A ✓ ✓ N/A N/A ✓ ✓
Partition N/A ✓ ✓ ✓ ✓ ✓ N/A ✓ ✓ N/A
Tolerance
Persistence ✓ ✓ ✓ ✓ ✓ ✓ ✓ N/A ✓ ✓
High ✓ ✓ ✓ ✓ ✓ ✓ ✓ N/A ✓ ✓
Availability
Durability ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Scalability High High High High High High High High High High
Performance High High High High High High High High High High
Schema-free ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Programming Java Ansi-C Java Java Java Cþ þ Erlang Erlang C, C þ þ Java
Language
Platform Linux Windows, Windows, Windows, Windows, Windows, Windows, Windows, Windows, Windows,
Linux, OS X Linux, OS X Linux, OS X Linux, OS X Linux, OS X Linux, OS X Linux, OS X Linux, OS X Linux, OS
Open Source X ✓ ✓ ✓ ✓ ✓ ✕ ✓ ✕ ✓
Developer Amazon Salvatore LinkedIn ASF ASF 10gen Amazon ASF Google Apache
Sanfilippo

ASF¼ Apache Software Foundation, Doc ¼ Document, KV¼ Key-Value, N/A ¼No Answer, ✓ ¼ Support, ✕¼Not support.
110 I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115

Table 12
Characteristics of scalable data storage in a cloud environment.

Reference Characteristic Advantage Disadvantage

[96] DBMS Faster data access Less attractive for the deployment of large-scale data
Faster processing Limited
[20] Key Value Scales to a very large size
Limitless
[97] Google file system (GFS) Scalable distributed file system for large distributed Garbage collection could become a problem
data-intensive applications Performance might degrade if the number of writers
Delivers high aggregate performance and random writers increases
File data is stored in different chunk servers
[74] Hadoop distributed file Stores large amounts of datasets
system (HDFS) Uses a large cluster

Fig. 4. Transforming big data for analysis.

within a short amount of time. Therefore, services must variety of data formats, big data can be transformed into
remain operational even in the case of a security breach an analysis workflow in two ways as shown in Fig. 4.
[98]. In addition, with the increasing number of cloud In the case of structured data, the data is pre-processed
users, cloud service providers must address the issue of before they are stored in relational databases to meet the
making the requested data available to users to deliver constraints of schema-on-write. The data can then be
high-quality services. Lee et al. [55] introduced a multi- retrieved for analysis. However, in unstructured data, the
cloud model called “rain clouds” to support big data data must first be stored in distributed databases, such as
exploitation. “Rain clouds” involves cooperation among HBase, before they are processed for analysis. Unstructured
single clouds to provide accessible resources in an emer- data are retrieved from distributed databases after meet-
gency. Schroeck et al. [99] predicted that the demand for ing the schema-on-read constraints.
more real time access to data may continue to increase as
business models evolve and organizations invest in tech-
nologies required for streaming data and smartphones. 8.5. Data quality

In the past, data processing was typically performed on


8.3. Data integrity
clean datasets from well-known and limited sources.
Therefore, the results were accurate [102]. However, with
A key aspect of big data security is integrity. Integrity
the emergence of big data, data originate from many
means that data can be modified only by authorized
different sources; not all of these sources are well-known
parties or the data owner to prevent misuse. The prolif-
or verifiable. Poor data quality has become a serious
eration of cloud-based applications provides users the
problem for many cloud service providers because data
opportunity to store and manage their data in cloud data
are often collected from different sources. For example,
centers. Such applications must ensure data integrity.
huge amounts of data are generated from smartphones,
However, one of the main challenges that must be
where inconsistent data formats can be produced as a
addressed is to ensure the correctness of user data in the
result of heterogeneous sources. The data quality problem
cloud. Given that users may not be physically able to
is usually defined as “any difficulty encountered along one
access the data, the cloud should provide a mechanism
or more quality dimensions that render data completely or
for the user to check whether the data is maintained [100].
largely unfit for use” [103]. Therefore, obtaining high-
quality data from vast collections of data sources is a
8.4. Transformation challenge. High-quality data in the cloud is characterized
by data consistency. If data from new sources are consis-
Transforming data into a form suitable for analysis is an tent with data from other sources, then the new data are
obstacle in the adoption of big data [101]. Owing to the of high quality [104].
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 111

8.6. Heterogeneity Fan and Huang [112] proposed a variant of symmetric


predicate encryption in cloud storage to control privacy
Variety, one of the major aspects of big data character- and preserve search-based functionalities, such as un-
ization, is the result of the growth of virtually unlimited decrypt and revocable delegated search. Therefore, con-
different sources of data. This growth leads to the hetero- trolling the lifetime and search privileges of cloud data
geneous nature of big data. Data from multiple sources are could become easy for the owner of the cloud storage.
generally of different types and representation forms and Li et al. [113] proposed a flexible multi-keyword query
significantly interconnected; they have incompatible for- scheme (MKQE) that significantly reduces the maintenance
mats and are inconsistently represented [105]. overhead during keyword dictionary expansion. MKQE con-
In a cloud environment, users can store data in struc- siders the keyword weights and user access history to
tured, semi-structured, or unstructured format. Structured generate query results. MKQE improves the performance of
data formats are appropriate for today's database systems, multi-keyword ranked query over encrypted data to prevent
whereas semi-structured data formats are appropriate information leakage and solve the data indexing problem.
only to some extent. Unstructured data are inappropriate Squicciarini et al. [114] presented a three-tier data protec-
[105] because they have a complex format that is difficult tion architecture to provide multiple levels of privacy to
to represent in rows and columns. According to Kocarev cloud users. Bhagat et al. [115] investigated the issue of social
and Jakimoski [110], the challenge is how to handle networks, such as Facebook and Twitter, in which users
multiple data sources and types. share sensitive information over the Internet. They presented
a method to deal with privacy leakages of an anonymous
user's information. Itani et al. [116] presented privacy as a
8.7. Privacy service model that involves a set of security protocols to
ensure the confidentiality of customer data in the cloud.
Privacy concerns continue to hamper users who out- Agarwal and Aggarwal [117] proposed a privacy mea-
source their private data into the cloud storage. This sure based on differential entropy. Differential entropy h
concern has become serious with the development of big (A) of a random variable A is defined as follows [119]:
data mining and analytics, which require personal infor- Z
mation to produce relevant results, such as personalized HðAÞ ¼ ƒA ðaÞlog2 ƒA ðaÞ da
ΩA
and location-based services [105]. Information on indivi-
duals is exposed to scrutiny, a condition that gives rise to where ΩA is the domain of “A.” h(A)¼ log2 a is a measure of
concerns on profiling, stealing, and loss of control [106]. uncertainty inherent in the value of “A” proposed to rando-
Currently, encryption is utilized by most researchers to mize variable “A” between 0 and (A). Therefore, the random
ensure data privacy in the cloud [107,108]. Encryption variable with less uncertainty than “A” in [0, 1] has negative
algorithms are usually written in the form of transforma- differential entropy, whereas the random variable with more
tions, such as Y¼EZ (X) [109], where (X) refers to plaintext, uncertainty has positive differential entropy. An overview of
(Y) is a cryptogram, and (Z) is the secret key. Encryption privacy preservation and their proposed solutions, techni-
algorithms have a special case called block algorithms as ques, and limitations are presented in Table 13.
proposed by Kocarev and Jakimoski [110], where EZ is
defined as fZ: fZ: X, X¼[0, 1………..,2m 1], and m ¼64. 8.8. Legal/regulatory issues
Xuyun et al. [111] discussed the problem of preserving
the privacy of intermediate datasets in cloud computing; Specific laws and regulations must be established to
they argued that encrypting all intermediate datasets in preserve the personal and sensitive information of users.
the cloud is neither computationally effective nor cost Different countries have different laws and regulations to
effective because much time is required to encrypt or achieve data privacy and protection. In several countries,
decrypt data. The researchers also performed experiments monitoring of company staff communications is not
to reduce the cost of encryption by investigating which allowed. However, electronic monitoring is permitted
part of the intermediate datasets must be encrypted and under special circumstances [120]. Therefore, the question
which part must not. is whether such laws and regulations offer adequate

Table 13
Overview of privacy preservation in a cloud.

References Proposed solution Technique Description Limitation

[117] Reconstruction algorithm for Expectation–maximization Measurement of privacy preservation Efficiency of


privacy-preserving data mining algorithm randomization
[114] Three-tier data protection Portable data binding Addresses the issue of privacy caused by data Protection from
architecture indexing malicious attempts
[118] Privacy-preserving layer (PPL) Ensure data privacy preservation before data are Integration with
over a MapReduce framework further processed by MapReduce subsequence tasks other data
processing
[111] Upper bound privacy leakage Privacy-preserving cost- Identify which intermediate datasets need to be Efficiency of the
constraint-based reducing heuristic encrypted proposed
algorithm technique
112 I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115

protection for individuals' data while enjoying the many gathered from different sources do not have a structured
benefits of big data in the society at large [2]. format. For instance, mobile cloud-based applications,
blogs, and social networking are inadequately structured
8.9. Governance similar to pieces of text messages, videos, and images.
Transforming and cleaning such unstructured data before
Data governance embodies the exercise of control and loading them into the warehouse for analysis are challen-
authority over data-related rules of law, transparency, and ging tasks. Efforts have been exerted to simplify the
accountabilities of individuals and information systems to transformation process by adopting technologies such as
achieve business objectives [121]. The key issues of big Hadoop and MapReduce to support the distributed pro-
data in cloud governance pertain to applications that cessing of unstructured data formats. However, under-
consume massive amounts of data streamed from external standing the context of unstructured data is necessary,
sources [122]. Therefore, a clear and acceptable data policy particularly when meaningful information is required.
with regard to the type of data that need to be stored, how MapReduce programming model is the most common
quickly an individual needs to access the data, and how to model that operates in clusters of computers; it has been
access the data must be defined [50]. utilized to process and distribute large amounts of data.
Big data governance involves leveraging information by
aligning the objectives of multiple functions, such as 9.2. Distributed storage systems
telecommunication carriers having access to vast troves
of customer information in the form of call detail records Numerous solutions have been proposed to store and
and marketing seeking to monetize this information by retrieve massive amounts of data. Some of these solutions
selling it to third parties [123]. have been applied in a cloud computing environment.
Moreover, big data provides significant opportunities to However, several issues hinder the successful implemen-
service providers by making information more valuable. tation of such solutions, including the capability of current
However, policies, principles, and frameworks that strike a cloud technologies to provide necessary capacity and high
stability between risk and value in the face of increasing performance to address massive amounts of data [68],
data size and deliver better and faster data management optimization of existing file systems for the volumes
technology can create huge challenges [124]. demanded by data mining applications, and how data
Cloud governance recommends the use of various poli- can be stored in such a manner that they can be easily
cies together with different models of constraints that limit retrieved and migrated between servers.
access to underlying resources. Therefore, adopting govern-
ance practices that maintain a balance between risk expo- 9.3. Data analysis
sure and value creation is a new organizational imperative
to unlock competitive advantages and maximize value from The selection of an appropriate model for large-scale
the application of big data in the cloud [124]. data analysis is critical. Talia [41] pointed out that obtain-
ing useful information from large amounts of data requires
9. Open research issues scalable analysis algorithms to produce timely results.
However, current algorithms are inefficient in terms of
Numerous studies have addressed a number of signifi- big data analysis. Therefore, efficient data analysis tools
cant problems and issues pertaining to the storage and and technologies are required to process such data. Each
processing of big data in clouds. The amount of data algorithm performance ceases to increase linearly with
continues to increase at an exponential rate, but the increasing computational resources. As researchers con-
improvement in the processing mechanisms is relatively tinue to probe the issues of big data in cloud computing,
slow. Only a few tools are available to address the issues of new problems in big data processing arise from the
big data processing in cloud environments. State-of-the- transitional data analysis techniques. The speed of stream
art techniques and technologies in many important big data arriving from different data sources must be pro-
data applications (i.e., MapReduce, Dryad, Pregel, PigLatin, cessed and compared with historical information within a
MangoDB, Hbase, SimpleDB, and Cassandra) cannot solve certain period of time. Such data sources may contain
the actual problems of storing and querying big data. For different formats, which makes the integration of multiple
example, Hadoop and MapReduce lack query processing sources for analysis a complex task [125].
strategies and have low-level infrastructures with respect
to data processing and management. Despite the plethora 9.4. Data security
of work performed to address the problem of storing and
processing big data in cloud computing environments, Although cloud computing has transformed modern
certain important aspects of storing and processing big ICT technology, several unresolved security threats exist in
data in cloud computing are yet to be solved. Some of cloud computing. These security threats are magnified by
these issues are discussed in the subsequent subsections. the volume, velocity, and variety of big data. Moreover,
several threats and issues, such as privacy, confidentiality,
9.1. Data staging integrity, and availability of data, exist in big data using
cloud computing platforms. Therefore, data security must
The most important open research issue regarding data be measured once data are outsourced to cloud service
staging is related to the heterogeneous nature of data. Data providers. The cloud must also be assessed at regular
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 113

intervals to protect it against threats. Cloud vendors must References


ensure that all service level agreements are met. Recently,
some controversies have revealed how some security [1] R.L .Villars, C.W. Olofson, M. Eastwood, Big data: what it is and why
agencies use data generated by individuals for their own you should care, White Paper, IDC, 2011, MA, USA.
[2] R. Cumbley, P. Church, Is Big Data creepy? Comput. Law Secur. Rev.
benefit without permission. Therefore, policies that cover 29 (2013) 601–609.
all user privacy concerns should be developed. Tradition- [3] S. Kaisler, F. Armour, J.A. Espinosa, W. Money, Big Data: Issues and
ally, the most common technique for privacy and data Challenges Moving Forward, System Sciences (HICSS), 2013, in:
Proceedings of the 46th Hawaii International Conference on, IEEE,
control is to protect the systems utilized to manage data 2013, pp. 995–1004.
rather than the data itself; however, such systems have [4] L. Chih-Wei, H. Chih-Ming, C. Chih-Hung, Y. Chao-Tung, An
proven to be vulnerable. Utilizing strong cryptography to Improvement to Data Service in Cloud Computing with Content
Sensitive Transaction Analysis and Adaptation, Computer Software
encapsulate sensitive data in a cloud computing environ-
and Applications Conference Workshops (COMPSACW), 2013 IEEE
ment and developing a novel algorithm that efficiently 37th Annual, 2013, pp. 463–468.
allows for key management and secure key exchange are [5] L. Chang, R. Ranjan, Z. Xuyun, Y. Chi, D. Georgakopoulos, C. Jinjun,
important to manage access to big data, particularly as Public Auditing for Big Data Storage in Cloud Computing – a Survey,
Computational Science and Engineering (CSE), 2013 IEEE 16th
they exist in the cloud independent of any platform. International Conference on, 2013, pp. 1128–1135.
Moreover, the issue with integrity is that previously [6] M. Cox, D. Ellsworth, Managing Big Data For Scientific Visualization,
developed hashing schemes are no longer applicable to ACM Siggraph, MRJ/NASA Ames Research Center, 1997.
[7] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, A.H.
large amounts of data. Integrity verification is also difficult Byers, Big data: The next frontier for innovation, competition, and
because of the lack of support, given remote data access productivity, (2011).
and the lack of information on internal storage. [8] P. Zikopoulos, K. Parasuraman, T. Deutsch, J. Giles, D. Corrigan,
Harness the Power of Big Data The IBM Big Data Platform, McGraw
Hill Professional, 2012.
[9] J.J. Berman, Introduction, in: Principles of Big Data, Morgan
10. Conclusion Kaufmann, Boston, 2013, xix–xxvi (pp).
[10] J. Gantz, D. Reinsel, Extracting value from chaos, IDC iView (2011)
The size of data at present is huge and continues to 1–12.
[11] J.K. Laurila, D. Gatica-Perez, I. Aad, J. Blom, O. Bornet, T.-M.-T. Do,
increase every day. The variety of data being generated is O. Dousse, J. Eberle, M. Miettinen, The mobile data challenge: Big
also expanding. The velocity of data generation and data for mobile computing research, Workshop on the Nokia
growth is increasing because of the proliferation of mobile Mobile Data Challenge, in: Proceedings of the Conjunction with
the 10th International Conference on Pervasive Computing, 2012,
devices and other device sensors connected to the Inter- pp. 1–8.
net. These data provide opportunities that allow busi- [12] D.E. O’Leary, Artificial intelligence and big data, IEEE Intell. Syst. 28
nesses across all industries to gain real-time business (2013) 96–99.
[13] M. Chen, S. Mao, Y. Liu, Big data: a survey, Mob. Netw. Appl. 19 (2)
insights. The use of cloud services to store, process, and (2014) 1–39.
analyze data has been available for some time; it has [14] B.P. Rao, P. Saluia, N. Sharma, A. Mittal, S.V. Sharma, Cloud
changed the context of information technology and has computing for Internet of Things & sensing based applications,
in: Proceedings of the Sensing Technology (ICST), 2012 Sixth
turned the promises of the on-demand service model into
International Conference on, IEEE, 2012, pp. 374–380.
reality. In this study, we presented a review on the rise of [15] B. Franks, Taming the Big Data Tidal Wave: Finding Opportunities
big data in cloud computing. We proposed a classification in Huge Data Streams with Advanced Analytics, Wiley. com John
for big data, a conceptual view of big data, and a cloud Wiley Sons Inc, 2012.
[16] D.J. Abadi, P.A. Boncz, S. Harizopoulos, Column-oriented database
services model. This model was compared with several systems, Proc. VLDB Endow 2 (2009) 1664–1665.
representative big data cloud platforms. We discussed the [17] F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach,
background of Hadoop technology and its core compo- M. Burrows, T. Chandra, A. Fikes, R.E. Gruber, Bigtable: a distributed
storage system for structured data, ACM Trans. Comput. Syst.
nents, namely, MapReduce and HDFS. We presented cur- (TOCS) 26 (2008) 4.
rent MapReduce projects and related software. We also [18] P. Neubauer, Graph databases, NOSQL and Neo4j, in, 2010.
reviewed some of the challenges in big data processing. [19] M. Seeger, S. Ultra-Large-Sites, Key-Value stores: a practical over-
view, Comput. Sci. Media (2009).
The review covered volume, scalability, availability, data [20] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,
integrity, data protection, data transformation, data qual- A. Pilchin, S. Sivasubramanian, P. Vosshall, W. Vogels, Dynamo:
ity/heterogeneity, privacy and legal/regulatory issues, data amazon’s highly available key-value store, SOSP 41 (6) (2007)
205–220.
access, and governance. Furthermore, the key issues in big [21] S. Das, D. Agrawal, A. El Abbadi, G-store: a scalable data store
data in clouds were highlighted. In the future, significant for transactional multi key access in the cloud, in: Proceedings
challenges and issues must be addressed by the academia of the 1st ACM symposium on Cloud computing, ACM, 2010,
pp. 163–174.
and industry. Researchers, practitioners, and social science
[22] F. Lin, W.W. Cohen, Power iteration clustering, in: Proceedings of
scholars should collaborate to ensure the long-term suc- the 27th International Conference on Machine Learning (ICML-10),
cess of data management in a cloud computing environ- 2010, pp. 655–662.
ment and to collectively explore new territories. [23] R.C. Taylor, An overview of the Hadoop/MapReduce/Hbase frame-
work and its current applications in bioinformatics, BMC Bioinf. 11
(2010) S1.
[24] A. Lakshman, P. Malik, The Apache cassandra project, in, 2011.
[25] E. Rahm, H.H. Do, Data cleaning: problems and current approaches,
Acknowledgment IEEE Data Eng. Bull. 23 (2000) 3–13.
[26] J. Quackenbush, Microarray data normalization and transforma-
tion, Nat. Genet. 32 (2002) 496–501.
This paper is financially supported by the Malaysian
[27] Y. Chen, S. Alspaugh, R. Katz, Interactive analytical processing in big
Ministry of Education under the University of Malaya High data systems: a cross-industry study of MapReduce workloads,
Impact Research Grant UM.C/625/1/HIR/MoE/FCSIT/03. Proc. VLDB Endow. 5 (2012) 1802–1813.
114 I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115

[28] L. Neumeyer, B. Robbins, A. Nair, A. Kesari, S4: Distributed Stream [57] W. Yan, U. Brahmakshatriya, Y. Xue, M. Gilder, B. Wise, p-PIC:
Computing Platform, Data Mining Workshops (ICDMW), 2010 IEEE parallel power iteration clustering for big data, J. Parallel Distrib.
International Conference on, 2010, pp. 170–177. Comput. 73 (3) (2012) 352–359.
[29] J. Hurwitz, A. Nugent, F. Halper, M. Kaufman, Big data for dummies, [58] E.E. Schadt, M.D. Linderman, J. Sorenson, L. Lee, G.P. Nolan, Cloud
For Dummies (2013). and heterogeneous computing solutions exist today for the emer-
[30] M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R. Katz, A. Konwinski, ging big data problems in biology, Nat. Rev. Genet. 12 (2011) 224.
G. Lee, D. Patterson, A. Rabkin, I. Stoica, M. Zaharia, A view of cloud (-224).
computing, Commun. ACM 53 (2010) 50–58. [59] Amazon, AWS Case Study: SwiftKey. 〈http://aws.amazon.com/solu
[31] L. Huan, Big data drives cloud adoption in enterprise, IEEE Internet tions/case-studies/big-data〉, (accessed 05.07.14).
Comput. 17 (2013) 68–71. [60] Microsoft, 343 Industries Gets New User Insights from Big Data in
[32] S. Pandey, S. Nepal, Cloud computing and scientific applications — the Cloud. 〈http://www.microsoft.com/casestudies/〉, (accessed
big data, Scalable Anal. Beyond, Futur. Gener. Comput. Syst. 29 15.07.14).
(2013) 1774–1776. [61] Google, Case study: How redBus uses BigQuery to Master Big Data.
[33] D. Warneke, O. Kao, Nephele: efficient parallel data processing in 〈https://developers.google.com/bigquery/case-studies/〉, (accessed
the cloud, in: Proceedings of the 2nd workshop on many-task 22.07.14).
computing on grids and supercomputers, ACM, 2009, p. 8. [62] Cloudera, Nokia: Using Big Data to Bridge the Virtual & Physical
[34] P. Mell, T. Grance, The NIST definition of cloud computing (draft), Worlds. 〈http://www.cloudera.com/content/dam/cloudera/docu
NIST Spec. Publ. 800 (2011) 7. ments/Cloudera-Nokia-case-study-final.pdf〉, (accessed 24.07.14).
[35] A. Giuseppe, B. Alessio, D. Walter, P. Antonio, Survey cloud [63] Alacer, Case Studies: Big Data. 〈http://www.alacergroup.com/practi
monitoring: a survey, Comput. Netw. 57 (2013) 2093–2115. ce-category/big-data/case-studies-big-data/〉, (accessed 24.07.14).
[36] T. Gunarathne, B. Zhang, T.-L. Wu, J. Qiu, Scalable parallel comput- [64] J.G. Reid, A. Carroll, N. Veeraraghavan, M. Dahdouli, A. Sundquist,
ing on clouds using Twister4Azure iterative MapReduce, Futur. A. English, M. Bainbridge, S. White, W. Salerno, C. Buhay, Launching
Gener. Comput. Syst. 29 (2013) 1035–1048. genomics into the cloud: deployment of Mercury, a next generation
[37] A. O’Driscoll, J. Daugelaite, R.D. Sleator, ‘Big data’, Hadoop and sequence analysis pipeline, BMC Bioinf. 15 (2014) 30.
cloud computing in genomics, J. Biomed. Inform. 46 (2013) [65] P. Noordhuis, M. Heijkoop, A. Lazovik, Mining twitter in the cloud:
774–781. A case study, Cloud Computing (CLOUD), 2010, in: Proceedings of
[38] N. Fernando, S.W. Loke, W. Rahayu, Mobile cloud computing: a IEEE 3rd International Conference on, IEEE, Miami, FL, 2010,
survey, Futu. Gener. Comput. Syst. 29 (2013) 84–106. pp. 107–114.
[39] R. Holman, Mobile Cloud Application Revenues To Hit $9.5 billion [66] C. Zhang, H. De Sterck, A. Aboulnaga, H. Djambazian, R. Sladek, Case
by 2014, Driven by Converged Mobile Services, in: The Juniper study of scientific data processing on a cloud using hadoop, High
Research, 2010. Performance Computing Systems and Applications, Springer, 2010,
[40] Z. Sanaei, S. Abolfazli, A. Gani, R. Buyya, Heterogeneity in mobile 400–415.
cloud computing: taxonomy and open challenges, IEEE Commun. [67] F. Faghri, S. Bazarbayev, M. Overholt, R. Farivar, R.H. Campbell,
Surv. Tutor. (2013) 1–24. W.H. Sanders, Failure scenario as a service (FsaaS) for Hadoop clusters,
[41] D. Talia, Clouds for scalable big data analytics, Computer 46 (2013) in: Proceedings of the Workshop on Secure and Dependable Middle-
98–101. ware for Cloud Monitoring and Management, ACM, 2012, p. 5.
[42] C. Ji, Y. Li, W. Qiu, U. Awada, K. Li, Big data processing in cloud [68] N. Leavitt, Storage challenge: where will all that big data go?
computing environments, Pervasive Systems, Algorithms and Net- Computer 46 (2013) 22–25.
works (ISPAN), 2012,in: Proceedings of the 12th International [69] K. Strauss, D. Burger, What the future holds for solid-state memory,
Symposium on, IEEE, 2012, pp. 17–23. Computer 47 (2014) 24–31.
[43] J. Dean, S. Ghemawat, MapReduce: simplified data processing on [70] K. Mayama, W. Skulkittiyut, Y. Ando, T. Yoshimi, M. Mizukawa,
large clusters, Commun. ACM 51 (2008) 107–113. Proposal of object management system for applying to existing
[44] D. Bollier, C. Firestone, M, The Promise and Peril of Big Data, Aspen object storage furniture, System Integration (SII),2011 IEEE/SICE
Institute, Communications and Society Program Washington, DC, International Symposium on, IEEE, 2011, pp. 279–282.
USA, 2010. [71] W. Hu, D. Hu, C. Xie, F. Chen, IEEE International Conference on A
[45] H. Miller, E, Big-data in cloud computing: a taxonomy of risks, Inf. New Data Format and a New Error Control Scheme for Optical-
Res. 18 (2013) 571. Storage Systems, Networking, Architecture, and Storage, 2007, NAS
[46] O. Kwon, N. Lee, B. Shin, Data quality management, data usage 2007 2007, pp. 193–198.
experience and acquisition intention of big data analytics, Int. J. Inf. [72] L. Hao, D. Han, IEEE Conference on The study and design on secure-
Manag. 34 (3) (2014) 387–394. cloud storage system, Electrical and Control Engineering (ICECE),
[47] K. Singh, S.C. Guntuku, A. Thakur, C. Hota, Big data analytics 2011 International 2011, pp. 5126–5129.
framework for peer-to-peer botnet detection using random forests, [73] T. White, Hadoop: The Definitive Guide: The Definitive Guide,
Inf. Sci. (2014). O’Reilly Media, Sebastapol, CA, 2009.
[48] J.L. Schnase, D.Q. Duffy, G.S. Tamkin, D. Nadeau, J.H. Thompson, [74] K. Shvachko, K. Hairong, S. Radia, R. Chansler, The Hadoop Dis-
C.M. Grieg, M.A. McInerney, W.P. Webster, MERRA Analytic Services: tributed File System, Mass Storage Systems and Technologies
Meeting the Big Data challenges of climate science through cloud- (MSST), 2010 IEEE 26th Symposium on, 2010, pp. 1–10.
enabled Climate Analytics-as-a-Service, Computers, Environment [75] S. Ghemawat, H. Gobioff, S.-T. Leung, The Google file system, ACM
and Urban Systems, (2014). SIGOPS Oper. Syst. Rev. ACM 37 (5) (2003) 29–43.
[49] B.K. Tannahill, M. Jamshidi, System of systems and big data [76] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, G. Fox,
analytics – bridging the gap, Comput. Electr. Eng. 40 (2014) 2–15. Twister: a runtime for iterative mapreduce, in: Proceedings of the
[50] J. Abawajy, Symbioses of Big Data and Cloud Computing: Oppor- 19th ACM International Symposium on High Performance Distrib-
tunities & Challenges, (2013). uted Computing, ACM, 2010, pp. 810–818.
[51] S. Aluru, Y. Simmhan, A special issue of journal of parallel and [77] A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu,
distributed computing: scalable systems for big data management P. Wyckoff, R. Murthy, Hive: a warehousing solution over a map-
and analytics, J.Parallel Distrib. Comput. 73 (2013) 896. reduce framework, Proc. VLDB Endow. 2 (2009) 1626–1629.
[52] S. Hipgrave, Smarter fraud investigations with big data analytics, [78] L. George, Hbase: The Definitive Guide, O’Reilly Media, Inc.,
Netw. Secur. 2013 (2013) 7–9. Sebastopol, CA, 2011.
[53] Z. Linquan, W. Chuan, L. Zongpeng, G. Chuanxiong, C. Minghua, [79] S. Owen, R. Anil, T. Dunning, E. Friedman, Mahout in action,
F.C.M. Lau, Moving big data to the cloud: an online cost-minimizing Manning Publications Co., 2011.
approach, IEEE J. Sel. Areas Commun. 31 (2013) 2710–2721. [80] C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins, Pig latin: a
[54] H. Demirkan, D. Delen, Leveraging the capabilities of service- not-so-foreign language for data processing, in: Proceedings of the
oriented decision support systems: putting analytics and big data 2008 ACM SIGMOD international conference on Management of
in cloud, Decis. Support Syst. 55 (2013) 412–421. data, ACM, 2008, pp. 1099–1110.
[55] S. Lee, H. Park, Y. Shin, Cloud computing availability: multi-clouds [81] P. Hunt, M. Konar, F.P. Junqueira, B. Reed, ZooKeeper: wait-free
for big data service, Communications in Computer and Information coordination for internet-scale systems, in: Proceedings of the 2010
Science 310 (2012) 799–806. USENIX conference on USENIX annual technical conference, 2010,
[56] S.N. Srirama, P. Jakovits, E. Vainikko, Adapting scientific computing pp. 11–11.
problems to clouds using MapReduce, Futur. Gener. Comput. Syst. [82] M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica,
28 (2012) 184–192. Spark: cluster computing with working sets, in: Proceedings of
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 115

the 2nd USENIX conference on Hot topics in cloud computing, 2010, [106] O. Tene, J. Polonetsky, Privacy in the age of big data: a time for big
pp. 10–10. decisions, Stanford Law Review Online 64 (2012) 63.
[83] A. Rabkin, R. Katz, Chukwa: A system for reliable large-scale log [107] L. Hsiao-Ying, W.G. Tzeng, A secure erasure code-based cloud
collection, in: Proceedings of the 24th international conference on storage system with secure data forwarding, parallel and distrib-
Large installation system administration, USENIX Association, uted systems, IEEE Transactions on, 23 (2012) pp. 995–1003.
2010, pp. 1–15. [108] C. Ning, W. Cong, M. Li, R. Kui, L. Wenjing, Privacy-preserving
[84] A. Cassandra, The Apache Cassandra project, in. multi-keyword ranked search over encrypted cloud data, INFO-
[85] S. Hoffman, Apache Flume: Distributed Log Collection for Hadoop, COM, 2011 Proceedings IEEE, 2011, pp. 829–837.
Packt Publishing Ltd., Birmingham, UK, 2013. [109] C.E. Shannon, Communication theory of secrecy systemsn, Bell
[86] X. Zhifeng, X. Yang, Security and privacy in cloud computing, IEEE Syst. Tech. J. 28 (1949) 656–715.
Commun. Surv. Tutor. 15 (2013) 843–859. [110] L. Kocarev, G. Jakimoski, Logistic map as a block encryption
[87] T. Gunarathne, T.-L. Wu, J. Qiu, G. Fox, MapReduce in the Clouds for algorithm, Phys. Lett. 289 (4–5) (2001) 199–206.
Science, IEEE Second International Conference on Cloud Computing [111] Z. Xuyun, L. Chang, S. Nepal, S. Pandey, C. Jinjun, A Privacy Leakage
Technology and Science (CloudCom), 2010, pp. 565–572. Upper Bound Constraint-Based Approach for Cost-Effective Privacy
[88] S. Sakr, A. Liu, A.G. Fayoumi, The family of MapReduce and large- Preserving of Intermediate Data Sets in Cloud, Parallel and Dis-
scale data processing systems, ACM Comput. Surv. (CSUR) 46 tributed Systems, IEEE Transactions on 24 (2013) pp. 1192–1202.
(2013) 11. [112] C.-I. Fan, S.-Y. Huang, Controllable privacy preserving search based
[89] K.S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, on symmetric predicate encryption in cloud storage, Futur. Gener.
C.-C. Kanne, F. Ozcan, E.J. Shekita, Jaql: a scripting language for Comput. Syst. 29 (2013) 1716–1724.
large scale semistructured data analysis, Proc. VLDB Conf. (2011). [113] R. Li, Z. Xu, W. Kang, K.C. Yow, C.-Z. Xu, Efficient multi-keyword
[90] L. Lin, V. Lychagina, W. Liu, Y. Kwon, S. Mittal, M. Wong, Tenzing a ranked query over encrypted data in cloud computing, Futur.
sql implementation on the mapreduce framework, (2011). Gener. Comput. Syst. (2013).
[91] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, [114] A. Squicciarini, S. Sundareswaran, D. Lin, Preventing Information
A. Rasin, HadoopDB: an architectural hybrid of MapReduce and Leakage from Indexing in the Cloud, Cloud Computing (CLOUD),
DBMS technologies for analytical workloads, Proc. VLDB Endow. 2 2010 IEEE 3rd International Conference on, 2010, pp. 188–195.
(2009) 922–933. [115] S. Bhagat, G. Cormode, B. Krishnamurthy, D. Srivastava, Privacy in
[92] E. Friedman, P. Pawlowski, J. Cieslewicz, SQL/MapReduce: a prac- dynamic social networks, in: Proceedings of the 19th international
tical approach to self-describing, polymorphic, and parallelizable conference on World wide web, ACM, Raleigh, North Carolina, USA,
user-defined functions, Proc. VLDB Endow. 2 (2009) 1402–1413. 2010, pp. 1059–1060.
[93] R. Pike, S. Dorward, R. Griesemer, S. Quinlan, Interpreting the data: [116] W. Itani, A. Kayssi, A. Chehab, Privacy as a Service: Privacy-Aware
parallel analysis with Sawzall, Sci. Progr. 13 (2005) 277–298. Data Storage and Processing in Cloud Computing Architectures,
[94] R. Cattell, Scalable SQL and NoSQL data stores, ACM SIGMOD Dependable, Autonomic and Secure Computing, 2009. DASC ’09, in:
Record, 39 (4), ACM New York, NY, USA, 2011, 12–27. Proceedings of the Eighth IEEE International Conference on, 2009,
[95] Z. Wang, Y. Chu, K.-L. Tan, D. Agrawal, A.E. Abbadi, X. Xu, Scalable pp. 711–716.
Data Cube Analysis over Big Data, arXiv preprint arXiv:1311.5663 [117] D. Agrawal, C.C. Aggarwal, On the design and quantification of
(2013). privacy preserving data mining algorithms, in: Proceedings of the
[96] R. Ramakrishnan, J. Gehrke, Database Management Systems, Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles
Osborne/McGraw-Hill, New York, 2003. of Database Systems, ACM, Santa Barbara, California, USA, 2001,
[97] S. Ghemawat, H. Gobioff, S.-T. Leung, The Google file system, pp. 247–255.
SIGOPS Oper. Syst. Rev. 37 (2003) 29–43. [118] Z. Xuyun, L. Chang, S. Nepal, D. Wanchun, C. Jinjun, Privacy-
[98] D. Zissis, D. Lekkas, Addressing cloud computing security issues, Preserving Layer over MapReduce on Cloud, in: International Con-
Futur. Gener. Comput. Syst. 28 (2012) 583–592. ference on Cloud and Green Computing (CGC), 2012, pp. 304–310.
[99] M. Schroeck, R. Shockley, J. Smart, D. Romero-Morales, P. Tufano, [119] D.P. Bertsekas, Nonlinear programming, (1999).
Analytics: The real-world use of big data, in, IBM Global Business [120] C. Tankard, Big data security, Netw. Secur. 2012 (2012) 5–8.
Services, 2012. [121] P. Malik, Governing big data: principles and practices, IBM J. Res.
[100] R. Sravan Kumar, A. Saxena, Data integrity proofs in cloud storage, Dev. 57 (1) (2013) 1. (-1: 13).
in: Proceedings of the Third International Conference on Commu- [122] D. Loshin, Chapter 5 – data governance for big data analytics:
nication Systems and Networks (COMSNETS), 2011, pp. 1–4. considerations for data policies and processes, in: D. Loshin (Ed.),
[101] R. Akerkar, Big Data Computing, CRC Press, 2013. Big Data Analytics, Morgan Kaufmann, Boston, 2013, pp. 39–48.
[102] T.C. Redman, A. Blanton, Data Quality for the Information Age, [123] S. Soares, Big Data Governance, Sunilsoares, 2012.
Artech House, Inc., Norwood, MA, USA, 1997. [124] P.P. Tallon, Corporate governance of big data: perspectives on value,
[103] D.M. Strong, Y.W. Lee, R.Y. Wang, Data quality in context, Commun. risk, and cost, Computer 46 (2013) 32–38.
ACM, 40, , 1997, 103–110. [125] M.D. Assuncao, R.N. Calheiros, S. Bianchi, M.A. Netto, R. Buyya, Big
[104] K. Weber, G. Rincon, A. Van Eenennaam, B. Golden, J. Medrano, Data Computing and Clouds: Challenges, Solutions, and Future
Differences in allele frequency distribution of bovine high-density Directions, arXiv preprint arXiv:1312.4722, (2013).
genotyping platforms in holsteins and jerseys, Western section [126] Khan, Abdul Nasir, et al. BSS: block-based sharing scheme for
American society of Animal science, 2012, p. 70. secure data storage services in mobile cloud environment. The
[105] D. Che, M. Safran, Z. Peng, From big data to big data mining: Journal of Supercomputing (2014) 1–31.
challenges, issues, and opportunities, in: B. Hong, X. Meng, L. Chen, [127] Khan, Abdul Nasir, et al., Incremental proxy re-encryption scheme
W. Winiwarter, W. Song (Eds.), Database Systems for Advanced for mobile cloud computing environment, The Journal of Super-
Applications, Springer, Berlin Heidelberg, 2013, pp. 1–15. computing 68 (2) (2014) 624–651.

You might also like