The Rise of Big Data On Cloud Computing
The Rise of Big Data On Cloud Computing
The Rise of Big Data On Cloud Computing
Information Systems
journal homepage: www.elsevier.com/locate/infosys
a r t i c l e in f o abstract
Article history: Cloud computing is a powerful technology to perform massive-scale and complex
Received 11 June 2014 computing. It eliminates the need to maintain expensive computing hardware, dedicated
Received in revised form space, and software. Massive growth in the scale of data or big data generated through
22 July 2014
cloud computing has been observed. Addressing big data is a challenging and time-
Accepted 24 July 2014
demanding task that requires a large computational infrastructure to ensure successful
Recommended by: Prof. D. Shasha
Available online 10 August 2014 data processing and analysis. The rise of big data in cloud computing is reviewed in this
study. The definition, characteristics, and classification of big data along with some
Keywords: discussions on cloud computing are introduced. The relationship between big data and
Big data
cloud computing, big data storage systems, and Hadoop technology are also discussed.
Cloud computing
Furthermore, research challenges are investigated, with focus on scalability, availability,
Hadoop
data integrity, data transformation, data quality, data heterogeneity, privacy, legal and
regulatory issues, and governance. Lastly, open research issues that require substantial
research efforts are summarized.
& 2014 Elsevier Ltd. All rights reserved.
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2. Definition and characteristics of big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
2.1. Classification of big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3. Cloud computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4. Relationship between cloud computing and big data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5. Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.1. Organization case Studies from vendors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.1.1. A. SwiftKey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.1.2. B. 343 Industries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.3. C. redBus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.4. D. Nokia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.1.5. E. Alacer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
n
Corresponding author. Tel.: +60 173946811.
E-mail addresses: targio@siswa.um.edu.my (I.A.T. Hashem), ibraryaqoob@siswa.um.edu.my (I. Yaqoob), badrul@um.edu.my (N.B. Anuar),
salimah@um.edu.my (S. Mokhtar), abdullah@um.edu.my (A. Gani), samee.khan@ndsu.edu (S. Ullah Khan).
http://dx.doi.org/10.1016/j.is.2014.07.006
0306-4379/& 2014 Elsevier Ltd. All rights reserved.
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 99
Table 1
List of abbreviations.
Data Sources Content Format Data Stores Data Staging Data processing
Transactions Key-value
IoT
structured are stored in various formats. Most popular is issues, such as infrastructure, flexibility, and availability of
the relational database that come in a large number of resources [35]. Moreover, combining the cloud computing
varieties [29]. As the result of the wide variety of data utility model and a rich set of computations, infrastructures,
srouces, the captured data differ in zise with respect to and storage cloud services offers a highly attractive envir-
redundancy, consisteny and noise, etc. onment where scientists can perform their experiments
[36]. Cloud service models typically consist of PaaS, SaaS,
3. Cloud computing and IaaS.
Cloud computing is a fast-growing technology that has PaaS, such as Google's Apps Engine, Salesforce.com,
established itself in the next generation of IT industry and Force platform, and Microsoft Azure, refers to different
business. Cloud computing promises reliable software, resources operating on a cloud to provide platform
hardware, and IaaS delivered over the Internet and remote computing for end users.
data centers [30]. Cloud services have become a powerful SaaS, such as Google Docs, Gmail, Salesforce.com, and
architecture to perform complex large-scale computing Online Payroll, refers to applications operating on a
tasks and span a range of IT functions from storage and remote cloud infrastructure offered by the cloud pro-
computation to database and application services. The vider as services that can be accessed through the
need to store, process, and analyze large amounts of Internet [37].
datasets has driven many organizations and individuals IaaS, such as Flexiscale and Amazon's EC2, refers to
to adopt cloud computing [31]. A large number of scientific hardware equipment operating on a cloud provided by
applications for extensive experiments are currently service providers and used by end users upon demand.
deployed in the cloud and may continue to increase
because of the lack of available computing facilities in The increasing popularity of wireless networks and
local servers, reduced capital costs, and increasing volume mobile devices has taken cloud computing to new heights
of data produced and consumed by the experiments [32]. because of the limited processing capability, storage capa-
In addition, cloud service providers have begun to inte- city, and battery lifetime of each device [126]. This condi-
grate frameworks for parallel data processing in their tion has led to the emergence of a mobile cloud computing
services to help users access cloud resources and deploy paradigm. Mobile cloud facilities allow users to outsource
their programs [33]. tasks to external service providers. For example, data can
Cloud computing “is a model for allowing ubiquitous, be processed and stored outside of a mobile device [38].
convenient, and on-demand network access to a number of Mobile cloud applications, such as Gmail, iCloud, and
configured computing resources (e.g., networks, server, Dropbox, have become prevalent recently. Juniper research
storage, application, and services) that can be rapidly predicts that cloud-based mobile applications will increase
provisioned and released with minimal management effort to approximately 9.5$ billion by 2014 [39]. Such applica-
or service provider interaction” [34]. Cloud computing has a tions improve mobile cloud performance and user experi-
number of favorable aspects to address the rapid growth of ence. However, the limitations associated with wireless
economies and technological barriers. Cloud computing networks and the intrinsic nature of mobile devices have
provides total cost of ownership and allows organizations imposed computational and data storage restrictions
to focus on the core business without worrying about [40,127].
102 I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115
Table 2
Various categories of big data.
Classification Description
Data sources
Social media Social media is the source of information generated via URL to share or exchange information and ideas in virtual communities
and networks, such as collaborative projects, blogs and microblogs, Facebook, and Twitter.
Machine-generated Machine data are information automatically generated from a hardware or software, such as computers, medical devices, or
data other machines, without human intervention.
Sensing Several sensing devices exist to measure physical quantities and change them into signals.
Transactions Transaction data, such as financial and work data, comprise an event that involves a time dimension to describe the data.
IoT IoT represents a set of objects that are uniquely identifiable as a part of the Internet. These objects include smartphones, digital
cameras, and tablets. When these devices connect with one another over the Internet, they enable more smart processes and
services that support basic, economic, environmental, and health needs. A large number of devices connected to the Internet
provides many types of services and produces huge amounts of data and information [14].
Content format
Structured Structured data are often managed SQL, a programming language created for managing and querying data in RDBMS.
Structured data are easy to input, query, store, and analyze. Examples of structured data include numbers, words, and dates.
Semi-structured Semi-structured data are data that do not follow a conventional database system. Semi-structured data may be in the form of
structured data that are not organized in relational database models, such as tables. Capturing semi-structured data for analysis
is different from capturing a fixed file format. Therefore, capturing semi-structured data requires the use of complex rules that
dynamically decide the next process after capturing the data [15].
Unstructured Unstructured data, such as text messages, location information, videos, and social media data, are data that do not follow a
specified format. Considering that the size of this type of data continues to increase through the use of smartphones, the need
to analyze and understand such data has become a challenge.
Data stores
Document-oriented Document-oriented data stores are mainly designed to store and retrieve collections of documents or information and support
complex data forms in several standard formats, such as JSON, XML, and binary forms (e.g., PDF and MS Word). A document-
oriented data store is similar to a record or row in a relational database but is more flexible and can retrieve documents based
on their contents (e.g., MongoDB, SimpleDB, and CouchDB).
Column-oriented A column-oriented database stores its content in columns aside from rows, with attribute values belonging to the same column
stored contiguously. Column-oriented is different from classical database systems that store entire rows one after the other
[16], such as BigTable [17].
Graph database A graph database, such as Neo4j, is designed to store and represent data that utilize a graph model with nodes, edges, and
properties related to one another through relations [18].
Key-value Key-value is an alternative relational database system that stores and accesses data designed to scale to a very large size [19].
Dynamo [20] is a good example of a highly available key-value storage system; it is used by amazon.com in some of its services.
Similarly, [21] proposed a scalable key-value store to support transactional multi-key access using a single key access supported
by key-value for use in G-store designs. [22] presented a scalable clustering method to perform a large task in datasets. Other
examples of key-value stores are Apache Hbase [23], Apache Cassandra [24], and Voldemort. Hbase uses HDFS, an open-source
version of Google's BigTable built on Cassandra. Hbase stores data into tables, rows, and cells. Rows are sorted by row key, and
each cell in a table is specified by a row key, a column key, and a version, with the content contained as an un-interpreted array
of bytes.
Data staging
Cleaning Cleaning is the process of identifying incomplete and unreasonable data [25].
Transform Transform is the process of transforming data into a form suitable for analysis.
Normalization Normalization is the method of structuring database schema to minimize redundancy [26].
Data processing
Batch MapReduce-based systems have been adopted by many organizations in the past few years for long-running batch jobs [27].
Such system allows for the scaling of applications across large clusters of machines comprising thousands of nodes.
Real time One of the most famous and powerful real time process-based big data tools is simple scalable streaming system (S4) [28]. S4 is
a distributed computing platform that allows programmers to conveniently develop applications for processing continuous
unbounded streams of data. S4 is a scalable, partially fault tolerant, general purpose, and pluggable platform.
4. Relationship between cloud computing and big data distributed fault-tolerant database and processed through
a programing model for large datasets with a parallel
Cloud computing and big data are conjoined. Big data distributed algorithm in a cluster. The main purpose of
provides users the ability to use commodity computing to data visualization, as shown in Fig. 3, is to view analytical
process distributed queries across multiple datasets and results presented visually through different graphs for
return resultant sets in a timely manner. Cloud computing decision making.
provides the underlying engine through the use of Big data utilizes distributed storage technology based on
Hadoop, a class of distributed data-processing platforms. cloud computing rather than local storage attached to a
The use of cloud computing in big data is shown in Fig. 3. computer or electronic device. Big data evaluation is driven
Large data sources from the cloud and Web are stored in a by fast-growing cloud-based applications developed using
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 103
APIs
Distributed fault tolerant database for large
unstructured data sets like NOSQL.
Web
Hadoop Distributed File System (HDFS)
Table 3
Comparison of several big data cloud platforms.
virtualized technologies. Therefore, cloud computing not communications. Likewise, Bollier and Firestone [44]
only provides facilities for the computation and processing emphasized the ability of cluster computing to provide a
of big data but also serves as a service model. Table 3 shows hospitable context for data growth. However, Miller [45]
the comparison of several big data cloud providers. argued that the lack of data availability is expensive
Talia [41] discussed the complexity and variety of data because users offload more decisions to analytical meth-
types and processing power to perform analysis on large ods; incorrect use of the methods or inherent weaknesses
datasets. The author stated that cloud computing infra- in the methods may produce wrong and costly decisions.
structure can serve as an effective platform to address DBMSs are considered a part of the current cloud comput-
the data storage required to perform big data analysis. ing architecture and play an important role to ensure the
Cloud computing is correlated with a new pattern for the easy transition of applications from old enterprise infra-
provision of computing infrastructure and big data proces- structures to new cloud infrastructure architectures. The
sing method for all types of resources available in the pressure for organizations to quickly adopt and implement
cloud through data analysis. Several cloud-based technol- technologies, such as cloud computing, to address the
ogies have to cope with this new environment because challenge of big data storage and processing demands
dealing with big data for concurrent processing has entails unexpected risks and consequences.
become increasingly complicated [42]. MapReduce [43] is Table 4 presents several related studies that deal with
a good example of big data processing in a cloud environ- big data through the use of cloud computing technology.
ment; it allows for the processing of large amounts of The table provides a general overview of big data and
datasets stored in parallel in the cluster. Cluster computing cloud computing technologies based on the area of study
exhibits good performance in distributed system environ- and current challenges, techniques, and technologies that
ments, such as computer power, storage, and network restrict big data and cloud computing.
104 I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115
Table 4
Several related studies that deal with big data through the use of cloud computing technology.
[46] “Data quality management, data usage experience and To propose a model for the acquisition intention of big data analytics
acquisition intention of big data analytics”
[47] “Big Data Analytics Framework for Peer-to-Peer Botnet To develop open-source tools, such as Hadoop, to provide a scalable
Detection Using Random Forests” implementation of a quasi-real-time intrusion detection system
[48] “MERRA Analytic Services: Meeting the Big Data Challenges of To address big data challenges in climate science
Climate4 Science through Cloud-enabled Climate Analytics-as-a-
Service”
[49] “System of Systems and Big Data Analytics – Bridging the Gap” To demonstrate the construction of a bridge between System of
Systems and Data Analytics to develop reliable models
[50] “Symbioses of Big Data and Cloud Computing: Opportunities & To highlight big data opportunity
Challenges”
[51] “A Special Issue of Journal of Parallel and Distributed Computing: To address special issues in big data management and analytics
Scalable Systems for Big Data Management and Analytics”
[52] “Smarter fraud investigations with big data analytics” To investigate smarter fraud with big data analytics
[53] Moving Big Data to the Cloud: An Online Cost-Minimizing To upload data into the cloud from different geographical locations
Approach with minimum cost of data migration. Two algorithms (OLM, RFHC)
are proposed. These algorithms provide optimization for data
aggregation and processing and a route for data.
[54] “Leveraging the capabilities of service-oriented decision To propose a framework for decision support systems in a cloud
support systems: putting analytics and big data in cloud”
[32] “Cloud Computing and Scientific Applications — Big Data, To review some of the papers published in Cloud Computing and
Scalable Analytics, and Beyond” Scientific Applications (CCSA2012) event
[41] “Clouds for Scalable Big Data Analytics” To discuss the use of cloud for scalable big data analytics
[55] “Cloud Computing Availability: Multi-clouds for Big Data To overcome the issue of single cloud
Service”
[56] “Adapting scientific computing problems to clouds using To review the challenges of reducing the number of iterative
MapReduce” algorithms in the MapReduce model
[57] “p-PIC: Parallel Power Iteration Clustering for Big Data”; To explore different parallelization strategies
Journal of Parallel and Distributed Computing
[58] “Cloud and heterogeneous computing solutions exist today for To review cloud and heterogeneous computing solutions existing
the emerging big data problems in biology” today for the emerging big data problem in biology
Table 5
Summary of Organization case studies from Vendors.
Case Business needs Cloud service models Big data solution Assessment Reference
Table 6
Summary of case studies from scholarly/academic sources.
1 Massively parallel DNA To provide accurate and Develop a Mercury analysis Established a powerful
sequencing generates reproducible genomic results at pipeline and deploy it in the combination of a robust and
staggering amounts of data. a scale ranging from individuals Amazon web service cloud via fully validated software
to large cohorts. the DNAnexus platform. pipeline and a scalable
computational resource that
have been applied to more
than 10,000 whole genome
and whole exome samples.
2 Given that conducting analyses To use cloud services as a Use PageRank algorithm on the Implemented a relatively
on large social networks such possible solution for the Twitter user base to obtain user cheap solution for data
as Twitter requires analysis of large amounts of rankings. Use the Amazon acquisition and analysis by
considerable resources because data. cloud infrastructure to host all using the Amazon cloud
of the large amounts of data related computations. infrastructure.
involved, such activities are
usually expensive.
3 To study the complex To develop a Hadoop-based Use Hadoop cloud computing Allows users to submit data
molecular interactions that cloud computing application framework. processing jobs in the cloud
regulate biological systems. that processes sequences of
microscopic images of live cells.
4 Applications running on cloud Design a failure scenario Create a series of failure Help to identify failure
computing likely may fail. scenarios on a Amazon cloud vulnerabilities in Hadoop
computing platform applications running in cloud.
two dual-core AMD Opteron 280 CPUs interconnected by for servers that are interconnected on a small scale. Given
gigabit Ethernet. the aforesaid low scalability, storage capacity is increased
but expandability and upgradeability are limited signifi-
5.1.5.5. Case study 4: failure scenario as a service (FSaaS) for cantly. NAS is a storage device that supports a network.
Hadoop Clusters. Faghri et al. [67] have created a series of NAS is connected directly to a network through a switch or
failure scenarios on a Amazon cloud computing platform hub via TCP/IP protocols. In NAS, data are transferred as
to provide Hadoop service with the means to test their files. Given that the NAS server can indirectly access a
applications against the risk of massive failure. They storage device through networks, the I/O burden on a NAS
developed a set failure scenarios for Hadoop clusters server is significantly lighter than that on a DAS server.
with 10 Amazon web service EC2 machines. These types NAS can orient networks, particularly scalable and
of failures could happen inside Hadoop jobs include bandwidth-intensive networks. Such networks include
CPU intensive, namely I/O-intensive and network-intensive. high-speed networks of optical-fiber connections. The
Thus, running such scenario against Hadoop applications SAN system of data storage is independent with respect
can help to identify failure vulnerabilities in these appli- to storage on the local area network (LAN). Multipath data
cations. switching is conducted among internal nodes to maximize
data management and sharing. The organizational systems
6. Big data storage system of data storages (DAS, NAS, and SAN) can be divided into
three parts: (i) disc array, where the foundation of a
The rapid growth of data has restricted the capability of storage system provides the fundamental guarantee, (ii)
existing storage technologies to store and manage data. connection and network subsystems, which connect one
Over the past few years, traditional storage systems have or more disc arrays and servers, and (iii) storage manage-
been utilized to store data through structured RDBMS [13]. ment software, which oversees data sharing, storage
However, almost storage systems have limitations and are management, and disaster recovery tasks for multiple
inapplicable to the storage and management of big data. servers.
A storage architecture that can be accessed in a highly
efficient manner while achieving availability and reliability 7. Hadoop background
is required to store and manage large datasets. The storage
media currently employed in enterprises are discussed and Hadoop [73] is an open-source Apache Software Foun-
compared in Table 7. dation project written in Java that enables the distributed
Several storage technologies have been developed to processing of large datasets across clusters of commodity.
meet the demands of massive data. Existing technologies Hadoop has two primary components, namely, HDFS and
can be classified as direct attached storage (DAS), network MapReduce programming framework. The most signifi-
attached storage (NAS), and storage area network (SAN). In cant feature of Hadoop is that HDFS and MapReduce
DAS, various hard disk drives (HDDs) are directly con- are closely related to each other; each are co-deployed
nected to the servers. Each HDD receives a certain amount such that a single cluster is produced [73]. Therefore, the
of input/output (I/O) resource, which is managed by storage system is not physically separated from the pro-
individual applications. Therefore, DAS is suitable only cessing system.
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 107
Table 7
Comparison of storage media.
Hard drives To store data up to four Density, cost per bit storage, and speedy Require special cooling and [68]
terabytes start up that may only take several seconds high read latency time; the
spinning of the platters can
sometimes result in vibration
and produce more heat than
solid state memory
Solid-state To store data up to two Fast access to data, fast movement of huge Ten times more expensive than [69]
memory terabytes quantities of data, start-up time only takes hard drives in terms of per
several milliseconds, no vibration, and gigabyte capacity
produces less heat than hard drives
Object storage To store data as Scales with ease to find information and Complexity in tracking indices. [70]
variable-size objects has a unique identifier to identify data
rather than fixed-size objects; ensures security because
blocks information on physical location cannot be
obtained from disk drives; supports
indexing access
Optical storage To store data at Least expensive removable storage medium Complex; its ability to produce [71]
different angles multiple optical disks in a
throughout the storage single unit is yet to be proven
medium
Cloud storage To serve as a Useful for small organizations that do not Security is the primary [72]
provisioning and have sufficient storage capacity; cloud challenge because of data
storage model and storage can store large amounts of data, but outsourcing
provide on-demand its services are billable
access to services, such
as storage
Table 9
Current MapReduce projects and related software.
Table 10
Summary of several SQL interfaces in the MapReduce framework in related literature.
[89] “Jaql: A scripting language for large scale Jaql Declarative query language designed for
semi-structured data analysis” JavaScript Object Notation
[90] “Tenzing an SQL implementation in the Tenzing An SQL query execution engine
MapReduce framework”
[91] “HadoopDB: an architectural hybrid of HadoopDB Comparison between Hadoop
MapReduce and DBMS technologies for implementation of MapReduce
analytical workloads” framework and parallel SQL database
management systems
[92] “SQL/MapReduce: A practical approach to SQL/MapReduce Provides a parallel computation of
self-describing, polymorphic, and procedural functions across hundreds of
parallelizable user-defined functions” servers working together as a single
relational database
[77] “Hive - A Warehousing Solution Over a Data summarization and ad hoc Presents an open-source warehouse
Map-Reduce Framework” querying Hive solution built on top of Hadoop
[80] “Pig latin: a not-so-foreign language for Pig Latin The software takes a middle position
data processing” between expressing tasks using the
high-level declarative querying model
in the spirit of SQL and the low-level/
procedural programming model using
MapReduce
[93] “Interpreting the data: Parallel analysis Sawzall Sawzall defines the operations to be
with Sawzall” performed in a single record of the data
used at Google on top of MapReduce
et al. [56] presented an approach to apply scientific instances [87]. Several strategies have been proposed to
computing problems to the MapReduce framework where improve the performance of big data processing. More-
scientists can efficiently utilize existing resources in the over, effort has been exerted to develop SQL interfaces in
cloud to solve computationally large-scale scientific data. the MapReduce framework to assist programmers who
Currently, many alternative solutions are available to prefer to use SQL as a high-level language to express their
deploy MapReduce in cloud environments; these solutions task while leaving all of the execution optimization details
include using cloud MapReduce runtimes that maximize to the backend engine [88]. Table 10 shows a summary of
cloud infrastructure services, using MapReduce as a ser- several SQL interfaces in the MapReduce framework avail-
vice, or setting up one's own MapReduce cluster in cloud able in existing literature.
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 109
Table 11
Comparison of NoSQL databases.
ASF¼ Apache Software Foundation, Doc ¼ Document, KV¼ Key-Value, N/A ¼No Answer, ✓ ¼ Support, ✕¼Not support.
110 I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115
Table 12
Characteristics of scalable data storage in a cloud environment.
[96] DBMS Faster data access Less attractive for the deployment of large-scale data
Faster processing Limited
[20] Key Value Scales to a very large size
Limitless
[97] Google file system (GFS) Scalable distributed file system for large distributed Garbage collection could become a problem
data-intensive applications Performance might degrade if the number of writers
Delivers high aggregate performance and random writers increases
File data is stored in different chunk servers
[74] Hadoop distributed file Stores large amounts of datasets
system (HDFS) Uses a large cluster
within a short amount of time. Therefore, services must variety of data formats, big data can be transformed into
remain operational even in the case of a security breach an analysis workflow in two ways as shown in Fig. 4.
[98]. In addition, with the increasing number of cloud In the case of structured data, the data is pre-processed
users, cloud service providers must address the issue of before they are stored in relational databases to meet the
making the requested data available to users to deliver constraints of schema-on-write. The data can then be
high-quality services. Lee et al. [55] introduced a multi- retrieved for analysis. However, in unstructured data, the
cloud model called “rain clouds” to support big data data must first be stored in distributed databases, such as
exploitation. “Rain clouds” involves cooperation among HBase, before they are processed for analysis. Unstructured
single clouds to provide accessible resources in an emer- data are retrieved from distributed databases after meet-
gency. Schroeck et al. [99] predicted that the demand for ing the schema-on-read constraints.
more real time access to data may continue to increase as
business models evolve and organizations invest in tech-
nologies required for streaming data and smartphones. 8.5. Data quality
Table 13
Overview of privacy preservation in a cloud.
protection for individuals' data while enjoying the many gathered from different sources do not have a structured
benefits of big data in the society at large [2]. format. For instance, mobile cloud-based applications,
blogs, and social networking are inadequately structured
8.9. Governance similar to pieces of text messages, videos, and images.
Transforming and cleaning such unstructured data before
Data governance embodies the exercise of control and loading them into the warehouse for analysis are challen-
authority over data-related rules of law, transparency, and ging tasks. Efforts have been exerted to simplify the
accountabilities of individuals and information systems to transformation process by adopting technologies such as
achieve business objectives [121]. The key issues of big Hadoop and MapReduce to support the distributed pro-
data in cloud governance pertain to applications that cessing of unstructured data formats. However, under-
consume massive amounts of data streamed from external standing the context of unstructured data is necessary,
sources [122]. Therefore, a clear and acceptable data policy particularly when meaningful information is required.
with regard to the type of data that need to be stored, how MapReduce programming model is the most common
quickly an individual needs to access the data, and how to model that operates in clusters of computers; it has been
access the data must be defined [50]. utilized to process and distribute large amounts of data.
Big data governance involves leveraging information by
aligning the objectives of multiple functions, such as 9.2. Distributed storage systems
telecommunication carriers having access to vast troves
of customer information in the form of call detail records Numerous solutions have been proposed to store and
and marketing seeking to monetize this information by retrieve massive amounts of data. Some of these solutions
selling it to third parties [123]. have been applied in a cloud computing environment.
Moreover, big data provides significant opportunities to However, several issues hinder the successful implemen-
service providers by making information more valuable. tation of such solutions, including the capability of current
However, policies, principles, and frameworks that strike a cloud technologies to provide necessary capacity and high
stability between risk and value in the face of increasing performance to address massive amounts of data [68],
data size and deliver better and faster data management optimization of existing file systems for the volumes
technology can create huge challenges [124]. demanded by data mining applications, and how data
Cloud governance recommends the use of various poli- can be stored in such a manner that they can be easily
cies together with different models of constraints that limit retrieved and migrated between servers.
access to underlying resources. Therefore, adopting govern-
ance practices that maintain a balance between risk expo- 9.3. Data analysis
sure and value creation is a new organizational imperative
to unlock competitive advantages and maximize value from The selection of an appropriate model for large-scale
the application of big data in the cloud [124]. data analysis is critical. Talia [41] pointed out that obtain-
ing useful information from large amounts of data requires
9. Open research issues scalable analysis algorithms to produce timely results.
However, current algorithms are inefficient in terms of
Numerous studies have addressed a number of signifi- big data analysis. Therefore, efficient data analysis tools
cant problems and issues pertaining to the storage and and technologies are required to process such data. Each
processing of big data in clouds. The amount of data algorithm performance ceases to increase linearly with
continues to increase at an exponential rate, but the increasing computational resources. As researchers con-
improvement in the processing mechanisms is relatively tinue to probe the issues of big data in cloud computing,
slow. Only a few tools are available to address the issues of new problems in big data processing arise from the
big data processing in cloud environments. State-of-the- transitional data analysis techniques. The speed of stream
art techniques and technologies in many important big data arriving from different data sources must be pro-
data applications (i.e., MapReduce, Dryad, Pregel, PigLatin, cessed and compared with historical information within a
MangoDB, Hbase, SimpleDB, and Cassandra) cannot solve certain period of time. Such data sources may contain
the actual problems of storing and querying big data. For different formats, which makes the integration of multiple
example, Hadoop and MapReduce lack query processing sources for analysis a complex task [125].
strategies and have low-level infrastructures with respect
to data processing and management. Despite the plethora 9.4. Data security
of work performed to address the problem of storing and
processing big data in cloud computing environments, Although cloud computing has transformed modern
certain important aspects of storing and processing big ICT technology, several unresolved security threats exist in
data in cloud computing are yet to be solved. Some of cloud computing. These security threats are magnified by
these issues are discussed in the subsequent subsections. the volume, velocity, and variety of big data. Moreover,
several threats and issues, such as privacy, confidentiality,
9.1. Data staging integrity, and availability of data, exist in big data using
cloud computing platforms. Therefore, data security must
The most important open research issue regarding data be measured once data are outsourced to cloud service
staging is related to the heterogeneous nature of data. Data providers. The cloud must also be assessed at regular
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 113
[28] L. Neumeyer, B. Robbins, A. Nair, A. Kesari, S4: Distributed Stream [57] W. Yan, U. Brahmakshatriya, Y. Xue, M. Gilder, B. Wise, p-PIC:
Computing Platform, Data Mining Workshops (ICDMW), 2010 IEEE parallel power iteration clustering for big data, J. Parallel Distrib.
International Conference on, 2010, pp. 170–177. Comput. 73 (3) (2012) 352–359.
[29] J. Hurwitz, A. Nugent, F. Halper, M. Kaufman, Big data for dummies, [58] E.E. Schadt, M.D. Linderman, J. Sorenson, L. Lee, G.P. Nolan, Cloud
For Dummies (2013). and heterogeneous computing solutions exist today for the emer-
[30] M. Armbrust, A. Fox, R. Griffith, A.D. Joseph, R. Katz, A. Konwinski, ging big data problems in biology, Nat. Rev. Genet. 12 (2011) 224.
G. Lee, D. Patterson, A. Rabkin, I. Stoica, M. Zaharia, A view of cloud (-224).
computing, Commun. ACM 53 (2010) 50–58. [59] Amazon, AWS Case Study: SwiftKey. 〈http://aws.amazon.com/solu
[31] L. Huan, Big data drives cloud adoption in enterprise, IEEE Internet tions/case-studies/big-data〉, (accessed 05.07.14).
Comput. 17 (2013) 68–71. [60] Microsoft, 343 Industries Gets New User Insights from Big Data in
[32] S. Pandey, S. Nepal, Cloud computing and scientific applications — the Cloud. 〈http://www.microsoft.com/casestudies/〉, (accessed
big data, Scalable Anal. Beyond, Futur. Gener. Comput. Syst. 29 15.07.14).
(2013) 1774–1776. [61] Google, Case study: How redBus uses BigQuery to Master Big Data.
[33] D. Warneke, O. Kao, Nephele: efficient parallel data processing in 〈https://developers.google.com/bigquery/case-studies/〉, (accessed
the cloud, in: Proceedings of the 2nd workshop on many-task 22.07.14).
computing on grids and supercomputers, ACM, 2009, p. 8. [62] Cloudera, Nokia: Using Big Data to Bridge the Virtual & Physical
[34] P. Mell, T. Grance, The NIST definition of cloud computing (draft), Worlds. 〈http://www.cloudera.com/content/dam/cloudera/docu
NIST Spec. Publ. 800 (2011) 7. ments/Cloudera-Nokia-case-study-final.pdf〉, (accessed 24.07.14).
[35] A. Giuseppe, B. Alessio, D. Walter, P. Antonio, Survey cloud [63] Alacer, Case Studies: Big Data. 〈http://www.alacergroup.com/practi
monitoring: a survey, Comput. Netw. 57 (2013) 2093–2115. ce-category/big-data/case-studies-big-data/〉, (accessed 24.07.14).
[36] T. Gunarathne, B. Zhang, T.-L. Wu, J. Qiu, Scalable parallel comput- [64] J.G. Reid, A. Carroll, N. Veeraraghavan, M. Dahdouli, A. Sundquist,
ing on clouds using Twister4Azure iterative MapReduce, Futur. A. English, M. Bainbridge, S. White, W. Salerno, C. Buhay, Launching
Gener. Comput. Syst. 29 (2013) 1035–1048. genomics into the cloud: deployment of Mercury, a next generation
[37] A. O’Driscoll, J. Daugelaite, R.D. Sleator, ‘Big data’, Hadoop and sequence analysis pipeline, BMC Bioinf. 15 (2014) 30.
cloud computing in genomics, J. Biomed. Inform. 46 (2013) [65] P. Noordhuis, M. Heijkoop, A. Lazovik, Mining twitter in the cloud:
774–781. A case study, Cloud Computing (CLOUD), 2010, in: Proceedings of
[38] N. Fernando, S.W. Loke, W. Rahayu, Mobile cloud computing: a IEEE 3rd International Conference on, IEEE, Miami, FL, 2010,
survey, Futu. Gener. Comput. Syst. 29 (2013) 84–106. pp. 107–114.
[39] R. Holman, Mobile Cloud Application Revenues To Hit $9.5 billion [66] C. Zhang, H. De Sterck, A. Aboulnaga, H. Djambazian, R. Sladek, Case
by 2014, Driven by Converged Mobile Services, in: The Juniper study of scientific data processing on a cloud using hadoop, High
Research, 2010. Performance Computing Systems and Applications, Springer, 2010,
[40] Z. Sanaei, S. Abolfazli, A. Gani, R. Buyya, Heterogeneity in mobile 400–415.
cloud computing: taxonomy and open challenges, IEEE Commun. [67] F. Faghri, S. Bazarbayev, M. Overholt, R. Farivar, R.H. Campbell,
Surv. Tutor. (2013) 1–24. W.H. Sanders, Failure scenario as a service (FsaaS) for Hadoop clusters,
[41] D. Talia, Clouds for scalable big data analytics, Computer 46 (2013) in: Proceedings of the Workshop on Secure and Dependable Middle-
98–101. ware for Cloud Monitoring and Management, ACM, 2012, p. 5.
[42] C. Ji, Y. Li, W. Qiu, U. Awada, K. Li, Big data processing in cloud [68] N. Leavitt, Storage challenge: where will all that big data go?
computing environments, Pervasive Systems, Algorithms and Net- Computer 46 (2013) 22–25.
works (ISPAN), 2012,in: Proceedings of the 12th International [69] K. Strauss, D. Burger, What the future holds for solid-state memory,
Symposium on, IEEE, 2012, pp. 17–23. Computer 47 (2014) 24–31.
[43] J. Dean, S. Ghemawat, MapReduce: simplified data processing on [70] K. Mayama, W. Skulkittiyut, Y. Ando, T. Yoshimi, M. Mizukawa,
large clusters, Commun. ACM 51 (2008) 107–113. Proposal of object management system for applying to existing
[44] D. Bollier, C. Firestone, M, The Promise and Peril of Big Data, Aspen object storage furniture, System Integration (SII),2011 IEEE/SICE
Institute, Communications and Society Program Washington, DC, International Symposium on, IEEE, 2011, pp. 279–282.
USA, 2010. [71] W. Hu, D. Hu, C. Xie, F. Chen, IEEE International Conference on A
[45] H. Miller, E, Big-data in cloud computing: a taxonomy of risks, Inf. New Data Format and a New Error Control Scheme for Optical-
Res. 18 (2013) 571. Storage Systems, Networking, Architecture, and Storage, 2007, NAS
[46] O. Kwon, N. Lee, B. Shin, Data quality management, data usage 2007 2007, pp. 193–198.
experience and acquisition intention of big data analytics, Int. J. Inf. [72] L. Hao, D. Han, IEEE Conference on The study and design on secure-
Manag. 34 (3) (2014) 387–394. cloud storage system, Electrical and Control Engineering (ICECE),
[47] K. Singh, S.C. Guntuku, A. Thakur, C. Hota, Big data analytics 2011 International 2011, pp. 5126–5129.
framework for peer-to-peer botnet detection using random forests, [73] T. White, Hadoop: The Definitive Guide: The Definitive Guide,
Inf. Sci. (2014). O’Reilly Media, Sebastapol, CA, 2009.
[48] J.L. Schnase, D.Q. Duffy, G.S. Tamkin, D. Nadeau, J.H. Thompson, [74] K. Shvachko, K. Hairong, S. Radia, R. Chansler, The Hadoop Dis-
C.M. Grieg, M.A. McInerney, W.P. Webster, MERRA Analytic Services: tributed File System, Mass Storage Systems and Technologies
Meeting the Big Data challenges of climate science through cloud- (MSST), 2010 IEEE 26th Symposium on, 2010, pp. 1–10.
enabled Climate Analytics-as-a-Service, Computers, Environment [75] S. Ghemawat, H. Gobioff, S.-T. Leung, The Google file system, ACM
and Urban Systems, (2014). SIGOPS Oper. Syst. Rev. ACM 37 (5) (2003) 29–43.
[49] B.K. Tannahill, M. Jamshidi, System of systems and big data [76] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, G. Fox,
analytics – bridging the gap, Comput. Electr. Eng. 40 (2014) 2–15. Twister: a runtime for iterative mapreduce, in: Proceedings of the
[50] J. Abawajy, Symbioses of Big Data and Cloud Computing: Oppor- 19th ACM International Symposium on High Performance Distrib-
tunities & Challenges, (2013). uted Computing, ACM, 2010, pp. 810–818.
[51] S. Aluru, Y. Simmhan, A special issue of journal of parallel and [77] A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu,
distributed computing: scalable systems for big data management P. Wyckoff, R. Murthy, Hive: a warehousing solution over a map-
and analytics, J.Parallel Distrib. Comput. 73 (2013) 896. reduce framework, Proc. VLDB Endow. 2 (2009) 1626–1629.
[52] S. Hipgrave, Smarter fraud investigations with big data analytics, [78] L. George, Hbase: The Definitive Guide, O’Reilly Media, Inc.,
Netw. Secur. 2013 (2013) 7–9. Sebastopol, CA, 2011.
[53] Z. Linquan, W. Chuan, L. Zongpeng, G. Chuanxiong, C. Minghua, [79] S. Owen, R. Anil, T. Dunning, E. Friedman, Mahout in action,
F.C.M. Lau, Moving big data to the cloud: an online cost-minimizing Manning Publications Co., 2011.
approach, IEEE J. Sel. Areas Commun. 31 (2013) 2710–2721. [80] C. Olston, B. Reed, U. Srivastava, R. Kumar, A. Tomkins, Pig latin: a
[54] H. Demirkan, D. Delen, Leveraging the capabilities of service- not-so-foreign language for data processing, in: Proceedings of the
oriented decision support systems: putting analytics and big data 2008 ACM SIGMOD international conference on Management of
in cloud, Decis. Support Syst. 55 (2013) 412–421. data, ACM, 2008, pp. 1099–1110.
[55] S. Lee, H. Park, Y. Shin, Cloud computing availability: multi-clouds [81] P. Hunt, M. Konar, F.P. Junqueira, B. Reed, ZooKeeper: wait-free
for big data service, Communications in Computer and Information coordination for internet-scale systems, in: Proceedings of the 2010
Science 310 (2012) 799–806. USENIX conference on USENIX annual technical conference, 2010,
[56] S.N. Srirama, P. Jakovits, E. Vainikko, Adapting scientific computing pp. 11–11.
problems to clouds using MapReduce, Futur. Gener. Comput. Syst. [82] M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica,
28 (2012) 184–192. Spark: cluster computing with working sets, in: Proceedings of
I.A.T. Hashem et al. / Information Systems 47 (2015) 98–115 115
the 2nd USENIX conference on Hot topics in cloud computing, 2010, [106] O. Tene, J. Polonetsky, Privacy in the age of big data: a time for big
pp. 10–10. decisions, Stanford Law Review Online 64 (2012) 63.
[83] A. Rabkin, R. Katz, Chukwa: A system for reliable large-scale log [107] L. Hsiao-Ying, W.G. Tzeng, A secure erasure code-based cloud
collection, in: Proceedings of the 24th international conference on storage system with secure data forwarding, parallel and distrib-
Large installation system administration, USENIX Association, uted systems, IEEE Transactions on, 23 (2012) pp. 995–1003.
2010, pp. 1–15. [108] C. Ning, W. Cong, M. Li, R. Kui, L. Wenjing, Privacy-preserving
[84] A. Cassandra, The Apache Cassandra project, in. multi-keyword ranked search over encrypted cloud data, INFO-
[85] S. Hoffman, Apache Flume: Distributed Log Collection for Hadoop, COM, 2011 Proceedings IEEE, 2011, pp. 829–837.
Packt Publishing Ltd., Birmingham, UK, 2013. [109] C.E. Shannon, Communication theory of secrecy systemsn, Bell
[86] X. Zhifeng, X. Yang, Security and privacy in cloud computing, IEEE Syst. Tech. J. 28 (1949) 656–715.
Commun. Surv. Tutor. 15 (2013) 843–859. [110] L. Kocarev, G. Jakimoski, Logistic map as a block encryption
[87] T. Gunarathne, T.-L. Wu, J. Qiu, G. Fox, MapReduce in the Clouds for algorithm, Phys. Lett. 289 (4–5) (2001) 199–206.
Science, IEEE Second International Conference on Cloud Computing [111] Z. Xuyun, L. Chang, S. Nepal, S. Pandey, C. Jinjun, A Privacy Leakage
Technology and Science (CloudCom), 2010, pp. 565–572. Upper Bound Constraint-Based Approach for Cost-Effective Privacy
[88] S. Sakr, A. Liu, A.G. Fayoumi, The family of MapReduce and large- Preserving of Intermediate Data Sets in Cloud, Parallel and Dis-
scale data processing systems, ACM Comput. Surv. (CSUR) 46 tributed Systems, IEEE Transactions on 24 (2013) pp. 1192–1202.
(2013) 11. [112] C.-I. Fan, S.-Y. Huang, Controllable privacy preserving search based
[89] K.S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, on symmetric predicate encryption in cloud storage, Futur. Gener.
C.-C. Kanne, F. Ozcan, E.J. Shekita, Jaql: a scripting language for Comput. Syst. 29 (2013) 1716–1724.
large scale semistructured data analysis, Proc. VLDB Conf. (2011). [113] R. Li, Z. Xu, W. Kang, K.C. Yow, C.-Z. Xu, Efficient multi-keyword
[90] L. Lin, V. Lychagina, W. Liu, Y. Kwon, S. Mittal, M. Wong, Tenzing a ranked query over encrypted data in cloud computing, Futur.
sql implementation on the mapreduce framework, (2011). Gener. Comput. Syst. (2013).
[91] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, [114] A. Squicciarini, S. Sundareswaran, D. Lin, Preventing Information
A. Rasin, HadoopDB: an architectural hybrid of MapReduce and Leakage from Indexing in the Cloud, Cloud Computing (CLOUD),
DBMS technologies for analytical workloads, Proc. VLDB Endow. 2 2010 IEEE 3rd International Conference on, 2010, pp. 188–195.
(2009) 922–933. [115] S. Bhagat, G. Cormode, B. Krishnamurthy, D. Srivastava, Privacy in
[92] E. Friedman, P. Pawlowski, J. Cieslewicz, SQL/MapReduce: a prac- dynamic social networks, in: Proceedings of the 19th international
tical approach to self-describing, polymorphic, and parallelizable conference on World wide web, ACM, Raleigh, North Carolina, USA,
user-defined functions, Proc. VLDB Endow. 2 (2009) 1402–1413. 2010, pp. 1059–1060.
[93] R. Pike, S. Dorward, R. Griesemer, S. Quinlan, Interpreting the data: [116] W. Itani, A. Kayssi, A. Chehab, Privacy as a Service: Privacy-Aware
parallel analysis with Sawzall, Sci. Progr. 13 (2005) 277–298. Data Storage and Processing in Cloud Computing Architectures,
[94] R. Cattell, Scalable SQL and NoSQL data stores, ACM SIGMOD Dependable, Autonomic and Secure Computing, 2009. DASC ’09, in:
Record, 39 (4), ACM New York, NY, USA, 2011, 12–27. Proceedings of the Eighth IEEE International Conference on, 2009,
[95] Z. Wang, Y. Chu, K.-L. Tan, D. Agrawal, A.E. Abbadi, X. Xu, Scalable pp. 711–716.
Data Cube Analysis over Big Data, arXiv preprint arXiv:1311.5663 [117] D. Agrawal, C.C. Aggarwal, On the design and quantification of
(2013). privacy preserving data mining algorithms, in: Proceedings of the
[96] R. Ramakrishnan, J. Gehrke, Database Management Systems, Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles
Osborne/McGraw-Hill, New York, 2003. of Database Systems, ACM, Santa Barbara, California, USA, 2001,
[97] S. Ghemawat, H. Gobioff, S.-T. Leung, The Google file system, pp. 247–255.
SIGOPS Oper. Syst. Rev. 37 (2003) 29–43. [118] Z. Xuyun, L. Chang, S. Nepal, D. Wanchun, C. Jinjun, Privacy-
[98] D. Zissis, D. Lekkas, Addressing cloud computing security issues, Preserving Layer over MapReduce on Cloud, in: International Con-
Futur. Gener. Comput. Syst. 28 (2012) 583–592. ference on Cloud and Green Computing (CGC), 2012, pp. 304–310.
[99] M. Schroeck, R. Shockley, J. Smart, D. Romero-Morales, P. Tufano, [119] D.P. Bertsekas, Nonlinear programming, (1999).
Analytics: The real-world use of big data, in, IBM Global Business [120] C. Tankard, Big data security, Netw. Secur. 2012 (2012) 5–8.
Services, 2012. [121] P. Malik, Governing big data: principles and practices, IBM J. Res.
[100] R. Sravan Kumar, A. Saxena, Data integrity proofs in cloud storage, Dev. 57 (1) (2013) 1. (-1: 13).
in: Proceedings of the Third International Conference on Commu- [122] D. Loshin, Chapter 5 – data governance for big data analytics:
nication Systems and Networks (COMSNETS), 2011, pp. 1–4. considerations for data policies and processes, in: D. Loshin (Ed.),
[101] R. Akerkar, Big Data Computing, CRC Press, 2013. Big Data Analytics, Morgan Kaufmann, Boston, 2013, pp. 39–48.
[102] T.C. Redman, A. Blanton, Data Quality for the Information Age, [123] S. Soares, Big Data Governance, Sunilsoares, 2012.
Artech House, Inc., Norwood, MA, USA, 1997. [124] P.P. Tallon, Corporate governance of big data: perspectives on value,
[103] D.M. Strong, Y.W. Lee, R.Y. Wang, Data quality in context, Commun. risk, and cost, Computer 46 (2013) 32–38.
ACM, 40, , 1997, 103–110. [125] M.D. Assuncao, R.N. Calheiros, S. Bianchi, M.A. Netto, R. Buyya, Big
[104] K. Weber, G. Rincon, A. Van Eenennaam, B. Golden, J. Medrano, Data Computing and Clouds: Challenges, Solutions, and Future
Differences in allele frequency distribution of bovine high-density Directions, arXiv preprint arXiv:1312.4722, (2013).
genotyping platforms in holsteins and jerseys, Western section [126] Khan, Abdul Nasir, et al. BSS: block-based sharing scheme for
American society of Animal science, 2012, p. 70. secure data storage services in mobile cloud environment. The
[105] D. Che, M. Safran, Z. Peng, From big data to big data mining: Journal of Supercomputing (2014) 1–31.
challenges, issues, and opportunities, in: B. Hong, X. Meng, L. Chen, [127] Khan, Abdul Nasir, et al., Incremental proxy re-encryption scheme
W. Winiwarter, W. Song (Eds.), Database Systems for Advanced for mobile cloud computing environment, The Journal of Super-
Applications, Springer, Berlin Heidelberg, 2013, pp. 1–15. computing 68 (2) (2014) 624–651.