Uploading and Replicating Internet of Things (Iot) Data On Distributed Cloud Storage
Uploading and Replicating Internet of Things (Iot) Data On Distributed Cloud Storage
Uploading and Replicating Internet of Things (Iot) Data On Distributed Cloud Storage
I. I NTRODUCTION and disaster recovery requirements. This calls for the data to
be optimally replicated among the mini-Clouds in a minimal
The Internet of Things (IoT) phenomenon is expected to amount of time. The time taken is a function of many
usher in as many as 25 billion devices by the year 2020- parameters such as network bandwidth, available data storage
[1] . As these devices (sensors, actuators, and gateways, on each mini-Cloud, number of data items at each mini-Cloud
routers & switches to manage the same) are not equipped and amount of data to be replicated. To that end, in this paper,
with extensive storage, their data has to be stored outside we propose an approach to address this problem. As an optimal
them. Typical solutions so far have included storing all this solution of this problem is intractable we propose a number
data in centralized cloud data centers [1]. However, the data of heuristics for solving this problem.
explosion is expected to overwhelm the capacity of even these
We use simulation for evaluating different data configura-
data centers.
tions and the heuristics we propose. Our results show that
Additionally, transferring all this data to a centralized loca-
scheduling the data items for transfer from the gateways to the
tion would render data analysis difficult. Thus distributed data
mini-Clouds based on the highest bandwidth requirement first
storage, among multiple geographically distributed mini-data
is optimal in most cases and can be up to 12 times as effective
centers (which we call mini-Clouds), is called for, as depicted
as other schedules based on different ordering heuristics.
in Figure 1. Note that, in this system setup, a group of sensors
upload their data to a single sensor gateway which is then The key contributions of this paper are:
responsible for pushing the data collected over a period of • Formulating the problem of uploading data from a set
time to the mini-Clouds. of sensor gateways to multiple distributed mini-Clouds
One of the key research problems in such a scenario is taking replication requirements for each data item into
how to upload data from the sensor gateways and replicate account. The formal problem is presented with the pa-
the data in multiple mini-Clouds to cater to high availability rameters of interest.
• Investigation of the performance of eleven possible or-
2159-6190/16 $31.00 © 2016 IEEE
DOI 10.1109/CLOUD.2016.92 670
derings of the data items used in the uploading and minimize the time taken to upload the data from the gateway
replication. These orderings are based on the parameters and replicate the data amongst the different mini-Clouds.
considered in the problem. Our earlier work [15] presents an approach for optimal
• An analysis on simulation based on the parameter space
distribution of data among various mini-Clouds in a cloud-
with results. based IoT network, and also presents optimal data migration
algorithms for migrating excess data between mini-Cloud
The rest of this paper is organized as follows. Section
storages to mitigate storage capacity issues. However, this
II analyzes existing work in this space. We then present
work did not consider data replication as a requirement.
a formal description of the problem in Section III. Our
In comparison to the above, our approach has two primary
solution approach follows this in Section IV. Finally the paper
advantages. First, it is based on the infrastructure and inherent
concludes in Section V with suggestions for future work.
property of IoT data. It considers data replication in IoT from
II. R ELAT ED W ORK multiple parameters, and proposes a framework by which the
Most replica placement problems target to minimize access replication problem can be solved using some way of ordering
time. [2] presents a replication strategy for reducing access la- by using different inherent properties in this scenario. Second,
tency and bandwidth utilization in data grid environments. The our approach is specific to IoT, since it takes care of small
strategy used here is to applying k-means and p-center model size data items which are abundant in number and tries to
based on the weighted average of response time. In [3], the determine the nature of time spent for data transfer as well as
authors present a replication strategy for optimally replicating amount of data transferred with each transfer.
objects in CDN (Content Distribution Network) servers using III. P ROBLEM FORMULATION
heuristics based on object popularity and storage capacity
Our system environment is modeled as E =< G, C, P >,
of the nodes using different topologies. In [4], the proposed
where G is a set of gateways, C is a set of mini-Cloud
replication system tries to reduce delay, energy consumption,
storages and P is a set of policies. Gateways are responsible
and cost for cloud uploading of IoT applications given the
for receiving sensor data and forwarding it to a mini-Cloud. A
massive number of devices with tiny memory sizes. In [4],
gateway is modeled as G =< I , D > where I is the gateway
the author proposes deployment of local cloud computing
identifier and D is the set of data items in the Gateway. A
resources to address the LTE architectural bottlenecks within
data item is modeled as D =< S , L > where S is the data
the radio access network and proposes a memory replication
identifier and L is size of the data in bytes. A mini-Cloud is
protocol. In [5], the authors propose an object replication
modeled as C =< T , A, R, W >, where T is total capacity
and placement scheme for wireless mesh networks (WMNs)
in bytes, A is available capacity in bytes, R is read access
through a divide and conquer strategy. In [6], the authors
latency and W is write latency (bytes/sec) per unit data size.
exploit graph partitioning to obtain a hierarchical replica
A policy P is modeled as P =< S , N > where S is a data
placement scheme in WMNs.
item, and N is the required number of replicas.
[7] presents a replication strategy which balances the load
Our research problem therefore is: given a network topology
among the servers in storage of large volume of data in
connecting the gateways and the mini-Clouds and available
Cloud data centers. In [8], the authors propose a method to
bandwidths of the links in this topology, minimize the total time
evenly distribute data among the available storage by using
required in transferring the data from all of the gateways to the
pseudo random data distribution function. In [9] the authors
mini-Clouds and replicating each data item according to the
present a replication placement scheme for cloud storages for
defined policy. It can be formally defined as follows:
an IoT Environment in the health care domain. It uses a MOX
(Mosquitoes Oviposition Mating) algorithm [10] to find a data
replica placements in data storage systems is discussed. In (i) c = c i.e. source cannot be destination f or data d
[14], a self managed key-value store is proposed which dy- (ii) Ld < Ac
namically allocates the resources of a data cloud to several d=1,...,m c=1,...,n
applications and thus maintains differentiated availability guar-
d
antee for different application requirements. the ider user specified policy specifications associated
All of the efforts presented here attempt to store data in a m with data items. Although, the scenario discussed in
way so as to minimize cost of accessing it later and none of cons
671
(iii) L < Ac
the paper [4] is almost similar to ours, our focus in this
d ⊂D
paper is to
where
n is the number of mini − C louds
m is the number of data items
Ld is size of D d placed on Gateway Gj
672
C = Rd − 1 (1) the data identifier of a JOB should be available at the
source identified by source identifier in the JOB, (2) the data
Rd is number of replications required f or Dd
item should not be present at the destination identified by
Tdj is waiting time of data Dd at Gateway Gj destination identifier of the JOB and (3) no transfer is taking
Tdc is waiting time of data Dd at mini − Cloud Cc place between the source and destination at the moment. In
Bjc is the rate of data transf er other words, each JOB simulates a transfer process in the
f rom Gateway Gj to mini − Cloud Cc program for particular data items for a pair of source and
Bcc is the rate of data transf er destination for a required span of time. Source of the transfer
f rom mini − Cloud Cc to mini − Cloud Cc may be gateways or mini-Clouds and destination of transfer
Rj is read latency at Gateway Gj
would be always mini-Cloud.
Given the set of all possible JOBS at any point in time, ours
R c is read latency at mini − Cloud Cc Wc
is a greedy heuristic which makes the following assumption:
(resp. Wc ) are write latencies at any
any data item will be sent only once from the gateway to a mini-
(resp. destination) mini − Cloud Cloud. Since the up links from the gateways are usually not as
reliable and robust as the links between the mini- Clouds,
The time complexity of the exact solution to the above
replication of data items is done from the initial destination
problem can be reasoned as follows:
mini-Cloud to other mini-Clouds opportunistically and
1) Let there be K data items. Each data item would be
greedily. However, the sequence in which the data items will be
uploaded to the mini-Cloud. Hence if there are R number
sent to the mini-Cloud as well as be replicated amongst different
of replications needed, each data item would be placed
mini-Clouds will be one of many possible sequences depending
at R mini-Clouds.
on how we order these transfers. There are eleven possible such
2) Hence there must be at least one appropriate assignment
sequences as we shall show below. Our algo- rithm is formally
of R mini-Clouds to each data item that would minimize
described in Algorithm 1 and encodes a greedy strategy in that
replication time.
we keep all network links busy as long as there is a data item
3) The gateway to mini-Cloud storage link would be used
at the source end of the network link that has not had its
at most once for each data item.
replication factor satisfied.
4) The process of uploading can therefore be viewed as
As stated earlier, we customize our greedy algorithm using
choosing first mini-Cloud for each data item out of the
one of the following orderings in which the JOBs will be
selected R mini-clouds for that data item; the other data
scheduled.
transfers are to the remaining (R-1) mini-Clouds.
5) Hence the total time complexity is of two parts: choose 1) Arrival of data - earliest data first.
R from C for each of the K data items, and choose one 2) Smallest remaining bandwidth link first.
from the selected R mini-Clouds for each of the data 3) Largest remaining bandwidth link first.
items. 4) Smallest available space at destination first - tightest
6) So, total complexity of this problem is [K CR ] ∗ R ∗ K . destination first.
5) Largest available space at destination first - loosest
Since this time complexity is obviously intractable, we
destination first.
present a heuristic solution in the next Section.
6) Data item with smallest data size first.
IV. O UR S OLUTION 7) Data item with largest data size first.
We now present our replication strategy which is based on 8) Smallest ratio of (data size/bandwidth available) first
ordering of data items to assign a sequence to all the data 9) Largest ratio (data size/bandwidth available) first
items for uploading from gateways to mini-Cloud. We must 10) Smallest value of (bandwidth available*data size) first
assign the order of transfer of these data items at each iteration 11) Largest value of (bandwidth available*data size) first
till all required replicas are stored on the mini-Clouds. We are We have evaluated all these orderings in our simulation. Note
assuming here a store and forward type of transfer, hence for that our heuristic is greedy in the following ways:
any transfer to be started, a completely copied data item should • JOBs are scheduled as soon as links over which the
be present at source. transfer needs to happen become free. Thus if a gateway
For implementation of our heuristic based on ordering, we or mini-Cloud has more than one up link to the network,
need an abstraction for simulating the transfer or copying more than one JOB can be in progress at the same time.
among mini-Clouds from the gateways. We call this abstrac- • As soon as a JOB completes, a new set of JOBs are im-
tion a JOB. We model each JOB as a 4-tuple comprising mediately spawned for the new set of available data items
the following: source identifier, destination identifier, value and free links until all requirements of the specification
of bandwidth of the link between source and destination replication factor of the data item has been satisfied.
identifier, and data identifier. A JOB represents the transfer of a • The ordering of the set of all JOBs is done on the basis
data item from either the gateway to a mini-Cloud or from one of one of the selected orderings in any single simulation.
mini-Cloud to another (the latter for replication purposes). The
transfer occurs when the following conditions are satisfied: The outermost while loop executes till such time that no
673
configuration is represented as a tuple of a collection of
ALGORITHM 1: Greedy Heuristic
these parameters. Figure 2 illustrates the possible system
Data: Given a set of gateways with data items, a set of configurations presented as a tree. A leaf node of this tree
mini-Clouds with a known capacity and a set of is a combination of these 4 parameters with specific values.
communication links among them For example: ACFH represents a configuration where mini-
Result: Time taken to transfer all data items from the Clouds have non-zero read and write latencies for their storage,
gateways and to replicate all data items network links have variable bandwidth, the mini-Clouds have
1 Initialization;
unlimited storage each and all the data items are of constant
2 let job J be composed of data item Dj , source size. Other configurations such as BCFI and ADFH can be
Sourcej , destination Destj and link Lj reflecting a similarly interpreted.
transfer event; We evaluated the different suitable configurations corre-
3 Let S be the set of all possible JOBs at any point in sponding to every one of the 16 leaf nodes of the tree in
time. For D data items, M mini-Clouds and L Figure 2, for our greedy approach (Algorithm 1) with its eleven
links/gateways, S will have D × M × L JOBs initially. types of considered ordering presented above. We also varied
repeat the number of data items and present experimental results for
4 Order S based on ordering heuristic; foreach the same.
J OBj ∈ S do Our simulation assumes the following fixed parameter val-
5 if Destj has no space for Dj then ues for all cases:
6 Delete J OBj ;
• Number of gateways: 50
7 end
• Number of mini-Cloud storages: 30
8 if Lj is not busy then
• Number of uplinks or downlinks: 2 for each gateway or
9 Schedule J OBj ;
10 Delete all J OBk from S such that Dk = Dj mini-Cloud storage
• GC link bandwidth: 1 Mbps to 5 Mbps
and Sourcek = Sourcej ;
11 Mark Lj busy; • CC link bandwidth: 25 Mbps to 100 Mbps
14 In parallel, when any J OBj completes Mark Lj • Write latency at cloud storage: 10MB/sec -25 MB/sec.
674
Fig. 2. The System Configuration Space as a Tree
675
Fig. 6. BDFI Configuration bandwidth, unlimited storage and variable data size) is shown
in Figure 10. This also shows that orderings 7 and 9 provide
better performance. We see similar results from configuration
ADEI (latency, constant bandwidth, limited storage and vari-
able data size; see Figure 11.
We see similar results from configurations ACFH (latency,
variable bandwidth, unlimited storage and constant data size),
ACEH (latency, variable bandwidth, limited storage and con-
stant data size), ADEH (latency, variable bandwidth, limited
storage and constant data size) and ADFH (latency, variable
bandwidth, unlimited storage and constant data size). But this
does not show any appreciable difference among the orderings,
since constant data size appears to invalidate the other ordering
criteria. But as each gateway has a large number of data items,
sequencing does not perform well in this case.
The final configuration is ADFI (latency, constant band-
width, unlimited storage and variable data size; see Figure 12).
Fig. 7. BDFH Configuration
Here, orderings 2 and 3, and ordering 10, show better results,
leading us to conclude that smaller bandwidth jobs need to be
scheduled first in such a configuration.
Finally, we present a comparison of all ordering strategies for
all possible system configurations on one graph in Figure
8. This clearly shows that the best performance we can hope
to achieve with any ordering with the A* configurations is
significantly worse than the performance we can achieve with
the B* configurations leading us to conclude that performance
is highly sensitive to read and write latency of storage at the
mini-Clouds. In fact, the ratio of the best to worst performance
for 2000 data items is more than 12 times which is significant.
C. Observations
ligible latency, constant bandwidth, limited storage and vari- Based on the experimental results above, we can draw the
able data size; see Figure 6) and BDFH (negligible latency, following conclusions:
constant bandwidth, unlimited storage and constant data size; 1) The time for replication increases roughly linearly with
see Figure 7). For both these configurations, we observe that the number of data items, suggesting that our heuristic
orderings 2 and 3 show the better results, along with orderings is scalable. This is shown by Figure 8.
9 and 11. One would conclude from this that data size plays 2) Selecting jobs based on higher bandwidth requirement
a vital role with the bandwidth in replication time in these generally provides appreciably better performance.
instances. 3) Selecting jobs based on larger data size also provides
The graph with system configuration ACEI (latency, vari- better performance, but the improvement seems to be
able bandwidth, limited storage and variable data size) is rather marginal.
shown in Figure 9. This graph suggests that orderings 9 (larger 4) Storage capacity based replication should not be em-
data size/bandwidth first) and 11 (larger bandwidth*data size ployed in general. However, if at all this ordering needs
first) yields the best results. Hence this leads us to conclude to be employed, it would be better to replicate data to
that data size and bandwidth are key criteria for optimal mini-Clouds possessing larger storage capacity.
performance in this case and in general, bandwidth and the 5) Whenever bandwidth is constant, data elements with
data size plays an important role in our problem scenario. variable data size modify the trends of applying or-
An important observation is that the performance for this derings. In case of configurations ADFI (Figure-12),
configuration even in the best case is several times worse than ADEI (Figure-11), BDEI(Figure-4) and BDFI(Figure-6),
the performance for the B* configurations. For 2000 data items orderings 9 and 10 are optimal but otherwise orderings
the best performance is almost 7 to 8 times worse than the best 8 and 11 are optimal. This leads us to conclude that jobs
performance for BDFI. with larger data size usually provide better performance
The graph with system configuration ACFI (latency, variable as already stated above.
676
Fig. 8. Ordering trends vs Configuration for 2000 data items
677
[6] Z. Al Arnaout, Q. Fu, and M. Frean, “Exploiting graph partitioning
V. C ONCLUSIONS for hierarchical replica placement in wmns,” in Proceedings of the 16th
In this paper, we have investigated a crucial problem in ACM International Conference on Modeling, Analysis and Simulation of
Wireless and Mobile Systems, ser. MSWiM ’13. New York, NY, USA:
integrating cloud computing and Internet of Things (IoT): ACM, 2013, pp. 5–14.
given a distributed IoT network comprising many mini-Clouds, [7] Q. Zhang, S. Q. Zhang, A. Leon-Garcia, and R. Boutaba, “Aurora: Adaptive
block replication in distributed file systems,” in Distributed Computing
how to implement optimal data replication in the network. By Systems (ICDCS), 2015 IEEE 35th International Conference on, June
optimal, we mean the time taken for data replication which 2015, pp. 442–451.
[8] S. A. Weil, S. A. Brandt, E. L. Miller, and C. Maltzahn, “Crush: Controlled,
could be quite considerable given the size of data generated scalable, decentralized placement of replicated data,” in Proceedings of
in IoT networks. We have characterized the problem from the 2006 ACM/IEEE Conference on Supercomputing, ser. SC ’06. New
parameters such as latency, link bandwidth, data size, and York, NY, USA: ACM, 2006.
[9] B. Zhang, X. Wang, and M. Huang, “A data replica placement scheme
storage capacity of the mini-Clouds. Since the general problem for cloud storage under healthcare iot environment,” in Fuzzy Systems
is intractable, we have presented a greedy heuristic, comprising and Knowledge Discovery (FSKD), 2014 11th International Conference
on, Aug 2014, pp. 542–547.
several orderings derived from the parameters. We have also [10] F. u. A. A. Minhas and M. Arif, “Mox: A novel global optimization
presented detailed experimental results, and derived useful algorithm inspired from oviposition site selection and egg hatching
conclusions useful to further research, as well as to any IoT inhibition in mosquitoes,” Appl. Soft Comput., vol. 11, no. 8, pp. 4614–
4625, Dec. 2011.
network operator that need to implement data replication in [11] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. F. De Rose, and
an IoT network. R. Buyya, “Cloudsim: A toolkit for modeling and simulation of cloud
computing environments and evaluation of resource provisioning algo-
For future work, we will be investigating the impact of rithms,” Softw. Pract. Exper., vol. 41, no. 1, pp. 23–50, Jan. 2011.
specific IoT network topologies on data replication, and also [12] S. Zaman and D. Grosu, “A distributed algorithm for the replica place-
how our various ordering strategies would perform in such ment problem,” Parallel and Distributed Systems, IEEE Transactions on,
vol. 22, no. 9, pp. 1455–1468, Sept 2011.
topologies. We will also be conducting further experiments on [13] V. Venkatesan, I. Iliadis, C. Fragouli, and R. Urbanke, “Reliability of
clustered vs. declustered replica placement in data storage systems,”
larger data sets. in Modeling, Analysis Simulation of Computer and Telecommunication
R EFERENCES Systems (MASCOTS), 2011 IEEE 19th International Symposium on, July
2011, pp. 307–317.
[1] G. Inc. (2014; accessed on 6th August, 2015) Gartner press release. [14] N. Bonvin, T. G. Papaioannou, and K. Aberer, “A self-organized,
[Online]. Available: http://www.gartner.com/newsroom/id/2684616 fault-tolerant and scalable replication scheme for cloud storage,” in
[2] R. Rahman, K. Barker, and R. Alhajj, “Replica placement strategies in Proceedings of the 1st ACM Symposium on Cloud Computing, ser. SoCC
data grid,” Journal of Grid Computing, vol. 6, no. 1, pp. 103–123, 2008. ’10. New York, NY, USA: ACM, 2010, pp. 205–216.
[3] J. Kangasharju, J. Roberts, and K. W. Ross, “Object replication strategies [15] N. Narendra, K. Koorapati, and V. Ujja, “Towards cloud-based decen-
in content distribution networks,” Comput. Commun., vol. 25, no. 4, pp. tralized storage for internet of things data,” in Cloud Computing for
376–383, Mar. 2002. Emerging Markets (CCEM), 2015 IEEE Conference on. IEEE, 2015.
[4] S. Abdelwahab, B. Hamdaoui, M. Guizani, and T. Znati, “Replisom :
Disciplined tiny memory replication for massive iot devices in lte edge
cloud,” Internet of Things Journal, IEEE, vol. PP, no. 99, pp. 1–1, 2015.
[5] Z. Al-Arnaout, Q. Fu, and M. Frean, “A divide-and-conquer approach
for content replication in wmns,” Comput. Netw., vol. 57, no. 18, pp.
3914–3928, Dec. 2013.
678