Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Monetary Cost Optimizations For Hosting Workflow-As-A-Service in Iaas Clouds

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

34 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO.

1, JANUARY-MARCH 2016

Monetary Cost Optimizations for Hosting


Workflow-as-a-Service in IaaS Clouds
Amelie Chi Zhou, Bingsheng He, and Cheng Liu

AbstractRecently, we have witnessed workflows from science and other data-intensive applications emerging on Infrastructure-
as-a-Service (IaaS) clouds, and many workflow service providers offering workflow-as-a-service (WaaS). The major concern of
WaaS providers is to minimize the monetary cost of executing workflows in the IaaS clouds. The selection of virtual machines
(instances) types significantly affects the monetary cost and performance of running a workflow. Moreover, IaaS cloud
environment is dynamic, with high performance dynamics caused by the interference from concurrent executions and price
dynamics like spot prices offered by Amazon EC2. Therefore, we argue that WaaS providers should have the notion of offering
probabilistic performance guarantees for individual workflows to explicitly expose the performance and cost dynamics of IaaS
clouds to users. We develop a scheduling system called Dyna to minimize the expected monetary cost given the user-specified
$
probabilistic deadline guarantees. Dyna includes an A -based instance configuration method for performance dynamics, and a
hybrid instance configuration refinement for using spot instances. Experimental results with three scientific workflow applications
on Amazon EC2 and a cloud simulator demonstrate (1) the ability of Dyna on satisfying the probabilistic deadline guarantees
required by the users; (2) the effectiveness on reducing monetary cost in comparison with the existing approaches.

Index TermsCloud computing, cloud dynamics, spot prices, monetary cost optimizations, scientific workflows

1 INTRODUCTION

C LOUD computing has become a popular computing


infrastructure for many scientific applications. Recently,
we have witnessed many workflows from various scientific
requirements [4], maximizing the performance for given
budgets [5] and scheduling optimizations with both cost
and performance constraints [6]. When it comes to cloud
and data-intensive applications deployed and hosted on the computing, the pay-as-you-go pricing, virtualization and
Infrastructure-as-a-Service (IaaS) clouds such as Amazon elasticity features of cloud computing open up various chal-
EC2 and other cloud providers. In those applications, work- lenges and opportunities [1], [7]. Recently, there have been
flows are submitted and executed in the cloud and each many studies on monetary cost optimizations with resource
workflow is usually associated with a deadline as perfor- allocations and task scheduling according to the features
mance guarantee [1], [2], [3]. This has formed a new soft- of cloud computing (e.g., [1], [2], [7], [8], [9], [10], [11]).
ware-as-a-service model for hosting workflows in the cloud, Although the above studies have demonstrated their effec-
and we refer it as Workflow-as-a-Service (WaaS). WaaS pro- tiveness in reducing the monetary cost, all of them assume
viders charge users based on the execution of their work- static task execution time and consider only fixed pricing
flows and QoS requirements. On the other hand, WaaS scheme (only on-demand instances in Amazons terminol-
providers rent cloud resources from IaaS clouds, which ogy). Particularly, they have the following limitations.
induces the monetary cost. Monetary cost is an important First, cloud is by design a shared infrastructure, and the
optimization factor for WaaS providers, since it directly interference causes significant variations in the performance
affects the profit of WaaS providers. In this paper, we investi- even with the same instance type. Previous studies [12], [13]
gate whether and how WaaS providers can reduce the mone- have demonstrated significant variances on I/O and net-
tary cost of hosting WaaS while offering performance work performance. The assumption of static task execution
guarantees for individual workflows. time in the previous studies (e.g., [1], [2], [7], [8], [9], [10])
Monetary cost optimizations have been classic research does not hold in the cloud. Under the static execution time
topics in grid and cloud computing environments. Over the assumption, the deadline notion is a deterministic dead-
era of grid computing, cost-aware optimization techniques line. Due to performance dynamics, a more rigorous notion
have been extensively studied. Researchers have addressed of deadline requirement is needed to cope with the dynamic
various problems: minimizing cost given the performance task execution time.
Second, cloud, which has evolved into an economic
market [14], has dynamic pricing. Amazon EC2 offers
spot instances, whose prices are determined by market
 The authors are with the School of Computer Engineering, Nanyang Tech-
nological University, Singapore 637598. demand and supply. Spot instances have been an effec-
E-mail: {czhou1, LIUC0012}@e.ntu.edu.sg, bshe@ntu.edu.sg. tive means to reduce monetary cost [15], [16], because
Manuscript received 14 Aug. 2014; revised 14 Jan. 2015; accepted 12 Feb. the spot price is usually much lower than the price of on-
2015. Date of publication 17 Feb. 2015; date of current version 2 Mar. 2016. demand instances of the same type. However, a spot
Recommended for acceptance by K. Keahey, I. Raicu, K. Chard, B. Nicolae. instance may be terminated at any time when the bidding
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below. price is lower than the spot price (i.e., out-of-bid events).
Digital Object Identifier no. 10.1109/TCC.2015.2404807 The usage of spot instances may cause excessive long
2168-7161 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
ZHOU ET AL.: MONETARY COST OPTIMIZATIONS FOR HOSTING WORKFLOW-AS-A-SERVICE IN IAAS CLOUDS 35

Fig. 2. Illustrative user interface.

2 BACKGROUND AND RELATED WORK


Fig. 1. Application scenario of this study. 2.1 Application Scenario
Fig. 1 illustrates our application scenario. In this study, we
latency due to failures. Most of the previous studies do consider a typical scenario of offering software-as-a-service
not consider deadline constraints of individual workflows model for workflows on IaaS clouds [1]. We call this model
when using spot instances. Workflow-as-a-Service. We consider three parities in this
In order to address performance and price dynamics, we scenario, namely the workflow application owner, WaaS
define the notion of probabilistic performance guarantees to provider and IaaS cloud provider. In this hosting, different
explicitly expose the performance dynamics to users. Each application owners submit a number of workflows with dif-
workflow is associated with a probabilistic deadline ferent parameters to WaaS and the WaaS provider rent
requirement of pr percent. WaaS provider guarantees that resources from the cloud provider to serve the applications.
the pr th percentile of the workflows execution time distri- The application owners submit workflows with specified
bution in the dynamic cloud environment is no longer than deadlines for QoS purposes. WaaS providers charge users
the predefined deadline. The WaaS provider may charge according to the execution of workflows and their QoS
differently according to the deadline and the probability in requirements. In this proposal, we argue that the WaaS pro-
the performance guarantee, and the users can select the suit- vider should offer a probabilistic performance guarantee for
able performance guarantee according to their require- users. Particularly, we can offer some fuzzy-style interfaces
ments. This is just like many IaaS cloud providers offer for users to specify their probabilistic deadline require-
different probabilistic availability guarantees. Under this ments, such as Low, Medium and High, as illustrated
notion, we propose a probabilistic scheduling system called in Fig. 2. Inside Dyna, we translate these requirements into
Dyna to minimize the cost of the WaaS provider while satis- probabilities of deadline. For example, the user may select
fying the probabilistic performance guarantees of individ- the loose deadline of 4 hours with the probability of 96 per-
ual workflows predefined by the user. The system embraces cent. Ideally, the WaaS provider tends to charge higher pri-
a series of optimization techniques for monetary cost ces to users when they specify tighter deadline and/or
optimizations, which are specifically designed for cloud higher probabilistic deadline guarantee. The design of the
dynamics. We develop probabilistic models to capture the billing scheme for WaaS is beyond the scope of this paper,
performance dynamics in I/O and network of instances in and we will explore it as future work.
IaaS clouds. We further propose a hybrid instance configu- Different workflow scheduling and resource provi-
ration approach to adopt both spot and on-demand instan- sioning algorithms can result in significant differences in
ces and to capture the price dynamics in IaaS clouds. The the monetary cost of WaaS providers running the service
spot instances are adopted to potentially reduce monetary on IaaS clouds. Considering the cloud dynamics, our
cost and on-demand instances are used as the last defense goal is to provide a probabilistic scheduling system for
to meet deadline constraints. WaaS providers, aiming at minimizing the expected
We calibrate the cloud dynamics from a real cloud pro- monetary cost while satisfying users probabilistic dead-
vider (Amazon EC2) for the probabilistic models on I/O line requirements.
and network performance as well as spot prices. We per-
form experiments using three workflow applications on 2.2 Terminology
Amazon EC2 and on a cloud simulator. Our experimental Instance. An instance is a virtual machine offered by the
results demonstrate the following two major results. First, cloud provider. Different types of instances can have differ-
with the calibrations from Amazon EC2, Dyna can accu- ent amount of resources such as CPUs and RAM and differ-
rately capture the cloud dynamics and guarantee the proba- ent capabilities such as CPU speed, I/O speed and network
bilistic performance requirements predefined by the users. bandwidth. We model the dynamic I/O and network per-
Second, the hybrid instance configuration approach signifi- formances as probabilistic distributions. The details are pre-
cantly reduces the monetary cost by 15-73 percent over sented in Section 3.
other state-of-the-art algorithms [1] which only adopt on- We adopt the instance definition of Amazon EC2, where
demand instances. an instance can be on-demand or spot. Amazon adopts the
The rest of the paper is organized as follows. We formu- hourly billing model, where any partial hour of instance
late our problem and review the related work in Section 2. usage is rounded to 1 hour. Both on-demand and spot
We present our detailed system design in Section 3, fol- instances can be terminated when users no longer need
lowed by the experimental results in Section 4. Finally, we them. If an instance is terminated by the user, the user has
conclude this paper in Section 5. to pay for any partial hour (rounded up to one hour). For a
36 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2016

TABLE 1 set the last dimension to on-demand instance to ensure the


Statistics on Spot Prices ($/hour, August 2013, US East Region) deadline of the task.
and On-Demand Prices of Amazon EC2 A hybrid instance configuration indicates a sequence of
Instance type Average stdev Min Max OnDemand instance types that the task is potentially to be executed on.
In the hybrid execution of spot and on-demand instances, a
m1.small 0.048 0.438 0.007 10 0.06 task is initially assigned to a spot instance of the type indi-
m1.medium 0.246 1.31 0.0001 10 0.12
m1.large 0.069 0.770 0.026 40 0.24 cated by the first dimension of its configuration (if any). If
m1.xlarge 0.413 2.22 0.052 20 0.48 the task fails on this spot instance, it will be re-assigned to an
instance of the next type indicated by its configuration until
it successfully finishes. Since the last dimension is an on-
spot instance, if it is terminated due to an out-of-bid event, demand instance type, the task can always finish the execu-
users do not need to pay for any partial hour of usage. tion, even when the task fails on all previous spot instances.
Table 1 shows some statistics of the price history of four
types of spot instances on Amazon in the US East region
during August 2013. We also show the price of the on- 2.3 Related Work
demand instances for those four types. We have the follow- There are a lot of works related to our study, and we focus
ing observations: a) The spot instances are usually cheaper on the most relevant ones on cost optimizations and cloud
than on-demand instances. There are some outlier points performance dynamics.
where the maximum spot price is much higher than the on- Cost-aware optimizations. Workflow scheduling with dead-
demand price. b) Different types have different variations line and budget constraints (e.g., [2], [4], [5], [21], [22], [23],
on the spot price. These observations are consistent with the [24], [25], [26]) has been widely studied. Yu et al. [4] pro-
previous studies [17], [18]. posed deadline assignment for the tasks within a job and
Task. Tasks can have different characteristics, e.g., used genetic algorithms to find optimal scheduling plans.
compute-intensive and I/O-intensive tasks, according to Multi-objective methods such as evolutionary algorithms
the dominating part of the total execution time. The exe- [27], [28] have been adopted to study the tradeoff between
cution time (or response time) of a task is usually esti- monetary cost and performance optimizations for workflow
mated using estimation methods such as task profiling executions. Those studies only consider a single workflow
[19]. In this study, we use a simple performance estima- with on-demand instances only. Malawski et al. [2] proposed
tion model on predicting the task execution time. Since dynamic scheduling strategies for workflow ensembles. The
scientific workflows are often regular and predictable [1], previous studies [1], [29], [30], [31] proposed auto-scaling
[4], this simple approach is sufficiently accurate in prac- techniques based on static execution time of individual
tice. Specifically, given the input data size, the CPU exe- tasks. In comparison with the previous works, the unique
cution time and output data size of a task, the overall feature of Dyna is that it targets at offering probabilistic per-
execution time of the task on a cloud instance can be formance guarantees as QoS, instead of deterministic dead-
estimated with the sum of the CPU, I/O and network lines. Dyna schedules the workflow by explicitly capturing
time of running the task on this instance. Note, the CPU the performance dynamics (particularly for I/O and net-
performance is usually rather stable [12]. Since the I/O work performance) in the cloud. Calheiros and Buyya and
and network performance of the cloud are dynamic Calheiros [21] proposed an algorithm with task replications
(modeled as probabilistic distributions in this paper), the to increase the likelihood of meeting deadlines.
estimated task execution time is also a probabilistic Due to their ability on reducing monetary cost, Amazon
distribution. EC2 spot instances have recently received a lot of interests.
Job. A job is expressed as a workflow of tasks with prece- Related work can be roughly divided into two categories:
dence constraints. A job has a soft deadline. In this study, modeling spot prices [17], [18] and leveraging spot instances
we consider the deadline of a job as a probabilistic require- [15], [16], [32].
ment. Suppose a workflow is specified with a probabilistic For modeling spot prices, Yehuda et al. [18] conducted
deadline requirement of pr percent. Rather than offering 100 reverse engineering on the spot price and figured out a
percent deadline guarantee, WaaS provider guarantees that model consistent with existing price traces. Javadi et al. [17],
the pr th percentile of the workflows execution time distri- [33] developed statistical models for different spot instance
bution in the dynamic cloud environment is no longer than types. Those models can be adopted to our hybrid execution.
a predefined deadline constraint. Our definition of probabi- For leveraging spot instances, Yi et al. [15] introduced
listic deadline is consistent with previous studies [20] on some checkpointing mechanisms for reducing cost of spot
defining the QoS in a probabilistic manner. instances. Further studies [16] used spot instances with dif-
Instance configuration. The hybrid instance configuration of a ferent bidding strategies and incorporating with fault toler-
task is defined as a n-dimension vector: h(type1 , price1 , ance techniques such as checkpointing, task duplication
isSpot1 ), (type2 , price2 , isSpot2 ), . . . , (typen , pricen , isSpotn )i, and migration. Those studies are with spot instance only,
where isSpoti indicates whether the instance is spot (True) without offering any guarantee on meeting the workflow
or on-demand (False). If the instance i is a spot instance, deadline like Dyna. Similar to Dyna, Chu and Simmhan [34]
pricei is the specified bidding price, and the on-demand proposed a hybrid method to use both on-demand and spot
price otherwise. In our hybrid instance configuration, only instances for minimizing total cost while satisfying deadline
the last dimension of the configuration is on-demand constraint. However, they did not consider the cloud perfor-
instance and all previous dimensions are spot instances. We mance dynamics.
ZHOU ET AL.: MONETARY COST OPTIMIZATIONS FOR HOSTING WORKFLOW-AS-A-SERVICE IN IAAS CLOUDS 37

according to different instance types. Instance acquisition/


release operations are performed in an auto-scaling man-
ner. For the instances that do not have any task and are
approaching multiples of full instance hours, we release
them and remove them from the pool. We schedule tasks to
instances in the earliest-deadline-first manner. When a task
with the deadline residual of zero requests an instance and
the task is not consolidated to an existing instance in the
pool, we acquire a new instance from the cloud provider,
and add it into the pool. In our experiment, for example,
Amazon EC2 poses the capacity limitation of 200 instances.
If this cap is met, we cannot acquire new instances until
some instances are released.
Fig. 3. Overview of the Dyna system.
The reason that we divide the search process into two
steps is to reduce the solution space. For example, consider
searching the instance configuration for a single task, where
Cloud performance dynamics. There have been some pro- there are n on-demand types and m spot instance types. If
posals to reduce the performance interference and unpre- we consider spot and on-demand instances  together,
  the
dictability in the cloud, such as network performance [35] number of configurations to be searched is n1  m1 while
and I/O performance [36], [37]. This paper offers a probabi- in our divide-and-conquer approach, the complexity is
   
listic notion to capture the performance and cost dynamics, reduced to n1 m1 . In each search step, we design efficient
and further develop a probabilistic scheduling system to techniques to further improve the optimization effective-
minimize the monetary cost with the consideration of those ness and efficiency. In the first step, we only consider on-
dynamics. demand instances and utilize the pruning capability of A
$

search to improve the optimization efficiency. In the second


3 SYSTEM DESIGN AND IMPLEMENTATION step, we consider the hybrid of spot and on-demand instan-
We first present an overview of the Dyna system and then ces as the refinement of the instance configuration obtained
discuss the design details about the optimization techniques from the first step. We give the following example to illus-
adopted in Dyna. trate the feasibility of the two-step optimization.
Example 1. Consider the instance configuration for a single
$
3.1 System Overview task. In the A -based instance configuration step, the on-
We propose Dyna, a workflow scheduling system in order demand instance configuration found for the task is
to minimize the monetary cost of executing the workflows h0; 0:1; Falsei. In the refinement step, the on-demand
in IaaS clouds. Compared with existing scheduling algo- instance configuration is refined to h0; 0:01; True;
rithms or systems [1], Dyna is specifically designed to cap- 0; 0:1; Falsei. Assume the expected execution time of
ture the cloud performance and price dynamics. The main the task on type 0 instance is 1 hour and the spot price is
components of Dyna are illustrated in Fig. 3. lower than $0.01 (equals to $0.006) for 50 percent of the
When a user has specified the probabilistic deadline time. The expected monetary cost of executing the task
requirement for a workflow, WaaS providers schedule the under the on-demand instance configuration is $0.1 and
workflow by choosing the cost-effective instance types for under the hybrid instance configuration is only $0.053
each task in the workflow. The overall functionality of the ($0:006  50% $0:1  50%). Thus, it is feasible to reduce
Dyna optimizations is to determine the suitable instance the expected monetary cost by instance configuration
configuration for each task of a workflow so that the mone- refinement in the second step.
tary cost is minimized while the probabilistic performance
In the remainder of this section, we outline the design of
requirement is satisfied. We formulate the optimization
the optimization components, and discuss on the imple-
process as a search problem, and develop a two-step
mentation details.
approach to find the solution efficiently. The instance con-
figurations of the two steps are illustrated in Fig. 3. We first
$ $
adopt an A -based instance configuration approach to 3.2 A -Based On-Demand Instance Configuration
select the on-demand instance type for each task of the In this optimization step, we determine an on-demand
workflow, in order to minimize the monetary cost while instance type for each task in the workflow so that the mon-
satisfying the probabilistic deadline guarantee. Second, etary cost is minimized while the probabilistic performance
starting from the on-demand instance configuration, we guarantee specified by the user is satisfied. We formulate
$
adopt the hybrid instance configuration refinement to con- the process into an A -based search problem. The reason
$
sider using hybrid of both on-demand and spot instances that we choose A search is to take advantage of its pruning
for executing tasks in order to further reduce cost. After the capability to reduce the large search space while targeting at
two optimization steps, the tasks of the workflow are a high quality solution. The challenging issues of develop-
$
scheduled to execute on the cloud according to their hybrid ing the A -based on-demand instance configuration
instance configuration. At runtime, we maintain a pool of include 1) how to define the state, state transitions and the
$
spot instances and on-demand instances, organized in lists search heuristics in A search; and 2) how to perform the
38 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2016

Algorithm 1 shows the optimization process of the


$
A -based instance configuration algorithm. Iteratively, we
fetch states from the OpenList and add their neighboring
states into the OpenList. Note, we only consider the feasible
states that satisfy the probabilistic deadline guarantee (Line
7-9). estimate performance is used to estimate the feasibility
of states. We maintain the lowest search cost found during
the search process as the upper bound to prune the unuseful
$
Fig. 4. An example of the configuration plan search tree in our A states on the search tree (Line 12-13). Function estimate cost
algorithm.
returns the estimation for the h and g scores of states. When
expanding the OpenList, we only add the neighboring
state evaluation so that the performance dynamics are cap- states with lower search cost than the upper bound (Line
tured to satisfy the probabilistic performance guarantee. 17-23).
$
3.2.1 A Search Process $
Algorithm 1. A -Based Instance Configuration Search
$
The process of A search can be modeled as a search tree. In from Initial State S to Goal State D
$
the formulated A search, we first need to clarify the defini-
Require: Max iter: Maximum number of iterations;
tions of the state and the state transitions in the search tree. A deadline; pr : Required probabilistic deadline guarantee
state is a configuration plan to the workflow, represented as 1: ClosedList = empty;
a multi-dimensional vector. Each dimension of the vector 2: OpenList = S;
represents the instance configuration of an on-demand 3: upperBound = 0;
instance type for each task in the workflow. This configura- 4: gS hS estimate costS;
tion is extended to hybrid instance configuration in the 5: fS gS + hS;
hybrid instance configuration refinement (described in 6: while not (OpenList is empty or reach Max iter) do
Section 3.3). For example, as shown in Fig. 4, a search state 7: current = the state in OpenList having the lowest f value;
for a workflow with three tasks is represented as t0 ; t1 ; t2 , 8: percentile = estimate performancecurrent; pr ;
meaning that task i (0  i  2) is configured with on- 9: if percentile < deadline then
demand instance type ti . Starting from the initial state (root 10: gcurrent = hcurrent = estimate costcurrent;
node of the search tree), the search tree is traversed by tran- 11: fcurrent = gcurrent+hcurrent;
sitting from a state to its child states level by level. At level 12: if fcurrent < upperBound then
l, the state transition is to replace the lth dimension in the 13: upperBound fcurrent;
state with all equally or more expensive instance types. In 14: D = current;
the example of Fig. 4, suppose there are three on-demand 15: Remove current from OpenList;
instance types (type 0, 1 and 2 with increasing on-demand 16: Add current to ClosedList;
prices). From the initial state (represented as 0; 0; 0) 17: for each neighbor in neighboring states of current do
where all tasks are assigned to the cheapest instance type 18: gneighbor = hneighbor = estimate costneighbor;
19: fneighbor gneighbor + hneighbor;
(instance type 0), we move to its child states by iterating the
20: if fneighbor > upperBound or neighbor is in
three available instance types for the first task (i.e., instance
ClosedList then
type 0, 1 and 2 and child states 0; 0; 0, 1; 0; 0 and 2; 0; 0).
$ 21: continue;
A search adopts several heuristics to enable its pruning 22: if neighbor is not in OpenList then
$
capability. Particularly, A evaluates a state s by combining 23: Add neighbor to OpenList;
two distance metrics gs and hs, which are the actual 24: Return D;
distance from the initial state to the state s and the estimated
distance from the state s to the goal state, respectively. gs
and hs are also referred as g score and h score for s, res- 3.2.2 State Evaluation
pectively. We estimate the total search cost for s to be The core operations of evaluating a state are to estimate
$
fs gs hs. In the A -based instance configuration, the expected monetary cost (function estimate cost)
we define both gs and hs to be the monetary cost of con- and to evaluate the feasibility of a state (function
figuration plan s. This is because if the monetary cost of a estimate performance) whether it satisfies the probabilis-
state s is higher than the best found result, its successors are tic performance guarantee. Due to cloud performance
unlikely to be the goal state since they have more expensive dynamics, we develop probabilistic methods for the
configurations than s. For example, assume state 1; 1; 0 on evaluation.
the search tree in Fig. 4 has a high search cost, the grey We model the execution time of tasks as probabilistic
states on the search tree are pruned since they have higher distributions. We develop probabilistic distribution mod-
$
monetary cost than state 1; 1; 0. During the A search, we els to describe the performance dynamics of I/O and net-
maintain two lists, namely the OpenList and ClosedList. work. Previous studies [12], [14] show that I/O and
The OpenList contains states that are potential solutions to network are the major sources of performance dynamics
the problem and are to be searched later. States already in the cloud due to resource sharing while the CPU
been searched or with high search cost are added to the performance is rather stable for a given instance type. We
ClosedList and do not need to be considered again during define the probability of the I/O and network bandwidth
$
the A search. equaling to a certain value x on instance type type to
ZHOU ET AL.: MONETARY COST OPTIMIZATIONS FOR HOSTING WORKFLOW-AS-A-SERVICE IN IAAS CLOUDS 39

PDF1 ; . . . ; PDFn2 PDFn1 , where PDFi (0  i  n  1)


is the probabilistic distribution of the execution time of task
i. The operation of two probabilistic distributions calcu-
lates the convolution of the two distributions and the MAX
operation finds the distribution of the maximum of two ran-
dom variables modeled by the two distributions. After
obtaining the execution time distribution of the workflow,
we check its percentile at the required probabilistic deadline
guarantee. According to our notion of probabilistic dead-
Fig. 5. Basic workflow structures and their probabilistic distributions of
the execution time, denoting the execution time distribution of Task 0, line, only if the returned percentile is no longer than the
1,. . ., n  1 to be PDF0, PDF1,. . ., PDFn1, respectively. deadline, the evaluated state is feasible.

be: PseqBand;type seqBand x, PrndBand;type rndBand x, 3.3 Hybrid Instance Configuration Refinement
PinBand;type inBand x and PoutBand;type outBand x as We consider the adoption of spot instances as a refinement
the probabilistic distributions for the sequential I/O, ran- to the configuration plan obtained from the previous step
dom I/O, downloading and uploading network perfor- $
(the A -based instance configuration algorithm) to further
mance from/to the persistent storage, respectively. In our reduce monetary cost. The major problem of adopting spot
calibrations on Amazon EC2, PrndBand;type rndBand x instances is that, running a task on spot instances may suffer
conforms to normal distributions and the other three from the out-of-bid events and fail to meet the deadline
conform to Gamma distributions (Section 4). Given the requirements. We propose a simple yet effective hybrid
I/O and network performance distributions and the cor- instance configuration to tackle this issue. The basic idea is,
responding I/O and networking data size, we manage to if the deadline allows, we can try to run a task on a spot
model the execution time of a task on different instance instance first. If the task can finish on the spot instance, the
types with probabilistic distribution functions (PDFs). For monetary cost tends to be lower than the monetary cost of
example, if the size of the input data on the disk is sin , running the task on an on-demand instance. It is possible
the probability of the time on reading the input data that we can try more than one spot instances, if the previous
equalling to sxin is PseqBand;type seqBand x, by assuming spot instance fails (as long as it can reduce the monetary
reading the input data is sequential accesses. cost and satisfy the probabilistic performance guarantee). If
Having modeled the execution time of tasks as probabi- all spot instances in the hybrid instance configuration fail,
listic distributions, we first introduce the implementation of the task is executed on an on-demand instance to ensure the
function estimate cost. The monetary cost of a state s is esti- deadline. When a task finishes the execution on a spot
mated to be the sum of the expected monetary cost of each instance, it is checkpointed, and the checkpoint is stored on
task running on the type of instance specified in s. Consider the persistent storage of the cloud (such as Amazon S3).
a task with on-demand instance type type and on-demand This is to avoid trigger the re-execution of its precedent
price p. We estimate the expected monetary cost of the task tasks. Dyna performs checkpointing only when the task
to be p multiplied by the expected execution time of the task ends, which is simple and has much less overhead than the
on the type-type on-demand instance. Here, we have general checkpointing algorithms [15].
ignored the rounding monetary cost in the estimation. This A hybrid instance configuration of a task is represented
is because in the WaaS environment, this rounding mone- as a vector of both spot and on-demand instance types, as
tary cost is usually amortized among many tasks. Enforcing described in Section 2.2. The last dimension in the vector is
$
the instance hour billing model could severely limit the the on-demand instance type obtained from the A -based
optimization space, leading to a suboptimal solution (a con- instance configuration step. The initial hybrid configuration
figuration plan with suboptimal monetary cost). contains only the on-demand instance type. Starting from
Another core evaluation function is estimate the initial configuration, we repeatedly add spot instances
performance. Given a state s and the execution time at the beginning of the hybrid instance configuration to find
distribution of each task under the evaluated state s, we first better configurations. Ideally, we can add n spot instances
calculate the execution time distribution of the entire work- (n is a predefined parameter). A larger n gives higher proba-
flow. Since the execution time of a task is now a distribution, bility of benefiting from the spot instances while a smaller n
rather than a static value, the execution time on the critical gives higher probability of meeting deadline requirement
path is also dynamic. To have a complete evaluation, we and reduces the optimization overhead. In our experiments,
apply a divide-and-conquer approach to get the execution we find that n 2 is sufficient for obtaining good optimiza-
time distribution of the entire workflow. Particularly, we tion results. A larger n greatly increases the optimization
decompose the workflow structure into the three kinds of overhead with only very small improvement on the optimi-
basic structures, as shown in Fig. 5. Each basic structure has zation results.
n tasks (n  2). The decomposition is straightforward by It is a challenging task to develop an efficient and effec-
identifying the basic structures in a recursive manner from tive approach for hybrid instance configuration refinement.
the starting task(s) of the workflow. First, coupled with the performance dynamics, it is a non-
The execution time distribution of each basic structure is trivial task to compare whether one hybrid instance config-
calculated with the execution time distributions of individ- uration is better than the other in terms of cost and perfor-
ual tasks. For example, the execution time distribution mance. Second, since the cloud provider usually offers
of the structure in Fig. 5b is calculated as MAXPDF0 ; multiple instance types and a wide range of spot prices, we
40 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2016

the execution time distribution of a task given a hybrid


instance configuration. Assume a hybrid instance configura-
tion of task T is Crefined = h(type1 , Pb , True), (type2 , Po ,
False)i. Assume the probabilistic distributions of the execu-
tion time of task T on the spot instance of type1 is PT;type1
and on the on-demand instance of type2 is PT;type2 . The over-
all execution time of task T under Crefined can be divided
into two cases. If the task successfully finishes on the spot
instance (with probability psuc ), the overall execution time
equals to the execution time of task T on the spot instance ts
Fig. 6. The definition of configuration C2  C1 . cdf 1 and cdf 2 are the with the following probability
cumulative execution time distribution functions under configuration plan
C1 and C2 , respectively. PT;Crefined time ts PT;type1 time ts  psuc : (1)

are facing a large space for finding the suitable spot instance Otherwise, the overall execution time equals to the time
type and spot price. that task T has run on the spot instance before it fails, tf ,
To address those two challenging issues, we develop effi- plus the execution time of task T on the on-demand instance
cient and effective heuristics to solve the problem. We to , with the following probability
describe the details in the remainder of this section. Refining
a hybrid instance configuration Corig of a task to a hybrid PT;Crefined time tf to PT;type1 time tf
instance configuration Crefined , we need to determine
whether Crefined is better than Corig in terms of monetary  PT;type2 time to (2)
cost and execution time distributions. Particularly, we have  1  psuc :
the following two considerations. We accept the refined
Now we discuss how to calculate psuc . Since a spot instance
configuration Crefined if both of the two considerations are
may fail at any time, we define a probabilistic function
satisfied.
ffpt; p to calculate the probability of a spot instance fails at
1) Probabilistic deadline guarantee consideration. Crefined time t for the first time when the bidding price is set to p.
should not violate the probabilistic deadline guaran- Existing studies have demonstrated that the spot prices can
tee of the entire workflow; be predicted using statistics models [17] or reverse engi-
2) Monetary cost reduction. The estimated monetary cost neering [18]. We use the recent spot price history as a pre-
of Crefined should be less than that of Corig . diction of the real spot price for ffpt; p to calculate the
Probabilistic deadline guarantee consideration. A naive way failing probability. We obtain that function with a Monte-
is to first calculate the probabilistic distribution of the entire Carlo based approach. Starting from a random point in the
workflows execution time under the refined configuration price history, if the price history becomes larger than p at
Crefined and then to decide whether the probabilistic dead- time t for the first time, we add one to the counter count. We
line requirement is met. However, this calculation introdu- repeat this process for NUM times (NUM is sufficiently
ces large overhead. We implement this process in the Oracle count
large) and return NUM as the failing probability. Using the
algorithm presented in Section 4. In Dyna, we propose a ffp function, we can define psuc as follows
light-weight localized heuristic to reduce the overhead. As Z ts
the on-demand configurations (i.e., the initial hybrid
$ psuc 1  ffpx; Pb dx: (3)
instance configuration) of each task found in the A -based 0
instance configuration step have already ensured the proba-
bilistic deadline requirement, we only need to make sure After obtaining the execution time distribution of a task
that the refined hybrid instance configuration Crefined of under the refined hybrid instance configuration Crefined , we
each task satisfies Crefined  Corig , where  is defined in Def- compare it with the configuration Corig according to Defini-
inition 1. Fig. 6 illustrates this definition. The integrals are tion 1. If Crefined  Corig is satisfied, the probabilistic dead-
represented as cumulative distribution functions (CDFs). line guarantee consideration is satisfied.
With this heuristic, when evaluating the probabilistic dead- Monetary cost reduction. We estimate the monetary cost of
line guarantee consideration for a refined configuration, we a hybrid instance configuration of a task as the sum of the
only need to calculate the probabilistic distribution of the cost spent on the spot instance and the cost on the on-
execution time of a task rather than the entire workflow and demand instance. Using Equation (1)-(3), we calculate the
thus greatly reduce the optimization overhead. expected monetary cost of configuration Crefined in Equa-
tion (4). Note that, we use the bidding price Pb to app-
Definition 1. Given two hybrid instance configurations C1 roximate the spot price in calculating the cost on spot
and C2 of task T , we have C2  C1 if for 8t, we have instances. This calculation gives an upper bound of the
Rt Rt
0 PT;C2 time x dx  0 PT;C1 time x dx, where PT;C1 actual expected monetary cost of the refined configuration
and PT;C2 are the PDFs of task T under configuration C1 and and thus assures the correctness when considering the mon-
C2 , respectively. etary cost reduction. If the estimated monetary cost of the
refined configuration is lower than the monetary cost of the
In order to compare two hybrid instance configurations original configuration, the monetary cost reduction consid-
according to Definition 1, we first discuss how to estimate eration is satisfied
ZHOU ET AL.: MONETARY COST OPTIMIZATIONS FOR HOSTING WORKFLOW-AS-A-SERVICE IN IAAS CLOUDS 41

with bidding price higher than Phigh does not contribute to


costCrefined psuc  Pb  ts
(4) monetary cost reduction.
1  psuc  Pb  tf Po  to :
Algorithm 3. Binary_Search(Plow , Phigh , types ) for a Task T .
Repeatedly, we add spot instances to generate better hybrid
Require: Plow : the lowest bidding price searched
instance configurations for each task in the workflow. Spe-
Phigh : the highest bidding price searched
cifically, for each added spot instance, we decide its type types : the evaluated spot instance type
and associated bidding price that satisfy the probabilistic Corig : the hybrid configuration before adding the spot
deadline guarantee and monetary cost reduction considera- instance of type types
tions. Due to price dynamics of spot instances, making the Crefined : the refined hybrid configuration with the spot
decision is non-trivial. One straightforward way is that, we instance of type types added
consider the cost of all spot instance types and its associated 1: if Plow > Phigh then
bidding price. The refined hybrid instance configuration is 2: Return -1;
chosen as the one that has the smallest expected monetary 3: Pmid Plow Phigh =2;
cost and satisfies the probabilistic performance guarantee. 4: originalcost estimate cost(Corig );
However, this method needs to search for a very large solu- 5: Crefined htypes ; Pmid ; True; Corig i;
tion space. To reduce the search space, we design a heuristic 6: refinedcost estimate cost(Crefined );
as described in Algorithm 2. We notice that, the added spot 7: if refinedcost > originalcost then
instance type should be at least as expensive as (i.e., the 8: Return binary_search(Plow ; Pmid ; types );
capability should be at least as good as) the on-demand 9: else
$
instance type found in the A search step in order to ensure 10: satisfied estimate performance(Crefined );
the probabilistic deadline guarantee. Thus, instead of 11: if not satisfied then
searching all spot instance types, we only need to evaluate 12: Return binary_search(Pmid ; Phigh ; types );
the types that are equally or more expensive than the given 13: Return Pmid ;
on-demand instance type. For each evaluated spot instance
type, we search the bidding price using the binary search
algorithm described in Algorithm 3. 4 EVALUATION
In this section, we present the evaluation results of the pro-
Algorithm 2. Hybrid Instance Configuration Refinement posed approach on Amazon EC2 and a cloud simulator.
for a Task T .
$
Require: typeo : the on-demand instance type found in the A 4.1 Experimental Setup
instance configuration for task T
We have two sets of experiments: firstly calibrating the
n: the dimension of the hybrid instance configuration
cloud dynamics from Amazon EC2 as the input of our opti-
1: T .configList[n] typeo ;
mization system; secondly running scientific workflows on
2: T .prices[n] on-demand price of typeo ;
3: for dim 1 to n  1 do Amazon EC2 and a cloud simulator with the compared
4: T .configList[dim] 1; algorithms for evaluation.
5: T .prices[dim] 0; Calibration. We measure the performance of CPU, I/O
6: for dim 1 to n  1 do and network for four frequently used instance types,
7: for types typeo to the most expensive instance type do namely m1.small, m1.medium, m1.large and m1.xlarge. We
8: Pmax the on-demand price of instance type types ; find that CPU performance is rather stable, which is consis-
9: Pb binary_search(Pmin , Pmax , types ); tent with the previous studies [12]. Thus, we focus on the
10: if Pb 1 then calibration for I/O and network performance. In particular,
11: continue; we repeat the performance measurement on each kind of
12: else instance for 10;000 times (once every minute in seven days).
13: T .configList[dim] types ; When an instance has been acquired for a full hour, it will
14: T .prices[dim] Pb ; be released and a new instance of the same type will be cre-
ated to continue the measurement. The measurement
The binary search algorithm in Algorithm 3 is illustrated results are used to model the probabilistic distributions of
as follows. If the probabilistic deadline guarantee consider- I/O and network performance.
ation is not satisfied, it means the searched spot price is too We measure both sequential and random I/O perfor-
low and we continue the search in the higher half of the mance for local disks. The sequential I/O reads performance
search space (Line 10-12). If the monetary cost reduction is measured with hdparm. The random I/O performance is
consideration is not met, it means the searched spot price is measured by generating random I/O reads of 512 bytes
too high and we continue the search in the lower half of the each. Reads and writes have similar performance results,
search space (Line 6-8). If both considerations are satisfied and we do not distinguish them in this study.
for a certain bidding price, this price is used as the bidding We measure the uploading and downloading bandwidth
price in the hybrid instance configuration. We search for the between different types of instances and Amazon S3. The
bidding price in the range of Plow ; Phigh . In our implementa- bandwidth is measured from uploading and downloading
tion, Plow is 0:001 and Phigh equals to the on-demand price of a file to/from S3. The file size is set to 8 MB. We also mea-
the evaluated spot instance type. Note, the spot instance sured the network bandwidth between two instances using
42 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2016

Iperf [38]. We find that the network bandwidth between previous study [40], MOHEFT is able to search the
instances of different types is generally lower than that instance configuration space and obtain a set of non-
between instances of the same type and S3. dominated solutions on the monetary cost and exe-
Workflows. There have been some studies on characteriz- cution time.
ing the performance behaviours of scientific workflows [19]. We conduct our experiments on both real clouds and
In this paper, we consider three common workflow struc- simulator. These two approaches are complementary,
tures, namely Ligo, Montage and Epigenomics. The three because some scientific workflows (such as Ligo and Epige-
workflows have different structures and parallelism. nomics) are not publicly available. Specifically, when the
We create instances of Montage workflows using Mon- workflows (including the input data and executables, etc.)
tage source code. The input data is the 2MASS J-band are publically available, we run them on public clouds. Oth-
images covering 8-degree by 8-degree areas retrieved from erwise, we simulate the execution with synthetic workflows
the Montage archive. The number of tasks in the workflow according to the workflow characteristics from existing
is 10,567. The input data size is 4 GB, where each of the studies [19].
2,102 tasks on the first level of the workflow structure reads On Amazon EC2, we adopt a popular workflow manage-
an input image of 2 MB. Initially, the input data is stored in ment system (Pegasus [41]) to manage the execution of
Amazon S3 storage. Since Ligo and Epigenomics are not workflows. We create an Amazon Machine Image (AMI)
open-sourced, we construct synthetic Ligo and Epigenomics installed with Pegasus and its prerequisites such as DAG-
workflows using the workflow generator provided by Pega- Man [42] and Condor [43]. We modify the Pegasus (release
sus [39]. We use the DAX files with 1,000 and 997 tasks 4.3.2) scheduler to enable scheduling the tasks onto instan-
(Inspiral_1000.xml and Epigenomics_997.xml [39]) for Ligo ces according to the hybrid instance configurations. A script
and Epigenomics, respectively. The input data size of Ligo written with Amazon EC2 API is developed for acquiring
is 9.3 GB, where each of the 229 tasks on the first level of the and releasing instances at runtime.
workflow structure reads 40.5 MB of input data on average. We develop a simulator based on CloudSim [44]. We
The input data size of Epigenomics is 1.7 GB, where each of mainly present our new extensions, and more details on
the seven tasks on the first level of the workflow structure cloud simulations can be found in the original paper [44].
reads 369 MB of DNA sequence data on average. The simulator includes three major components, namely
Implementations. In order to evaluate the effectiveness of Cloud, Instance and Workflow. The Cloud component
the proposed techniques in Dyna, we have implemented maintains a pool of resources which supports acquisition
the following algorithms. and release of Instance components. It also maintains the I/
O and network performance histograms measured from
 Static. This approach is the same as the previous
Amazon EC2 to simulate cloud dynamics. A spot price trace
study in [1] which only adopts on-demand instances.
obtained from the Amazon EC2 history is also maintained
We adopt it as the state-of-the-art comparison. For a
to simulate the price dynamics. The Instance component
fair comparison, we set the workflow deadline
simulates the on-demand and spot instances, with cloud
according to the probabilistic QoS setting used in
dynamics from the calibration. We simulate the cloud
Dyna. For example, if the user requires 90 percent of
dynamics in the granularity of seconds, which means the
probabilistic deadline guarantee, the deterministic
average I/O and network performance per second conform
deadline used for Static is set to the 90th percentile
the distributions from calibration. The Workflow compo-
of the workflows execution time distribution.
nent manages the workflow structures and the scheduling
 DynaNS. This approach is the same as Dyna except
of tasks onto the simulated instances.
that DynaNS does not use any spot instances. The
Experimental settings. We acquire the four measured types
comparison between Dyna and DynaNS is to assess
of instances from the US East region using the created AMI.
the impact of spot instances.
The hourly costs of the on-demand instance for the four
 SpotOnly. This approach adopts only spot instances
$ instance types are shown in Table 1. Those four instances
during execution. It first utilizes the A -based
have also been used in the previous studies [15]. As for the
instance configuration approach to decide the
instance type for each task in the workflow. Then we instance acquisition time (lag), our experiments show that
set the bidding price of each task to be very high (in each on-demand instance acquisition costs 2 minutes and
our studies, we set it to be $1, 000) in order to guar- spot instance acquisition costs 7 minutes on average. This is
antee the probabilistic deadline requirement. consistent with the existing studies [45].
 Oracle. We implement the Oracle method to assess The deadline of workflows is an important factor for the
the trade-off between the optimization overhead and candidate space of determining the instance configuration.
the effectiveness of the optimizations in Dyna. Ora- There are two deadline settings with particular interests:
cle is different from Dyna in that, Oracle does not Dmin and Dmax , the expected execution time of all the tasks
adopt the localized heuristic as Definition 1 (Section in the critical path of the workflow all on the m1.xlarge and
3.3) when evaluating the probabilistic deadline guar- m1.small instances, respectively. By default, we set the
antee consideration. This is an offline approach, deadline to be Dmin D
2
max
.
since the time overhead of getting the solution in We assume there are many workflows submitted by the
Oracle is prohibitively high. users to the WaaS provider. In each experiment, we submit
 MOHEFT. We select a state-of-the-art multi-objec- 100 jobs of the same workflow structure to the cloud. We
tive approach [40] for comparison. According to the assume the job arrival conforms to a Poisson distribution.
ZHOU ET AL.: MONETARY COST OPTIMIZATIONS FOR HOSTING WORKFLOW-AS-A-SERVICE IN IAAS CLOUDS 43

TABLE 2
Parameters of I/O Performance Distributions

Instance type Sequential I/O (Gamma) Random I/O (Normal)


m1.small k 129:3; u 0:79 m 150:3; s 50:0
m1.medium k 127:1; u 0:80 m 128:9; s 8:4
m1.large k 376:6; u 0:28 m 172:9; s 34:8
m1.xlarge k 408:1; u 0:26 m 1;034:0; s 146:4

performance distributions on the four instance types follow


the normal distribution. The parameters of those distribu-
tions are presented in Tables 2 and 3. Those results are
Fig. 7. The histogram and probabilistic distribution of random I/O perfor- mainly based on the measurement on real clouds. It is
mance on m1.medium instances. the result when different network and disk I/O pattern
interplayed with the shared virtualization environments.
The parameter  in the Poisson distribution affects the However, we do not know the underlying reason that the
chance for virtual machine reuse. By default, we set  as 0.1. random disk I/O performance follows the normal distribu-
As for metrics, we study the average monetary cost and tion and other patterns follow the Gamma distribution.
elapsed time for a workflow. All the metrics in the figures in Second, the I/O and network performance of the same
Section 4.3 and 4.4 are normalized to those of Static. Given the instance type varies significantly, especially for m1.small
probabilistic deadline requirement, we run the compared and m1.medium instances. This can be observed from the u
algorithms multiple times on the cloud and record their parameter of Gamma distributions or the s parameter of
monetary cost and execution time. We consider monetary normal distributions in Tables 2 and 3. Additionally, ran-
cost as the main metric for comparing the optimization dom I/O performance varies more significantly than
effectiveness of different scheduling algorithms when they sequential I/O performance on the same instance type. The
all satisfy the QoS requirements. By default, we set the prob- coefficient of variance of sequential and random I/O perfor-
abilistic deadline requirement as 96 percent. By default, we mance on m1.small are 9 and 33 percent, respectively. That
present the results obtained when all parameters are set to indicates the necessity of capturing the performance
their default setting. In Section 4.4, we experimentally study dynamics into our performance guarantee.
the impact of different parameters with sensitivity studies. Third, the performance between different instance types
also differ greatly from each other. This can be observed
from the k  u parameter (the expected value) of Gamma dis-
4.2 Cloud Dynamics tributions or the m parameter of normal distributions in
In this section, we present the performance dynamics Tables 2 and 3. Due to the significant differences among dif-
observed on Amazon EC2. The price dynamics have been ferent instance types, we need to carefully select the suitable
presented in Table 1 of Section 2. instance types so that the monetary cost is minimized.
Figs. 7 and 8 show the measurements of random I/O per- Finally, we observe that the performance distributions of
formance and downloading network performance from the on-demand instance types are the same as or very close
Amazon S3 of m1.medium instances. We have observed to those of the spot instance types.
similar results on other instance types. We make the follow-
ing observations.
First, both I/O and network performances can be mod- 4.3 Overall Comparison
eled with normal or Gamma distributions. We verify the In this subsection, we present the overall comparison results
distributions with null hypothesis, and find that (1) the of Dyna and the other compared algorithms on Amazon
sequential I/O performance, and uploading and download- EC2 and the cloud simulator under the default settings.
ing network bandwidth from/to S3 of the four instance Sensitivity studies are presented in Section 4.4. Note that we
types follow the Gamma distribution; (2) the random I/O have used the calibrations from Section 4.2 as input to Dyna
for performance and cost estimations.
Fig. 9 shows the average monetary cost per job results of
Static, DynaNS, SpotOnly, Dyna and Oracle methods on the
Montage, Ligo and Epigenomics workloads. The standard

TABLE 3
Gamma Distribution Parameters on Bandwidth
between an Instance and S3

Instance type Uploading bandwidth Downloading bandwidth


m1.small k 107:3; u 0:55 k 51:8; u 1:8
m1.medium k 421:1; u 0:27 k 279:9; u 0:55
m1.large k 571:4; u 0:22 k 6;187:7; u 0:44
Fig. 8. The histogram and probability distribution of downloading band- m1.xlarge k 420:3; u 0:29 k 15;313:4; u 0:23
width between m1.medium instances and S3 storage.
44 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2016

Fig. 11. Histogram of the spot price history in August 2013, US East
Region of Amazon EC2.

because the opportunity for leveraging spot instances gets


higher. Fig. 10 shows the monetary cost results of the com-
Fig. 9. The normalized average monetary cost optimization results of pared algorithms when the probabilistic deadline guarantee
compared algorithms on Montage, Ligo and Epigenomics workflows. is set to 90 percent. In this setting, the monetary cost reduc-
tion of Dyna over DynaNS is even higher than the default
errors of the monetary cost results of Static, DynaNS, Spo- setting, by 28-37 percent.
tOnly, Dyna and Oracle are 0.01-0.06, 0.02-0.04, 0.40-0.76, Third, SpotOnly obtains the highest monetary cost
0.01-0.03 and 0.01-0.03, respectively, on the tested work- among all the compared algorithms. This is due to the
loads. The absolute values of the average monetary cost of dynamic characteristic of spot price. Fig. 11 shows the histo-
Static are $285, $476 and $214 for Montage, Ligo and Epige- gram of the spot price during the month of the experiments.
nomics, respectively. Overall, Dyna obtains the smallest Although the spot price is lower than the on-demand price
monetary cost among the online approaches in all three of the same type in most of the time, it can be very high
workloads, saving monetary cost over Static, DynaNS and compared to on-demand price at some time. As shown in
SpotOnly by 15-73, percent 1-33 percent and 78-85 percent, Table 1, the highest spot price for a m1.small instance in
respectively. We make the following observations. August 2013 is $10 which is more than 160 times higher
First, DynaNS obtains smaller monetary cost than Static, than the on-demand price. Nevertheless, this observation
$
because the proposed A configuration search technique is depends on the fluctuation of spot price. The results on
capable of finding cheaper instance configurations and is comparing SpotOnly and Dyna can be different if we run
suitable for different structures of workflows. This also the experiments at other times. We study the sensitivity of
shows that performing deadline assignment before instance Dyna and SpotOnly to spot price with another spot price
configuration in the Static algorithm reduces the optimiza- history in Section 4.4.
tion effectiveness. For example, with the deadline assign- Fig. 12 shows the average execution time of a workflow
ment approach, the instance configuration of a task has to of Static, DynaNS, SpotOnly, Dyna and Oracle methods on
make sure that its execution time is no longer than its the Montage, Ligo and Epigenomics workloads. The stan-
assigned sub-deadline. However, this task can actually dard errors of the execution time results of the compared
make use of the left-over time from its previous tasks and algorithms are between 0.01-0.06 on the tested workloads.
be assigned to a cheaper instance type. Static has the smallest average execution time, which are
Second, Dyna obtains smaller monetary cost than around 3.4, 6.3 and 2.5 hours for Montage, Ligo and Epige-
DynaNS, meaning that the hybrid configuration with both nomics, respectively. This is because Static configures each
spot and on-demand instances is effective on reducing task in workflows with better and more expensive instance
monetary cost, in comparison with the on-demand only types. The careful selection of bidding price for each task in
approach. For lower probabilistic deadline guarantees, the the workflow in Dyna and high bidding prices in SpotOnly
monetary cost saved by Dyna over DynaNS gets higher diminish the out-of-bid events during execution. All of

Fig. 10. The normalized average monetary cost results of compared


algorithms on Montage, Ligo and Epigenomics workflows when the prob- Fig. 12. The normalized average execution time optimization results of
abilistic deadline guarantee is 90 percent. compared algorithms on Montage, Ligo and Epigenomics workflows.
ZHOU ET AL.: MONETARY COST OPTIMIZATIONS FOR HOSTING WORKFLOW-AS-A-SERVICE IN IAAS CLOUDS 45

TABLE 4
Optimization Overhead of the Compared Algorithms on
Montage, Ligo and Epigenomics Workflows (Seconds)

Static DynaNS SpotOnly Dyna Oracle


Montage 1 153 153 163 2,997
Ligo 1 236 236 244 10,452
Epigenomics 1 166 166 175 2,722

Fig. 14. Breakdown of the instance types adopted by compared algo-


rithms when the deadlines are Tight and Loose.

Fig. 13. The normalized average monetary cost and average execution
time results of sensitivity studies on deadline.

DynaNS, SpotOnly, Dyna and Oracle are able to guarantee Fig. 15. The normalized average monetary cost and average execution
the probabilistic deadline requirement. time results of sensitivity studies on the probabilistic deadline guarantees.
Finally, we analyze the optimization overhead of the
compared algorithms. The optimization overhead results
are shown in Table 4. Note that, for workflows with the
same structure and profile, our system only need to do the
optimization once. Although Oracle obtains smaller mone-
tary cost than Dyna, the optimization overhead of Oracle is
16-44 times as high as that of Dyna. This shows that Dyna is
able to find optimization results close to the optimal results
in much shorter time. Due to the long execution time of the
Oracle optimization, in the rest of the experiments, we do
not evaluate Oracle but only compare Dyna with Static,
DynaNS and SpotOnly.

4.4 Sensitivity Studies


We have conducted sensitivity studies on different work-
flows. Since we observed similar results across workflows, Fig. 16. The normalized average monetary cost results of sensitivity
we focus on Montage workflows in the following. In each studies on the arrival rate of workflows.
study, we vary one parameter at a time and keep other
parameters in their default settings. the previous study. This trend does not apply to SpotOnly
Deadline. Deadline is an important factor for determining because the spot price of the m1.medium instance can be
the instance configurations. We evaluate the compared algo- lower than the m1.small instance at some time. We have val-
rithms under deadline requirement varying from 1:5  Dmin idated this phenomena with studying the spot price trace.
(denoted as Tight), 0:5  Dmin Dmax (denoted as Probabilistic deadline guarantee. We evaluate the effective-
Medium) to 0:75  Dmax (denoted as Loose). All results ness of Dyna on satisfying probabilistic deadline require-
are normalized to those of Static when the deadline is ments when the requirement varies from 90 to 99.9 percent.
Medium. Fig. 13 shows the average monetary cost per job Fig. 15 shows the average monetary cost per job and average
and average execution time results. Dyna obtains the small- execution time results of the compared algorithms. Dyna
est average monetary cost among the compared algorithms achieves the smallest monetary cost for different probabilis-
under all tested deadline settings. As the deadline gets tic deadline guarantee settings. With a lower probabilistic
loose, the monetary cost is decreased since more cheaper deadline requirement, the monetary cost saved by Dyna is
instances (on-demand instances) are used for execution. We higher. DynaNS, SpotOnly and Dyna can guarantee the
break down the number of different types of on-demand probabilistic deadline requirement under all settings.
instances when the deadlines are Tight and Loose as shown Arrival rate. We evaluate the effectiveness of Dyna when
in Fig. 14. When the deadline is Loose, more cheap instances the arrival rate  of workflows varies from 0.1, 0.2, 0.4, 0.6,
are utilized. Under the same deadline, e.g., Tight, DynaNS 0.8, 0.9 to 1.0. All results are normalized to those when
utilizes more cheap instances than Static, which again shows arrival rate is 0.1. Fig. 16 shows the optimized average mon-
$
our A approach is better than the existing heuristics [1] in etary cost per job. Dyna obtains the smallest average
46 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2016

TABLE 5
Statistics on Spot Prices ($/hour, December 2011, Asia Pacific
Region) and On-Demand Prices of Amazon EC2

Instance type Average stdev Min Max OnDemand


m1.small 0.041 0.003 0.038 0.05 0.06
m1.medium 0.0676 0.003 0.064 0.08 0.12
m1.large 0.160 0.005 0.152 0.172 0.24
m1.xlarge 0.320 0.009 0.304 0.336 0.48

Fig. 18. Simulation results of Dyna and MOHEFT with Epigenomics


workflow.

expected execution time as the deadline constraint for


Dyna, and then we compare the monetary cost between
Dyna and MOHEFT. All other parameters are set as default.
All results are normalized to the lowest monetary cost
obtained by MOHEFT.
Fig. 18 shows the normalized average monetary cost
results of the compared algorithms. In this experiment,
MOHEFT outputs five different solutions. We denote the
average execution time of different MOHEFT solutions to
be t1, t2,. . ., t5. Dyna obtains smaller monetary cost than
Fig. 17. The simulation result of the normalized average monetary cost MOHEFT due to the utilization of spot instances, and can
obtained by the compared algorithms, using the spot price history of the
Asia Pacific Region of Amazon EC2 in December, 2011. ensure the deadline constraint under all settings. Although
MOHEFT optimizes both the monetary cost and execution
monetary cost under all job arrival rates. As the job arrival time as a multi-objective optimization problem, none of its
rate increases, the average cost per job is decreasing. This is solutions dominates the solution of Dyna.
because the runtime optimizations that we adopt from the
previous study [1], including consolidation and instance 5 CONCLUSIONS
reuse, can enable resource sharing between workflow jobs.
As the popularity of various scientific and data-intensive
When the arrival rate increases, there are more jobs arriving
applications in the cloud, hosting WaaS in IaaS clouds
in the WaaS at the same time. The resource utilization of the
becomes emerging. However, the IaaS cloud is a dynamic
WaaS is increased and the partial instance time is better uti-
environment with performance and price dynamics, which
lized. The dashed lines in Fig. 16 indicates the average mon-
make the assumption of static task execution time and the
etary cost of the compared algorithms without the runtime
QoS definition of deterministic deadlines undesirable. In
optimizations. For the arrival rates that we have evaluated,
this paper, we propose the notion of probabilistic perfor-
the instance configuration is still essential for the monetary
mance guarantees as QoS to explicitly expose the cloud
cost reduction in Dyna, in comparison with runtime consoli-
dynamics to users. We develop a workflow scheduling sys-
dations and instance reuse.
tem named Dyna to minimize the monetary cost for the
Spot price. To study the sensitivity of Dyna and SpotOnly
WaaS provider while satisfying predefined probabilistic
to the spot price variance, we use simulations to study the
deadline guarantees for individual workflows. We develop
compared algorithms on different spot price histories. Par- $
an A search based instance configuration method to
ticularly, we study the compared algorithms with the spot
address the performance dynamics, and hybrid instance
price history of the Asia Pacific Region in December 2011.
configuration of both spot and on-demand instances for
As shown in Table 5, the spot price during this period is
price dynamics. We deploy Dyna on both Amazon EC2 and
very low and stable, in comparison with the period that we
simulator and evaluate its effectiveness with three scientific
performed the experiments in August 2013. Thus the spot
workflow applications. Our experimental results demon-
instances are less likely to fail during the execution
strate that Dyna achieves much lower monetary cost than
(the failing probability ffp is rather low). Fig. 17 shows the
the state-of-the-art approaches (by 73 percent) while
obtained monetary cost result. SpotOnly and Dyna obtain
guaranteeing users probabilistic deadline requirements.
similar monetary cost results, which are much lower than
Static and DynaNS. This demonstrates that Dyna is able to
obtain good monetary cost optimization results for different ACKNOWLEDGMENTS
spot price distributions. The authors would like to thank anonymous reviewers for
their valuable comments. The authors acknowledge the sup-
4.5 Comparison with Multi-Objective Method port from the Singapore National Research Foundation
Finally, we present the comparison results of Dyna with under its Environmental & Water Technologies Strategic
MOHEFT [40] on the Epigenomics workflows with simula- Research Programme and administered by the Environment
tions. For each solution obtained from MOHEFT, we use its & Water Industry Programme Office (EWI) of the PUB,
ZHOU ET AL.: MONETARY COST OPTIMIZATIONS FOR HOSTING WORKFLOW-AS-A-SERVICE IN IAAS CLOUDS 47

under project 1002-IRIS-09. This work is partly supported [22] H. Kloh, B. Schulze, R. Pinto, and A. Mury, A bi-criteria schedul-
ing process with CoS support on grids and clouds, Concurrency
by a MoE AcRF Tier 1 grant (MOE 2014-T1-001-145) in Computat. Pract. Exp., vol. 24, pp. 14431460, 2012.
Singapore. Amelie Chi Zhou is also with Nanyang Environ- [23] I. M. Sardi~ na, C. Boeres, and L. M. De A. Drummond, An
ment and Water Research Institute (NEWRI). Amelie Chi efficient weighted bi-objective scheduling algorithm for hetero-
Zhou is the corresponding author. geneous systems, in Proc. Int. Conf. Parallel Process., 2009,
pp. 102111.
[24] C. Lin and S. Lu, Scheduling scientific workflows elastically for
REFERENCES cloud computing, in Proc. IEEE Int. Conf. Cloud Comput., 2011,
pp. 746747.
[1] M. Mao and M. Humphrey, Auto-scaling to minimize cost [25] S. Di, C.-L. Wang and F. Cappello, Adaptive algorithm for mini-
and meet application deadlines in cloud workflows, in Proc. mizing cloud task length with prediction errors, IEEE Trans.
Int. Conf. High Perform. Comput., Netw. Storage Anal., 2011, Cloud Comput., vol. 2, no. 2, pp. 194207, Apr.Jun. 2014.
pp. 112 [26] M. Rodriguez and R. Buyya, Deadline based resource provision-
[2] M. Malawski, G. Juve, E. Deelman, and J. Nabrzyski, Cost- and ing and scheduling algorithm for scientific workflows on clouds,
deadline-constrained provisioning for scientific workflow ensem- IEEE Trans. Cloud Comput., vol. 2, no. 2, pp. 222235, Apr.Jun.
bles in IaaS clouds, in Proc. Int. Conf. High Perform. Comput., 2014.
Netw., Storage Anal., 2012, pp. 111. [27] D. de Oliveira, V. Viana, E. Ogasawara, K. Ocana, and M. Mattoso,
[3] A. C. Zhou, B. He, and S. Ibrahim, A taxonomy and survey on Dimensioning the virtual cluster for parallel scientific workflows
escience as a service in the cloud, Arxiv Preprint Arxiv:1407.7360, in clouds, in Proc. 4th ACM Workshop Sci. Cloud Comput., 2013,
2014. pp. 512.
[4] J. Yu, R. Buyya, and C. K. Tham, Cost-based scheduling of scien- [28] D. Oliveira, K. A. Oca~ na, F. Bai~ao, and M. Mattoso, A prove-
tific workflow application on utility grids, in Proc. 1st Int. Conf. nance-based adaptive scheduling heuristic for parallel scien-
E-Science Grid Comput., 2005, pp. 8147. tific workflows in clouds, J. Grid Comput., vol. 10, pp. 521
[5] R. Sakellariou, H. Zhao, E. Tsiakkouri, and M. D. Dikaiakos, 552, 2012.
Scheduling workflows with budget constraints, in Proc. Core- [29] N. Roy, A. Dubey, and A. Gokhale, Efficient autoscaling in the
GRID, 2007, pp. 189202. cloud using predictive models for workload forecasting, in Proc.
[6] R. Duan, R. Prodan, and T. Fahringer, Performance and cost opti- IEEE Int. Conf. Cloud Comput., 2011, pp. 500507.
mization for multiple large-scale grid workflow applications, in [30] J. Yang, J. Qiu, and Y. Li, A profile-based approach to just-in-time
Proc. ACM/IEEE Conf. Supercomput., 2007. scalability for cloud applications, in Proc. IEEE Int. Conf. Cloud
[7] S. Abrishami, M. Naghibzadeh, and D. H. J. Epema, Deadline- Comput., 2009, pp. 916.
constrained workflow scheduling algorithms for IaaS clouds, [31] A. C. Zhou and B. He, Transformation-based monetary cost opti-
Future Generation Comput. Syst., vol. 29, pp. 15169, 2013. mizations for workflows in the cloud, IEEE Trans. Cloud Comput.,
[8] E.-K. Byun, Y.-S. Kee, J.-S. Kim, and S. Maeng, Cost optimized vol. 2, no. 1, pp. 8598, Jan.Mar. 2013.
provisioning of elastic resources for application workflows, [32] S. Ostermann and R. Prodan, Impact of variable priced cloud
Future Gen. Comput. Syst., vol. 27, pp. 10111026, 2011. resources on scientific workflow scheduling, in Proc. 18th Int.
[9] S. Maguluri, R. Srikant, and L. Ying, Stochastic models of load Conf. Parallel Process., 2012, pp. 350362.
balancing and scheduling in cloud computing clusters, in Proc. [33] B. Javadi, R. K. Thulasiram, and R. Buyya, Characterizing spot
IEEE INFOCOM, 2012, pp. 702710. price dynamics in public cloud environments, Future Gen. Com-
[10] F. Zhang, J. Cao, K. Hwang, and C. Wu, Ordinal optimized put. Syst., vol. 29, pp. 988999, 2013.
scheduling of scientific workflows in elastic compute clouds, [34] H.-Y. Chu and Y. Simmhan, Cost-efficient and resilient job life-
in Proc. IEEE 3rd Int. Conf. Cloud Comput. Technol. Sci., 2011, cycle management on hybrid clouds, in Proc. IEEE 28th Int. Paral-
pp. 917. lel Distrib. Process. Symp., 2014, pp. 327336.
[11] A. C. Zhou and B. He, Simplified resource provisioning for [35] H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron, Towards
workflows in IaaS clouds, in Proc. IEEE 6th Int. Conf. Cloud Com- predictable datacenter networks, in Proc. ACM SIGCOMM Conf.,
put. Technol. Sci., 2014, pp. 650655. 2011, pp. 242253.
[12] J. Schad, J. Dittrich, and J.-A. Quiane-Ruiz, Runtime measure- [36] M. Hovestadt, O. Kao, A. Kliem, and D. Warneke, Evaluating
ments in the cloud: observing, analyzing, and reducing variance, adaptive compression to mitigate the effects of shared I/O in
Proc. VLDB Endowment, vol. 3, pp. 460471, 2010. clouds, in Proc. IEEE Int. Symp. Parallel Distrib. Process. Workshops
[13] A. Iosup, S. Ostermann, N. Yigitbasi, R. Prodan, T. Fahringer, and Phd Forum, 2011, pp. 10421051.
D. Epema, Performance analysis of cloud computing services for [37] S. Ibrahim, H. Jin, L. Lu, B. He, and S. Wu, Adaptive disk i/o
many-tasks scientific computing, IEEE Trans. Parallel Distrib. scheduling for MapReduce in virtualized environment, in Proc.
Syst., vol. 22, no. 6, pp. 931945, Jun. 2011. Int. Conf. Parallel Process., 2011, pp. 335344.
[14] H. Wang, Q. Jing, R. Chen, B. He, Z. Qian, and L. Zhou, [38] Iperf [Online]. Available: http://iperf.sourceforge.net, Jul. 2014.
Distributed systems meet economics: pricing in the cloud, in [39] Workflow Generator. (2014, Jul.) [Online]. Available: https://
Proc. HotCloud, 2010, pp. 17. confluence.pegasus.isi.edu/display/pegasus/
[15] S. Yi, A. Andrzejak, and D. Kondo, Monetary cost-aware check- WorkflowGenerator
pointing and migration on amazon cloud spot instances, IEEE [40] J. J. Durillo, R. Prodan, and H. M. Fard, MOHEFT: A multi-objec-
Trans. Services Comput., vol. 5, no. 4, pp. 512524, 4th Quarter 2012. tive list-based method for workflow scheduling, in Proc. IEEE 4th
[16] M. Mazzucco and M. Dumas, Achieving performance and avail- Int. Conf. Cloud Comput. Technol. Sci., 2012, pp. 185192.
ability guarantees with spot instances, in Proc. IEEE 13th Int. [41] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G.
Conf. High Perform. Commun., 2011, pp. 296303. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, and
[17] B. Javadi, R. Thulasiram, and R. Buyya, Statistical modeling of D. S. Katz, Pegasus: A framework for mapping complex scientific
spot instance prices in public cloud environments, in Proc. IEEE workflows onto distributed systems, Sci. Program., vol. 13,
4th Int. Utility Cloud Comput., 2011, pp. 219228. pp. 219237, 2005.
[18] O. Agmon Ben-Yehuda, M. Ben-Yehuda, A. Schuster, and D. [42] CondorTeam, DAGMan [Online]. Available: http://cs.wisc.edu/
Tsafrir, Deconstructing amazon EC2 spot instance pricing, condor/dagman, Jul. 2014.
in Proc. IEEE 3rd Int. Conf. Cloud Comput. Technol. Sci.,, 2011, [43] M. Litzkow, M. Livny, and M. Mutka, CondorA hunter of idle
pp. 304311. workstations, in Proc. 8th Int. Conf. Distrib. Comput. Syst., 1988,
[19] G. Juve, A. Chervenak, E. Deelman, S. Bharathi, G. Mehta, and K. pp. 104111.
Vahi, Characterizing and profiling scientific workflows, Future [44] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. F. De Rose, and R.
Gen. Comput. Syst., vol. 29, pp. 682692, 2013. Buyya, Cloudsim: A toolkit for modeling and simulation of cloud
[20] L. Abeni and G. Buttazzo, QoS guarantee using probabilistic computing environments and evaluation of resource provisioning
deadlines, in Proc. Euromicro Conf. Real-Time Syst., 1999, pp. algorithms, Softw. Pract. Exper., vol. 41, pp. 2350, 2011.
242249. [45] M. Mao and M. Humphrey, A performance study on the VM
[21] R. N. Calheiros and R. Buyya, Meeting deadlines of scientific startup time in the cloud, in Proc. IEEE 5th Int. Conf. Cloud Com-
workflows in public clouds with tasks replication, IEEE Trans. put., 2012, pp. 423430.
Parallel Distrib. Syst., 2013, pp. 17861796.
48 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 1, JANUARY-MARCH 2016

Amelie Chi Zhou received the bachelors and Cheng Liu received the bachelors and masters
masters degrees from Beihang University. She is degrees from USTC. He is currently a research
currently working toward the PhD degree assistant in School of Computer Engineering of
at School of Computer Engineering of NTU, Sin- NTU, Singapore. His areas of expertise include
gapore. Her research interests include cloud structured peer-to-peer network and compiler.
computing and database systems.

Bingsheng He received the bachelors degree in " For more information on this or any other computing topic,
computer science from SJTU, and the PhD please visit our Digital Library at www.computer.org/publications/dlib.
degree in computer science from HKUST. He is
an assistant professor in School of Computer
Engineering of NTU, Singapore. His research
interests include high-performance computing,
cloud computing, and database systems.

You might also like