Kandi 22218
Kandi 22218
Kandi 22218
Official URL
https://doi.org/10.1145/3167132.3167148
change the allocated amount at any time. The services are provided
according to a pricing model and meet a set of performance re-
ABSTRACT quirements that are specified in Service Level Agreements (SLA). If
Cloud computing has emerged as a paradigm for delivering Infor- requirements are not met, the provider pays penalties to the client.
mation Technology services over Internet. Services are provided We are interested in cloud services for database querying (Platform-
according to a pricing model and meet requirements that are spec- as-a-Service database, PaaS), particularly the problem of resource
ified in Service Level Agreements (SLA). Recently, most of cloud allocation. Most of the current cloud providers include services for
providers include services for DataBase (DB) querying dedicated to database querying with languages similar to SQL in which queries
run on MapReduce platform and a virtualized architecture. Classi- run on MapReduce [2] clusters (Hive [7]). Among these services, we
cal resource allocation methods for query optimization need to be mention Amazon Elastic MapReduce1 , Microsoft Azure HDInsight2 ,
revised to handle the pricing models in cloud environnements. In Oracle BigData Cloud service3 . The proposed query languages are
this work, we propose a resource allocation method for the query usually called SQL-like4 . With the above services, a SQL-like query
optimization in the cloud based on Integer Linear-Programming is transformed into a set of dependent MapReduce jobs. Each job
(ILP). The proposed linear models can be implemented in any fast contains a set of parallel tasks. These tasks are submitted to an al-
solver for ILP. The method is compared with some existing greedy locator that places them within the available resources and defines
algorithms. Experimental evaluation shows that the solution offers the execution scheduling over time that respects the precedence
a good trade-off between the allocation quality and allocation cost. constraints and resource availability.
Several solutions have been proposed for resource allocation
in MapReduce paradigm [4][9][11]. The aim of most of this work
is to ensure fairness (i.e. assign resources so that all jobs get an
CCS CONCEPTS equal share of resources over time) and data locality (i.e. assign the
task to the node that contains its data). These methods are better
• Computer systems organization → Cloud computing; • Com-
suited to classical parallel environments and do not handle cloud
puting methodologies → MapReduce algorithms;
constraints. In classical parallel environments, resource allocation is
efficient when it minimizes execution time and maximizes through-
put. However, in the cloud, the aim is to maximize the monetary
gain of the provider and meet the client requirements established
KEYWORDS in SLAs. Existing methods that take into account these aspects are
Cloud Computing, PaaS, MapReduce, Query Optimization, Resource generally based on greedy methods [3] that have the advantage
Allocation, Integer Linear-Programming of quick decision-making and simplicity of their design. However,
greedy methods do not give a theoretical guarantee on the qual-
ity of the solution in terms of monetary gain, which generates a
negative effect on the provider’s gain.
Motivated by the limitations of the above methods, we propose
a resource allocation method for the execution of SQL-like queries
1 INTRODUCTION in the cloud. The solution consists of two phases : (1) place tasks in
Cloud computing became a common way to provide on-demand available resources and (2) choose the time windows allocated to
Information Technology services. Cloud services are offered by a each task. Each phase is modeled by an ILP model, so the resolu-
provider who owns a hardware architecture and a set of software tion can be done with any exact ILP optimization algorithm. We
tools that meet client needs. In the cloud, resources can be reserved compare in the experimental section our method with a one-phase
and released in an elastic way, which means that it is possible to ILP method, and some existing greedy methods [3]. We show that
1 https://aws.amazon.com/fr/emr/
2 https://azure.microsoft.com/fr-fr/services/hdinsight/
3 https://cloud.oracle.com/bigdata
https://doi.org/10.1145/3167132.3167148 4 https://docs.treasuredata.com/articles/hive
our method offers a good trade-off between the allocation cost and The resources are made available to the PaaS provider by an
the monetary cost generated by the execution of queries. Infrastructure-as-a-Service (IaaS) provider in the form of VMs. VMs
The rest of this paper is organized as follows. In section 2, we are rented by the PaaS depending on the duration of use. In Equation
present the considered database cloud services. Then we detail (4), T is the set of VM types, V(t) is the set of VMs of type t, Pricet
our resource allocation method in section 3. Section 4 reports the is the price of using a VM of type t for one time unit, Dv is the
experimentation results, while section 5 reviews some related work duration of use of the VM v ∈ V(t).
on resource allocation for MapReduce applications. Finally, we Õ Õ
Resources = Pricet ∗ Dv + N etworkAccess (4)
conclude in section 6.
t ∈T v ∈V(t )
2 CLOUD DB SERVICE DESCRIPTION The price of penalties depends on the duration of the deadline vi-
olation (V iolationDur ), the price of the query (Query− Price) and a
2.1 Query Compilation and Execution percentage that depends on the class of the client (PercentaдeSLA):
Figure 1 shows the considered architecture. SQL-like queries are Õ Õ Õ
submitted through a client interface. The lexical and syntactic ana- Penalties = V iolationDur (q) ∗ W (q) (5)
lyzer checks if the query is correct and generates a graph of opera- o ∈O c ∈C(o) q ∈Q(c)
tors (joins, projections...). Logical optimization consists in reducing W (q) = QueryPrice(q) ∗ PercentaдeSLA(o). The resource allo-
the volume of manipulated data by applying classical transforma- cation models that we propose in the following are intended to
tion rules of algebraic trees. Physical optimization determines the minimize expenditure (resources+penalties).
join algorithms and join order, then generate a graph of the execu-
tion plan. The nodes of this graph are MapReduce jobs and edges 3 RESOURCE ALLOCATION METHOD
represent dependencies between them. A query is transformed into We propose a method based on Integer Linear-Programming (ILP)
a job graph in different ways : (1) associate a job to each join opera- for the problem of resource allocation. Given a set of logical re-
tor [7], (2) associate one job to all operators [1] or (3) decompose sources, the aim of our allocation method is to find a placement
the join operators into several groups and associate a MapReduce and a scheduling over time that minimize monetary costs for the
job to each group [10]. Intra-job parallelization consists in defining PaaS cloud provider. The proposed solution adopts a two-phase
the number of Map and Reduce tasks of each job of the graph. approach. First, the placement involves choosing a pool of resources
The provider’s cloud infrastructure consists of a set of physi- on which each task group will be executed. Then, scheduling con-
cal machines. A hypervisor, whose role is to manage the Virtual sists of choosing the time windows allocated to each task group. A
Machines (VMs), is installed on each physical machine. Each VM resource pool is a set of Map (or Reduce) resources with the same
represents a MapReduce node. It contains a set of logical resources characteristics and physically close to each other. A task group is a
and a local resource allocation manager that receives allocation de- set of Map (or Reduce) tasks that belong to the same job. We assume
cisions from the global manager and returns the state of resources. that the cardinality of resource pools is equal to the cardinality of
Each logical resource can contain only one task at a given time. A task groups. We present in the following the ILP placement model
logical resource is an abstract representation of a certain reserved (section 3.1) then the ILP scheduling model (section 3.2).
CPU, memory and storage capacity. The global resource allocation
manager receives the graphs of execution plans and performs task 3.1 Placement Model (1st phase)
placement and scheduling given the available resources. Section 3 is
The placement consists of choosing the resource pool on which
devoted to a new resource allocation method that takes into account
each task group is executed. We introduce the following variables:
economic aspects.
• x i,m,a = 1 if the Map task group m of the job i is placed on
2.2 Economic Model the Map resource pool a; = 0 if not.
• yi,r,b = 1 if the Reduce task group r of the job i is placed on
We propose an economic model for a PaaS database provider. The
the Reduce resource pool b; = 0 if not.
profit is obviously defined by:
• za,b = the maximum amount of data transferred between
Pro f it = Income − Expenditure (1) the task groups placed in the pool a and the task groups
We assume that O is the set of client classes, C(o) is the set of placed in the pool b.
clients belonging to the class o ∈ O and Q(c) is the set of queries J is the set of jobs for all submitted queries, Mi is the set of Map
issued by the client c. We assume a query-based pricing model, task groups of the job i, Rj is the set of reduce task groups of the
i.e: the client pays an amount of money for each submitted query. job j, A is the set of Map resource pools, B is the set of Reduce
The price of the query depends on its nature (number of operators, resource pools:
manipulated data sizes...) and the client class. The income of the
x i,m,a ∈ {0, 1}, ∀i ∈ J, m ∈ Mi , a ∈ A (6)
PaaS provider is equal to the price of all submitted queries:
Õ Õ Õ y j,r,b ∈ {0, 1}, ∀j ∈ J, r ∈ Rj , b ∈ B (7)
Income = QueryPrice(q) (2)
za,b ∈ {0, 1...., U pperBound(z)}, ∀a ∈ A, b ∈ B (8)
o ∈O c ∈C(o) q ∈Q(c)
Expenditures consist of resource costs and penalties. Multiple tasks can be assigned to the same resource. The ex-
clusivity of execution is then ensured in time with the scheduling
Expenditure = Resources + Penalties (3) model which will be presented in the section 3.2.
Õ Õ Õ
Tmi ∗ x i,m,a + (1 − Fma,t ) ≤ α, ∀a ∈ A (19)
i ∈J m ∈Mi t <T
Õ Õ Õ
Tr i ∗ x i,r,b + (1 − Frb,t ) ≤ β, ∀b ∈ B (20)
i ∈J r ∈Ri t <T
Communication cost
Penalty cost 40
30 30
20 20
20
10 10
0 0 0
GBRT GMPT GMPM ILP2P GBRT GMPT GMPM ILP2P GBRT GMPT GMPM ILP2P
(a) 2 simple queries per time unit (b) 3 simple queries per time unit (c) 4 simple queries per time unit
60 60
60
monetary cost ($)
40 40
40
20 20
20
0 0 0
GBRT GMPT GMPM ILP2P GBRT GMPT GMPM ILP2P GBRT GMPT GMPM ILP2P
(d) 2 complex queries per time unit (e) 3 complex queries per time unit (f) 4 complex queries per time unit
(1) G-BRT assigns the task with maximum execution time to the in the step where the choice was made, but it will turn out that it is
resource that minimizes the standard deviation of the utilization of not a good choice later. Its results are thus worse than that of ILP2P
resources, (2) G-MPT assigns the task with the maximum execution which adopts an exact approach.
time to the resource that minimizes completion time, (3) G-MPM In a second step, we compare our two-phase ILP method (ILP2P)
assigns the task with the maximum output size to the resource that with another ILP method (ILP1P) designed to show the advantages
minimizes monetary cost. of adopting a two-phase approach. ILP1P is based on a single phase
In the simulation, we consider two types of VMs. A type1 VM approach, i.e. one ILP model that handles both placement and sched-
contains 32 CPU and 8GB of RAM, its price per hour of use is 1.5$. uling at the same time.
A type2 VM contains 16 CPU and 4GB of RAM, its price per hour
of use is 0.75$. We consider the arrival of simple queries (< 6 jobs Table 1: Allocation cost (seconds)
per query) in sub-figures (3a),(3b),(3c) and complex queries (≥ 6
jobs per query) in sub-figures (3d),(3e),(3f). Each job contains 16 to
average minimum maximum
40 Map (resp. Reduce) tasks. The size of each Map block is 256 or
G-BRT 0.020 0.017 0.052
512 MB. The initial resource availability rate is generated randomly.
G-MPT 0.223 0.178 0.401
Each sub-figure represents the average monetary cost per time unit
G-MPM 0.228 0.176 0.483
for different arrival rates.
ILP2P 2.272 0.931 19.405
Figures (3a),(3b),(3c),(3d),(3e),(3f) show that G-MPM and ILP2P
ILP1P 376.043 54.763 1201.742
have a lower cost than G-BRT and G-MPT. The latter two methods
handle load balancing and execution time reduction but it is not
sufficient to reduce monetary costs. Indeed, when we have a set The results of Figure 4 shows that ILP1P has obviously the best
of queries to place and schedule, and we want to reduce costs, we monetary cost. Indeed, if the problems of placement and scheduling
should schedule first the query with the most restrictive deadline are treated at the same time then the search space is significantly
and penalty weight and not the query that minimizes the global larger. It is therefore likely to find a better solution in terms of
execution time. G-MPM handles monetary cost but uses a greedy monetary cost. On the other hand, dealing with scheduling and
method in which a part of the solution is determined at each step of placement problems at the same time gives rise to a more complex
the algorithm. This part is determined with the available informa- problem. Table 1 illustrates the average, minimal and maximum
tion in the current step and without taking into account all possible allocation times of the different methods. Given the complicated
placement and scheduling configurations. This may give rise to nature of ILP1P, its allocation time is very long and unreasonable
choices that would seem interesting given the information available in practice. Although ILP2P is slower than the greedy methods, its
CPU cost three different problems: (1) minimize execution time given a fixed
30 Memory cost budget, (2) minimize the monetary cost given deadlines and (3)
Storage cost
find the right trade-off between execution time and monetary cost.
monetary cost ($)
Communication cost
Penalty cost They propose some greedy methods and a local search algorithm
20
to allocate resources to dependent tasks. They show that the local
search does not significantly improve the results compared to the
greedy methods. However, it is known that the greedy approaches
10
do not theoretically guarantee the quality of the solution. This
can generate a negative impact on the provider benefit. Unlike
greedy methods, we propose in our work an ILP formulation for the
0
GBRT GMPT GMPM ILP2P ILP1P problem, so an exact solution can be found. Our work is compared
with greedy methods in the experimental section.
Figure 4: Monetary cost comparison (GBRT, GMPT, GMPM,
ILP2P, ILP1P) - 2 simple queries per time unit 6 CONCLUSION
We addressed in this work the resource allocation problem for SQL-
like queries in the cloud. We proposed an ILP based method that
allocation time remains reasonable and significantly better than
ensures placement and scheduling. We implemented the models
ILP1P. Indeed, the allocation time is negligible compared to the
and compared our work with some existing methods. The results
execution time of the query. As shown in Figure 4, ILP2P makes it
showed that our method provides a higher monetary gain compared
possible to gain about 1$ per time unit compared to G-MPM. ILP2P
to the greedy algorithms with a reasonable allocation time. As a
presents thus a good trade-off between the allocation cost and the
future work, we plan to consider parameter estimation errors and
monetary cost.
design efficient dynamic strategies that detect estimation errors
during execution time, then change the allocation plan to reduce
5 RELATED WORK the impact of these errors on the monetary cost.
Several methods have been proposed in the literature to improve
the resource allocation for MapReduce applications. Some solutions REFERENCES
are more suitable for classical parallel environments [5][8][9][11], [1] Foto N Afrati and Jeffrey D Ullman. 2010. Optimizing joins in a map-reduce
while others are dedicated to the cloud [3][6]. The goal of resource environment. In Proceedings of the 13th International Conference on Extending
Database Technology. ACM, 99–110.
allocation for classical parallel environments is to minimize execu- [2] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing
tion time and maximize throughput. The allocation in the cloud, on on large clusters. Commun. ACM 51, 1 (2008), 107–113.
the other hand, supposes the existence of a provider and several [3] Herald Kllapi, Eva Sitaridi, Manolis M Tsangaris, and Yannis Ioannidis. 2011.
Schedule optimization for data processing flows on the cloud. In Proceedings of
clients with different needs, the goal is to meet client requirements the 2011 ACM SIGMOD International Conference on Management of data. ACM,
(specified in SLAs) while maximizing profit. 289–300.
[4] Minghong Lin, Li Zhang, Adam Wierman, and Jian Tan. 2013. Joint optimization
Most of the existing work for parallel environments is limited to of overlapping phases in MapReduce. Performance Evaluation 70, 10 (2013),
independent tasks. The basic allocation algorithm for these environ- 720–735.
ments is FIFO. The allocator assigns the oldest waiting task to the [5] Zhihong Liu, Qi Zhang, Mohamed Faten Zhani, Raouf Boutaba, Yaping Liu, and
Zhenghu Gong. 2015. Dreams: Dynamic resource allocation for mapreduce with
first available resource. This solution is unfair. Indeed, when long data skew. In Integrated Network Management (IM), 2015 IFIP/IEEE International
tasks are submitted the later short tasks must wait until the earlier Symposium on. IEEE, 18–26.
finish. FAIR [11] is a resource allocation algorithm that solves this [6] Zilong Tan and Shivnath Babu. 2016. Tempo: robust and self-tuning resource man-
agement in multi-tenant parallel databases. Proceedings of the VLDB Endowment
problem by considering fairness. This algorithm ensures that each 9, 10 (2016), 720–731.
user queries receive a minimum resource capacity as long as there [7] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka,
Ning Zhang, Suresh Antony, Hao Liu, and Raghotham Murthy. 2010. Hive-a
is sufficient demand. When a user does not need its minimum ca- petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010
pacity, other users are allowed to take it. Despite its advantages, IEEE 26th International Conference on. IEEE, 996–1005.
FAIR does not offer mechanisms to handle deadlines. ARIA [8] is [8] Abhishek Verma, Ludmila Cherkasova, and Roy H Campbell. 2011. ARIA: au-
tomatic resource inference and allocation for mapreduce environments. In Pro-
a framework that manages this problem. For this purpose, ARIA ceedings of the 8th ACM international conference on Autonomic computing. ACM,
builds a job profile that reflects performance characteristics of the 235–244.
job for Map and Reduce phases, then it defines a performance model [9] Weina Wang, Kai Zhu, Lei Ying, Jian Tan, and Li Zhang. 2016. Maptask schedul-
ing in mapreduce with data locality: Throughput and heavy-traffic optimality.
that estimates the amount of map and reduce tasks for the job and IEEE/ACM Transactions on Networking 24, 1 (2016), 190–203.
its deadline. Finally, ARIA determines the job order for meeting [10] Sai Wu, Feng Li, Sharad Mehrotra, and Beng Chin Ooi. 2011. Query optimization
for massively parallel data processing. In Proceedings of the 2nd ACM Symposium
deadlines based on the earliest deadline first policy. on Cloud Computing. ACM, 12.
The above resource allocation algorithms do not consider cloud [11] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmeleegy, Scott
features. Tan et al [6] position their work in the context of multi- Shenker, and Ion Stoica. 2010. Delay scheduling: a simple technique for achieving
locality and fairness in cluster scheduling. In Proceedings of the 5th European
tenant parallel databases. Their solution handles SLAs. Nevertheless, conference on Computer systems. ACM, 265–278.
they consider only performance metrics in the allocation decision
but not economic aspects. Among the existing resource allocation
work dedicated to the cloud, Kllapi et al [3] is the closest to our
context. This work considers economic aspects. Authors explore