Machine Status Prediction for Dynamic and Heterogenous Cloud Environment

Machine Status Prediction for Dynamic and
Heterogenous Cloud Environment
Jinliang Xu, Ao Zhou∗, Shangguang Wang, Qibo Sun, Jinglin Li, Fangchun Yang
State Key Laboratory of Networking and Switching Technology
Beijing University of Posts and Telecommunications, Beijing, China
{jlxu,aozhou,sgwang,qbsun,jlli,fcyang}@bupt.edu.cn
Abstract—The widespread utilization of cloud computing ser-
vices has brought in the emergence of cloud service reliability
as an important issue for both cloud providers and users. To
enhance cloud service reliability and reduce the subsequent losses,
the future status of virtual machines should be monitored in real
time and predicted before they crash. However, most existing
methods ignore the following two characteristics of actual cloud
environment, and will result in bad performance of status pre-
diction: 1. cloud environment is dynamically changing; 2. cloud
environment consists of many heterogeneous physical and virtual
machines. In this paper, we investigate the predictive power of
collected data from cloud environment, and propose a simple yet
general machine learning model StaP to predict multiple machine
status. We introduce the motivation, the model development
and optimization of the proposed StaP. The experimental results
validated the effectiveness of the proposed StaP.
Keywords—cloud environment, status prediction, heterogenous,
dynamic.
I. INTRODUCTION
The large-scale utilization of cloud computing services for
hosting industrial applications has led to the emergence of
cloud service reliability as an important issue for both cloud
service providers and users [1]. Cloud environment employs
plenty of physical machines that are interconnected together
to deliver highly reliable cloud computing services. Although
the crash probability of a single virtual machine might be low,
it magnifys across the whole cloud environment [2].
To enhance cloud service reliability and reduce the subse-
quent losses, the future status of virtual machines, such as
missing, locked, paused, deleting, the time taken to crash,
etc., should be predicted before it happens. However, most
existing methods ignore the following characteristics of actual
cloud environment and this will result in bad performance
of status prediction [1], [3]: (1) the cloud environment is
dynamically changing, which requires response in real time;(2)
cloud environment consists of many heterogeneous physical
machines, which bring about different items to deal with,
such as fan speed, CPU temperature, memory, disk error,
application updating, software architecture, etc.
In this paper, to tackle the above problems, we propose a
simple yet general machine learning model StaP to predict
multiple machine status. StaP can automatically learn the
representation of of different items and the correlations among
them, and predict multiple statuses in real time when ready
trained. The development and optimization process of the
proposed StaP is detailed. The experimental results validated
the effectiveness of StaP.
II. THE PROPOSED MODEL
Let V = {vm ∈ RD
|m = 1, 2, · · · , M} denote representa-
tion vectors of collected items in a D-dimension continuous
space, where vector vm represent the m-th item bm. V is
shared across all machines and can automatically be learned
from collected data. Let Sn denote the collected item list of
machine un. The length of Sn and its elements may vary
with different machines. Then we denote collected data by
R = {rn,m|n = 1, · · · , N, m ∈ Sn}, where rn,m stand for the
collected value of un on bm. We use one-hot representation
to represent each status, and concatenate them all to generate
representation yn of all statuses of machine un [4].
Based on these denotations, we can represent the conditional
probability that machine un takes value bm by adding the
corresponding value rm,n as scaling factor as follows:
p(vm|yn) = (pr(vm|yn))
rm,n
=
exp rm,nvT
mW yn
m∈Sn
exp (vT
mW yn)
rm,n
, (1)
where W ∈ RD×C
is a parameter that need to be learned from
data. And then, according to the product rule in probability
theory, we can get probability that un takes all values of Sn
as follows:
p(Sn|yn) =
m∈Sn
p(vm|yn)
=
exp m∈Sn
rm,nvm
T
W yn
m∈Sn
exp (vT
mW yn) m∈Sn
rm,n
. (2)
Finally we get the objective function as the log likelihood
over all the machines as follows:
StaP =
N
n=1
log p(yn|Sn), (3)
where η is the regularization constant.
2016 IEEE International Conference on Cluster Computing
2168-9253/16 $31.00 © 2016 IEEE
DOI 10.1109/CLUSTER.2016.73
136

Training model is to find the optimal parameters Θ that can
maximize Eq. 3. With elementary algebraic manipulations, we
can change the training target into:
Θ∗
= argmax
W ,V
N
n=1 m∈Sn
rm,nvm
T
W yn
−
m∈Sn
rm,n log
m∈Sn
exp vT
mW yn
(4)
As direct optimizing would suffer high computational cost,
we resort to the negative sampling technique [5] for efficiency.
The optimizing process is shown in Algorithm 1, where σ(x)
means the logistic function.
Algorithm 1: Learning algorithm
Input: R, Y = {yn},learning rate η, maximum iterations
maxIt, sampling number k;
Output: W , V ;
1 Initialize W , V ,t = 0, define ϕn = m∈Sn
rm,nvm;
2 while t + + < maxIt do
3 for n = 1; n ≤ N; n + + do
4 W = W + ησ −ϕT
n W yn ϕnyT
n ;
5 for m ∈ Sn do
6 vm = vm + ησ −ϕT
n W yn rm,nW yn;
7 for i = 1; i ≤ k; i + + do
8 sample negative sample yi;
9 W = W − ησ ϕT
n W yn ϕnyT
i ;
10 for m ∈ Sn do
11 vm = vm − ησ ϕT
n W yn rm,nW yi;
12 Update W = W − 2ληW , V = V − 2ληV ;
13 return W , V ;
With the ready trained parameters W , V , we can predict yz
for a new machine uz according to
y ∗
z = argmax
y∈Y
p(y|Sz) = argmax
y∈Y
˜p(yn)p(Sn|yn)
= argmax
y∈Y
log ˜p(y) +
m∈Sz
rm,zvm
T
W y
−
m∈Sz
rm,z log
m∈Sz
exp vT
mW y
, (5)
where ˜p(yn) is the empirical distribution of machine status
representation yn given by the R.
As the terms m∈Sz
rm,zvm
T
W , m∈Sz
rm,z and
vT
mW are constant for all y, the process to get y ∗
z would
not cost much. In Eq. 5, the empirical distribution ˜p(y) can
be considered as the prior probability of y, and p(Sn|yn) is
closely related to the likelihood function.
III. EXPERIMENT
The experimental dataset contains 210, 000 ratings ex-
pressed by 1, 075 users on 2, 000 books. A user has individual
information, such as gender, age group and his rating list. We
choose this dataset because a user, his rating list and individual
information can be mapped to a virtual machine, the set of
collected items and its two different future statuses.
We employ POP and SNE as baseline models, and weight-
ed F1 and Hamming Loss as evaluation metrics[4]. They
are commonly used in multi-task multi-class classification
problem, which is similar to status prediction tasks in cloud
environment.
TABLE I
PERFORMANCE COMPARISON.
Training
ratio (%)
weighted F 1 Hamming Loss
POP SNE StaP POP SNE StaP
50 0.095 0.213 0.278 0.464 0.469 0.467
70 0.096 0.315 0.350 0.489 0.463 0.458
90 0.096 0.367 0.379 0.451 0.452 0.443
The experimental results are as shown in Table I. Clearly,
the proposed StaP outperforms POP and SNE under different
evaluation metrics all the time, as we set the training data ratio
with 50%, 80% and 90% respectively. This result validates the
assumption that the proposed StaP is a more proper model to
predict the current status of virtual or physical machines by
utilizing data that collected from cloud environment.
IV. CONCLUSION
In this paper, we address the problem of virtual machine
status prediction for dynamic and heterogenous cloud environ-
ment. More specifically, we investigate the predictive power of
collected data of different items from cloud environment and
propose a simple yet general machine learning model StaP to
automatically learn the representation of of different items and
the correlations among them, and predict multiple statuses in
real time. The experimental results validated the effectiveness
of the proposed model.
ACKNOWLEDGMENT
This work was supported by NSFC (61272521), NSFC
(61571066, 2016.01-2019.12), and the Fundamental Research
Funds for the Central Universities(2016RC19).
REFERENCES
[1] M. Dong, H. Li, K. Ota, L. T. Yang, and H. Zhu, “Multicloud-based
evacuation services for emergency management,” Cloud Computing,
IEEE, vol. 1, no. 4, pp. 50–59, 2014.
[2] P. Gill, N. Jain, and N. Nagappan, “Understanding network failures in data
centers: measurement, analysis, and implications,” in ACM SIGCOMM
Computer Communication Review, vol. 41, pp. 350–361, ACM, 2011.
[3] J. Liu, S. Wang, A. Zhou, S. Kumar, F. Yang, and R. Buyya, “Using
proactive fault-tolerance approach to enhance cloud service reliability,”
IEEE Transactions on Cloud Computing, pp. 1–13, 2016.
[4] P. Wang, J. Guo, Y. Lan, J. Xu, and X. Cheng, “Your cart tells you:
Inferring demographic attributes from purchase data,” in Proceedings of
ACM International Conference on Web Search and Data Mining (WSDM),
pp. 251–260, ACM, 2016.
[5] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Dis-
tributed representations of words and phrases and their compositionality,”
in Proceedings of Advances in Neural Information Processing Systems
(NIPS), pp. 3111–3119, 2013.
137

Machine Status Prediction for Dynamic and Heterogenous Cloud Environment

More Related Content

Machine Status Prediction for Dynamic and Heterogenous Cloud Environment