Network Traffic Prediction based on Diffusion Convolutional Recurrent Neural Networks
Network Traffic Prediction based on Diffusion Convolutional Recurrent Neural Networks
Abstract—By predicting the traffic load on network links, a a more effective planning decisions. Short-term traffic predic-
network operator can effectively pre-dispose resource-allocation tion (i.e., predictions within minutes, even seconds) is usually
strategies to early address, e.g., an incoming congestion event. linked to dynamic resource allocation, and can be used to
Traffic loads on different links of a telecom is know to be
subject to strong correlation, and this correlation, if properly improve Quality of Service (QoS) mechanisms as well as for
represented, can be exploited to refine the prediction of future congestion control and optimal resource management. Several
congestion events. Machine Learning (ML) represents nowadays different techniques including time series models, modern data
the state-of-the-art methodology for discovering complex rela- mining techniques, soft computing approaches, and neural
tions among data. However, ML has been traditionally applied networks have been used for network traffic analysis and
to data represented in the Euclidean space (e.g., to images)
and it may not be straightforward to effectively employ it to prediction [1].
model graph-stuctured data (e.g., as the events that take place in Within a telecom network, traffic is exchanged between
telecom networks). Recently, several ML algorithms specifically nodes and crosses network links. Such links have relations
designed to learn models of graph-structured data have appeared among each other, i.e., due to their adjacency, their behaviour
in the literature. The main novelty of these techniques relies on is correlated. For example, it is more likely that congestion
their ability to learn a representation of each node of the graph
considering both its properties (e.g., features) and the structure occurs in links adjacent to a congested link than elsewhere.
of the network (e.g., the topology). In this paper, we employ Due to the large amount of data that is available today in
a recently-proposed graph-based ML algorithm, the Diffusion telecom networks, algorithms coming from the area of Ma-
Convolutional Recurrent Neural Network (DCRNN), to forecast chine Learning (ML) have been investigated to enable network
traffic load on the links of a real backbone network. We evaluate intelligence [2], thanks to the ability of ML to extract useful
DRCNN’s ability to forecast the volume of expected traffic and
to predict events of congestion, and we compare this approach (and sometimes “hidden”) information from data. However,
to other existing approaches (as LSTM, and Fully-Connected despite the significant amount of research in this direction, the
Neural Networks). Results show that DCRN outperforms the topological relation among the links has not been traditionally
other methods both in terms of its forecasting ability (e.g., MAPE leveraged by these machine learning algorithms, and, to the
is reduced from 210% to 43%) and in terms of the prediction of best of our knowledge, no existing solution is specifically
congestion events, and represent promising starting point for the
application of DRCNN to other network management problems. designed to process graph-structured data.
Index Terms—traffic forecasting, graph-based machine learn- In this paper we employ a recently-proposed machine
ing, network congestion learning algorithm (originally developed to do road traffic
forecasting [3]) to predict the traffic load on the links of a
I. I NTRODUCTION telecom network. This algorithm is referred to as Diffusion
As telecom networks become more and more complex Convolution Recurrent Neural Network (DCRNN) and, dif-
(see, e.g., the enormous set of adjustable parameters to be ferently from traditional machine learning approaches, it can
managed in modern systems), is also becoming increasingly capture important topological properties of the network, which
important to limit human intervention and speed up network are expected to significantly influence the patterns followed by
management procedures. Novel software solutions for network the traffic when propagating through the network.
automation allow to automatically configure, provision, man- Our objective is to predict to next load on a link of a telecom
age and test network devices and can be used to increase network, given the sequence of the past observations of link
the infrastructure efficiency and reduce human error and loads. The problem is modeled as a regression where the
operational expenditures. objective is to minimize the error between the predicted and
In particular, network traffic prediction plays an important the actual next load on the links. In the literature, this specific
role in many areas of networking, such as network manage- problem has already been addressed by using ML methods.
ment, network design, short and long-term resource allocation, However, to the best of our knowledge, this is the first time
traffic (re)-routing and anomaly detection. Two categories of that a ML algorithm able to capture the topological relations of
prediction methods, based on long and short term’s periods, the links of telecom network is employed to perform this task.
are typically considered. Long-term traffic prediction is used Specifically, we train a DCRNN using real data gathered from
to estimate future capacity requirements, and therefore enables a backbone network (i.e., Abilene) and compare this approach
with several baselines traffic-prediction algorithms (e.g., the To our knowledge, none of the existing methods of traffic
LSTM network [4]). A comparison of both the effectiveness prediction explicitely considers the topological information of
of the regression (e.g., measured in terms of mean absolute the network. Arguably, this is due to the fact that ML solutions
error) and the ability to detect congestions events (which are specifically designed to process data that do not belong to the
defined following a threshold-based criterion) is carried out. Euclidean domain [10], and in particular those with a graph-
Results show a remarkable improvement of the DCRNN with based structure, have appeared in the literature only recently
respect to all the baseline methods on both the aspects, and [10]. At a high-level, these methods are based on filtering
encourage its application also for other network management operations designed to be suitable for graphs. These filters are
tasks. used within machine learning algorithms and their parameters
The rest of the paper is structured as follows. In Section II are learned to make them able to capture hidden patterns of
we review some works related to the use of machine learning the relations among the nodes of the graph. For example, [11],
as a tool for network traffic prediction, as well as the several [12] propose a generalization of the CNN that is suitable to
machine learning algorithms specifically designed to work on process graphs to perform, for example, classification of the
graph-structured data. Section III briefly reviews the concepts nodes. The authors of [3] propose the diffusion convolution
needed to understand the proposed methodology, such as operator and build a machine learning algorithms based on
the recurrent neural networks and the diffusion convolutional this. This algorithm is then used to perform traffic forecasting
operator. In Section IV, we present the problem statement on road traffic in [3], [13]. Here, we use the same methodology
and describe the employed methodology. Description of the to forecast the load on the links of a telecommunication
simulation settings and presentation of results is given in network.
Section V. Finally, Section VI concludes the paper.
III. BACKGROUND
II. R ELATED W ORK In this Section, we briefly review background concepts to
An accurate prediction of network traffic is of utmost understand the DCRNN, as well as benchmarks algorithms
importance for network operators, as it enables an efficient that we compare with the proposed approach.
management of resources and load balancing. Given the A. Convolutional Neural Networks
importance of the topic, the related literature is abundant.
Convolution is widely employed in signal processing to
We focus here on several related works evaluating ML-based
perform filtering operations. The convolution between two
methods for network traffic prediction.
signals x and w is defined as:
The authors of [5] propose a framework for network Traffic
Matrix (TM) prediction based on Recurrent Neural Networks T
X
equipped with the Long Short-Term Memory units, i.e., RNN (x ∗ w)(t) = x(t) · w(t − τ ) (1)
LSTM. TM prediction is defined as the problem of estimating τ =0
future network traffic matrix from the previous ones. Similar where w is generally referred to as kernel of filter and T is
approaches can be found in [6], [7]. [6] proposes an end-to- its support. In general, the kernel w is hand-crafted by expert
end deep learning architecture consisting of a convolutional designers in such a way that the convolution captures some
and a recurrent module that, combined, can extract both desidered properties of the signal. A Convolutional Neural
spatial and temporal information from the traffic flows. [7] Network (CNN) is a machine learning module that is trained
proposes a model of neural network which can be used to learn the parameters of a number of filters (whose support,
to combine LSTM with Deep Neural Networks (DNN). An i.e., their length, is fixed).
autocorrelation coefficient is added to the model to improve CNN networks can be formed by stacking together multiple
the accuracy of predictions. The main novelty of [7] is to CNN layers. In general, these architectures are characterized
include autocorrelation of the time series in the input of by a Dropout layer on top of each CNN. Although CNNs
the ML algorithm, which leads to superior performance with are more commonly used in the 2D domain (e.g., to perform
respect to existing methods. The combination of a special type image recognition), it is not rare to see their employment
of LSTM unit, i.e., the Gated Recurrent Units (GRU) and also in the 1D domain, e.g., for time-series forecasting. The
the Convolutional Neural Network (CNN) in the 2D domain support of the kernel tunes the level of temporal dynamic that
(CNN-2D) has been proposed for the task of network traffic the filters can capture. Namely, filters with long support can
prediction in datacenters in [8]. The underlying idea of the extract longer temporal dynamic with respect to shorter ones.
work in [8] is to treat network matrices as images and use
the CNN2D to find the correlations among traffic exchanged B. Recurrent Neural Networks
between different pairs of nodes. Note that, in literature, the Recurrent Neural Networks (RNNs) have been designed
prediction of traffic exchanged among network nodes is more with the specific purpose to overcome the limitations of
common than the prediction of the load on network links. feedforward neural networks in modeling sequences. RNN
However, examples of application of ML to this specific task networks are composed of units (i.e., neurons) capable of
can be found, e.g., in [9] where Support Vector Machines are keeping track of past observations. This allows RNNs to
employed to perform the regression. model the input data based on both current and previously
247
2019 IEEE INFOCOM WKSHPS: NI 2019: Network Intelligence: Machine Learning for Networking
seen observations. Among the proposed RNNs architectures, this Section, we briefly review one of the most prominent
the Long Short-Term Memory (LSTM) proved particularly solutions of this kind, which is based on the idea that the
effective in modeling long-range temporal dependendies of relation between two nodes can be represented as a diffusion
input data. process. Specifically, the probability that a random walk of K
More formally, given an input vector x(t) and the current steps that starts at the first node and ends at the second can
observation (say x(t+1)), the LSTM unit recursively performs be computed knowing the state transition matrix D0 −1 · W
the following operations: (with D0 being the out-degree diagonal matrix of the graph).
Intuitively, the diffusion process gives important clues on
i(t) = σ(Wi [x(t), h(t − 1)] + bi ) (2) the influence that each node excercises on all the others.
This contextual knowledge may be used to improve the
representation of the nodes within the feature space (i.e., X)
c̃(t) = tanh(Wc [x(t), h(t − 1)] + bc ) (3)
through the application of filtering performed using appropri-
ate convolutional operations.
f (t) = σ(Wf [x(t), h(t − 1)] + bf ) (4) The K-steps diffusion convolution between a graph signal
X ∈ RN XP and a filter fθ is referred to as ∗G and defined
as:
c(t) = f (t) c(t − 1) + i(t) c̃(t) (5)
K−1
X
o(t) = σ(Wo [x(t), h(t − 1)] + bo ) (6) X ∗G fθ = θk,1 (D0 −1 W)k + θk,2 (D0 −1 W| )k ·X
k=0
(8)
h(t) = o(t) tanh(C(t)) (7) where θ ∈ RKX2 are the parameters of the filter, D0 −1 ·
W is the state transition matrix of the diffusion process and
where is the element-wise matrix multiplication.
D0 −1 · W| is its transpose.
Wi , Wc , Wf , Wo and bi , bc , bf , bo are learnable kernels
The diffusion convolutional operator can be used as building
and biases, respectively, whereas i, c̃, f , c, o are referred to
block of a Diffusion Convolutional Layer of Neural Network
as input, input modulation, forget, cell and output gates and
and θ learnt using common training approaches (e.g., back-
jointly perform operations to make the LSTM able select the
propagation). Specifically, this layer can be trained to map
information to remember and to forget from the input data.
the feature matrix X ∈ RN XP to an output H ∈ RN XQ as
Finally, the hidden state h encodes what the LSTM unit retains
follows:
about past observations and, along with x, is successively used
as input data.
P
Many variants of LSTM units have been proposed in
X
the literature, and the above formulation only refers to its H:,q = σ X:,p ∗G fΘq,p,:,: , ∀q ∈ {1, ..., Q} (9)
p=1
most common implementation. For example, the RNN Gated
Recurrent Units (GRU) is a widely-used and simplified version where Θ ∈ RQXP XKX2 is the tensor of the trainable
of the LSTM, which is used for example in the recently- parameters. By replacing the matrix multiplications described
proposed Diffusion Convolutional Recurrent Neural Network in Section III-B with the diffusion convolutional operation,
(DCRNN). the RNN unit becomes the Diffusion Convolutional Gated
C. Diffusion Convolutional Recurrent Neural Network Recurrent Unit (DCGRU) [3]. For the sake of precision, the
authors of [3] present a modified version of the RNN GRU
Machine Learning algorithms have been originally thought mentioned in Section III-B and formally described by the
to learn models of data defined on Euclidean domains and following equations:
their application to other types of data, such as graphs, is
not straightforward [10]. Specifically, a graph is defined as r(t) = σ(Θr ∗G [X(t), H(t − 1)] + br ) (10)
the pair G = (V, E), where V is the set of nodes and E is
the set of edges. If the graph is characterized by attributes
(e.g., properties of nodes and edges), G can be alternatively
described as (X ∈ RN XP , W ∈ RN XN ), where N is the C(t) = tanh(ΘC ∗G [X(t), (r(t) H(t − 1))] + bc ) (11)
number of nodes and P the number of their attributes (i.e.,
features). X is the feature matrix and W is a weighted matrix u(t) = σ(Θu ∗G [X(t), H(t − 1)] + bu ) (12)
that encodes the relations among the nodes, e.g., the adjacency
matrix of the graph.
Traditional ML algorithms (e.g., LSTM or CNN) can easily H(t) = u(t) H(t − 1) + (1 − u(t)) C(t) (13)
process X, but fall short in including the information encoded
in W. Recent approaches proposed in the literature aim to where is the element-wise tensor multiplication. r, u and
enrich the feature matrix with this relational information. In C are referred to as reset, update and cell gates respectively
248
2019 IEEE INFOCOM WKSHPS: NI 2019: Network Intelligence: Machine Learning for Networking
249
2019 IEEE INFOCOM WKSHPS: NI 2019: Network Intelligence: Machine Learning for Networking
TABLE I
C OMPARISON OF THE DEEP LEARNING ARCHITECTURES CONSIDERING THEIR ABILITY TO PERFORM THE FORECAST OF THE NEXT TRAFFIC LOAD
MAPE MAE (Mbit/s) RMSE (Mbit/s) Convergence Epoch Convergence Time (sec)
DCRNN 43.2% 92.5 497.1 225 525.1
LSTM 210.34% 142.43 525.21 87 19.83
CNN 234.75% 121.32 506.55 252 9.82
CNN-LSTM 248.16% 127.18 512.91 240 5.76
Fully-Connected 220.75% 138.24 522.65 201 3.14
tions and ground-truth and stops when no improvement on 25% 50% 75% 100%
the validation set is noticed for at least 50 training epochs. Traffic On Links (Mbits/s) 59.67 180.33 389.41 5929.52
The traning is performed using the Adam optimizer [15] with
initial learning rate set to 0.01. In the following Section, we
describe the results derived by averaging the results obtained baselines. We assume that a congestion occurs on a link if the
in 100 simulations. traffic load is above a threshold that is directly proportional (of
a factor α) to the average amount of traffic observed on that
B. Experiment Results link. In this way, we perform a fair comparison that takes into
The first set of experiments evaluates the employed method consideration the different patterns of link load that Abilene
considering the Mean Absolute Percentage Error (MAPE), the presents. The evaluation is done considering the following
Mean Absolute Error (MAE) and the Root Mean Squared metrics: percentage of false positives, false negatives, true
Error (RMSE), as well as metrics related to the speed of positives and true negatives, from which we derive precision,
convergence, i.e., number of epochs and time at which training accuracy, recall and F-score.
is interrupted due to an early stopping event. In Table III we show the results obtained with α = 3 (i.e., a
The results of the evaluation are summarized in Table link is congested when the volume of traffic is above 3 times
I, where it is possible to notice how the DCRNN method the average load). The DCRNN outperforms the baselines
significantly outperforms the baselines in MAPE, MAE and for all the considered metrics. In particular, the precision
RMSE. In particular, the MAPE drops from ∼ 210% obtained (i.e., the percentage of congestion predictions that are actually
with the LSTM-based architecture to ∼ 43% by using the congestion events) is increased of up to 25% with respect the
DCRNN. We notice also an improvement with respect to the the best baseline (i.e., the LSTM-based architecture).
best MAE and RMSE (both obtained with the CNN-based As far as the congestion prediction task is concerned,
architecture) which decrease from ∼ 121 to ∼ 92 Mbit/s and the recall is the percentage of congestion events that are
from ∼ 506 to ∼ 497 Mbit/s, respectively. correctly predicted, whereas the accuracy is the percentage
The improvement of the MAE of ∼ 30 Mbit/s with respect of correct predictions (being they referred to congestion or
to the best baseline is significant considering an average traffic normal loads). Hence, they both give essential indications to
on links of ∼ 301 Mbit/s and that 50% of the measured a network operator that takes decision based on the likelihood
loads are below 180 Mbit/s (see Table II, where we show that congestion will (or will not) occur.
several values of the percentile of the load on links). The We depict the accuracy and the recall in Fig. 1(a) and Fig.
improvement of the RMSE is the least impressive. This result 1(b), respectively, as a function of the threshold congestion
can be explained saying that the DCRNN performs in general expressed by α ∈ [1.5, ..., 5]. We notice that the DCRNN
a better prediction of the next link loads (as indicated by always outperforms the baselines also in this task. Increasing
the remarkable decrease of the MAPE), but it hardly predicts α maens to limit the congestion events only to traffic volumes
sudden high peaks (i.e., burst events). We do not consider this that are significantly higher than the average link load. This
a limitation of the model, since the prediction of this type of has two opposite effects on the accuracy and on the recall. In
event is essentialy not possible. As for the convergence speed, fact, the accuracy increases as a consequence of the increased
the time needed to train the DCRNN (i.e., ∼ 512sec) is one number of non-congestion events, which positively affects the
order of magnitude higher than the LSTM-based architecture, number of correct classifications. Conversely, the recall shows
which presents the most time-consuming training process a general decrease with increasing α. This can be explained
among the baselines (i.e., ∼ 19sec). We underline that the considering that the models hardly predict very high and
forecasting process introduces a negligible delay for all the sudden peaks, as already discussed in relation to the RMSE.
considered models. α = 5 represents the hardest conditions to detect a con-
A straightforward application of a reliable estimator of gestion event. In this scenario, in fact, the DCRNN reaches
traffic load is the early detection of congestion events. In a recall of ∼ 34%, which means that 56% of the actual
the second set of experiements, we assess the ability of our congestions are not detected. Notice, however, that this result
approach to perform this task and we compare it with the is still 32% higher than the recall obtained by the best baseline
250
2019 IEEE INFOCOM WKSHPS: NI 2019: Network Intelligence: Machine Learning for Networking
95
40
Accuracy %
Recall %
90
CNN 20 CNN
LSTM LSTM
Fully-Connected Fully-Connected
85 CNN-LSTM CNN-LSTM
DCRNN 0 DCRNN
1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Threshold of congestion Threshold of congestion
(a) Accuracy of the effectiveness to detect a congestion event (b) Recall of the effectiveness to detect a congestion event
Fig. 1. Comparison of all the methods considering the ability to detect a congestion event in terms of Accuracy and Recall
TABLE III
C OMPARISON OF THE DEEP LEARNING ARCHITECTURES CONSIDERING THEIR ABILITY TO DETECT A CONGESTION EVENT WHEN THRESHOLD FACTOR
α=3
(i.e., the CNN-based architecture). [2] R. Alvizu, S. Troia, G. Maier, and A. Pattavina, “Matheuristic with
machine-learning-based prediction for software-defined mobile metro-
VI. C ONCLUSIONS core networks,” Journal of Optical Communications and Networking,
vol. 9, no. 9, pp. D19–D30, 2017.
In this work, we employ an existing graph-based machine [3] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent
learning algorithm (i.e., the DCRNN) to forecast the next neural network: Data-driven traffic forecasting,” 2018.
[4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
traffic load on the links of the backbone telecom network Abi- computation, vol. 9, no. 8, pp. 1735–1780, 1997.
lene. The main novelty of this appraoch is the ability to learn [5] A. Azzouni and et al, “Neutm: A neural network-based framework for
a representation of the telecom network that considers both traffic matrix prediction in sdn,” CoRR, vol. abs/1710.06799, 2017.
[6] Y. Liu and et al, “Short-term traffic flow prediction with conv-lstm,” in
the features (i.e., the load on the links) and the topological Wireless Communications and Signal Processing (WCSP), 2017. IEEE,
relations among them (i.e., if the links are connected or not). 2017, pp. 1–6.
The DCRNN is compared to the baselines (e.g., LSTM and [7] Q. Zhuo and et al, “Long short-term memory neural network for network
traffic prediction,” in ISKE. IEEE, 2017, pp. 1–6.
CNN) considering the effectiveness of the forecasting and the [8] X. Cao and et al, “Interactive temporal recurrent convolution network
ability to detect congestion events. For example, a reduction of for traffic prediction in data centers,” IEEE Access, vol. 6, pp. 5276–
the MAPE from 210% to 43% is observed. These promising 5289, 2018.
[9] P. Bermolen and D. Rossi, “Support vector regression for link load
results suggest that the forecasting of events within a telecom prediction,” Computer Networks, vol. 53, no. 2, pp. 191–201, 2009.
network may significantly benefit from using ML approaches [10] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst,
explicitely-designed to capture, along with the properties of “Geometric deep learning: going beyond euclidean data,” IEEE Signal
Processing Magazine, vol. 34, no. 4, pp. 18–42, 2017.
the events themselves, also the structure of the network. [11] T. N. Kipf and M. Welling, “Semi-supervised classification with graph
convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
VII. ACKNOWLEDGEMENTS [12] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural
networks on graphs with fast localized spectral filtering,” in Advances
The work leading to these results has been supported by in Neural Information Processing Systems, 2016, pp. 3844–3852.
the European Community under grant agreement no. 761727 [13] X. Wang, C. Chen, Y. Min, J. He, B. Yang, and Y. Zhang, “Efficient
Metro-Haul project and by the EU FP7 ERANET program metropolitan traffic prediction based on graph recurrent neural network,”
arXiv preprint arXiv:1811.00740, 2018.
under grant CHIST-ERA-2016 UPRISE-IOT. [14] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” in Advances in neural information processing
R EFERENCES systems, 2014, pp. 3104–3112.
[1] M. Joshi and T. H. Hadi, “A review of network traffic analysis and [15] D. P. Kingma and et al, “Adam: A method for stochastic optimization,”
prediction techniques,” CoRR, vol. abs/1507.05722, 2015. [Online]. arXiv preprint arXiv:1412.6980, 2014.
Available: http://arxiv.org/abs/1507.05722
251