An Optical Communication's Perspective On Machine Learning and Its Applications

JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 37, NO.
2, JANUARY 15, 2019 493
An Optical Communication’s Perspective on

Machine Learning and Its Applications
Faisal Nadeem Khan , Qirui Fan , Chao Lu, and Alan Pak Tao Lau
(Invited Paper)
Abstract—Machine learning (ML) has disrupted a wide range

of science and engineering disciplines in recent years. ML applica-
tions in optical communications and networking are also gaining
more attention, particularly in the areas of nonlinear transmission
systems, optical performance monitoring, and cross-layer network
optimizations for software-defined networks. However, the extent
to which ML techniques can benefit optical communications and
networking is not clear and this is partly due to an insufficient
understanding of the nature of ML concepts. This paper aims to
describe the mathematical foundations of basic ML techniques
from communication theory and signal processing perspectives,
which in turn will shed light on the types of problems in optical
communications and networking that naturally warrant ML use.
This will be followed by an overview of ongoing ML research in
optical communications and networking with a focus on physical
layer issues.
Index Terms—Artificial intelligence, deep learning, machine
learning, optical communications, optical performance monitor-
ing, software-defined networks.
I. INTRODUCTION Fig. 1. Given a data set, ML attempts to solve two main types of problems:
(a) functional description of given data and (b) classification of data by deriving
RTIFICIAL intelligence (AI) makes use of comput-
A ers/machines to perform cognitive tasks, i.e., the ones
requiring knowledge, perception, learning, reasoning, under-
appropriate decision boundaries. (c) Laser frequency offset and phase estimation
for quadrature phase-shift keying (QPSK) systems by raising the signal phase φ
to the 4th power and performing regression to estimate the slope and intercept.
(d) Decision boundaries for a received QPSK signal distribution.
standing and other similar cognitive abilities. An AI system
is expected to do three things: (i) store knowledge, (ii) apply
the stored knowledge to solve problems, and (iii) acquire new patterns and structures can then be used to make decisions or
knowledge via experience. The three key components of an predictions on some other data in the system of interest [1].
AI system include knowledge representation, machine learning ML is not a new field as ML-related algorithms exist at
(ML), and automated reasoning. ML is a branch of AI which least since the 1970s. However, tremendous increase in com-
is based on the idea that patterns and trends in a given data set putational power over the last decade, recent groundbreaking
can be learned automatically through algorithms. The learned developments in theory and algorithms surrounding ML, and
easy access to an overabundance of all types of data worldwide
(thanks to three decades of Internet growth) have all contributed
Manuscript received July 1, 2018; revised October 16, 2018 and January 4, to the advent of modern deep learning (DL) technology, a class
2019; accepted January 30, 2019. Date of publication February 4, 2019; date of
current version February 20, 2019. This work was supported in part by the Hong of advanced ML approaches that displays superior performance
Kong Government General Research Fund under Project PolyU 152757/16E in an ever-expanding range of domains. In the near future, ML is
and in part by the National Natural Science Foundation China under Projects expected to power numerous aspects of modern society such as
61435006 and 61401020. (Corresponding author: Faisal Nadeem Khan.)
F. N. Khan, Q. Fan, and A. P. T. Lau are with the Photonics Research web searches, computer translation, content filtering on social
Centre, Department of Electrical Engineering, The Hong Kong Polytechnic media networks, healthcare, finance, and laws [2].
University, Kowloon, Hong Kong (e-mail:, fnadeem.khan@yahoo.com; remi. ML is an interdisciplinary field which shares common threads
qr.fan@gmail.com; eeaptlau@polyu.edu.hk).
C. Lu is with the Photonics Research Centre, Department of Electronic and with the fields of statistics, optimization, information theory, and
Information Engineering, The Hong Kong Polytechnic University, Kowloon, game theory. Most ML algorithms perform one of the following
Hong Kong (e-mail:,chao.lu@polyu.edu.hk). two types of pattern recognition tasks as shown in Fig. 1. In the
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. first type, the algorithm tries to find some functional description
Digital Object Identifier 10.1109/JLT.2019.2897313 of given data with the aim of predicting values for new inputs,
0733-8724 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
494 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 37, NO. 2, JANUARY 15, 2019
Fig. 2. (a) Probability distribution and corresponding optimal decision bound-

aries for received 16-QAM symbols in the presence of fiber nonlinearity are hard
to characterize analytically. (b) Probability distribution of received 64-QAM
signal amplitudes. The distribution can be used to monitor optical signal-to-
noise ratio (OSNR) and identify modulation format. However, this task will be
extremely difficult if one relies on analytical modeling.
Fig. 3. Dynamic network resources allocation and link capacity maximization
via cross-layer optimization in SDNs.
i.e., regression problem. The second type attempts to find suit-
able decision boundaries to distinguish different data classes,
i.e., classification problem [3], which is more commonly re- ent channel parameters such as OSNR, optical power, CD, etc.
ferred to as clustering problem in ML literature. ML techniques [4]. In this case, the mapping between input and output parame-
are well known for performing exceptionally well in scenar- ters is intractable from underlying physics/mathematics, which
ios in which it is too hard to explicitly describe the problem’s in turn warrants ML. An example of OSNR monitoring using
underlying physics and mathematics. received signal amplitudes distribution is shown in Fig. 2(b).
Optical communication researchers are no strangers to re- Besides physical layer-related developments, optical net-
gressions and classifications. Over the last decade, coherent work architectures and operations are also undergoing major
detection and digital signal processing (DSP) techniques have paradigm shifts under the software-defined networking (SDN)
been the cornerstone of optical transceivers in fiber-optic com- framework and are increasingly becoming complex, transparent
munication systems. Advanced modulation formats such as 16 and dynamic in nature [5]. One of the key features of SDNs
quadrature amplitude modulation (16-QAM) and above together is that they can assemble large amounts of data and perform
with DSP-based estimation and compensation of various trans- so-called big data analysis to estimate the network states as
mission impairments such as laser phase noise have become the shown in Fig. 3. This in turn can enable (i) adaptive provision-
key drivers of innovation. In this context, parameter estimation ing of resources such as wavelength, modulation format, routing
and symbol detection are naturally regression and classifica- path, etc., according to dynamic traffic patterns and (ii) advance
tion problems, respectively, as demonstrated by examples in discovery of potential components faults so that preventative
Fig. 1(c) and (d). Currently, most of these parameter estimation maintenance can be performed to avoid major network disrup-
and decision rules are derived from probability theory and ad- tions. The data accumulated in SDNs can span from physical
equate understanding of the problem’s underlying physics. As layer (e.g., OSNR of a certain channel) to network layer (e.g.,
high-capacity optical transmission links are increasingly being client-side speed demand) and obviously have no underlying
limited by transmission impairments such as fiber nonlinearity, physics to explain their interrelationships. Extracting patterns
explicit statistical characterizations of inputs/outputs become from such cross-layer parameters naturally demands the use of
difficult. An example of 16-QAM multi-span dispersion-free data-driven algorithms such as ML.
transmissions in the presence of fiber nonlinearity and inline This review paper is intended for the researchers in optical
amplifier noise is shown in Fig. 2(a). The maximum likelihood communications with a basic background in probability the-
decision boundaries in this case are curved and virtually im- ory, communication theory and standard DSP techniques used
possible to derive analytically. Consequently, there has been in fiber-optic communications such as matched filters, maxi-
an increasing amount of research on the application of ML mum likelihood/maximum a posteriori (MAP) detection, equal-
techniques for fiber nonlinearity compensation (NLC). Another ization, adaptive filtering, etc. In this regard, a large class of
related area where ML flourishes is short-reach direct detection ML techniques such as Kalman filtering, Bayesian learning,
systems that are affected by chromatic dispersion (CD), laser hidden Markov models (HMMs), etc., are actually standard
chirp and other transceiver components imperfections, which statistical signal processing methods, and hence will not be
render the overall communication system hard to analyze. covered here. We will first introduce artificial neural networks
Optical performance monitoring (OPM) is another area with (ANNs) and support vector machines (SVMs) from communi-
an increasing amount of ML-related research. OPM is the acqui- cation theory and signal processing perspectives. This will be
sition of real-time information about different channel impair- followed by other popular ML techniques like K-means cluster-
ments ubiquitously across the network to ensure reliable net- ing, expectation-maximization (EM) algorithm, principal com-
work operation and/or improve network capacity. Often, OPM ponent analysis (PCA), independent component analysis (ICA),
is cost-limited so that one can only employ simple hardware as well as more recent DL approaches such as deep neural
components and obtain partial signal features to monitor differ- networks (DNNs), convolutional neural networks (CNNs) and
KHAN et al.: OPTICAL COMMUNICATION’S PERSPECTIVE ON MACHINE LEARNING AND ITS APPLICATIONS 495
Fig. 5. Structure of a single hidden layer ANN with input vector x(l), target
Fig. 4. The complexity of classification problems depends on how the different vector y(l) and actual output vector o(l).
classes of data are distributed across the variable space.
The region for the ‘green’ class depends on the outputs of the
‘red’ and ‘blue’ decision boundaries. Therefore, one will need
recurrent neural networks (RNNs). The analytical derivations
to implement an extra decision step to label the ‘green’ region.
presented in this paper are slightly different from those in stan-
The graphical representation of this ‘decision of decisions’ al-
dard introductory ML text to better align with the fields of
gorithm is the simplest form of an ANN [7]. The intermediate
communications and signal processing. We will then provide an
decision output units are known as hidden neurons and they
overview of applications of ML techniques in various aspects
form the hidden layer.
of optical communications and networking.
We emphasize that this is by no means an exhaustive and
in-depth discussion on state-of-the-art ML techniques and their A. Artificial Neural Networks (ANNs)
respective challenges. Also, the views presented are not the only Let {(x(1), y(1)), (x(2), y(2)), . . . (x(L), y(L))} be a set of
way to understand the fundamental properties of ML methods. L input-output pairs of M and K dimensional column vectors.
By discussing ML through the language of communications ANNs are information processing systems comprising of an in-
and DSP, we hope to provide a more intuitive understanding of put layer, one or more hidden layers, and an output layer. The
ML, its relation to optical communications and networking, and structure of a single hidden layer ANN with M input, H hid-
why/where/how it can play a unique role in specific areas of den and K output neurons is shown in Fig. 5. Neurons in two
optical communications and networking. adjacent layers are interconnected where each connection has a
The rest of the paper is organized as follows. In Section II, we variable weight assigned. Such ANN architecture is the simplest
will illustrate the fundamental conditions that warrant the use of and most commonly-used one [7]. The number of neurons M in
a neural network and discuss the technical details of ANNs and the input layer is determined by the dimension of the input data
SVMs. Section III will describe a range of basic unsupervised vectors x(l). The hidden layer enables the modeling of complex
ML techniques and briefly discuss reinforcement learning (RL). relationships between the input and output parameters of an
Section IV will be devoted to more recent ML algorithms. Sec- ANN. There are no fixed rules for choosing the optimum num-
tion V will provide an overview of existing ML applications in ber of neurons for a given hidden layer and the optimum number
optical communications and networking while Section VI will of hidden layers in an ANN. Typically, the selection is made via
discuss their future role. Links for online resources and codes experimentation, experience and other prior knowledge of the
for standard ML algorithms will be provided in Section VII. problem. These are known as the hyperparameters of an ANN.
Section VIII will conclude the paper. A video presentation of For regression problems, the dimension K of the vectors y(l) de-
this review paper is available at [6]. pends on the actual problem nature. For classification problems,
K typically equals to the number of class labels such that if a
II. ANNS AND SVMS data point x(l) belongs to class k, y(l) = [0 0 · · · 0 1 0 · · · 0 0]T
where the ‘1’ is located at the kth position. This is called one-hot
What are the conditions that need ML for classification? encoding. The ANN output o(l) will naturally have the same
Fig. 4 shows three scenarios with 2-dimensional (2D) data dimension as y(l) and the mapping between input x (l) and o(l)
x = [x1 x2 ]T and their respective class labels depicted as ‘o’ can be expressed as
and ‘×’ in the figure. In the first case, classifying the data is
straightforward: the decision rule is to see whether σ(x1 − c) o (l) = σ2 (r (l))
or σ(x2 − c) is greater or less than 0 where σ(·) is the decision = σ2 (W2 u (l) + b2 )
function as shown. The second case is slightly more compli-
cated as the decision boundary is a slanted straight line. How- = σ2 (W2 σ1 (q (l)) + b2 )
ever, a simple rotation and shifting of the input, i.e., Wx + b = σ2 (W2 σ1 (W1 x (l) + b1 ) + b2 ) (1)
will map one class of data to below zero and the other class
above. Here, the rotation and shifting are described by matrix where σ1(2) (·) are the activation functions for the hidden and
W and vector b, respectively. This is followed by the decision output layer neurons, respectively. W1 and W2 are matrices
function σ(Wx + b). The third case is even more complicated. containing the weights of connections between the input and
hidden layer neurons and between the hidden and output layer the network to generate the outputs o using Eq. (1). The input
neurons, respectively, while b1 and b2 are the bias vectors for can be a single data point, a mini-batch or the complete set of
the hidden and output layer neurons, respectively. For a vector L inputs. This step is named so because the computation flow
z = [z1 z2 · · · zK ] of length K, σ1 (·) is typically an element- is in the natural forward direction, i.e., starting from the in-
wise nonlinear function such as the sigmoid function put, passing through the network, and going to the output; (iii)
Backward propagation and weights/biases update: For simplic-
1 1 1
σ1 (z) = ··· . (2) ity, let us assume SGD using 1 input-output pair (x(n), y(n))
1 + e−z 1 1 + e−z 2 1 + e−z K
for the n+1th iteration, sigmoid activation function for the hid-
As for the output layer neurons, σ2 (·) is typically chosen to den layer neurons and linear activation function for the out-
be a linear function for regression problems. In classification put layer neurons such that o(n) = W2 u(n) + b2 . The pa-
problems, one will normalize the output vector o(l) using the rameters W2 , b2 will be updated first followed by W1 , b1 .
softmax function, i.e., Since E(n) = o(n) − y(n)2 and ∂∂Eo(n (n )
) = 2(o(n) − y(n)),
1
o (l) = sof tmax (W2 u (l) + b2 ) (3) the corresponding update equations are
where (n +1) (n )

K
∂ok (n)
1 W2 = W2 − 2α (ok (n) − yk (n))
sof tmax (z) = K [ez 1 ez 2 · · · ez K ] . (4) ∂W2
k =1
k =1 ez k
(n +1) (n ) ∂o (n)
The softmax operation ensures that the ANN outputs conform b2 = b2 − 2α (o (n) − y (n)) (7)
∂b2
to a probability distribution for reasons we will discuss below.
To train the ANN is to optimize all the parameters θ = where ok (n) and yk (n) denote the kth element of vectors o(n)
{W1 , W2 , b1 , b2 } such that the difference between the ac- and y(n), respectively. In this case, ∂∂o(n )
b 2 is the Jacobian matrix
tual ANN outputs o and the target outputs y is minimized. One th
in which the j row and m column is the derivative of the mth
th
commonly-used objective function (also called loss function in element of o(n) with respect to the j th element of b2 . Also,
ML literature) to optimize is the mean square error (MSE)
the j th row and mth column of the matrix ∂∂oW k (n )
2
denotes the
1 1
L L
derivative of ok (n) with respect to the j th row and mth column
E= E (l) = o (l) − y (l)2 . (5) of W2 . Interested readers are referred to [9] for an overview of
L L
l=1 l=1
matrix calculus. Since o(n) = W2 u(n) + b2 , ∂∂o(n )
b 2 is simply
Like most optimization procedures in practice, gradient descent
the identity matrix. For ∂∂oW
k (n )
, its k th row is equal to u(n)T
is used instead of full analytical optimization. In this case, the 2
parameter estimates for n+1th iteration are given by (where (·)T denotes transpose) and is zero otherwise. Eq. (7)
can be simplified as
∂E
θ (n +1)
=θ −α
(n )
(6)
∂θ θ ( n )
(n +1) (n )
W2 = W2 − 2α (o (n) − y (n)) u(n)T
where the step size α is known as the learning rate. Note that (n +1)
b2 = b2
(n )
− 2α (o (n) − y (n)) . (8)
for computational efficiency, one can use a single input-output
pair instead of all the L pairs for each iteration in Eq. (6). This With the updated W2
(n +1)
and b2
(n +1)
, one can calculate
is known as stochastic gradient descent (SGD) which is the
standard optimization method used in common adaptive DSP
(n +1) (n )

K
∂ok (n)
such as constant modulus algorithm (CMA) and least mean W1 = W1 − 2α (ok (n) − yk (n))
∂W1
squares (LMS) algorithm. As a trade-off between computa- k =1
tional efficiency and accuracy, one can use a mini-batch of (n +1) (n ) ∂o (n)
data {(x(nP + 1), y(nP + 1)), (x(nP + 2), y(nP + 2)) . . . b1 = b1 − 2α (o (n) − y (n)) . (9)
∂b1
(x(nP + P ), y(nP + P ))} of size P for the nth iteration in-
stead. This can reduce the stochastic nature of SGD and improve Since the derivative of the sigmoid function is given by σ1 (z) =
accuracy. When all the data set has been used, the update algo- σ1 (z) ◦ (1 − σ1 (z)) where ◦ denotes element-wise multiplica-
rithm will have completed one epoch. However, it is often the tion and 1 denotes a column vector of 1’s with the same length
case that one epoch equivalent of updates is not enough for all as z,
the parameters to converge to their optimal values. Therefore,
∂o (n) ∂q (n) ∂u (n) ∂o (n)
one can reuse the data set and the algorithm goes through the 2nd =
∂b1 ∂b1 ∂q (n) ∂u (n)
epoch for further parameter updates. There is no fixed rule to
T
determine the number of epochs required for convergence [8]. (n +1)
= diag {u (n) ◦ (1 − u (n))} · W2 (10)
The update algorithm is comprised of following main steps:
(i) Model initialization: All the ANN weights and biases are
randomly initialized, e.g., by drawing random numbers from a
normal distribution with zero mean and unit variance; (ii) For- 1 One can also express the update of W 2 using the 3rd -order tensor notation
∂ o k (n )
∂ o (n )
ward propagation: In this step, the inputs x are passed through ∂W2 instead of k ∂W2
.
Fig. 6. Example illustrating ANN learning processes with (a) no over-fitting

or under-fitting, (b) over-fitting, and (c) under-fitting.
where diag{z} denotes a diagonal matrix with diagonal vector

z. Next, Fig. 7. Decision boundaries for appropriate data classification obtained using
∂ok (n) ∂ok (n) ∂uj (n) ∂qj (n) an ANN.
=
∂W1 j
∂uj (n) ∂qj (n) ∂W1
(∼1/3) of that of training data set. Finally, the testing data set
(n +1) ∂qj (n) evaluates the performance of the trained ANN. Note that an
= w2,k ,j uj (n) (1 − uj (n)) (11)
∂W1 ANN may also be subjected to under-fitting problem which oc-
j
curs when it is under-trained and thus unable to perform at an
(n +1) (n +1)
where w2,k ,j is the k th row and j th column entry of W2 . acceptable level as shown in Fig. 6(c). Under-fitting can again
∂ q j (n ) lead to poor ANN generalization. The reasons for under-fitting
For ∂W1 ,its j row is x(n) and is zero otherwise. Eq. (11)
th T
include insufficient training time or number of iterations, in-
can be simplified as

appropriate choice of activation functions, and/or insufficient
∂ok (n) T
(n +1) number of hidden neurons used.
= w2,k ◦ u (n) ◦ (1 − u (n)) x(n)T (12)
∂W1 It should be noted that given an adequate number of hid-
(n +1) (n +1) den neurons, proper nonlinearities, and appropriate training, an
where w2,k is the k th row of W2 . Since the parameters ANN with one hidden layer has great expressive power and can
are updated group by group starting from the output layer back approximate any continuous function in principle. This is called
to the input layer, this algorithm is called back-propagation (BP) the universal approximation theorem [10]. One can intuitively
algorithm (Not to be confused with the digital back-propagation appreciate this characteristic by considering the classification
(DBP) algorithm for fiber NLC). The weights and biases are problem in Fig. 7. Since each hidden neuron can be represented
continuously updated until convergence. as a straight-line decision boundary, any arbitrary curved bound-
For the learning and performance evaluation of an ANN, the ary can be approximated by a collection of hidden neurons in a
data sets are typically divided into three groups: training, valida- single hidden layer ANN. This important property of an ANN
tion and testing. The training data set is used to train the ANN. enables it to be applied in many diverse applications.
Clearly, a larger training data set is better since the more data an
ANN sees, the more likely it is that it has encountered examples
B. Choice of Activation Functions
of all possible types of input. However, the learning time also
increases with the training data size. There is no fixed rule for The choice of activation functions has a significant effect on
determining the minimum amount of training data needed since the training dynamics and final ANN performance. Historically,
it often depends on the given problem. A rule of thumb typically sigmoid and hyperbolic tangent have been the most commonly-
used is that the size of the training data should be at least 10 used nonlinear activation functions for hidden layer neurons.
times the total number of weights [1]. The purpose of the vali- However, the rectified linear unit (ReLU) activation function
dation data set is to keep a check on how well the ANN is doing has become the default choice among ML community in recent
as it learns since during training there is an inherent danger of years. The above-mentioned three functions are given by
over-fitting (or over-training). In this case, instead of finding the 1
underlying general decision boundaries as shown in Fig. 6(a), Sigmoid : σ (z) =
1 + e−z
the ANN tends to perfectly fit the training data (including any
noise components of them) as shown in Fig. 6(b). This in turn ez − e−z
Hyperbolic tangent : σ (z) =
makes the ANN customized for a few data points and reduces ez + e−z
its generalization capability, i.e., its ability to make predictions Rectiﬁed linear unit : σ (z) = max (0, z) (13)
about new inputs which it has never seen before. The over-fitting
problem can be avoided by constantly examining ANN’s error and their plots are shown in Fig. 8. Sigmoid and hyperbolic
performance during the course of training against an indepen- tangent are both differentiable. However, a major problem with
dent validation data set and enforcing an early termination of these functions is that their gradients tend to zero as |z| be-
the training process if the validation data set gives large errors. comes large and thus the activation output gets saturated. In this
Typically, the size of the validation data set is just a fraction case, the weights and biases updates for a certain layer will be
the zero-entropy term from Eq. (14) to obtain
1
L K
E=− yk (l) log (ok (l))
L
l=1 k =1
1
L K
+ yk (l) log (yk (l))
L
l=1 k =1

=0

L
K
1 yk (l)
= yk (l) log (15)
L ok (l)
l=1 k =1
which is simply the Kullback-Leibler (KL) divergence between

Fig. 8. Common activation functions used in ANNs.
the distributions o(l) and y(l) averaged over all input-output
pairs. Therefore, the cross-entropy is in fact a measure of the
similarity between ANN outputs and the class labels. The cross-
minimal, which in turn will slow down the weights and biases entropy function also leads to simple gradient updates as the
updates for all the preceding layers. This is known as vanish- logarithm cancels out the exponential operation inherent in the
ing gradient problem and is particularly an issue when training softmax calculation, thus leading to faster ANN training. The
ANNs with large number of hidden layers. To circumvent this Appendix shows the derivation of BP algorithm for the single
problem, ReLU was proposed since its gradient does not vanish hidden layer ANN in Fig. 5 with cross-entropy loss function and
as z increases. Note that although ReLU is not differentiable at softmax activation function for the output layer neurons.
z = 0, it is not a problem in practice since the probability of In many applications, a common approach to prevent over-
having an entry exactly equal to 0 is generally very low. Also, fitting is to reduce the magnitude of the weights as large weights
as the ReLU function and its derivative are 0 for z < 0, around produce high curvatures which make the decision boundaries
50% of hidden neurons’ outputs will be 0, i.e., only half of to- overly complicated. This can be achieved by including an extra
tal neurons will be active when the ANN weights and biases regularization term in the loss function, i.e.,
are randomly initialized. It has been found that such sparsity
E = E + λW2 (16)
of activation not only reduces computational complexity (and
thus training time) but also leads to better ANN performance where W2 is the sum of squared element-wise weights. The
[11]. Note that while using the ReLU activation function, the parameter λ, called regularization coefficient, defines the relative
ANN weights and biases are often initialized using the method importance of the training error E and the regularization term.
proposed by He et al. [12]. On the other hand, the Xavier ini- The regularization term thus discourages weights from reaching
tialization technique [13] is more commonly employed for the large values and this often results in significant improvement in
hyperbolic tangent activation function. These heuristics-based ANN’s generalization ability [14].
approaches initialize the weights and biases by drawing random
numbers from a truncated normal distribution (instead of a stan- D. Support Vector Machines (SVMs)
dard normal distribution) with variance which depends on the
size of the previous ANN layer. In many classification tasks, it often happens that the two
data categories are not easily separable with straight lines or
planes in the original variable space. SVM is an ML tech-
C. Choice of Loss Functions nique that preprocesses the input data x(i) and transforms it
The choice of loss function E has a considerable effect on into (sometimes) a higher-dimensional space v(i) = ϕ(x(i)),
the performance of an ANN. The MSE is a common choice in called feature space, where the data belonging to two different
adaptive signal processing and other DSP in telecommunica- classes can be separated easily by a simple straight plane de-
tions. For regression problems, MSE works well in general and cision boundary or hyperplane [15]. An example is shown in
is also easy to compute. On the other hand, for classification Fig. 9 where one class of data lies within a circle of radius 3 and
problems, the cross-entropy loss function defined as the other class lies outside. When transformed into the feature
space v = (v1 , v2 , v3 ) = (x1 , x2 , x21 + x22 ), the two data classes
can be separated simply by the hyperplane v3 = 9.
1
L K
E=− yk (l) log (ok (l)) (14) Let us first focus on finding the right decision hyperplane af-
L ter the transformation into feature space as shown in Fig. 10(a).
l=1 k =1
The right hyperplane should have the largest (and also equal)
is often used instead of the MSE [10]. The cross-entropy func- distance from the borderline points of the two data classes.
tion can be interpreted by viewing the softmax output o(l) and This is graphically illustrated in Fig. 10(b). Had the data points
the class label with one-hot encoding y(l) as probability distri- been generated from two probability density functions (PDFs),
butions. In this case, y(l) has zero entropy and one can subtract finding a hyperplane with maximal margin from the borderline
Therefore, we seek to find w, b that maximize 1/w subject

to the fact that all the data points are classified correctly. To
characterize the constraints more mathematically, one can first
assign the blue class label to 1 and red class label to −1. In
this case, if we have correct decisions for all the data points, the
product y(i)(wT v(i) + b) will always be greater than 1 for all
i. The optimization problem then becomes
argmin w1
w ,b (18)
Fig. 9. Example showing how a linearly inseparable problem (in the original subject to y (l) wT v (l) + b ≥ 1, l = 1, 2, . . . , L
2D data space) can undergo a nonlinear transformation and becomes a linearly
separable one in the 3-dimensional (3D) feature space. and thus standard convex programming software packages such
as CVXOPT [16] can be used to solve Eq. (18).
Let us come back to the task of choosing the nonlinear
function ϕ(·) that maps the original input space x to feature
space v. For SVM, one would instead find a kernel function
K(x(i), x(j)) = ϕ(x(i)) · ϕ(x(j)) = v(i)T v(j) that maps to
the inner product. Typical kernel functions include:
r Polynomials: K(x(i), x(j)) = (x(i)T x(j) + a)b for
some scalars a, b
r Gaussian radial basis function: K(x(i), x(j)) =
exp(−ax(i) − x(j)2 ) for some scalar a
r Hyperbolic tangent: K(x(i), x(j)) = tanh(ax(i)T x(j)
+ b) for some scalars a, b.
The choice of a kernel function is often determined by the de-
signer’s knowledge of the problem domain [3]. Note that a larger
separation margin typically results in better generalization of the
SVM classifier. SVMs often demonstrate better generalization
performance than conventional ANNs in various pattern recog-
nition applications. Furthermore, multiple SVMs can be applied
to the same data set to realize non-binary classifications such as
detecting 16-QAM signals [17]–[19] (to be discussed in more
detail in Section V).
It should be noted that ANNs and SVMs can be seen as two
Fig. 10. (a) Maping from input space to a higher-dimensional feature space complementary approaches for solving classification problems.
using a nonlinear kernel function ϕ. (b) Separation of two data classes in the While an ANN derives curved decision boundaries in the input
feature space through an optimal hyperplane.
variable space, the SVM performs nonlinear transformations of
the input variables followed by determining a simple decision
points is conceptually analogous to finding a maximum like-
boundary or hyperplane as shown in Fig. 11.
lihood decision boundary. The borderline points, represented
as solid dot and square in Fig. 10(b), are referred to as support
vectors and are often most informative for the classification task. III. UNSUPERVISED AND REINFORCEMENT LEARNING
More technically, in the feature space, a general hyperplane The ANN and SVM are examples of supervised learning
is defined as wT v + b = 0. If it classifies all the data points approach in which the class labels y of the training data are
correctly, all the blue points will lie in the region wT v + b > 0 known. Based on this data, the ML algorithm generalizes to react
and the red points will lie in the region wT v + b < 0. We seek accurately to new data to the best possible extent. Supervised
to find a hyperplane wT v + b = 0 that maximizes the margin d learning can be considered as a closed-loop feedback system
as shown in Fig. 10(b). Without loss of generality, let the point as the error between the ML algorithm’s actual outputs and the
v(i) reside on the hyperplane wT v + b = 1 and is closest to targets is used as a feedback signal to guide the learning process.
the hyperplane wT v + b = 0 on which v+ resides. Since the In unsupervised learning, the ML algorithm is not provided
vectors v(i) − v+ , w and the angle φ are related by cos φ = with correct labels of the training data. Rather, it learns to iden-
wT (v(i) − v+ )/(wv(i) − v+ ), the margin d is given as tify similarities between various inputs with the aim to either
d = v (i) − v+ cos φ categorize together those inputs which have something in com-
w T (v(i)−v + ) mon or to determine some better representation/description of
= v (i) − v+ · w v(i)−v + the original input data. It is referred to as “unsupervised” be-
w T (v(i)−v + ) (17)
= w v(i)−w v
T T +
= cause the ML algorithm is not told what the output should be
w w
w T v(i)+b 1 rather it has to come up with it itself [20]. One example of
= w = w . unsupervised learning is data clustering as shown in Fig. 12.
Fig. 13. Example to illustarte initialization and two iterations of K-means

algorithm. The data points are shown as dots and cluster centers are depicted as
Fig. 11. Example showing how an ANN determines a curved decision bound- crosses.
ary in the original input space while an SVM obtains a simple decision boundary
in the transformed feature space.
In the second step, the new center of each cluster Ck is calculated
by averaging out the locations of data points that are assigned
to cluster Ck , i.e.,

µ (k) = x (i) (20)
x(i)∈ C k
The two steps are repeated iteratively until the cluster centers
converge. Several variants of K-means algorithm have been pro-
posed over the years to improve its computational efficiency as
well as to achieve smaller errors. These include fuzzy K-means,
hierarchical K-means, K-means++, K-medians, K-medoids,
etc.
Fig. 12. Data clustering based on unsupervised learning.
Unsupervised learning is becoming more and more important B. Expectation-Maximization (EM) Algorithm
because in many real circumstances it is practically not possible One drawback of K-means algorithm is that it requires the
to obtain labeled training data. In such scenarios, an unsuper- use of hard decision boundaries whereby a data point can only
vised learning algorithm can be applied to discover some simi- be assigned to one cluster even though it might lie somewhere
larities between different inputs for itself. Unsupervised learning midway between two or more clusters. The EM algorithm is an
is typically used in tasks such as clustering, vector quantization, improved clustering technique which assigns a probability to
dimensionality reduction, and features extraction. It is also often the data point belonging to each cluster rather than forcing it to
employed as a preprocessing tool for extracting useful (in some belong to one particular cluster during each iteration [20]. The
particular context) features of the raw data before supervised algorithm assumes that a given data distribution can be modeled
learning algorithms can be applied. We hereby provide a review as a superposition of K jointly Gaussian probability distributions
of few key unsupervised learning techniques. with distinct means and covariance matrices µ(k), Σ(k) (also
referred to as Gaussian mixture models). The EM algorithm is a
A. K-Means Clustering two-step iterative procedure comprising of expectation (E) and
Let {x(1), x(2), . . . x(L)} be the set of data points which is maximization (M) steps [3]. The E step computes the a posteriori
to be split into K clusters C1 , C2 , . . . CK . K-means clustering probability of the class label given each data point using the
is an iterative unsupervised learning algorithm which aims to current means and covariance matrices of the Gaussians, i.e.,
partition L observations into K clusters such that the sum of pij = p (Cj |x (i))
squared errors for data points within a group is minimized [14].
An example of this algorithm is graphically shown in Fig. 13. p (x (i) |Cj ) p (Cj )
= K
The algorithm initializes by randomly picking K locations µ(j), k =1 p (x (i) |Ck ) p (Ck )
j = 1, 2, . . . , K as cluster centers. This is followed by two
N (x (i) |µ (k) , Σ (k))
iterative steps. In the first step, each data point x(i) is assigned = K (21)
to the cluster Ck with the minimum Euclidean distance, i.e., k =1 N (x (i) |µ (k) , Σ (k))
Ck = {x (i) : x (i) − µ (k) < x (i) − µ (j) where N (x(i)|µ(k), Σ(k)) is the Gaussian PDF with mean and
covariance matrix µ(k), Σ(k). Note that we have inherently as-
∀j ∈ {1, 2, ..., K} \ {k}} (19) sumed equal probability p(Cj ) of each class, which is a valid
Fig. 15. Example to illustarte the concept of PCA. (a) Data points in the
original 3D data space; (b) Three PCs ordered according to the variability in
original data; (c) Projection of data points onto a plane defined by the first two
PCs while discarding the third one.
of the information in the original data as shown in Fig. 15. The

reduced dimensionality feature space is spanned by a small (but
most significant) set of orthonormal eigenvectors, called princi-
Fig. 14. Example showing the concept of EM algorithm. (a) Original data pal components (PCs). The first PC points in the direction along
points and initialization. Results after (b) first E step; (c) first M step; (d) 2 which the original data has the greatest variability and each
complete EM iterations; (e) 5 complete EM iteratons; and (f) 20 complete EM
iterations [3].
successive PC in turn accounts for as much of the remaining
variability as possible. Geometrically, we can think of PCA as
assumption for most communication signals. In scenarios where a rotation of the axes of the original coordinate system to a new
this assumption is not valid, e.g., the one involving probabilistic set of orthogonal axes which are ordered based on the amount
constellation shaping (PCS), the actual non-uniform probabil- of variation of the original data they account for, thus achieving
ities p(Cj ) of individual symbols shall instead be used in Eq. dimensionality reduction.
(21). The M step attempts to update the means and covari- More technically, consider a data set {x(1), x(2), . . . x(L)}
ance matrices according to the updated soft-labelling of the data with L data vectors of
M dimensions. We will first compute the
points, i.e., mean vector x̄ = L1 Li=1 x(i) and the covariance matrix Σ can
L then be estimated as
i=1 pij x (i)
µ (j) = L 1
L
i=1 pij Σ≈ (x (i) − x̄) (x (i) − x̄)T (23)
L i=1

L
Σ (k) = pij (x (i) − µ (j)) (x (i) − µ (j))T (22) where Σ can have up to M eigenvectors µ(i) and corresponding
i=1
eigenvalues λi . We then sort the eigenvalues in terms of their
A graphical illustration of EM algorithm and its convergence magnitude from large to small and choose the first S (where S
process is shown in Fig. 14. Fig. 14(a) shows the original data << M) corresponding eigenvectors such that
points in green which are to be split into two clusters by apply-
ing EM algorithm. The two Gaussian probability distributions
S
M
are initialized with random means and unit covariance matri- λi λi > R (24)
i=1 i=1
ces and are depicted using red and blue circles. The results
after first E step are shown in Fig. 14(b) where the posterior where R is typically above 0.9 [22]. Note that, as compared to
probabilities in Eq. (21) are expressed by the proportion of red the original M-dimensional data space, the chosen eigenvectors
and blue colors for each data point. Fig. 14(c) depicts the re- span only an S-dimensional subspace that in a way captures most
sults after first M step where the means and covariance matrices of the data information. One can understand such procedure
of the red and blue Gaussian distributions are updated using intuitively by noting that for a covariance matrix, finding the
Eq. (22), which in turn uses the posterior probabilities com- eigenvectors with large eigenvalues corresponds to finding linear
puted by Eq. (21). This completes the 1st iteration of the EM combinations or particular directions of the input space that give
algorithm. Fig. 14(d) to (f) show the results after 2, 5 and 20 large variances, which is exactly what we want to capture. A
complete EM iterations, respectively, where the convergence of data vector x can then be approximated as a weighted-sum of
the algorithm and consequently effective splitting of the data the chosen eigenvectors in this subspace, i.e.,
points into two clusters can be clearly observed.

S
x≈ wi µ (i) (25)
C. Principal Component Analysis (PCA)
i=1
PCA is an unsupervised learning technique for features ex-
traction and data representation [21], [22]. It is often used as a where µ(i), i = 1, 2, . . . , S are the chosen orthogonal eigen-
preprocessing tool in many pattern recognition applications for vectors such that

the extraction of limited but most critical data features. The cen- 1 if l = m
tral idea behind PCA is to project the original high-dimensional µ (m) µ (l) =
T
(26)
data onto a lower-dimensional feature space that retains most 0 if l
= m
higher-dimensional space may be desirable if the data classes

can be separated more easily by a classifier in the new space.
E. Reinforcement Learning (RL)

In this learning type, the input of the ML model (called ob-
servation) is associated with a reward or reinforcement signal.
The output (called action) determines the value of the next ob-
servation and hence the reward through the predefined action-
observation relationship of a particular problem. The objective
here is to learn a sequence of actions that optimizes the fi-
Fig. 16. Example 2D data fitted using (a) PCs bases and (b) ICs bases. As
shown, the orthogonal basis vectors in PCA may not be efficient while represent- nal reward. However, unlike supervised learning, the model is
ing non-orthogonal density distributions. In contrast, ICA does not necessitate not optimized through SGD-like approaches. Rather, the model
orthogonal basis vectors and can thus represent general types of densities more tries different actions until it finds a set of parameters that lead
effectively.
to better rewards. In RL, the model is rewarded for its good
output result and punished for the bad one. In this way, it can
Multiplying both sides of Eq. (25) with µT (k) and then using learn to choose actions which can maximize the expected re-
Eq. (26), we get ward [24]. Like supervised learning, RL can also be regarded
as a closed-loop feedback system since the RL model’s actions
wk = µT (k) x, k = 1, 2, . . . , S (27) will influence its later inputs. RL is particularly useful in solv-
ing interactive problems in which it is often impossible to attain
The vector w = [w1 w2 . . . wS ]T of weights describing the con-
examples of desired behavior which are not only correct but are
tribution of each chosen eigenvector µ(k) in representing x can
also representative of all the possible situations in which the
then be considered as a feature vector of x.
model may have to act ultimately. In an uncharted territory, an
RL model should be able to learn from its own experiences in-
D. Independent Component Analysis (ICA) stead of getting trained by an external supervisor with a training
Another interesting technique for features extraction and data data set of labeled examples.
representation is ICA. Unlike PCA which uses orthogonal and Due to their inherent self-learning and adaptability character-
uncorrelated components, the components in ICA are instead istics, RL algorithms have been considered for various tasks in
required to be statistically independent [1]. In other words, ICA optical networks including network self-configuration, adaptive
seeks those directions in the feature space that are most inde- resource allocation, etc. In these applications, the actions per-
pendent from each other. Fig. 16 illustrates the conceptual dif- formed by the RL algorithms may include choosing spectrum or
ference between PCA and ICA. Finding the independent com- modulation format, rerouting data traffic, etc., while the reward
ponents (ICs) of the observed data can be useful in scenarios may be the maximization of network throughput, minimization
where we need to separate mutually independent but unknown of latency or packet loss rate, etc. (to be discussed in more detail
source signals from their linear mixtures with no information in Section V). Currently, there are limited applications of RL
about the mixing coefficients. An example is the task of po- in the physical layer of optical communication systems. This
larization demultiplexing at the receiver using DSP. For a data is because in most cases the reward (objective function) can be
set {x(1), x(2), . . . , x(L)}, one seeks to identify a collection explicitly expressed as a continuous and differentiable function
of the actions. An example is the CMA algorithm where the
of basis vectors v(1), v(2), . . . v(S) so that x ≈ Sk=1 wk v(k)
and the empirical distributions of wk , k = 1, 2, . . . , S across all actions are the filter tap weights and the objective is to produce
the data x are statistically independent. This can be achieved by output signals with a desired amplitude. For such optimization
minimizing the mutual information between different wk . procedures, we simply refer to them as adaptive signal process-
ICA is used as a preprocessing tool for extracting data features ing instead of RL.
in many pattern recognition applications and is shown to out-
perform conventional PCA in many cases [23]. This is expected IV. DEEP LEARNING TECHNIQUES
because unlike PCA which is derived from second-order statis-
tics (i.e., covariance matrix) of the input data, ICA takes into A. Deep Learning Vs. Conventional Machine Learning
account high-order statistics of the data as it considers complete The recent emergence of DL technologies has taken ML re-
probability distribution. search to a whole new level. DL algorithms have demonstrated
We would like to highlight here that the dimensionality of comparable or better performance than humans in a lot of im-
the transformed space in ML techniques can be higher or lower portant tasks including image recognition, speech recognition,
than the original input space depending upon the nature of the natural language processing, information retrieval, etc. [2], [25].
problem at hand. If the objective of the transformation is to Loosely speaking, DL systems consist of multiple layers of non-
simply reduce the input data dimensionality (e.g., for decreasing linear processing units (thus deeper architectures) and may even
the computational complexity of the learning system) then the contain complex structures such as feedback and memory. DL
dimensionality of the transformed space should be lower than then refers to learning the parameters of these architectures for
that of original one. On the other hand, a transformation to a performing various pattern recognition tasks.
Fig. 18. Conceptual differences between rule-based systems, conventional

ML, and DL approaches for pattern recognition.
follow Rician distribution. In this case, a specific mathematical

formula or computational instruction is pre-coded into the pro-
gram and there is nothing to learn from the input data. Fig. 18
compares the underpinning philosophies of the three different
approaches discussed above. Note that, in principle, there is no
hard rule on how many layers are needed for an ML model in
a given problem. In practice, it is generally accepted that when
Fig. 17. Example illustrating OSNR monitoring using eye-diagrams’ features more underlying physics/mathematics of the problem is used to
by applying (a) DNN, (b) conventional ANN, and (c) analytical modeling and
parameters fitting. identify and extract the suitable data features as inputs, the ML
model tends to be simpler.
It should be noted that deep architectures are more efficient or
One way to interpret DL algorithms is that they automatically
more expressive than their shallow counterparts [26]. For exam-
learn and extract higher-level features of data from lower-level
ple, it has been observed empirically that compared to a shallow
ones as the input propagates through various layers of non-
neural network, a DNN requires much fewer number of neurons
linear processing units, resulting in a hierarchical representa-
and weights (i.e., around 10 times less connections in speech
tion of data. For example, while performing a complex human
recognition problems [27]) to achieve the same performance.
face recognition task using a DL-based multilayer ANN (called
A major technical challenge in DL is that the conventional BP
DNN), the first layer might learn to detect edges, the second
algorithm and gradient-based learning methods used for train-
layer can learn to recognize more complex shapes such as cir-
ing shallow networks are inherently not effective for training
cles or squares which are built from the edges. The third layer
networks with multiple layers due to the vanishing gradient
may then recognize even more complex combinations and ar-
problem [2]. In this case, different layers in the network learn
rangements of shapes such as the location of two ovals and a
at significantly different speeds during the training process, i.e.,
triangle in between, which in turn starts to resemble parts of a
when the layers close to the output are learning well, the layers
human face with two eyes and a nose. Such an ability to auto-
close to the input often get stuck. In the worst case, this may
matically discover and learn features at increasingly high levels
completely stop the network from further learning. Several so-
of abstraction empowers DL systems to learn complex relation-
lutions have been proposed to address the vanishing gradient
ships between inputs and outputs directly from the data instead
problem in DL systems. These include: (i) choosing specific
of using human-crafted features.
activation functions such as ReLU [11], as discussed earlier; (ii)
As an example of this notion of hierarchical learning, Fig. 17
pretraining of network one layer at a time in a greedy way and
shows the use of a DNN as well as a conventional ANN on
then fine-tuning the entire network through BP algorithm [28];
signal’s eye-diagrams to monitor OSNR. In the first approach,
(iii) using some special architectures such as long short-term
the eye-diagrams are directly applied as images at the input of
memory (LSTM) networks [29]; and (iv) applying network op-
the DNN, as shown in Fig. 17(a), and it is made to automatically
timization approaches which avoid gradients (e.g., global search
learn and discover OSNR-sensitive features without any human
methods such as genetic algorithm). The choice of a given so-
intervention. The extracted features are subsequently exploited
lution typically depends on the type of DL model being trained
by DNN for OSNR monitoring. In contrast, with conventional
and the degree of computational complexity involved.
ANNs, prior knowledge in optical communications is utilized
in choosing suitable features for the task, e.g., the variances of
“1” and “0” levels and eye-opening can be indicative of OSNR. B. Deep Neural Networks (DNNs)
Therefore, these useful features are manually extracted from Unlike shallow ANNs, DNNs contain multiple hidden layers
the eye-diagrams and are then used as inputs to an ANN for the between input and output layers. The structure of a simple three
estimation of OSNR as shown in Fig. 17(b). For completeness, hidden layers DNN is shown in Fig. 19 (top). DNNs can be
Fig. 17(c) shows an analytical and non-ML approach to deter- trained effectively using the BP algorithm. To avoid vanishing
mine OSNR by finding the powers and noise variances that best gradient problem during training of DNNs, following two ap-
fit the noise distributions of “1” and “0” levels knowing that they proaches are typically adopted. In the first method, the ReLU
Fig. 19. Schematic diagram of a three hidden layers DNN (top). Two autoen-
coders used for the pretraining of first two hidden layers of the DNN (bottom).
The decoder parts in both autoencoders are shown in grey color with dotted
weight lines.
activation function is simply used for the hidden layers neurons

due to its non-saturating nature. In the second approach, a DNN
is first pretrained one layer at a time and then the training process
is fine-tuned using BP algorithm [28]. For pretraining of hidden
layers of the DNNs, autoencoders are typically employed which
are essentially feed-forward neural networks. Fig. 19 (bottom)
shows two simple autoencoders used for the unsupervised pre-
Fig. 20. (a) A simple CNN architecture comprising of two sets of convolu-
training of first two hidden layers of the DNN. First, hidden tional and pooling layers followed by an FC layer on top. (b) In a CNN, a node
layer-1 of the DNN is pretrained in isolation using autoencoder- in the next layer is connected to a small subset of nodes in the previous layer.
1 as shown in the figure. The first part of autoencoder-1 (called The weights (indicated by colors of the edges) are also shared among the nodes.
(c) Nonlinear down-sampling of feature maps via a max-pooling layer.
encoder) maps input vectors x to a hidden representation f1 while
the second part (called decoder) reverses this mapping in order to
synthesize the initial inputs x. Once autoencoder-1 learns these
mappings successfully, hidden layer-1 is considered to be pre-
trained. The original input vectors x are then passed through the
encoder of autoencoder-1 and the corresponding representations
f1 (also called feature vectors) at the output of pretrained hidden
layer-1 are obtained. Next, vectors f1 are utilized as inputs for the
unsupervised pretraining of hidden layer-2 using autoencoder-2,
as depicted in the figure. This procedure is repeated for the pre-
training of hidden layer-3 and the corresponding feature vectors
f3 are then used for the supervised pretraining of final output Fig. 21. Convolution followed by an activation function in a CNN. Viewing
layer by setting the desired outputs y as targets. After isolated the w T (·) + b operation as cross-correlating a 2D function g(sx , sy ) with the
input image, the overall feature map indicates which location in the original
pretraining of hidden and output layers, the complete DNN is image best resembles g(sx , sy ).
trained (i.e., fine-tuned) using BP algorithm with x and y as
inputs and targets, respectively. By adopting this autoencoders-
same layer, i.e., each neuron undergoes the same computation
based hierarchical learning approach, the vanishing gradient
wT (·) + b but the input is a different part of the original image.
problem can be successfully bypassed in DNNs.
This is followed by a decision-like nonlinear activation function
and the output is called a feature map or activation map. For the
C. Convolutional Neural Networks (CNNs) same input image/layer, one can build multiple feature maps,
CNNs are a type of neural network primarily used for pattern where the features are learned via a training process. A parame-
recognition within images though they have also been applied in ter called stride defines how many pixels we slide the wT (·) + b
a variety of other areas such as speech recognition, natural lan- filter across the input image horizontally/vertically per output.
guage processing, video analysis, etc. The structure of a typical The stride value determines the size of a feature map. Next,
CNN is shown in Fig. 20(a) comprising of a few alternating con- a max-pooling or sub-sampling layer operates over the feature
volutional and pooling layers followed by an ANN-like structure maps by picking the largest value out of 4 neighboring neurons
towards the end of the network. The convolutional layer consists as shown in Fig. 20(c). Max-pooling is essentially nonlinear
of neurons whose outputs only depend on the neighboring pixels down-sampling with the objective to retain the largest identified
of the input as opposed to fully-connected (FC) layers in typical features while reduce the dimensionality of the feature maps.
ANNs as shown in Fig. 20(b). That is why it is called local The wT (·) + b operation essentially multiplies part of the in-
network, local connected network or local receptive field in ML put image with a 2D function g(sx , sy ) and sums the results
literature. The weights are also shared across the neurons in the as shown in Fig. 21. The sliding of g(sx , sy ) over all spa-
Fig. 23. Flow of gradient signals in an RNN.

Fig. 22. Schematic diagram of an RNN and the unfolding in time.
computed as
tial locations is the same as convolving the input image with
h (t) = σ1 (W1 x (t) + Wr h (t − 1) + b1 ) (28)
g(−sx , −sy ) (hence the name convolutional neural networks).
Alternatively, one can also view the wT (·) + b operation as o (t) = σ2 (W2 h (t) + b2 ) (29)
cross-correlating g(sx , sy ) with the input image. Therefore, a
where b1 and b2 are the bias vectors while σ1 (·) and
high value will result if that part of the input image resembles
σ2 (·) are the activation functions for the hidden and output
g(sx , sy ). Together with the decision-like nonlinear activation
layer neurons, respectively. Given a data set {(x(1), y(1)), (
function, the overall feature map indicates which location in the
x(2), y(2)), . . . (x(L), y(L))} of input-output pairs, the RNN
original image best resembles g(sx , sy ), which essentially tries
is first unfolded in time to represent it as a multilayer net-
to identify and locate a certain feature in the input image. With
work and then BP algorithm is applied on this graph, as shown
this insight, the interleaving convolutional and sub-sampling
in Fig. 23, to compute all the necessary matrix derivatives
layers can be intuitively understood as identifying higher-level
{ ∂∂WE 1 , ∂∂WE 2 , ∂∂W
E
, ∂∂bE1 , ∂∂bE2 }. The loss function can be cross-
and more complex features of the input image. r
The training of a CNN is performed using a modified BP entropy or MSE. The matrix derivative ∂∂W E
r
is a bit more com-
algorithm which updates convolutional filters’ weights and also plicated to calculate since Wr is shared across all hidden layers.
takes the sub-sampling layers into account. Since a lot of weights In this case,
are supposedly identical as the network is essentially performing ∂E ∂E (t) ∂E (t) ∂h (t)
L L
the convolution operation, one will update those weights using = =
∂Wr ∂Wr ∂h (t) ∂Wr
the average of the corresponding gradients. t=1 t=1

L
∂E (t) ∂h (t) ∂h (l)
t
D. Recurrent Neural Networks (RNNs) = (30)

t=1
∂h (t) ∂h (l) ∂Wr
l=1
In our discussion up to this point, different input-output pairs
where most of the derivatives in Eq. (30) can be easily computed
(x(i), y(i)) and (x(j), y(j)) in a data set are assumed to have
no relation with each other. However, in a lot of real-world ap- using Eqs. (28) and (29). The Jacobian ∂∂ h(t)
h(l) is further decom-
∂ h(t) ∂ h(t−1) ∂ h(l+1)
plications such as speech recognition, handwriting recognition, posed into ∂ h(t−1) ∂ h(t−2) · · · ∂ h(l) so that efficient updates
stock market performance prediction, inter-symbol interference naturally involve the flow of matrix derivatives from the last data
(ISI) cancellation in communications, etc., the sequential data point (x(L), y(L)) back to the first (x(1), y(1)). This algorithm
has important spatial/temporal dependence to be learned. An is called back-propagation through time (BPTT) [32].
RNN is a type of neural network that performs pattern recogni- In the special case when the nonlinear activation function
tion for data sets with memory. RNNs have feedback connec- is absent, the RNN structure resembles a linear multiple-input
tions, as shown in Fig. 22, and thus enable the information to multiple-output (MIMO) channel with memory 1 in communi-
be temporarily memorized in the networks [30]. This property cation systems. Optimizing the RNN parameters will thus be
allows RNNs to analyze sequential data by making use of their equivalent to estimating the channel memory given input and
inherent memory. output signal waveforms followed by maximum likelihood se-
Consider an RNN as shown in Fig. 22 with an input x(t), an quence detection (MLSD) of additional received signals. Con-
output o(t) and a hidden state h(t) representing the memory sequently, an RNN may be used as a suitable tool for channel
of the network, where the subscript t denotes time. The model characterization and data detection in nonlinear channels with
parameters W1 , W2 and Wr are input, output and recurrent memory such as long-haul transmission links with fiber Kerr
weight matrices, respectively. An RNN can be unfolded in time nonlinearity or direct detection systems with CD, chirp or other
into a multilayer network [31], as shown in Fig. 22. Note that component nonlinearities. Network traffic prediction may be
unlike a feed-forward ANN which employs different parameters another area where RNNs can play a useful role.
for each layer, the same parameters W1 , W2 , Wr are shared One major limitation of conventional RNNs in many prac-
across all steps which reflects the fact that essentially same task tical applications is that they are not able to learn long-term
is being performed at each step but with different inputs. This dependencies in data (i.e., dependencies between events that are
significantly reduces the number of parameters to be learned. far apart) due to the so-called exploding and vanishing gradient
The hidden state h(t) and output o(t) at time step t can be problems encountered during their training. To overcome this
Fig. 24. Some key applications of ML in fiber-optic communications.
issue, a special type of RNN architecture called long short-term

memory (LSTM) network is designed which can model and
learn temporal sequences and their long-range dependencies
more accurately through better storing and accessing of infor-
mation [29]. An LSTM network makes decision on whether to
forget/delete or store the information based on the importance
which it assigns to the information. The assigning of importance
takes place through weights which are determined via a learn-
ing process. Simply put, an LSTM network learns over time
which information is important and which is not. This allows
LSTM network’s short-term memory to last for longer periods
of time as compared to conventional RNNs which in turn leads
to improved sequence learning performance.
V. APPLICATIONS OF ML TECHNIQUES IN OPTICAL

COMMUNICATIONS AND NETWORKING
Fig. 24 shows some significant research works related to the
use of ML techniques in fiber-optic communications. A brief
discussion on these works is given below.
A. Optical Performance Monitoring (OPM) Fig. 25. Impairments-dependent patterns reflected by (a) eye-diagrams [34],
(b) ADTPs [35] and (c) AHs [36], and their corresponding known impairments
Optical communication networks are becoming increasingly values which serve as data labels during the training process.
complex, transparent and dynamic. Reliable operation and ef-
ficient management of these complex fiber-optic networks re- (and hence low cost) multi-impairment monitoring in optical
quire incessant and real-time information of various channel networks and have already shown tremendous potential.
impairments ubiquitously across the network, also known as Most existing ML-based OPM techniques adopt a super-
OPM [33]. OPM is widely regarded as a key enabling tech- vised learning approach utilizing training data sets of labeled
nology for SDNs. Through OPM, SDNs can become aware of examples during the offline learning process of selected ML
the real-time network conditions and subsequently adjust differ- models. The training data may, e.g., consist of signal rep-
ent transceiver/network elements parameters such as launched resentations like eye-diagrams, asynchronous delay-tap plots
powers, data rates, modulation formats, spectrum assignments, (ADTPs), amplitude histograms (AHs), etc., and their corre-
etc., for optimized transmission performance [4]. Unfortunately, sponding known impairments values such as CD, differential
conventional OPM techniques have shown limited success in group delay (DGD), OSNR, etc., serving as data labels, as
simultaneous and independent monitoring of multiple transmis- shown in Fig. 25. During the training phase, the inputs to an
sion impairments since the effects of different impairments are ML model are the impairments-indicative feature vectors x of
often difficult to separate analytically. Another crucial OPM re- eye-diagrams/ADTPs/AHs while their corresponding labels y
quirement is low complexity since the OPM devices need to are used as the targets as shown in Fig. 26(a). The ML model
be deployed ubiquitously across optical networks. ML tech- then learns the mapping between input features and the labels.
niques are proposed as an enabler for realizing low complexity Note that in case of eye-diagrams, the features can be parame-
Fig. 26. (a) ML model during the offline training phase with feature vectors x
as inputs and the labels y as targets. (b) Trained ML model used for online OPM
with feature vectors x as inputs and the impairments estimates o as outputs.
ters like eye-closure, Q-factor, root-mean-square jitter, crossing

amplitude, etc. [34]. On the other hand, for AHs/ADTPs, the
empirical one-dimensional (1D)/2D histograms can be treated
as features [35], [36]. Once the offline training process is
completed, the ML model can be used for real-time monitoring
(as the computations involved are relatively simple) in deployed
networks as shown in Fig. 26(b).
ML algorithms have been applied successfully for cost-
effective multi-impairment monitoring in optical networks.
Wu et al. [34] exploited impairments-sensitive features of
eye-diagrams using an ANN for simultaneous monitoring
of OSNR, CD and DGD. Similarly, Anderson et al. [35] Fig. 27. (a) Receiver DSP configuration with the DNN-based OSNR moni-
demonstrated joint estimation of CD and DGD by applying toring and MFI stage shown in red color. (b) True versus estimated OSNRs for
kernel-based ridge regression on ADTPs. In [37], we showed 112 Gb/s polarization-multiplexed (PM) 16-QAM signals. (c) MFI accuracies
(in number of instances and percertange of correct identifications) for different
that the raw empirical moments of asynchronously sampled modulation formats in the absence of fiber nonlinear effects [45]. (d) Effect of
signal amplitudes are sensitive to OSNR, CD and DGD. The fiber nonlinearity on the MFI accuracies [44].
first five empirical moments of received signal amplitudes are
thus used as input features to an ANN for multi-impairment
monitoring. Unlike [34], [35], which can only determine the be beneficial for digital coherent receivers in elastic optical net-
magnitude of the CD, this technique enables monitoring of works (EONs) since it can enable fast switching between format-
both magnitude and sign of accumulated CD. In [38], the dependent carrier recovery modules as conventional supervisory
low-frequency part of the received signal’s radio frequency channels may not be able to provide modulation format infor-
(RF) spectrum is used as input to an ANN for OSNR monitoring mation that quickly [40]. Reported ML-based MFI techniques
in the presence of large inline uncompensated CD. Apart from in the literature include K-means algorithm [41], ANNs [36],
supervised learning, unsupervised ML techniques have also [42], variational Bayesian expectation-maximization (VBEM)
been employed for OPM. In [39], PCA and statistical distance [43], and DNN [44] based methods.
measurement based pattern recognition is applied on ADTPs for Recently, DL algorithms have also been applied for OPM.
joint OSNR, CD and DGD monitoring as well as identification In [45], we demonstrated joint OSNR monitoring and MFI in
of bit-rate and modulation format of the received signal. digital coherent receivers, as shown in Fig. 27(a), using DNNs
The emergence of SDNs imposes new requirements on OPM in combination with AHs depicted in Fig. 25(c). The DNNs
devices deployed at the intermediate network nodes. As a lot automatically extracted OSNR and modulation format sensitive
of OSNR/CD/polarization-mode dispersion (PMD) monitoring features of AHs and exploited them for the joint estimation of
techniques are modulation format dependent, the OPM devices these parameters. The OSNR monitoring results for one signal
are desired to have modulation format identification (MFI) capa- type are shown in Fig. 27(b) and it is clear from the figure that
bilities in order to select the most suitable monitoring technique. OSNR estimates are quite accurate. The confusion table/matrix
Although the modulation format information of a signal can be in Fig. 27(c) summarizes MFI results (in the absence of fiber
obtained from upper-layer protocols in principle, it is practi- nonlinear effects) for 57 test cases used for evaluation. The up-
cally not available for OPM task at the intermediate network per element in each cell of this table represents the number of
nodes because the OPM units are often stand-alone devices and instances of correct/incorrect identifications for a given actual
can only afford limited complexity [4]. Note that MFI may also modulation format while the bottom element shows percentage
of correct/incorrect identifications for a given actual modulation

format. It is evident from the table that no errors are encoun-
tered in the identification of all three modulation formats under
consideration. The performance of this technique in the pres-
ence of fiber nonlinearity is shown in Fig. 27(d) and it is clear
from the figure that identification accuracies decrease slightly in
this case [44]. However, they still remain higher than 99%, thus
showing the resilience of this technique against fiber nonlinear
effects. Since this technique uses DL algorithms inside stan-
dard digital coherent receiver, it avoids extra hardware costs.
Similarly, Tanimura et al. [46] applied DNN on asynchronously
Fig. 28. Received signal distributions after linear equalization for long-haul
sampled raw data for OSNR monitoring in a coherent receiver. (a) dispersion-managed [17] and (b) dispersion-unmanaged [53] transmissions.
Using a deep 5-layers architecture and a large training data set
of 400,000 samples, the DNN is shown to learn and extract
useful OSNR-sensitive features of incoming signals without in- together with distributed copropagating amplified spontaneous
volving any manual feature engineering. An extension of this emission (ASE) noise n(t, z) can be described by the stochastic
method is presented in [47] where the DNN-based monitor is nonlinear Schrödinger equation (NLSE)
enhanced using the dropout technique [48] at the inference time
(unlike typical approach of using dropout during training) so ∂ β2 ∂ 2
that multiple “thinned” DNNs with slightly different configu- u (t, z) + j u (t, z) = jγ|u (t, z)|2 u (t, z) + n (t, z)
∂z 2 ∂t2
rations provide multiple OSNR estimates. This in turn enables (31)
them to compute confidence intervals of the OSNR estimates
as an auxiliary output. In [49], raw eye-diagrams are treated as where u(t, z) is the electric field while β2 and γ are group
images (comprising of various pixels) and are processed using velocity dispersion (GVD) parameter and fiber nonlinear co-
a CNN for automatic extraction of features which are then used efficient, respectively. Although the NLSE can be numerically
for joint OSNR monitoring and MFI. This technique exhibits evaluated using the split-step Fourier method (SSFM) to sim-
better performance than conventional ML approaches. ulate the waveforms evolution during transmission, the inter-
Open issues: While ML-based OPM has received significant play between signal, noise, nonlinearity and dispersion compli-
attention over the last few years, there are certain issues which cates the analysis. This is also essentially the limiting factor of
still need to be addressed. For example, accurate OSNR mon- the DBP technique [52]. At present, stochastic characteristics
itoring in long-haul transmission systems in the presence of of nonlinearity-induced noise depend in a complex manner on
fiber nonlinearity is still a challenging task as nonlinear distor- dispersion, modulation format, transmission distance, amplifier
tions are incorrectly treated as noise by most OSNR monitoring and type of optical fiber [53]. As an example, two received sig-
methods. Developing ML-based techniques to estimate actual nal distributions after linear compensation of various transceiver
OSNR irrespective of other transmission impairments, channel and transmission impairments are shown in Fig. 28 for long-
power, and wavelength-division multiplexing (WDM) effects haul systems with and without inline dispersion compensation.
is highly desirable in future optical networks. Recently, there From Fig. 28(a), it is obvious that the decision boundaries are
have been some initial attempts in this regard exploiting ampli- nonlinear, which naturally calls for the use of ML techniques. In
tude noise covariance of received symbols [50], and features of contrast, the nonlinear noise is more Gaussian-like in dispersion-
nonlinear phase noise along with time correlation properties of unmanaged transmissions [54], as shown in Fig. 28(b). However,
fiber nonlinearities [51]. Another open issue is the development the noise is correlated in time and hard to model analytically.
of ML-based monitoring techniques using only low-bandwidth More specifically, consider a transmission link with inline
components as this can reduce the computational complexity distributed erbium-doped fiber amplifiers (EDFAs). Let y =
and cost of OPM devices installed at the intermediate network [y1 y2 .
. .] be a symbol sequence with overall transmitted signal
nodes. Alternatively, instead of physical deployment of OPM q(t) = j yj s(t − jT ) where s(t) is the pulse shape and T is
units across the network, capturing of various physical layer data the symbol period. The received signal is given by
(such as launched powers, parameters of various optical ampli-
fiers and fibers, etc.) via network management and using the x (t) = fNLSE (q (t) , n (t, z)) (32)
ML algorithms to uncover complex relationships between these
parameters and the actual link OSNRs can also be investigated. where fNLSE (·) is the input-output mapping characterized by
the NLSE. It should be noted that electronic shot noise and quan-
tization noise also act as additional noise sources but are omitted
B. Fiber Nonlinearity Compensation (NLC) here as ASE noise and its interaction with Kerr nonlinearity and
The optical fiber communication channel is nonlinear due dispersion are the dominant noise sources in long-haul transmis-
to the Kerr effect and thus classical linear systems equaliza- sions. Other WDM effects are also omitted here for simplicity.
tion/detection theory is suboptimal in this context. Signal prop- The received signal is sampled to obtain x = [x1 x2 . . .]. While
agation in optical fibers in the presence of Kerr nonlinearity using ML for NLC, one seeks to develop a neural network or
related works with different complexities and performance con-

cerns are reported in [18], [19], [59] with a 0.5−2 dB gain in
Q-factor compared to linear equalization methods. In [60], the
use of an ANN per subcarrier to equalize nonlinear distortions
in coherent optical orthogonal frequency-division multiplexing
(CO-OFDM) systems is studied.
Another class of ML techniques incorporates limited amount
of underlying physics to develop better fiber NLC methods. Pan
et al. [61] noted that the received phase distortions of a 16-
QAM signal are correlated across symbols due to nonlinearity
in addition to the effect of slowly varying laser phase noise.
They proposed to interleave the soft forward error correction
(FEC) decoding with a modified EM algorithm. In particular, the
EM algorithm takes the output of the soft-decision low-density
parity-check (LDPC) decoder as input and provides phase noise
estimates φ̂k which are then used to compensate the symbols
Fig. 29. Q-factor for 27.59 GBd/s RZ-QPSK signals after transmission over and fed back to the LDPC decoder, thus forming an adaptive
2000 km SSMF [55].
feedback equalizer. An additional regularization term is added
inside the EM algorithm to prevent large phase jumps.
generally a mapping function g(·) so that the output vector Finally, ML approaches can also augment well-developed
analytical models and signal processing strategies for NLC.
o = g (x) (33)
Häger et al. [62] consider the standard DBP algorithm
of ML model is as close as possible to y.
There are a few classes of approaches to learn the best map- W−1 σ −1 W−1 σ −1 (. . . .) (34)
ping g(·). One direction is to completely ignore the intricate
interactions between nonlinearity, CD and noise in NLSE and where W is a matrix (representing linear CD operation) and
2
treat the nonlinear fiber as a black-box system. To this end, in σ(z) = ej γ|z | is the nonlinear phase rotation. However, when
[55], we proposed the use of an ANN after CD compensation in a viewed from an ML perspective, this sequence of interleaved
digital coherent receiver and applied a training technique called linear and nonlinear operations resembles a standard neural
extreme learning machine (ELM) that avoids SGD-like iterative network structure and hence, all the parameters in W as well as
weights and biases updates [56]. Fig. 29 shows the simulated Q- the nonlinear function parameters can be learned. A dedicated
factor for 27.59 GBd/s return-to-zero (RZ) QPSK transmissions complex-valued DNN-like model following time-domain DBP
over 2000 km standard single-mode fiber (SSMF). The pro- procedures but generalizing the filter of the linear step and
posed ELM-based approach provides comparable performance the scaling factor of the nonlinear phase rotation is developed.
to DBP but is computationally much simpler. The loss function is differentiable with respect to the real and
For dispersion-managed systems, Zibar et al. [57] investi- imaginary parts of the parameters and thus can be learned using
gated the use of EM and K-means algorithms to derive non- standard BP techniques. The learned DBP algorithm performs
linear decision boundaries for optimized transmission perfor- similar to conventional DBP but is computationally simpler.
mance. By applying the EM algorithm after regular DSP, the For high baud rate systems, a sub-band division solution to
distribution of the combined fiber nonlinear distortions, laser reduce the required linear filter size was recently proposed [63].
phase noise and transceiver imperfections can be estimated. In another approach, perturbation analysis of fiber nonlinearity
EM algorithm assumes that each cluster is Gaussian distributed is used to analyze the received signal x(t) = xlin (t) + Δx(t),
with different mean and covariance matrix (the outer clusters where xlin (t) is the received signal if the system is linear while
in Fig. 28(a) are expected to have larger variances) so that the the intra-channel four-wave mixing (IFWM) perturbations
received signal distribution is a Gaussian mixture model. EM Δx(t) are given by
algorithm optimizes the mean and covariance matrix parame-
ters iteratively to best fit the observed signal distribution in a Δx (t) = P 3/2 Cm ,n xlin (t − mT )
m ,n
maximum likelihood sense. The converged mixture model is
then used for symbols detection. In contrast, the K-means al- × x∗lin (t − (m + n) T ) xlin (t − nT ) (35)
gorithm provides less improvement because it assumes that the
clusters share the same covariance matrix [58]. Nevertheless, where T is the symbol period, P is the signal power and Cm ,n is
for dispersion-unmanaged systems, both EM and K-means al- determined by the fiber parameters [64]. From an ML perspec-
gorithms provide minimal performance gain as optimal decision tive, the xlin (t − mT )x∗lin (t − (m + n)T )xlin (t − nT ) triplets
boundaries in this case are nearly straight lines as shown in [57]. and x(t) can serve as inputs to an ANN to estimate Cm ,n and
Li et al. [17] applied SVMs for fiber NLC. However, since ba- Δx(t), which can then be used to pre-distort the transmitted sig-
sic SVM is only a binary classifier, classifying M-QAM signals nal and obtain better results. The proposed technique is demon-
would require log2 M binary SVMs in this case. Other SVM- strated in an 11000 km subsea cable transmission and it outper-
TABLE I
SOME KEY ML-BASED FIBER NLC TECHNIQUES. PPD: PRE/POST-DISTORTION, DCF: DISPERSION-COMPENSATING FIBER, DSM: DIGITAL SUBCARRIER
MODULATION, PS: PROBABILISTIC SHAPED
forms transmitter side perturbation-based pre-distortion meth- which should not be completely disregarded while evaluating
ods by 0.3 dB in both single-channel and WDM systems [65]. the complexity of ML approaches for NLC.
Open issues: Table I shows some key techniques using ML for
fiber NLC. Most of these works incorporate ML as an extra DSP
C. Proactive Fault Detection
module placed either at transmitter or receiver. While effective
to a certain extent, it is not clear what is the best sequence of Reliable network operations are essential for the carriers
conventional signal processing and ML blocks in such a hybrid to provide service guarantees, called service-level agreements
DSP configuration. One factor driving the choice of sequence (SLAs), to their customers regarding system’s availability and
is the dynamic effects such as carrier frequency offset, laser promised quality levels. Violation of these guarantees may
phase noise, PMD, etc., that are hard to be captured in the learn- result in severe penalties. It is thus highly desirable to have an
ing process of an ML algorithm. In this case, one can perform early warning and proactive protection mechanism incorporated
ML-based NLC after linear compensations so as to avoid tack- into the network. This can empower network operators to know
ling these time-varying dynamics in ML. In the other extreme, when the network components are beginning to deteriorate
RNN structures can embrace all the time-varying dynamics in and preventive measures can then be taken to avoid serious
principle but it may be an overkill since we do know their un- disruptions [33].
derlying physics and it should be exploited in the overall DSP Conventional fault detection and management tools in opti-
design. Also, in case of hybrid configurations, the accuracy of cal networks adopt a rigid approach where some fixed threshold
conventional DSP algorithms such as CMA or carrier phase es- limits are set by the system engineers and alarms are triggered to
timation (CPE) plays a major role in the quality of the data sets alert malfunctions if those limits are surpassed. Such traditional
which ML fundamentally relies on. Therefore, there are strong network protection approaches have the following main draw-
dependencies between ML and conventional DSP blocks and backs: (i) These methods protect a network in a passive manner,
the right balance is still an open area of research. Finally, to the i.e., they are unable to forecast the risks and tend to reduce the
best of our knowledge, an ML-based single-channel processing damages only after a failure occurs. This approach may result in
technique that outperforms DBP in practical WDM settings has the loss of immense amounts of data during network recovery
yet to be developed. process once a failure happens. (ii) The inability to accurately
Numerous studies are conducted to also address the compu- forecast the faults leads to ultraconservative network designs in-
tational complexity issues of conventional and ML techniques volving large operating margins and protection switching paths
for fiber NLC. For conventional NLC algorithms, we direct the which in turn result in an underutilization of the system re-
readers to the survey paper [66]. On the other hand, the computa- sources. (iii) They are unable to determine the root cause of
tional complexity of ML algorithms for NLC varies significantly faults. (iv) Apart from hard failures (i.e., the ones causing major
with the architecture and the training process used which make signal disruptions), several kinds of soft failures (i.e., the ones
comparison with the conventional techniques difficult. Gener- degrading system performance slowly and slightly) may also
ally, the training processes are too complex to be performed occur in optical networks which cannot be easily detected using
online as they require a lot of iterations and potentially massive conventional methods.
training data. For the inference phase (i.e., using the trained ML-enabled proactive fault management has recently been
model for real-time data detection), most ML algorithms pro- conceived as a powerful means to assure reliable network op-
posed involve relatively simple computations, leading to the eration [67]. Instead of using traditional fixed pre-engineered
perception that ML techniques are generally simple to imple- solutions, this new mechanism relies on dynamic data-driven
ment since offline training processes are typically not counted operations, leveraging immense amounts of operational data re-
towards the computational complexity. However, in reality, the trieved through network monitors (e.g., using simple network
training does take up a lot of computational resources and time management protocol (SNMP)). The data repository may in-
Fig. 30. Fault management tasks enabled by ML-based approaches.
clude network components’ parameters such as optical power

levels at different network nodes, EDFAs’ gains, current drawn
and power consumption of various devices, shelf temperature,
temperatures of various critical devices, etc. ML-based fault
prediction tools are able to learn historical fault patterns in net-
works and uncover hidden correlations between various entities
and events through effective data analytics. Such unique and
powerful capabilities are extremely beneficial in realizing proac-
tive fault discovery and preventive maintenance mechanisms in
optical networks. Fig. 30 illustrates various fault management
tasks powered by the ML-based data analytics in optical net-
works including proactive fault detection, fault classification,
fault localization, fault identification, and fault recovery.
Recently, a few ML-based techniques have been developed
for advanced failure prediction in networks. Wang et al. [68]
demonstrated an ML-based network equipment failure predic-
tion method in software-defined metropolitan area networks
(SDMANs) using a combination of double exponential smooth-
ing (DES) and an SVM. Their approach involves constant mon-
itoring of various physical parameters of the boards used in
WDM nodes. The set of parameters includes the boards’ power
consumption, laser bias current, laser temperature offset, and
environment temperature. However, to realize proactive failure
detection, DES, which is basically a time-series prediction al-
gorithm, is used to predict the future values of these parameters.
Next, an SVM-based classifier is used to learn the relationship
between forecasted values of boards’ parameters and the occur-
rence of failure events. This method is shown to predict boards’
failures with an average accuracy of 95%.
Fig. 31. (a) Fault types typically encountered in commercial fiber-optic net-
Similarly, in [69], proactive detection of fiber damages is works. (b) Comparison of fault detection rates and proactive reaction times of
demonstrated using ML-based pattern recognition. In their data-driven and condition-based methods for the fault types given in (a) [70].
work, the state-of-polarization (SOP) rotation speed is con-
stantly monitored in a digital coherent receiver and if it ex- tical power levels evolve over time under normal or abnormal
ceeds a certain predefined limit, the system considers it as an network operation (i.e., recognize power level abnormalities due
indication of some fiber stress event (leading to certain fiber to occurrence of certain faults). The trained ANN is then shown
damages) and a flag is raised. Next, Stokes parameters’ traces to detect significant network faults with better detection accura-
are recorded which are shown to exhibit unique patterns for cies and proactive reaction times as compared to conventional
various mechanical stress events on the fiber such as bending, threshold-based fault detection approaches, as shown in Fig. 31.
shaking, etc. These patterns are exploited using a naive Bayes An extension of this method is presented in [71] which makes
classifier for their recognition. This technique is shown to pre- use of an ANN and shape-based clustering algorithm to not only
dict various fiber stress events (and thus fiber breaks before their proactively detect and localize faults but also determine their
actual occurrence) with 95% accuracy. likely root causes. The two-stage fault detection and diagnosis
In [70], a cognitive fault detection architecture is proposed framework proposed in their work involves monitoring optical
for intelligent network assurance. In their work, an ANN is used power levels across various network nodes as well as nodes’
to learn historical fault patterns in networks for proactive fault local features such as temperature, amplifier gain, current draw
detection. The ANN is trained to learn how the monitored op- profiles, etc. In the first stage, an ANN is applied to detect faults
by identifying optical power level abnormalities across various

network nodes. The faulty node is then localized using network
topology information. In the second stage, the faulty node’s lo-
cal features (which are also interdependent) are further analyzed
using a clustering technique to identify potential root causes.
Open issues: Realization of ML-based proactive fault man-
agement in optical networks is still at its nascent stage. While
few techniques for detecting and localizing hard failures have
been proposed and deployed, the development of effective au-
tomated solutions for soft failures is still a relatively unexplored
area. Furthermore, most of the existing works focus on the de-
tection/localization of faults while the development of mecha-
nisms which can uncover actual root causes of these faults as
well as facilitate efficient fault recovery process is an open area
for research. Another major problem faced while implementing
ML-based fault detection/prevention is the unavailability of ex-
tensive data sets corresponding to different faulty operational
conditions. This is mainly due to the fact that current network Fig. 32. Maximum used transponders versus load [75].
operators adopt ultraconservative designs with large operating
margins in order to reduce the fault occurrence probability in
their networks. This, however, limits the chances to collect suf- In [77], an RL technique called Q-learning is used to solve the
ficient examples of various network failure scenarios. In this path and wavelength selection problem in optical burst switched
context, the development of ML algorithms which could predict (OBS) networks. Initially, for each burst to be transmitted be-
network faults accurately despite using minimal training data tween a given source-destination pair, the algorithm picks a path
sets is an interesting area for research. and wavelength from the given sets of paths and wavelengths,
respectively, as action and then a reward which depends on
the success or failure of that burst transmission is determined.
D. Software-Defined Networking (SDN) In this way, the algorithm learns over time how to select op-
Software-defined networking approach centralizes network timal paths and wavelengths which can minimize burst loss
management by decoupling the data and control planes. SDN probability (BLP) for each source-destination pair. It has been
technologies enable the network infrastructure to be centrally shown that the Q-learning algorithm reduces BLP significantly
controlled/configured in an intelligent way by using various as compared to other adaptive schemes proposed in the lit-
software applications. Data-driven ML techniques naturally fit erature. Similarly, Chen et al. [78] applied an RL algorithm,
in SDNs where abundant data can be captured by accessing called Q-network, for joint routing, modulation and spectrum
the monitors spanning the whole network. Many studies have assignment (RMSA) provisioning in SDNs. In their work, the
demonstrated the applications of ML in solving particular prob- Q-network self-learns the best RMSA policies under different
lems in SDNs such as network traffic prediction, fault detection, network states and time-varying demands based on the feed-
quality-of-transmission (QoT) estimation, etc. We refer the read- back obtained from the network for the RMSA actions taken in
ers to two recent survey papers [72], [73] for comprehensive those conditions. Compared to shortest-path (SP) routing and the
reviews on these topics. In contrast, systematic integration of first-fit (FF) spectrum assignment approach, 4 times reduction
those ML applications into an SDN framework for cross-layer in request blocking probability is reported using this method.
optimization is less reported, which is what we will focus on In [79], we demonstrated an ML-assisted optical network
here. Morales et al. [74] performed ANN-based data analyt- planning framework for SDNs. In this work, the network con-
ics for robust and adaptive network traffic modeling. Based on figuration as well as the real-time information about different
the predicted traffic volume and direction, the virtual network link/signal parameters such as launched power, EDFAs’ in-
topology (VNT) is adaptively reconfigured to ensure that the put and output powers, EDFAs’ gains, EDFAs’ noise figures
required grade of service is supported. Compared to static VNT (NFs), etc., is stored in a network configuration and monitoring
design approaches, this predictive method decreases the required database (NCMDB) as shown in Fig. 33(a). Next, an ANN is
number of transponders to be installed at the routers by 8−42% trained using this information where vectors x comprising of
as shown in Fig. 32, thus reducing energy consumption and above-mentioned link/signal parameters are applied at the input
costs. Similarly, Alvizu et al. [76] used ML to predict tidal traf- of the ANN while actual known OSNR values y corresponding
fic variations in a software-defined mobile metro-core network. to those links are used as targets, as depicted in Fig. 33(b). The
In their work, ANNs are employed to forecast traffic at different ANN is then made to learn the relationship between these two
locations of an optical network and the predicted traffic demands sets of data by optimizing its various parameters. After training,
are then exploited to optimize the online routing and wavelength the ANN is able to predict the performance (in terms of OSNR)
assignments using a combination of analytical derivations and of various unestablished lightpaths in the network, as shown
heuristics. Energy savings of ∼31% are observed as compared to in Fig. 33(c), for optimum network planning. We demonstrated
traditional static methods used in mobile metro-core networks. that the ML-based performance prediction mechanism can be
algorithms and how to properly incorporate them into SDN

framework.
E. Physical Layer Design

ML techniques offer the opportunity to optimize the design
of individual physical components as well as complete end-
to-end fiber-optic communication systems. Recently, we have
seen some noticeable research works in this regard with quite
encouraging results.
In [80], a complete optical communication system includ-
ing transmitter, receiver and nonlinear channel is modeled as an
end-to-end fully-connected DNN. This approach enables the op-
timization of transceivers in a single end-to-end process where
the transmitter learns waveform representations that are robust
to channel impairments while the receiver learns to equalize
channel distortions. The results for 42 Gb/s intensity mod-
ulation/direct detection (IM/DD) systems show that the DL-
based optimization outperforms the solutions based on two-
and four-level pulse amplitude modulation (PAM2/PAM4) and
conventional receiver equalization, for a range of transmission
distances. Jones et al. [81] proposed an ANN-based receiver
for nonlinear frequency-division multiplexing (NFDM) opti-
cal communication systems. Unlike standard nonlinear Fourier
transform (NFT) based receivers which are vulnerable to losses
and noise in NFDM systems, the ANN-based receiver tackles
these impairments by learning the distortion characteristics of
Fig. 33. (a) Schematic diagram of ML-assisted optical network planning
framework for SDNs. (b) ANN model with link/signal parameters as inputs
previously-transmitted pulses and applying them for inference
and estimated OSNRs as outputs. (c) True versus estimated OSNRs using the for future decisions. The results demonstrate improved bit-error
ANN model [79]. rate (BER) performance as compared to conventional NFT-
based receivers for practical link configurations. In [82], an ANN
is used in receiver DSP for mitigating linear and nonlinear im-
used to increase the transmission capacity in an SDN framework pairments in IM/DD systems. The ANN infers linear and nonlin-
by adaptively configuring a probabilistic shaping-based spectral ear channel responses simultaneously which are then exploited
efficiency tunable transmitter. for increasing the demodulation reliability beyond the capabil-
Open issues: We already observe some benefits of ML- ity of linear equalization techniques. Using an ANN along with
assisted network state prediction and decision-making in SDNs. standard feed-forward equalization (FFE), up to 10 times BER
However, some practical concerns need to be addressed when improvement over FFE-only configurations is demonstrated for
applying ML in SDNs. Firstly, real networks still require worst- 84 GBd/s PAM4 transmission over 1.5 km SSMF.
case performance guarantees, which in turn necessitates a full
understanding of the robustness of the chosen ML algorithms.
Secondly, network characteristics can vary significantly in dif- VI. FUTURE ROLE OF ML IN OPTICAL COMMUNICATIONS
ferent network scenarios. An ML model trained using one par- The emergence of SDNs with their inherent programmability
ticular data set may not be able to generalize to all network and access to enormous amount of network-related monitored
scenarios and thus the scalability of such a method becomes data provides unprecedented opportunities for the application of
questionable. ML methods in these networks. The vision of future intelligent
A number of concerns also need to be addressed to realize optical networks integrates the programmability/automation
more active use of RL in SDNs. Firstly, it must be shown that functionalities of SDNs with data-analytics capabilities of ML
RL algorithms are scalable to handle large and more complex technologies to realize self-aware, self-managing and self-
networks. Secondly, the RL algorithms must demonstrate fast healing network infrastructures. Over the past few years, we
convergence in real network conditions so as to limit the impact have seen an increasing amount of research on the application
of non-optimal actions taken during the early learning phase of of ML techniques in various aspects of optical communica-
these algorithms. tions and networking. As ML is gradually becoming a common
The interpretability of ML methods is another issue as it knowledge to the photonics community, we can envisage some
is not desirable in practice to adopt an algorithm without really potential significant developments in optical networks in the
understanding how and why it works. Consequently, much needs near future ushered in by the application of emerging ML tech-
to be done in understanding the fundamental properties of ML nologies.
Looking to the future, we can foresee a vital role played by With softmax activation function for the output neurons,
ML-based mechanisms across several diverse functional areas K
in optical networks, e.g., network planning and performance ∂ o j (n ) ( m =1 er m (n )
)e r j ( n ) δ j , k −e r j ( n ) ·e r k ( n )
∂ r k (n ) = K 2
prediction, network maintenance and fault prevention, network K
( m = 1 er m ( n ) )
r j (n )
( m =1 e r m (n )
)e δ j , k −e r j ( n ) ·e r k ( n ) (37)
resources allocation and management, etc. ML can also aid = K 2
( k = 1 er k ( n ) )
cross-layer optimization in future optical networks requiring big
= oj (n) δj,k − oj (n) ok (n)
data analytics since it can inherently learn and uncover hidden
patterns and unknown correlations in big data which can be where δj,k = 1 when j = k and 0 otherwise. Consequently,
extremely beneficial in solving complex network optimization
problems. The ultimate objective of ML-driven next-generation ∂ E (n )
K
∂ E (n ) ∂ o j (n )
∂ r k (n ) = ∂ o j (n ) ∂ r k (n )
optical networks will be to provide infrastructures which can j =1
monitor themselves, diagnose and resolve their problems, and K
y (n )
provide intelligent and efficient services to the end users. = − o jj (n ) (oj (n) δj,k − oj (n) ok (n))
j =1
K
VII. ONLINE RESOURCES FOR ML ALGORITHMS = −yj (n) (δj,k − ok (n)) = ok (n) − yk (n)
j =1
Standard ML algorithms’ codes and examples are readily (38)
available online and one seldom needs to write their own codes K
from the very beginning. There are several off-the-shelf pow- as j =1 yj (n) = 1. Therefore,
erful frameworks available under open-source licenses such as
∂E (n)
TensorFlow, Pytorch, Caffe, etc. Matlab, which is widely used in = o (n) − y (n) .
optical communications researches is not the most popular pro- ∂r (n)
gramming language among the ML community. Instead, Python ∂ r(n ) ∂ r(n ) ∂ r k (n ) ∂ r k (n )
Now, since ∂ b2 , ∂ b1 , ∂W2 , ∂ W 1 are the same as
is the preferred language for ML research partly because it is ∂ o(n ) ∂ o(n ) ∂ o k (n ) ∂ o k (n )
freely available, multi-platform, relatively easy to use/read, and ∂ b2 , ∂ b 1 , ∂ W 2 , ∂ W 1 for MSE loss function and linear
has a huge number of libraries/modules available for a wide va- activation function for the output neurons (as o(n) = r(n) for
riety of tasks. We hereby report some useful resources including that case), it follows that the update equations Eq. (8) to Eq.
example Python codes using TensorFlow library to help inter- (12) also hold for the ANNs with cross-entropy loss function
ested readers get started with applying simple ML algorithms and softmax activation function for the output neurons.
to their problems. More intuitive understanding of ANNs can
be found at this visual playground [83]. The Python codes for ACKNOWLEDGMENT
most of the standard neural network architectures discussed The authors would like to thank anonymous reviewers for
in this paper can be found in these Github repositories [84], their valuable comments and suggestions.
[85] with examples. For non-standard model design, Tensor-
Flow also provides low-level programming interfaces for more
custom and complex operations based on its symbolic building REFERENCES
blocks, which are documented in detail in [86]. [1] S. Marsland, Machine Learning: An Algorithmic Perspective, 2nd ed. Boca
Raton, FL, USA: CRC Press, 2015.
[2] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A
VIII. CONCLUSIONS review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
In this paper, we discussed how the rich body of ML tech- [3] C. M. Bishop, Pattern Recognition and Machine Learning. New York,
niques can be applied as a unique and powerful set of signal NY, USA: Springer-Verlag, 2006.
[4] Z. Dong, F. N. Khan, Q. Sui, K. Zhong, C. Lu, and A. P. T. Lau, “Optical
processing tools in fiber-optic communication systems. As opti- performance monitoring: A review of current and future technologies,” J.
cal networks become faster, more dynamic and more software- Lightw. Technol., vol. 34, no. 2, pp. 525–543, Jan. 2016.
defined, we will see an increasing number of applications of ML [5] A. S. Thyagaturu, A. Mercian, M. P. McGarry, M. Reisslein, and W.
Kellerer, “Software defined optical networks (SDONs): A comprehensive
and big data analytics in future networks to solve certain critical survey,” IEEE Commun. Surv. Tut., vol. 18, no. 4, pp. 2738–2786, Oct.–
problems that cannot be easily tackled using conventional ap- Dec. 2016.
proaches. A basic knowledge and skills in ML will thus become [6] [Online]. Available: https://www.youtube.com/channel/UCLZL8KsCz
OODkDKBOI3S2lw
necessary and beneficial for researchers in the field of optical [7] R. A. Dunne, A Statistical Approach to Neural Networks for Pattern
communications and networking. Recognition. Hoboken, NJ, USA: Wiley, 2007.
[8] I. Kaastra and M. Boyd, “Designing a neural network for forecasting finan-
cial and economic time series,” Neurocomputing, vol. 10, no. 3, pp. 215–
APPENDIX 236, Apr. 1996.
[9] C. D. Meyer, Matrix Analysis and Applied Linear Algebra. Philadelphia,
For cross-entropy loss function defined in Eq. (14), the deriva- PA, USA: SIAM, 2000.
tive with respect to the output is given by [10] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed.
New York, NY, USA: Wiley, 2007.
∂E (n) yj (n) [11] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural net-
=− . (36) works,” in Proc. Int. Conf. Artif. Intell. Statist., Fort Lauderdale, FL, USA,
∂oj (n) oj (n) 2011, vol. 15, pp. 315–323.
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Sur- [38] T. S. R. Shen, Q. Sui, and A. P. T. Lau, “OSNR monitoring for PM-QPSK
passing human-level performance on ImageNet classification,” in Proc. systems with large inline chromatic dispersion using artificial neural net-
Int. Conf. Comput. Vis., Santiago, Chile, 2015, pp. 1026–1034. work technique,” IEEE Photon. Technol. Lett., vol. 24, no. 17, pp. 1564–
[13] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep 1567, Sep. 2012.
feedforward neural networks,” in Proc. Int. Conf. Artif. Intell. Statist., [39] M. C. Tan, F. N. Khan, W. H. Al-Arashi, Y. Zhou, and A. P. T. Lau, “Si-
Sardinia, Italy, 2010, pp. 249–256. multaneous optical performance monitoring and modulation format/bit-
[14] A. Webb, Statistical Pattern Recognition, 2nd ed. Chicester, U.K.: Wiley, rate identification using principal component analysis,” J. Opt. Commun.
2002. Netw., vol. 6, no. 5, pp. 441–448, May 2014.
[15] A. Statnikov, C. F. Aliferis, D. P. Hardin, and I. Guyon, A Gentle Intro- [40] P. Isautier, K. Mehta, A. J. Stark, and S. E. Ralph, “Robust architecture for
duction to Support Vector Machines in Biomedicine. Singapore: World autonomous coherent optical receivers,” J. Opt. Commun. Netw., vol. 7,
Scientific, 2011. no. 9, pp. 864–874, Sep. 2015.
[16] M. S. Andersen, J. Dahl, and L. Vandenberghe, “CVXOPT: Python soft- [41] N. G. Gonzalez, D. Zibar, and I. T. Monroy, “Cognitive digital receiver
ware for convex optimization.” [Online]. Available: https://cvxopt.org. for burst mode phase modulated radio over fiber links,” in Proc. Eur. Conf.
Accessed on: Feb. 5, 2019. Opt. Commun., Torino, Italy, 2010, Paper P6.11.
[17] M. Li, S. Yu, J. Yang, Z. Chen, Y. Han, and W. Gu, “Nonparameter [42] F. N. Khan, Y. Zhou, Q. Sui, and A. P. T. Lau, “Non-data-aided joint
nonlinear phase noise mitigation by using M-ary support vector machine bit-rate and modulation format identification for next-generation hetero-
for coherent optical systems,” IEEE Photon. J., vol. 5, no. 6, Dec. 2013, geneous optical networks,” Opt. Fiber Technol., vol. 20, no. 2, pp. 68–74,
Art. no. 7800312. Mar. 2014.
[18] D. Wang et al., “Nonlinear decision boundary created by a machine [43] R. Borkowski, D. Zibar, A. Caballero, V. Arlunno, and I. T. Monroy,
learning-based classifier to mitigate nonlinear phase noise,” in Proc. Eur. “Stokes space-based optical modulation format recognition in digital co-
Conf. Opt. Commun., Valencia, Spain, 2015, Paper P.3.16. herent receivers,” IEEE Photon. Technol. Lett., vol. 25, no. 21, pp. 2129–
[19] T. Nguyen, S. Mhatli, E. Giacoumidis, L. V. Compernolle, M. Wuilpart, 2132, Nov. 2013.
and P. Mégret, “Fiber nonlinearity equalizer based on support vector clas- [44] F. N. Khan, K. Zhong, W. H. Al-Arashi, C. Yu, C. Lu, and A. P. T.
sification for coherent optical OFDM,” IEEE Photon. J., vol. 8, no. 2, Lau, “Modulation format identification in coherent receivers using deep
Apr. 2016, Art. no. 7802009. machine learning,” IEEE Photon. Technol. Lett., vol. 28, no. 17, pp. 1886–
[20] M. Kirk, Thoughtful Machine Learning With Python. Sebastopol, CA, 1889, Sep. 2016.
USA: O’Reilly Media, 2017. [45] F. N. Khan et al., “Joint OSNR monitoring and modulation format iden-
[21] I. T. Jolliffe, Principal Component Analysis, 2nd ed. New York, NY, USA: tification in digital coherent receivers using deep neural networks,” Opt.
Springer-Verlag, 2002. Express, vol. 25, no. 15, pp. 17767–17776, Jul. 2017.
[22] J. E. Jackson, A User’s Guide to Principal Components. Hoboken, NJ, [46] T. Tanimura, T. Hoshida, J. C. Rasmussen, M. Suzuki, and H. Morikawa,
USA: Wiley, 2003. “OSNR monitoring by deep neural networks trained with asynchronously
[23] L. J. Cao, K. S. Chua, W. K. Chong, H. P. Lee, and Q. M. Gu, “A sampled data,” in Proc. OptoElectron. Commun. Conf., Niigata, Japan,
comparison of PCA, KPCA and ICA for dimensionality reduction in 2016, Paper TuB3-5.
support vector machine,” Neurocomputing, vol. 55, no. 1/2, pp. 321–336, [47] T. Tanimura, T. Kato, S. Watanabe, and T. Hoshida, “Deep neural network
Sep. 2003. based optical monitor providing self-confidence as auxiliary output,” in
[24] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Proc. Eur. Conf. Opt. Commun., Rome, Italy, 2018, Paper We1D.5.
2nd ed. Cambridge, MA, USA: MIT Press, 2018. [48] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi-
[25] Y. Bengio, “Learning deep architectures for AI,” Found. Trends Mach. nov, “Dropout: A simple way to prevent neural networks from overfitting,”
Learn., vol. 2, no. 1, pp. 1−127, Nov. 2009. J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, Jun. 2014.
[26] Y. Bengio and O. Delalleau, “On the expressive power of deep archi- [49] D. Wang et al., “Modulation format recognition and OSNR estimation
tectures,” in Algorithmic Learning Theory, J. Kivinen, C. Szepesvári, E. using CNN-based deep learning,” IEEE Photon. Technol. Lett., vol. 29,
Ukkonen, and T. Zeugmann, Eds. Heidelberg, Germany: Springer-Verlag, no. 19, pp. 1667–1670, Oct. 2017.
2011, pp. 18–36. [50] A. S. Kashi et al., “Fiber nonlinear noise-to-signal ratio monitoring using
[27] L. J. Ba and R. Caurana, “Do deep nets really need to be deep?” in Proc. artificial neural networks,” in Proc. Eur. Conf. Opt. Commun., Gothenburg,
Neural Inf. Process. Syst., Montreal, QC, Canada, 2014, pp. 2654–2662. Sweden, 2017, Paper M.2.F.2.
[28] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring strate- [51] F. J. V. Caballero et al., “Machine learning based linear and nonlinear
gies for training deep neural networks,” J. Mach. Learn. Res., vol. 10, noise estimation,” J. Optical Commun. Netw., vol. 10, no. 10, pp. D42–
pp. 1–40, Jan. 2009. D51, Oct. 2018.
[29] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, [52] E. Ip, “Nonlinear compensation using backpropagation for polarization-
MA, USA: MIT Press, 2016. multiplexed transmission,” J. Lightw. Technol., vol. 28, no. 6, pp. 939–951,
[30] D. P. Mandic and J. Chambers, Recurrent Neural Networks for Prediction: Mar. 2010.
Learning Algorithms, Architectures and Stability. Chicester, U.K.: Wiley, [53] P. Poggiolini and Y. Jiang, “Recent advances in the modeling of the im-
2001. pact of nonlinear fiber propagation effects on uncompensated coherent
[31] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep transmission systems,” J. Lightw. Technol., vol. 35, no. 3, pp. 458–480,
recurrent neural networks,” arXiv:1312.6026, 2013. Feb. 2017.
[32] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training [54] A. Carena, G. Bosco, V. Curri, Y. Jiang, P. Poggiolini, and F. Forghieri,
recurrent neural networks,” in Proc. Int. Conf. Mach. Learn., Atlanta, GA, “EGN model of non-linear fiber propagation,” Opt. Express, vol. 22, no. 13,
USA, 2013, pp. 1310–1318. pp. 16335–16362, Jun. 2014.
[33] F. N. Khan, Z. Dong, C. Lu, and A. P. T. Lau, “Optical performance moni- [55] T. S. R. Shen and A. P. T. Lau, “Fiber nonlinearity compensation using
toring for fiber-optic communication networks,” in Enabling Technologies extreme learning machine for DSP-based coherent communication sys-
for High Spectral-Efficiency Coherent Optical Communication Networks, tems,” in Proc. OptoElectron. Commun. Conf., Kaohsiung, Taiwan, 2011,
X. Zhou and C. Xie, Eds. Hoboken, NJ, USA: Wiley, 2016, ch. 14. pp. 816–817.
[34] X. Wu, J. A. Jargon, R. A. Skoog, L. Paraschis, and A. E. Willner, “Appli- [56] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine:
cations of artificial neural networks in optical performance monitoring,” Theory and applications,” Neurocomputing, vol. 70, no. 1–3, pp. 489–
J. Lightw. Technol., vol. 27, no. 16, pp. 3580–3589, Aug. 2009. 501, Dec. 2006.
[35] T. B. Anderson, A. Kowalczyk, K. Clarke, S. D. Dods, D. Hewitt, and [57] D. Zibar et al., “Nonlinear impairment compensation using expecta-
J. C. Li, “Multi impairment monitoring for optical networks,” J. Lightw. tion maximization for dispersion managed and unmanaged PDM 16-
Technol., vol. 27, no. 16, pp. 3729–3736, Aug. 2009. QAM transmission,” Opt. Express, vol. 20, no. 26, pp. B181–B196,
[36] F. N. Khan, Y. Zhou, A. P. T. Lau, and C. Lu, “Modulation format identi- Dec. 2012.
fication in heterogeneous fiber-optic networks using artificial neural net- [58] E. Alpaydin, Introduction to Machine Learning, 2nd ed. Cambridge, MA,
works,” Opt. Express, vol. 20, no. 11, pp. 12422–12431, May 2012. USA: MIT Press, 2010.
[37] F. N. Khan, T. S. R. Shen, Y. Zhou, A. P. T. Lau, and C. Lu, “Optical perfor- [59] E. Giacoumidis et al., “Reduction of nonlinear intersubcarrier intermixing
mance monitoring using artificial neural networks trained with empirical in coherent optical OFDM by a fast Newton-based support vector machine
moments of asynchronously sampled signal amplitudes,” IEEE Photon. nonlinear equalizer,” J. Lightw. Technol., vol. 35, no. 12, pp. 2391–2397,
Technol. Lett., vol. 24, no. 12, pp. 982–984, Jun. 2012. Jun. 2017.
[60] E. Giacoumidis et al., “Fiber nonlinearity-induced penalty reduction in [83] D. Smilkov and S. Carter, “TensorFlow—A neural network playground."
CO-OFDM by ANN-based nonlinear equalization,” Opt. Lett., vol. 40, [Online]. Available: http://playground.tensorflow.org
no. 21, pp. 5113–5116, Nov. 2015. [84] A. Damien, “GitHub repository—TensorFlow tutorial and examples
[61] C. Pan, H. Bülow, W. Idler, L. Schmalen, and F. R. Kschischang, “Optical for beginners with latest APIs.” [Online]. Available: https://github.com/
nonlinear phase noise compensation for 9 × 32-Gbaud PolDM-16QAM aymericdamien/TensorFlow-Examples
transmission using a code-aided expectation-maximization algorithm,” J. [85] M. Zhou, “GitHub repository—TensorFlow tutorial from basic to hard."
Lightw. Technol., vol. 33, no. 17, pp. 3679–3686, Sep. 2015. [Online]. Available: https://github.com/MorvanZhou/Tensorflow-Tutorial
[62] C. Häger and H. D. Pfister, “Nonlinear interference mitigation via deep [86] TensorFlow, “Guide for programming with the low-level TensorFlow
neural networks,” in Proc. Opt. Fiber Commun., San Diego, CA, USA, APIs.” [Online]. Available: https://www.tensorflow.org/programmers_
2018, Paper W3A.4. guide/low_level_intro
[63] C. Häger and H. D. Pfister, “Wideband time-domain digital backpropaga-
tion via subband processing and deep learning,” in Proc. Eur. Conf. Opt.
Commun., Rome, Italy, 2018, Paper Tu4F.4.
[64] Z. Tao, L. Dou, W. Yan, L. Li, T. Hoshida, and J. C. Rasmussen,
“Multiplier-free intrachannel nonlinearity compensating algorithm oper-
ating at symbol rate,” J. Lightw. Technol., vol. 29, no. 17, pp. 2570–2576, Faisal Nadeem Khan was born in Jhang, Pakistan. He received the B.Sc. degree
Sep. 2011. in electrical engineering from the University of Engineering and Technology,
[65] V. Kamalov et al., “Evolution from 8QAM live traffic to PS 64-QAM Taxila, Pakistan, the M.Sc. degree in communications technology from the
with neural-network based nonlinearity compensation on 11000 km open University of Ulm, Ulm, Germany, and the Ph.D. degree in electronic and infor-
subsea cable,” in Proc. Opt. Fiber Commun., San Diego, CA, USA, 2018, mation engineering from The Hong Kong Polytechnic University, Hong Kong.
Paper Th4D.5. From 2012 to 2015, he was a Senior Lecturer with the School of Electrical
[66] R. Dar and P. J. Winzer, “Nonlinear interference mitigation: Methods and and Electronic Engineering, Universiti Sains Malaysia. He is currently with the
potential gain,” J. Lightw. Technol., vol. 35, no. 4, pp. 903–930, Feb. 2017. Photonics Research Centre, The Hong Kong Polytechnic University. He has au-
[67] F. N. Khan, C. Lu, and A. P. T. Lau, “Optical performance monitoring in thored or coauthored more than 50 research papers in prestigious international
fiber-optic networks enabled by machine learning techniques,” in Proc. journals and conferences as well as written one book chapter. His research inter-
Opt. Fiber Commun., San Diego, CA, USA, 2018, Paper M2F.3. ests include machine learning and signal processing techniques for high-speed
[68] Z. Wang et al., “Failure prediction using machine learning and time series fiber-optic communication systems. He has been an invited speaker at various
in optical network,” Opt. Express, vol. 25, no. 16, pp. 18553–18565, prestigious international conferences including Optical Fiber Communication
Aug. 2017. 2018 and Signal Processing in Photonic Communications 2017, among others.
[69] F. Boitier et al., “Proactive fiber damage detection in real-time coherent
receiver,” in Proc. Eur. Conf. Opt. Commun., Gothenburg, Sweden, 2017,
Paper Th.2.F.1.
[70] D. Rafique, T. Szyrkowiec, H. Griesser, A. Autenrieth, and J.-P. Elbers,
“Cognitive assurance architecture for optical network fault management,”
J. Lightw. Technol., vol. 36, no. 7, pp. 1443–1450, Apr. 2018. Qirui Fan was born in Zhejiang, China, in 1992. He received the B.Eng. and
[71] D. Rafique, T. Szyrkowiec, A. Autenrieth, and J.-P. Elbers, “Analytics- M.Eng. degrees in electrical engineering from Hunan University, Changsha,
driven fault discovery and diagnosis for cognitive root cause analysis,” in China, in 2014 and 2017, respectively. He is currently working toward the
Proc. Opt. Fiber Commun., San Diego, CA, USA, 2018, Paper W4F.6. Ph.D. degree with the Department of Electrical Engineering, The Hong Kong
[72] J. Mata et al., “Artificial intelligence (AI) methods in optical networks: Polytechnic University, Hong Kong. His research interests include machine
A comprehensive survey,” Opt. Switching Netw., vol. 28, pp. 43–57, learning techniques for fiber nonlinearity compensation.
Apr. 2018.
[73] F. Musumeci et al., “An overview on application of machine learning tech-
niques in optical networks,” IEEE Commun. Surv. Tut., to be published,
doi: 10.1109/COMST.2018.2880039.
[74] F. Morales, M. Ruiz, L. Gifre, L. M. Contreras, V. López, and L. Velasco, Chao Lu received the B.Eng. degree in electronic engineering from Tsinghua
“Virtual network topology adaptability based on data analytics for traffic
University, Beijing, China, in 1985, and the M.Sc. and Ph.D. degrees from the
prediction,” J. Opt. Commun. Netw., vol. 9, no. 1, pp. A35–A45, Jan.
University of Manchester, Manchester, U.K., in 1987 and 1990, respectively.
2017.
In 1991, he joined, as a Lecturer, the School of Electrical and Electronic Engi-
[75] D. Rafique and L. Velasco, “Machine learning for network automation: neering, Nanyang Technological University, Singapore, where he has been an
Overview, architecture, and applications,” J. Opt. Commun. Netw., vol. 10,
Associate Professor since January 1999. From June 2002 to December 2005,
no. 10, pp. D126–D143, Oct. 2018. he was seconded to the Institute for Infocomm Research, Agency for Science,
[76] R. Alvizu, S. Troia, G. Maier, and A. Pattavina, “Matheuristic with Technology and Research, Singapore, as a Program Director and Department
machine-learning-based prediction for software-defined mobile metro- Manager, helping to establish a research group in the area of optical commu-
core networks,” J. Opt. Commun. Netw., vol. 9, no. 9, pp. D19–D30, nication and fiber devices. Since April 2006, he has been a Professor with
Sep. 2017. the Department of Electronic and Information Engineering, The Hong Kong
[77] Y. V. Kiran, T. Venkatesh, and C. S. Murthy, “A reinforcement learning Polytechnic University, Hong Kong. His research interests include optical com-
framework for path selection and wavelength selection in optical burst munication systems and networks, fiber devices for optical communication, and
switched networks,” IEEE J. Sel. Areas Commun., vol. 25, no. 9, pp. 18– sensor systems.
26, Dec. 2007.
[78] X. Chen, J. Guo, Z. Zhu, R. Proietti, A. Castro, and S. J. B. Yoo, “Deep-
RMSA: A deep-reinforcement-learning routing, modulation and spectrum
assignment agent for elastic optical networks,” in Proc. Opt. Fiber Com-
mun., San Diego, CA, USA, 2018, Paper W4F.2.
[79] S. Yan et al., “Field trial of machine-learning-assisted and SDN-based op- Alan Pak Tao Lau received the B.A.Sc. degree in engineering science (electri-
tical network planning with network-scale monitoring database,” in Proc. cal option) and the M.A.Sc. degree in electrical and computer engineering from
Eur. Conf. Opt. Commun., Gothenburg, Sweden, 2017, Paper Th.PDP.B.4. the University of Toronto, Toronto, ON, Canada, in 2003 and 2004, respec-
[80] B. Karanov et al., “End-to-end deep learning of optical fiber communica- tively, and the Ph.D. degree in electrical engineering from Stanford University,
tions,” J. Lightw. Technol., vol. 36, no. 20, pp. 4843–4855, Oct. 2018. Stanford, CA, USA, in 2008. In 2008, he joined, as an Assistant Professor, The
[81] R. T. Jones, S. Gaiarin, M. P. Yankov, and D. Zibar, “Time-domain neural Hong Kong Polytechnic University, where he is currently a Professor. He col-
network receiver for nonlinear frequency division multiplexed systems,” laborates with industry in various aspects of optical communications and serves
IEEE Photon. Technol. Lett., vol. 30, no. 12, pp. 1079–1082, Jun. 2018. in organizing committees of numerous conferences in optical communications.
[82] J. Estaran et al., “Artificial neural networks for linear and non-linear His current research interests include long-haul and short-reach coherent optical
impairment mitigation in high-baudrate IM/DD systems,” in Proc. Eur. communication systems, optical performance monitoring, and machine learning
Conf. Opt. Commun., Düsseldorf, Germany, 2016, Paper M.2.B.2. applications in optical communications and networks.

An Optical Communication's Perspective On Machine Learning and Its Applications

Uploaded by

Copyright:

Available Formats

An Optical Communication's Perspective On Machine Learning and Its Applications

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Optical Communication's Perspective On Machine Learning and Its Applications

Uploaded by

Copyright:

Available Formats

JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 37, NO.

2, JANUARY 15, 2019 493

An Optical Communication’s Perspective on

Abstract—Machine learning (ML) has disrupted a wide range

Fig. 2. (a) Probability distribution and corresponding optimal decision bound-

Fig. 6. Example illustrating ANN learning processes with (a) no over-fitting

where diag{z} denotes a diagonal matrix with diagonal vector

the zero-entropy term from Eq. (14) to obtain

which is simply the Kullback-Leibler (KL) divergence between

Therefore, we seek to find w, b that maximize 1/w subject

Fig. 13. Example to illustarte initialization and two iterations of K-means

of the information in the original data as shown in Fig. 15. The

higher-dimensional space may be desirable if the data classes

E. Reinforcement Learning (RL)

Fig. 18. Conceptual differences between rule-based systems, conventional

follow Rician distribution. In this case, a specific mathematical

activation function is simply used for the hidden layers neurons

Fig. 23. Flow of gradient signals in an RNN.

D. Recurrent Neural Networks (RNNs) = (30)

Fig. 24. Some key applications of ML in fiber-optic communications.

issue, a special type of RNN architecture called long short-term

V. APPLICATIONS OF ML TECHNIQUES IN OPTICAL

ters like eye-closure, Q-factor, root-mean-square jitter, crossing

of correct/incorrect identifications for a given actual modulation

related works with different complexities and performance con-

Fig. 30. Fault management tasks enabled by ML-based approaches.

clude network components’ parameters such as optical power

by identifying optical power level abnormalities across various

algorithms and how to properly incorporate them into SDN

E. Physical Layer Design

You might also like