An Optical Communication's Perspective On Machine Learning and Its Applications
An Optical Communication's Perspective On Machine Learning and Its Applications
An Optical Communication's Perspective On Machine Learning and Its Applications
(Invited Paper)
I. INTRODUCTION Fig. 1. Given a data set, ML attempts to solve two main types of problems:
(a) functional description of given data and (b) classification of data by deriving
RTIFICIAL intelligence (AI) makes use of comput-
A ers/machines to perform cognitive tasks, i.e., the ones
requiring knowledge, perception, learning, reasoning, under-
appropriate decision boundaries. (c) Laser frequency offset and phase estimation
for quadrature phase-shift keying (QPSK) systems by raising the signal phase φ
to the 4th power and performing regression to estimate the slope and intercept.
(d) Decision boundaries for a received QPSK signal distribution.
standing and other similar cognitive abilities. An AI system
is expected to do three things: (i) store knowledge, (ii) apply
the stored knowledge to solve problems, and (iii) acquire new patterns and structures can then be used to make decisions or
knowledge via experience. The three key components of an predictions on some other data in the system of interest [1].
AI system include knowledge representation, machine learning ML is not a new field as ML-related algorithms exist at
(ML), and automated reasoning. ML is a branch of AI which least since the 1970s. However, tremendous increase in com-
is based on the idea that patterns and trends in a given data set putational power over the last decade, recent groundbreaking
can be learned automatically through algorithms. The learned developments in theory and algorithms surrounding ML, and
easy access to an overabundance of all types of data worldwide
(thanks to three decades of Internet growth) have all contributed
Manuscript received July 1, 2018; revised October 16, 2018 and January 4, to the advent of modern deep learning (DL) technology, a class
2019; accepted January 30, 2019. Date of publication February 4, 2019; date of
current version February 20, 2019. This work was supported in part by the Hong of advanced ML approaches that displays superior performance
Kong Government General Research Fund under Project PolyU 152757/16E in an ever-expanding range of domains. In the near future, ML is
and in part by the National Natural Science Foundation China under Projects expected to power numerous aspects of modern society such as
61435006 and 61401020. (Corresponding author: Faisal Nadeem Khan.)
F. N. Khan, Q. Fan, and A. P. T. Lau are with the Photonics Research web searches, computer translation, content filtering on social
Centre, Department of Electrical Engineering, The Hong Kong Polytechnic media networks, healthcare, finance, and laws [2].
University, Kowloon, Hong Kong (e-mail:, fnadeem.khan@yahoo.com; remi. ML is an interdisciplinary field which shares common threads
qr.fan@gmail.com; eeaptlau@polyu.edu.hk).
C. Lu is with the Photonics Research Centre, Department of Electronic and with the fields of statistics, optimization, information theory, and
Information Engineering, The Hong Kong Polytechnic University, Kowloon, game theory. Most ML algorithms perform one of the following
Hong Kong (e-mail:,chao.lu@polyu.edu.hk). two types of pattern recognition tasks as shown in Fig. 1. In the
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. first type, the algorithm tries to find some functional description
Digital Object Identifier 10.1109/JLT.2019.2897313 of given data with the aim of predicting values for new inputs,
0733-8724 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
494 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 37, NO. 2, JANUARY 15, 2019
Fig. 5. Structure of a single hidden layer ANN with input vector x(l), target
Fig. 4. The complexity of classification problems depends on how the different vector y(l) and actual output vector o(l).
classes of data are distributed across the variable space.
The region for the ‘green’ class depends on the outputs of the
‘red’ and ‘blue’ decision boundaries. Therefore, one will need
recurrent neural networks (RNNs). The analytical derivations
to implement an extra decision step to label the ‘green’ region.
presented in this paper are slightly different from those in stan-
The graphical representation of this ‘decision of decisions’ al-
dard introductory ML text to better align with the fields of
gorithm is the simplest form of an ANN [7]. The intermediate
communications and signal processing. We will then provide an
decision output units are known as hidden neurons and they
overview of applications of ML techniques in various aspects
form the hidden layer.
of optical communications and networking.
We emphasize that this is by no means an exhaustive and
in-depth discussion on state-of-the-art ML techniques and their A. Artificial Neural Networks (ANNs)
respective challenges. Also, the views presented are not the only Let {(x(1), y(1)), (x(2), y(2)), . . . (x(L), y(L))} be a set of
way to understand the fundamental properties of ML methods. L input-output pairs of M and K dimensional column vectors.
By discussing ML through the language of communications ANNs are information processing systems comprising of an in-
and DSP, we hope to provide a more intuitive understanding of put layer, one or more hidden layers, and an output layer. The
ML, its relation to optical communications and networking, and structure of a single hidden layer ANN with M input, H hid-
why/where/how it can play a unique role in specific areas of den and K output neurons is shown in Fig. 5. Neurons in two
optical communications and networking. adjacent layers are interconnected where each connection has a
The rest of the paper is organized as follows. In Section II, we variable weight assigned. Such ANN architecture is the simplest
will illustrate the fundamental conditions that warrant the use of and most commonly-used one [7]. The number of neurons M in
a neural network and discuss the technical details of ANNs and the input layer is determined by the dimension of the input data
SVMs. Section III will describe a range of basic unsupervised vectors x(l). The hidden layer enables the modeling of complex
ML techniques and briefly discuss reinforcement learning (RL). relationships between the input and output parameters of an
Section IV will be devoted to more recent ML algorithms. Sec- ANN. There are no fixed rules for choosing the optimum num-
tion V will provide an overview of existing ML applications in ber of neurons for a given hidden layer and the optimum number
optical communications and networking while Section VI will of hidden layers in an ANN. Typically, the selection is made via
discuss their future role. Links for online resources and codes experimentation, experience and other prior knowledge of the
for standard ML algorithms will be provided in Section VII. problem. These are known as the hyperparameters of an ANN.
Section VIII will conclude the paper. A video presentation of For regression problems, the dimension K of the vectors y(l) de-
this review paper is available at [6]. pends on the actual problem nature. For classification problems,
K typically equals to the number of class labels such that if a
II. ANNS AND SVMS data point x(l) belongs to class k, y(l) = [0 0 · · · 0 1 0 · · · 0 0]T
where the ‘1’ is located at the kth position. This is called one-hot
What are the conditions that need ML for classification? encoding. The ANN output o(l) will naturally have the same
Fig. 4 shows three scenarios with 2-dimensional (2D) data dimension as y(l) and the mapping between input x (l) and o(l)
x = [x1 x2 ]T and their respective class labels depicted as ‘o’ can be expressed as
and ‘×’ in the figure. In the first case, classifying the data is
straightforward: the decision rule is to see whether σ(x1 − c) o (l) = σ2 (r (l))
or σ(x2 − c) is greater or less than 0 where σ(·) is the decision = σ2 (W2 u (l) + b2 )
function as shown. The second case is slightly more compli-
cated as the decision boundary is a slanted straight line. How- = σ2 (W2 σ1 (q (l)) + b2 )
ever, a simple rotation and shifting of the input, i.e., Wx + b = σ2 (W2 σ1 (W1 x (l) + b1 ) + b2 ) (1)
will map one class of data to below zero and the other class
above. Here, the rotation and shifting are described by matrix where σ1(2) (·) are the activation functions for the hidden and
W and vector b, respectively. This is followed by the decision output layer neurons, respectively. W1 and W2 are matrices
function σ(Wx + b). The third case is even more complicated. containing the weights of connections between the input and
496 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 37, NO. 2, JANUARY 15, 2019
hidden layer neurons and between the hidden and output layer the network to generate the outputs o using Eq. (1). The input
neurons, respectively, while b1 and b2 are the bias vectors for can be a single data point, a mini-batch or the complete set of
the hidden and output layer neurons, respectively. For a vector L inputs. This step is named so because the computation flow
z = [z1 z2 · · · zK ] of length K, σ1 (·) is typically an element- is in the natural forward direction, i.e., starting from the in-
wise nonlinear function such as the sigmoid function put, passing through the network, and going to the output; (iii)
Backward propagation and weights/biases update: For simplic-
1 1 1
σ1 (z) = ··· . (2) ity, let us assume SGD using 1 input-output pair (x(n), y(n))
1 + e−z 1 1 + e−z 2 1 + e−z K
for the n+1th iteration, sigmoid activation function for the hid-
As for the output layer neurons, σ2 (·) is typically chosen to den layer neurons and linear activation function for the out-
be a linear function for regression problems. In classification put layer neurons such that o(n) = W2 u(n) + b2 . The pa-
problems, one will normalize the output vector o(l) using the rameters W2 , b2 will be updated first followed by W1 , b1 .
softmax function, i.e., Since E(n) = o(n) − y(n)2 and ∂∂Eo(n (n )
) = 2(o(n) − y(n)),
1
o (l) = sof tmax (W2 u (l) + b2 ) (3) the corresponding update equations are
where (n +1) (n )
K
∂ok (n)
1 W2 = W2 − 2α (ok (n) − yk (n))
sof tmax (z) = K [ez 1 ez 2 · · · ez K ] . (4) ∂W2
k =1
k =1 ez k
(n +1) (n ) ∂o (n)
The softmax operation ensures that the ANN outputs conform b2 = b2 − 2α (o (n) − y (n)) (7)
∂b2
to a probability distribution for reasons we will discuss below.
To train the ANN is to optimize all the parameters θ = where ok (n) and yk (n) denote the kth element of vectors o(n)
{W1 , W2 , b1 , b2 } such that the difference between the ac- and y(n), respectively. In this case, ∂∂o(n )
b 2 is the Jacobian matrix
tual ANN outputs o and the target outputs y is minimized. One th
in which the j row and m column is the derivative of the mth
th
commonly-used objective function (also called loss function in element of o(n) with respect to the j th element of b2 . Also,
ML literature) to optimize is the mean square error (MSE)
the j th row and mth column of the matrix ∂∂oW k (n )
2
denotes the
1 1
L L
derivative of ok (n) with respect to the j th row and mth column
E= E (l) = o (l) − y (l)2 . (5) of W2 . Interested readers are referred to [9] for an overview of
L L
l=1 l=1
matrix calculus. Since o(n) = W2 u(n) + b2 , ∂∂o(n )
b 2 is simply
Like most optimization procedures in practice, gradient descent
the identity matrix. For ∂∂oW
k (n )
, its k th row is equal to u(n)T
is used instead of full analytical optimization. In this case, the 2
parameter estimates for n+1th iteration are given by (where (·)T denotes transpose) and is zero otherwise. Eq. (7)
can be simplified as
∂E
θ (n +1)
=θ −α
(n )
(6)
∂θ θ ( n )
(n +1) (n )
W2 = W2 − 2α (o (n) − y (n)) u(n)T
where the step size α is known as the learning rate. Note that (n +1)
b2 = b2
(n )
− 2α (o (n) − y (n)) . (8)
for computational efficiency, one can use a single input-output
pair instead of all the L pairs for each iteration in Eq. (6). This With the updated W2
(n +1)
and b2
(n +1)
, one can calculate
is known as stochastic gradient descent (SGD) which is the
standard optimization method used in common adaptive DSP
(n +1) (n )
K
∂ok (n)
such as constant modulus algorithm (CMA) and least mean W1 = W1 − 2α (ok (n) − yk (n))
∂W1
squares (LMS) algorithm. As a trade-off between computa- k =1
tional efficiency and accuracy, one can use a mini-batch of (n +1) (n ) ∂o (n)
data {(x(nP + 1), y(nP + 1)), (x(nP + 2), y(nP + 2)) . . . b1 = b1 − 2α (o (n) − y (n)) . (9)
∂b1
(x(nP + P ), y(nP + P ))} of size P for the nth iteration in-
stead. This can reduce the stochastic nature of SGD and improve Since the derivative of the sigmoid function is given by σ1 (z) =
accuracy. When all the data set has been used, the update algo- σ1 (z) ◦ (1 − σ1 (z)) where ◦ denotes element-wise multiplica-
rithm will have completed one epoch. However, it is often the tion and 1 denotes a column vector of 1’s with the same length
case that one epoch equivalent of updates is not enough for all as z,
the parameters to converge to their optimal values. Therefore,
∂o (n) ∂q (n) ∂u (n) ∂o (n)
one can reuse the data set and the algorithm goes through the 2nd =
∂b1 ∂b1 ∂q (n) ∂u (n)
epoch for further parameter updates. There is no fixed rule to
T
determine the number of epochs required for convergence [8]. (n +1)
= diag {u (n) ◦ (1 − u (n))} · W2 (10)
The update algorithm is comprised of following main steps:
(i) Model initialization: All the ANN weights and biases are
randomly initialized, e.g., by drawing random numbers from a
normal distribution with zero mean and unit variance; (ii) For- 1 One can also express the update of W 2 using the 3rd -order tensor notation
∂ o k (n )
∂ o (n )
ward propagation: In this step, the inputs x are passed through ∂W2 instead of k ∂W2
.
KHAN et al.: OPTICAL COMMUNICATION’S PERSPECTIVE ON MACHINE LEARNING AND ITS APPLICATIONS 497
1
L K
E=− yk (l) log (ok (l))
L
l=1 k =1
1
L K
+ yk (l) log (yk (l))
L
l=1 k =1
=0
L
K
1 yk (l)
= yk (l) log (15)
L ok (l)
l=1 k =1
argmin w1
w ,b (18)
Fig. 9. Example showing how a linearly inseparable problem (in the original subject to y (l) wT v (l) + b ≥ 1, l = 1, 2, . . . , L
2D data space) can undergo a nonlinear transformation and becomes a linearly
separable one in the 3-dimensional (3D) feature space. and thus standard convex programming software packages such
as CVXOPT [16] can be used to solve Eq. (18).
Let us come back to the task of choosing the nonlinear
function ϕ(·) that maps the original input space x to feature
space v. For SVM, one would instead find a kernel function
K(x(i), x(j)) = ϕ(x(i)) · ϕ(x(j)) = v(i)T v(j) that maps to
the inner product. Typical kernel functions include:
r Polynomials: K(x(i), x(j)) = (x(i)T x(j) + a)b for
some scalars a, b
r Gaussian radial basis function: K(x(i), x(j)) =
exp(−ax(i) − x(j)2 ) for some scalar a
r Hyperbolic tangent: K(x(i), x(j)) = tanh(ax(i)T x(j)
+ b) for some scalars a, b.
The choice of a kernel function is often determined by the de-
signer’s knowledge of the problem domain [3]. Note that a larger
separation margin typically results in better generalization of the
SVM classifier. SVMs often demonstrate better generalization
performance than conventional ANNs in various pattern recog-
nition applications. Furthermore, multiple SVMs can be applied
to the same data set to realize non-binary classifications such as
detecting 16-QAM signals [17]–[19] (to be discussed in more
detail in Section V).
It should be noted that ANNs and SVMs can be seen as two
Fig. 10. (a) Maping from input space to a higher-dimensional feature space complementary approaches for solving classification problems.
using a nonlinear kernel function ϕ. (b) Separation of two data classes in the While an ANN derives curved decision boundaries in the input
feature space through an optimal hyperplane.
variable space, the SVM performs nonlinear transformations of
the input variables followed by determining a simple decision
points is conceptually analogous to finding a maximum like-
boundary or hyperplane as shown in Fig. 11.
lihood decision boundary. The borderline points, represented
as solid dot and square in Fig. 10(b), are referred to as support
vectors and are often most informative for the classification task. III. UNSUPERVISED AND REINFORCEMENT LEARNING
More technically, in the feature space, a general hyperplane The ANN and SVM are examples of supervised learning
is defined as wT v + b = 0. If it classifies all the data points approach in which the class labels y of the training data are
correctly, all the blue points will lie in the region wT v + b > 0 known. Based on this data, the ML algorithm generalizes to react
and the red points will lie in the region wT v + b < 0. We seek accurately to new data to the best possible extent. Supervised
to find a hyperplane wT v + b = 0 that maximizes the margin d learning can be considered as a closed-loop feedback system
as shown in Fig. 10(b). Without loss of generality, let the point as the error between the ML algorithm’s actual outputs and the
v(i) reside on the hyperplane wT v + b = 1 and is closest to targets is used as a feedback signal to guide the learning process.
the hyperplane wT v + b = 0 on which v+ resides. Since the In unsupervised learning, the ML algorithm is not provided
vectors v(i) − v+ , w and the angle φ are related by cos φ = with correct labels of the training data. Rather, it learns to iden-
wT (v(i) − v+ )/(wv(i) − v+ ), the margin d is given as tify similarities between various inputs with the aim to either
d = v (i) − v+ cos φ categorize together those inputs which have something in com-
w T (v(i)−v + ) mon or to determine some better representation/description of
= v (i) − v+ · w v(i)−v + the original input data. It is referred to as “unsupervised” be-
w T (v(i)−v + ) (17)
= w v(i)−w v
T T +
= cause the ML algorithm is not told what the output should be
w w
w T v(i)+b 1 rather it has to come up with it itself [20]. One example of
= w = w . unsupervised learning is data clustering as shown in Fig. 12.
500 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 37, NO. 2, JANUARY 15, 2019
The two steps are repeated iteratively until the cluster centers
converge. Several variants of K-means algorithm have been pro-
posed over the years to improve its computational efficiency as
well as to achieve smaller errors. These include fuzzy K-means,
hierarchical K-means, K-means++, K-medians, K-medoids,
etc.
Fig. 12. Data clustering based on unsupervised learning.
Unsupervised learning is becoming more and more important B. Expectation-Maximization (EM) Algorithm
because in many real circumstances it is practically not possible One drawback of K-means algorithm is that it requires the
to obtain labeled training data. In such scenarios, an unsuper- use of hard decision boundaries whereby a data point can only
vised learning algorithm can be applied to discover some simi- be assigned to one cluster even though it might lie somewhere
larities between different inputs for itself. Unsupervised learning midway between two or more clusters. The EM algorithm is an
is typically used in tasks such as clustering, vector quantization, improved clustering technique which assigns a probability to
dimensionality reduction, and features extraction. It is also often the data point belonging to each cluster rather than forcing it to
employed as a preprocessing tool for extracting useful (in some belong to one particular cluster during each iteration [20]. The
particular context) features of the raw data before supervised algorithm assumes that a given data distribution can be modeled
learning algorithms can be applied. We hereby provide a review as a superposition of K jointly Gaussian probability distributions
of few key unsupervised learning techniques. with distinct means and covariance matrices µ(k), Σ(k) (also
referred to as Gaussian mixture models). The EM algorithm is a
A. K-Means Clustering two-step iterative procedure comprising of expectation (E) and
Let {x(1), x(2), . . . x(L)} be the set of data points which is maximization (M) steps [3]. The E step computes the a posteriori
to be split into K clusters C1 , C2 , . . . CK . K-means clustering probability of the class label given each data point using the
is an iterative unsupervised learning algorithm which aims to current means and covariance matrices of the Gaussians, i.e.,
partition L observations into K clusters such that the sum of pij = p (Cj |x (i))
squared errors for data points within a group is minimized [14].
An example of this algorithm is graphically shown in Fig. 13. p (x (i) |Cj ) p (Cj )
= K
The algorithm initializes by randomly picking K locations µ(j), k =1 p (x (i) |Ck ) p (Ck )
j = 1, 2, . . . , K as cluster centers. This is followed by two
N (x (i) |µ (k) , Σ (k))
iterative steps. In the first step, each data point x(i) is assigned = K (21)
to the cluster Ck with the minimum Euclidean distance, i.e., k =1 N (x (i) |µ (k) , Σ (k))
Ck = {x (i) : x (i) − µ (k) < x (i) − µ (j) where N (x(i)|µ(k), Σ(k)) is the Gaussian PDF with mean and
covariance matrix µ(k), Σ(k). Note that we have inherently as-
∀j ∈ {1, 2, ..., K} \ {k}} (19) sumed equal probability p(Cj ) of each class, which is a valid
KHAN et al.: OPTICAL COMMUNICATION’S PERSPECTIVE ON MACHINE LEARNING AND ITS APPLICATIONS 501
Fig. 15. Example to illustarte the concept of PCA. (a) Data points in the
original 3D data space; (b) Three PCs ordered according to the variability in
original data; (c) Projection of data points onto a plane defined by the first two
PCs while discarding the third one.
are initialized with random means and unit covariance matri- λi λi > R (24)
i=1 i=1
ces and are depicted using red and blue circles. The results
after first E step are shown in Fig. 14(b) where the posterior where R is typically above 0.9 [22]. Note that, as compared to
probabilities in Eq. (21) are expressed by the proportion of red the original M-dimensional data space, the chosen eigenvectors
and blue colors for each data point. Fig. 14(c) depicts the re- span only an S-dimensional subspace that in a way captures most
sults after first M step where the means and covariance matrices of the data information. One can understand such procedure
of the red and blue Gaussian distributions are updated using intuitively by noting that for a covariance matrix, finding the
Eq. (22), which in turn uses the posterior probabilities com- eigenvectors with large eigenvalues corresponds to finding linear
puted by Eq. (21). This completes the 1st iteration of the EM combinations or particular directions of the input space that give
algorithm. Fig. 14(d) to (f) show the results after 2, 5 and 20 large variances, which is exactly what we want to capture. A
complete EM iterations, respectively, where the convergence of data vector x can then be approximated as a weighted-sum of
the algorithm and consequently effective splitting of the data the chosen eigenvectors in this subspace, i.e.,
points into two clusters can be clearly observed.
S
x≈ wi µ (i) (25)
C. Principal Component Analysis (PCA)
i=1
PCA is an unsupervised learning technique for features ex-
traction and data representation [21], [22]. It is often used as a where µ(i), i = 1, 2, . . . , S are the chosen orthogonal eigen-
preprocessing tool in many pattern recognition applications for vectors such that
the extraction of limited but most critical data features. The cen- 1 if l = m
tral idea behind PCA is to project the original high-dimensional µ (m) µ (l) =
T
(26)
data onto a lower-dimensional feature space that retains most 0 if l
= m
502 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 37, NO. 2, JANUARY 15, 2019
Fig. 19. Schematic diagram of a three hidden layers DNN (top). Two autoen-
coders used for the pretraining of first two hidden layers of the DNN (bottom).
The decoder parts in both autoencoders are shown in grey color with dotted
weight lines.
computed as
tial locations is the same as convolving the input image with
h (t) = σ1 (W1 x (t) + Wr h (t − 1) + b1 ) (28)
g(−sx , −sy ) (hence the name convolutional neural networks).
Alternatively, one can also view the wT (·) + b operation as o (t) = σ2 (W2 h (t) + b2 ) (29)
cross-correlating g(sx , sy ) with the input image. Therefore, a
where b1 and b2 are the bias vectors while σ1 (·) and
high value will result if that part of the input image resembles
σ2 (·) are the activation functions for the hidden and output
g(sx , sy ). Together with the decision-like nonlinear activation
layer neurons, respectively. Given a data set {(x(1), y(1)), (
function, the overall feature map indicates which location in the
x(2), y(2)), . . . (x(L), y(L))} of input-output pairs, the RNN
original image best resembles g(sx , sy ), which essentially tries
is first unfolded in time to represent it as a multilayer net-
to identify and locate a certain feature in the input image. With
work and then BP algorithm is applied on this graph, as shown
this insight, the interleaving convolutional and sub-sampling
in Fig. 23, to compute all the necessary matrix derivatives
layers can be intuitively understood as identifying higher-level
{ ∂∂WE 1 , ∂∂WE 2 , ∂∂W
E
, ∂∂bE1 , ∂∂bE2 }. The loss function can be cross-
and more complex features of the input image. r
The training of a CNN is performed using a modified BP entropy or MSE. The matrix derivative ∂∂W E
r
is a bit more com-
algorithm which updates convolutional filters’ weights and also plicated to calculate since Wr is shared across all hidden layers.
takes the sub-sampling layers into account. Since a lot of weights In this case,
are supposedly identical as the network is essentially performing ∂E ∂E (t) ∂E (t) ∂h (t)
L L
the convolution operation, one will update those weights using = =
∂Wr ∂Wr ∂h (t) ∂Wr
the average of the corresponding gradients. t=1 t=1
L
∂E (t) ∂h (t) ∂h (l)
t
A. Optical Performance Monitoring (OPM) Fig. 25. Impairments-dependent patterns reflected by (a) eye-diagrams [34],
(b) ADTPs [35] and (c) AHs [36], and their corresponding known impairments
Optical communication networks are becoming increasingly values which serve as data labels during the training process.
complex, transparent and dynamic. Reliable operation and ef-
ficient management of these complex fiber-optic networks re- (and hence low cost) multi-impairment monitoring in optical
quire incessant and real-time information of various channel networks and have already shown tremendous potential.
impairments ubiquitously across the network, also known as Most existing ML-based OPM techniques adopt a super-
OPM [33]. OPM is widely regarded as a key enabling tech- vised learning approach utilizing training data sets of labeled
nology for SDNs. Through OPM, SDNs can become aware of examples during the offline learning process of selected ML
the real-time network conditions and subsequently adjust differ- models. The training data may, e.g., consist of signal rep-
ent transceiver/network elements parameters such as launched resentations like eye-diagrams, asynchronous delay-tap plots
powers, data rates, modulation formats, spectrum assignments, (ADTPs), amplitude histograms (AHs), etc., and their corre-
etc., for optimized transmission performance [4]. Unfortunately, sponding known impairments values such as CD, differential
conventional OPM techniques have shown limited success in group delay (DGD), OSNR, etc., serving as data labels, as
simultaneous and independent monitoring of multiple transmis- shown in Fig. 25. During the training phase, the inputs to an
sion impairments since the effects of different impairments are ML model are the impairments-indicative feature vectors x of
often difficult to separate analytically. Another crucial OPM re- eye-diagrams/ADTPs/AHs while their corresponding labels y
quirement is low complexity since the OPM devices need to are used as the targets as shown in Fig. 26(a). The ML model
be deployed ubiquitously across optical networks. ML tech- then learns the mapping between input features and the labels.
niques are proposed as an enabler for realizing low complexity Note that in case of eye-diagrams, the features can be parame-
KHAN et al.: OPTICAL COMMUNICATION’S PERSPECTIVE ON MACHINE LEARNING AND ITS APPLICATIONS 507
Fig. 26. (a) ML model during the offline training phase with feature vectors x
as inputs and the labels y as targets. (b) Trained ML model used for online OPM
with feature vectors x as inputs and the impairments estimates o as outputs.
TABLE I
SOME KEY ML-BASED FIBER NLC TECHNIQUES. PPD: PRE/POST-DISTORTION, DCF: DISPERSION-COMPENSATING FIBER, DSM: DIGITAL SUBCARRIER
MODULATION, PS: PROBABILISTIC SHAPED
forms transmitter side perturbation-based pre-distortion meth- which should not be completely disregarded while evaluating
ods by 0.3 dB in both single-channel and WDM systems [65]. the complexity of ML approaches for NLC.
Open issues: Table I shows some key techniques using ML for
fiber NLC. Most of these works incorporate ML as an extra DSP
C. Proactive Fault Detection
module placed either at transmitter or receiver. While effective
to a certain extent, it is not clear what is the best sequence of Reliable network operations are essential for the carriers
conventional signal processing and ML blocks in such a hybrid to provide service guarantees, called service-level agreements
DSP configuration. One factor driving the choice of sequence (SLAs), to their customers regarding system’s availability and
is the dynamic effects such as carrier frequency offset, laser promised quality levels. Violation of these guarantees may
phase noise, PMD, etc., that are hard to be captured in the learn- result in severe penalties. It is thus highly desirable to have an
ing process of an ML algorithm. In this case, one can perform early warning and proactive protection mechanism incorporated
ML-based NLC after linear compensations so as to avoid tack- into the network. This can empower network operators to know
ling these time-varying dynamics in ML. In the other extreme, when the network components are beginning to deteriorate
RNN structures can embrace all the time-varying dynamics in and preventive measures can then be taken to avoid serious
principle but it may be an overkill since we do know their un- disruptions [33].
derlying physics and it should be exploited in the overall DSP Conventional fault detection and management tools in opti-
design. Also, in case of hybrid configurations, the accuracy of cal networks adopt a rigid approach where some fixed threshold
conventional DSP algorithms such as CMA or carrier phase es- limits are set by the system engineers and alarms are triggered to
timation (CPE) plays a major role in the quality of the data sets alert malfunctions if those limits are surpassed. Such traditional
which ML fundamentally relies on. Therefore, there are strong network protection approaches have the following main draw-
dependencies between ML and conventional DSP blocks and backs: (i) These methods protect a network in a passive manner,
the right balance is still an open area of research. Finally, to the i.e., they are unable to forecast the risks and tend to reduce the
best of our knowledge, an ML-based single-channel processing damages only after a failure occurs. This approach may result in
technique that outperforms DBP in practical WDM settings has the loss of immense amounts of data during network recovery
yet to be developed. process once a failure happens. (ii) The inability to accurately
Numerous studies are conducted to also address the compu- forecast the faults leads to ultraconservative network designs in-
tational complexity issues of conventional and ML techniques volving large operating margins and protection switching paths
for fiber NLC. For conventional NLC algorithms, we direct the which in turn result in an underutilization of the system re-
readers to the survey paper [66]. On the other hand, the computa- sources. (iii) They are unable to determine the root cause of
tional complexity of ML algorithms for NLC varies significantly faults. (iv) Apart from hard failures (i.e., the ones causing major
with the architecture and the training process used which make signal disruptions), several kinds of soft failures (i.e., the ones
comparison with the conventional techniques difficult. Gener- degrading system performance slowly and slightly) may also
ally, the training processes are too complex to be performed occur in optical networks which cannot be easily detected using
online as they require a lot of iterations and potentially massive conventional methods.
training data. For the inference phase (i.e., using the trained ML-enabled proactive fault management has recently been
model for real-time data detection), most ML algorithms pro- conceived as a powerful means to assure reliable network op-
posed involve relatively simple computations, leading to the eration [67]. Instead of using traditional fixed pre-engineered
perception that ML techniques are generally simple to imple- solutions, this new mechanism relies on dynamic data-driven
ment since offline training processes are typically not counted operations, leveraging immense amounts of operational data re-
towards the computational complexity. However, in reality, the trieved through network monitors (e.g., using simple network
training does take up a lot of computational resources and time management protocol (SNMP)). The data repository may in-
KHAN et al.: OPTICAL COMMUNICATION’S PERSPECTIVE ON MACHINE LEARNING AND ITS APPLICATIONS 511
Looking to the future, we can foresee a vital role played by With softmax activation function for the output neurons,
ML-based mechanisms across several diverse functional areas K
in optical networks, e.g., network planning and performance ∂ o j (n ) ( m =1 er m (n )
)e r j ( n ) δ j , k −e r j ( n ) ·e r k ( n )
∂ r k (n ) = K 2
prediction, network maintenance and fault prevention, network K
( m = 1 er m ( n ) )
r j (n )
( m =1 e r m (n )
)e δ j , k −e r j ( n ) ·e r k ( n ) (37)
resources allocation and management, etc. ML can also aid = K 2
( k = 1 er k ( n ) )
cross-layer optimization in future optical networks requiring big
= oj (n) δj,k − oj (n) ok (n)
data analytics since it can inherently learn and uncover hidden
patterns and unknown correlations in big data which can be where δj,k = 1 when j = k and 0 otherwise. Consequently,
extremely beneficial in solving complex network optimization
problems. The ultimate objective of ML-driven next-generation ∂ E (n )
K
∂ E (n ) ∂ o j (n )
∂ r k (n ) = ∂ o j (n ) ∂ r k (n )
optical networks will be to provide infrastructures which can j =1
monitor themselves, diagnose and resolve their problems, and K
y (n )
provide intelligent and efficient services to the end users. = − o jj (n ) (oj (n) δj,k − oj (n) ok (n))
j =1
K
VII. ONLINE RESOURCES FOR ML ALGORITHMS = −yj (n) (δj,k − ok (n)) = ok (n) − yk (n)
j =1
Standard ML algorithms’ codes and examples are readily (38)
available online and one seldom needs to write their own codes K
from the very beginning. There are several off-the-shelf pow- as j =1 yj (n) = 1. Therefore,
erful frameworks available under open-source licenses such as
∂E (n)
TensorFlow, Pytorch, Caffe, etc. Matlab, which is widely used in = o (n) − y (n) .
optical communications researches is not the most popular pro- ∂r (n)
gramming language among the ML community. Instead, Python ∂ r(n ) ∂ r(n ) ∂ r k (n ) ∂ r k (n )
Now, since ∂ b2 , ∂ b1 , ∂W2 , ∂ W 1 are the same as
is the preferred language for ML research partly because it is ∂ o(n ) ∂ o(n ) ∂ o k (n ) ∂ o k (n )
freely available, multi-platform, relatively easy to use/read, and ∂ b2 , ∂ b 1 , ∂ W 2 , ∂ W 1 for MSE loss function and linear
has a huge number of libraries/modules available for a wide va- activation function for the output neurons (as o(n) = r(n) for
riety of tasks. We hereby report some useful resources including that case), it follows that the update equations Eq. (8) to Eq.
example Python codes using TensorFlow library to help inter- (12) also hold for the ANNs with cross-entropy loss function
ested readers get started with applying simple ML algorithms and softmax activation function for the output neurons.
to their problems. More intuitive understanding of ANNs can
be found at this visual playground [83]. The Python codes for ACKNOWLEDGMENT
most of the standard neural network architectures discussed The authors would like to thank anonymous reviewers for
in this paper can be found in these Github repositories [84], their valuable comments and suggestions.
[85] with examples. For non-standard model design, Tensor-
Flow also provides low-level programming interfaces for more
custom and complex operations based on its symbolic building REFERENCES
blocks, which are documented in detail in [86]. [1] S. Marsland, Machine Learning: An Algorithmic Perspective, 2nd ed. Boca
Raton, FL, USA: CRC Press, 2015.
[2] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A
VIII. CONCLUSIONS review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell.,
vol. 35, no. 8, pp. 1798–1828, Aug. 2013.
In this paper, we discussed how the rich body of ML tech- [3] C. M. Bishop, Pattern Recognition and Machine Learning. New York,
niques can be applied as a unique and powerful set of signal NY, USA: Springer-Verlag, 2006.
[4] Z. Dong, F. N. Khan, Q. Sui, K. Zhong, C. Lu, and A. P. T. Lau, “Optical
processing tools in fiber-optic communication systems. As opti- performance monitoring: A review of current and future technologies,” J.
cal networks become faster, more dynamic and more software- Lightw. Technol., vol. 34, no. 2, pp. 525–543, Jan. 2016.
defined, we will see an increasing number of applications of ML [5] A. S. Thyagaturu, A. Mercian, M. P. McGarry, M. Reisslein, and W.
Kellerer, “Software defined optical networks (SDONs): A comprehensive
and big data analytics in future networks to solve certain critical survey,” IEEE Commun. Surv. Tut., vol. 18, no. 4, pp. 2738–2786, Oct.–
problems that cannot be easily tackled using conventional ap- Dec. 2016.
proaches. A basic knowledge and skills in ML will thus become [6] [Online]. Available: https://www.youtube.com/channel/UCLZL8KsCz
OODkDKBOI3S2lw
necessary and beneficial for researchers in the field of optical [7] R. A. Dunne, A Statistical Approach to Neural Networks for Pattern
communications and networking. Recognition. Hoboken, NJ, USA: Wiley, 2007.
[8] I. Kaastra and M. Boyd, “Designing a neural network for forecasting finan-
cial and economic time series,” Neurocomputing, vol. 10, no. 3, pp. 215–
APPENDIX 236, Apr. 1996.
[9] C. D. Meyer, Matrix Analysis and Applied Linear Algebra. Philadelphia,
For cross-entropy loss function defined in Eq. (14), the deriva- PA, USA: SIAM, 2000.
tive with respect to the output is given by [10] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed.
New York, NY, USA: Wiley, 2007.
∂E (n) yj (n) [11] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural net-
=− . (36) works,” in Proc. Int. Conf. Artif. Intell. Statist., Fort Lauderdale, FL, USA,
∂oj (n) oj (n) 2011, vol. 15, pp. 315–323.
KHAN et al.: OPTICAL COMMUNICATION’S PERSPECTIVE ON MACHINE LEARNING AND ITS APPLICATIONS 515
[12] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Sur- [38] T. S. R. Shen, Q. Sui, and A. P. T. Lau, “OSNR monitoring for PM-QPSK
passing human-level performance on ImageNet classification,” in Proc. systems with large inline chromatic dispersion using artificial neural net-
Int. Conf. Comput. Vis., Santiago, Chile, 2015, pp. 1026–1034. work technique,” IEEE Photon. Technol. Lett., vol. 24, no. 17, pp. 1564–
[13] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep 1567, Sep. 2012.
feedforward neural networks,” in Proc. Int. Conf. Artif. Intell. Statist., [39] M. C. Tan, F. N. Khan, W. H. Al-Arashi, Y. Zhou, and A. P. T. Lau, “Si-
Sardinia, Italy, 2010, pp. 249–256. multaneous optical performance monitoring and modulation format/bit-
[14] A. Webb, Statistical Pattern Recognition, 2nd ed. Chicester, U.K.: Wiley, rate identification using principal component analysis,” J. Opt. Commun.
2002. Netw., vol. 6, no. 5, pp. 441–448, May 2014.
[15] A. Statnikov, C. F. Aliferis, D. P. Hardin, and I. Guyon, A Gentle Intro- [40] P. Isautier, K. Mehta, A. J. Stark, and S. E. Ralph, “Robust architecture for
duction to Support Vector Machines in Biomedicine. Singapore: World autonomous coherent optical receivers,” J. Opt. Commun. Netw., vol. 7,
Scientific, 2011. no. 9, pp. 864–874, Sep. 2015.
[16] M. S. Andersen, J. Dahl, and L. Vandenberghe, “CVXOPT: Python soft- [41] N. G. Gonzalez, D. Zibar, and I. T. Monroy, “Cognitive digital receiver
ware for convex optimization.” [Online]. Available: https://cvxopt.org. for burst mode phase modulated radio over fiber links,” in Proc. Eur. Conf.
Accessed on: Feb. 5, 2019. Opt. Commun., Torino, Italy, 2010, Paper P6.11.
[17] M. Li, S. Yu, J. Yang, Z. Chen, Y. Han, and W. Gu, “Nonparameter [42] F. N. Khan, Y. Zhou, Q. Sui, and A. P. T. Lau, “Non-data-aided joint
nonlinear phase noise mitigation by using M-ary support vector machine bit-rate and modulation format identification for next-generation hetero-
for coherent optical systems,” IEEE Photon. J., vol. 5, no. 6, Dec. 2013, geneous optical networks,” Opt. Fiber Technol., vol. 20, no. 2, pp. 68–74,
Art. no. 7800312. Mar. 2014.
[18] D. Wang et al., “Nonlinear decision boundary created by a machine [43] R. Borkowski, D. Zibar, A. Caballero, V. Arlunno, and I. T. Monroy,
learning-based classifier to mitigate nonlinear phase noise,” in Proc. Eur. “Stokes space-based optical modulation format recognition in digital co-
Conf. Opt. Commun., Valencia, Spain, 2015, Paper P.3.16. herent receivers,” IEEE Photon. Technol. Lett., vol. 25, no. 21, pp. 2129–
[19] T. Nguyen, S. Mhatli, E. Giacoumidis, L. V. Compernolle, M. Wuilpart, 2132, Nov. 2013.
and P. Mégret, “Fiber nonlinearity equalizer based on support vector clas- [44] F. N. Khan, K. Zhong, W. H. Al-Arashi, C. Yu, C. Lu, and A. P. T.
sification for coherent optical OFDM,” IEEE Photon. J., vol. 8, no. 2, Lau, “Modulation format identification in coherent receivers using deep
Apr. 2016, Art. no. 7802009. machine learning,” IEEE Photon. Technol. Lett., vol. 28, no. 17, pp. 1886–
[20] M. Kirk, Thoughtful Machine Learning With Python. Sebastopol, CA, 1889, Sep. 2016.
USA: O’Reilly Media, 2017. [45] F. N. Khan et al., “Joint OSNR monitoring and modulation format iden-
[21] I. T. Jolliffe, Principal Component Analysis, 2nd ed. New York, NY, USA: tification in digital coherent receivers using deep neural networks,” Opt.
Springer-Verlag, 2002. Express, vol. 25, no. 15, pp. 17767–17776, Jul. 2017.
[22] J. E. Jackson, A User’s Guide to Principal Components. Hoboken, NJ, [46] T. Tanimura, T. Hoshida, J. C. Rasmussen, M. Suzuki, and H. Morikawa,
USA: Wiley, 2003. “OSNR monitoring by deep neural networks trained with asynchronously
[23] L. J. Cao, K. S. Chua, W. K. Chong, H. P. Lee, and Q. M. Gu, “A sampled data,” in Proc. OptoElectron. Commun. Conf., Niigata, Japan,
comparison of PCA, KPCA and ICA for dimensionality reduction in 2016, Paper TuB3-5.
support vector machine,” Neurocomputing, vol. 55, no. 1/2, pp. 321–336, [47] T. Tanimura, T. Kato, S. Watanabe, and T. Hoshida, “Deep neural network
Sep. 2003. based optical monitor providing self-confidence as auxiliary output,” in
[24] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, Proc. Eur. Conf. Opt. Commun., Rome, Italy, 2018, Paper We1D.5.
2nd ed. Cambridge, MA, USA: MIT Press, 2018. [48] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi-
[25] Y. Bengio, “Learning deep architectures for AI,” Found. Trends Mach. nov, “Dropout: A simple way to prevent neural networks from overfitting,”
Learn., vol. 2, no. 1, pp. 1−127, Nov. 2009. J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, Jun. 2014.
[26] Y. Bengio and O. Delalleau, “On the expressive power of deep archi- [49] D. Wang et al., “Modulation format recognition and OSNR estimation
tectures,” in Algorithmic Learning Theory, J. Kivinen, C. Szepesvári, E. using CNN-based deep learning,” IEEE Photon. Technol. Lett., vol. 29,
Ukkonen, and T. Zeugmann, Eds. Heidelberg, Germany: Springer-Verlag, no. 19, pp. 1667–1670, Oct. 2017.
2011, pp. 18–36. [50] A. S. Kashi et al., “Fiber nonlinear noise-to-signal ratio monitoring using
[27] L. J. Ba and R. Caurana, “Do deep nets really need to be deep?” in Proc. artificial neural networks,” in Proc. Eur. Conf. Opt. Commun., Gothenburg,
Neural Inf. Process. Syst., Montreal, QC, Canada, 2014, pp. 2654–2662. Sweden, 2017, Paper M.2.F.2.
[28] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring strate- [51] F. J. V. Caballero et al., “Machine learning based linear and nonlinear
gies for training deep neural networks,” J. Mach. Learn. Res., vol. 10, noise estimation,” J. Optical Commun. Netw., vol. 10, no. 10, pp. D42–
pp. 1–40, Jan. 2009. D51, Oct. 2018.
[29] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, [52] E. Ip, “Nonlinear compensation using backpropagation for polarization-
MA, USA: MIT Press, 2016. multiplexed transmission,” J. Lightw. Technol., vol. 28, no. 6, pp. 939–951,
[30] D. P. Mandic and J. Chambers, Recurrent Neural Networks for Prediction: Mar. 2010.
Learning Algorithms, Architectures and Stability. Chicester, U.K.: Wiley, [53] P. Poggiolini and Y. Jiang, “Recent advances in the modeling of the im-
2001. pact of nonlinear fiber propagation effects on uncompensated coherent
[31] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, “How to construct deep transmission systems,” J. Lightw. Technol., vol. 35, no. 3, pp. 458–480,
recurrent neural networks,” arXiv:1312.6026, 2013. Feb. 2017.
[32] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training [54] A. Carena, G. Bosco, V. Curri, Y. Jiang, P. Poggiolini, and F. Forghieri,
recurrent neural networks,” in Proc. Int. Conf. Mach. Learn., Atlanta, GA, “EGN model of non-linear fiber propagation,” Opt. Express, vol. 22, no. 13,
USA, 2013, pp. 1310–1318. pp. 16335–16362, Jun. 2014.
[33] F. N. Khan, Z. Dong, C. Lu, and A. P. T. Lau, “Optical performance moni- [55] T. S. R. Shen and A. P. T. Lau, “Fiber nonlinearity compensation using
toring for fiber-optic communication networks,” in Enabling Technologies extreme learning machine for DSP-based coherent communication sys-
for High Spectral-Efficiency Coherent Optical Communication Networks, tems,” in Proc. OptoElectron. Commun. Conf., Kaohsiung, Taiwan, 2011,
X. Zhou and C. Xie, Eds. Hoboken, NJ, USA: Wiley, 2016, ch. 14. pp. 816–817.
[34] X. Wu, J. A. Jargon, R. A. Skoog, L. Paraschis, and A. E. Willner, “Appli- [56] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine:
cations of artificial neural networks in optical performance monitoring,” Theory and applications,” Neurocomputing, vol. 70, no. 1–3, pp. 489–
J. Lightw. Technol., vol. 27, no. 16, pp. 3580–3589, Aug. 2009. 501, Dec. 2006.
[35] T. B. Anderson, A. Kowalczyk, K. Clarke, S. D. Dods, D. Hewitt, and [57] D. Zibar et al., “Nonlinear impairment compensation using expecta-
J. C. Li, “Multi impairment monitoring for optical networks,” J. Lightw. tion maximization for dispersion managed and unmanaged PDM 16-
Technol., vol. 27, no. 16, pp. 3729–3736, Aug. 2009. QAM transmission,” Opt. Express, vol. 20, no. 26, pp. B181–B196,
[36] F. N. Khan, Y. Zhou, A. P. T. Lau, and C. Lu, “Modulation format identi- Dec. 2012.
fication in heterogeneous fiber-optic networks using artificial neural net- [58] E. Alpaydin, Introduction to Machine Learning, 2nd ed. Cambridge, MA,
works,” Opt. Express, vol. 20, no. 11, pp. 12422–12431, May 2012. USA: MIT Press, 2010.
[37] F. N. Khan, T. S. R. Shen, Y. Zhou, A. P. T. Lau, and C. Lu, “Optical perfor- [59] E. Giacoumidis et al., “Reduction of nonlinear intersubcarrier intermixing
mance monitoring using artificial neural networks trained with empirical in coherent optical OFDM by a fast Newton-based support vector machine
moments of asynchronously sampled signal amplitudes,” IEEE Photon. nonlinear equalizer,” J. Lightw. Technol., vol. 35, no. 12, pp. 2391–2397,
Technol. Lett., vol. 24, no. 12, pp. 982–984, Jun. 2012. Jun. 2017.
516 JOURNAL OF LIGHTWAVE TECHNOLOGY, VOL. 37, NO. 2, JANUARY 15, 2019
[60] E. Giacoumidis et al., “Fiber nonlinearity-induced penalty reduction in [83] D. Smilkov and S. Carter, “TensorFlow—A neural network playground."
CO-OFDM by ANN-based nonlinear equalization,” Opt. Lett., vol. 40, [Online]. Available: http://playground.tensorflow.org
no. 21, pp. 5113–5116, Nov. 2015. [84] A. Damien, “GitHub repository—TensorFlow tutorial and examples
[61] C. Pan, H. Bülow, W. Idler, L. Schmalen, and F. R. Kschischang, “Optical for beginners with latest APIs.” [Online]. Available: https://github.com/
nonlinear phase noise compensation for 9 × 32-Gbaud PolDM-16QAM aymericdamien/TensorFlow-Examples
transmission using a code-aided expectation-maximization algorithm,” J. [85] M. Zhou, “GitHub repository—TensorFlow tutorial from basic to hard."
Lightw. Technol., vol. 33, no. 17, pp. 3679–3686, Sep. 2015. [Online]. Available: https://github.com/MorvanZhou/Tensorflow-Tutorial
[62] C. Häger and H. D. Pfister, “Nonlinear interference mitigation via deep [86] TensorFlow, “Guide for programming with the low-level TensorFlow
neural networks,” in Proc. Opt. Fiber Commun., San Diego, CA, USA, APIs.” [Online]. Available: https://www.tensorflow.org/programmers_
2018, Paper W3A.4. guide/low_level_intro
[63] C. Häger and H. D. Pfister, “Wideband time-domain digital backpropaga-
tion via subband processing and deep learning,” in Proc. Eur. Conf. Opt.
Commun., Rome, Italy, 2018, Paper Tu4F.4.
[64] Z. Tao, L. Dou, W. Yan, L. Li, T. Hoshida, and J. C. Rasmussen,
“Multiplier-free intrachannel nonlinearity compensating algorithm oper-
ating at symbol rate,” J. Lightw. Technol., vol. 29, no. 17, pp. 2570–2576, Faisal Nadeem Khan was born in Jhang, Pakistan. He received the B.Sc. degree
Sep. 2011. in electrical engineering from the University of Engineering and Technology,
[65] V. Kamalov et al., “Evolution from 8QAM live traffic to PS 64-QAM Taxila, Pakistan, the M.Sc. degree in communications technology from the
with neural-network based nonlinearity compensation on 11000 km open University of Ulm, Ulm, Germany, and the Ph.D. degree in electronic and infor-
subsea cable,” in Proc. Opt. Fiber Commun., San Diego, CA, USA, 2018, mation engineering from The Hong Kong Polytechnic University, Hong Kong.
Paper Th4D.5. From 2012 to 2015, he was a Senior Lecturer with the School of Electrical
[66] R. Dar and P. J. Winzer, “Nonlinear interference mitigation: Methods and and Electronic Engineering, Universiti Sains Malaysia. He is currently with the
potential gain,” J. Lightw. Technol., vol. 35, no. 4, pp. 903–930, Feb. 2017. Photonics Research Centre, The Hong Kong Polytechnic University. He has au-
[67] F. N. Khan, C. Lu, and A. P. T. Lau, “Optical performance monitoring in thored or coauthored more than 50 research papers in prestigious international
fiber-optic networks enabled by machine learning techniques,” in Proc. journals and conferences as well as written one book chapter. His research inter-
Opt. Fiber Commun., San Diego, CA, USA, 2018, Paper M2F.3. ests include machine learning and signal processing techniques for high-speed
[68] Z. Wang et al., “Failure prediction using machine learning and time series fiber-optic communication systems. He has been an invited speaker at various
in optical network,” Opt. Express, vol. 25, no. 16, pp. 18553–18565, prestigious international conferences including Optical Fiber Communication
Aug. 2017. 2018 and Signal Processing in Photonic Communications 2017, among others.
[69] F. Boitier et al., “Proactive fiber damage detection in real-time coherent
receiver,” in Proc. Eur. Conf. Opt. Commun., Gothenburg, Sweden, 2017,
Paper Th.2.F.1.
[70] D. Rafique, T. Szyrkowiec, H. Griesser, A. Autenrieth, and J.-P. Elbers,
“Cognitive assurance architecture for optical network fault management,”
J. Lightw. Technol., vol. 36, no. 7, pp. 1443–1450, Apr. 2018. Qirui Fan was born in Zhejiang, China, in 1992. He received the B.Eng. and
[71] D. Rafique, T. Szyrkowiec, A. Autenrieth, and J.-P. Elbers, “Analytics- M.Eng. degrees in electrical engineering from Hunan University, Changsha,
driven fault discovery and diagnosis for cognitive root cause analysis,” in China, in 2014 and 2017, respectively. He is currently working toward the
Proc. Opt. Fiber Commun., San Diego, CA, USA, 2018, Paper W4F.6. Ph.D. degree with the Department of Electrical Engineering, The Hong Kong
[72] J. Mata et al., “Artificial intelligence (AI) methods in optical networks: Polytechnic University, Hong Kong. His research interests include machine
A comprehensive survey,” Opt. Switching Netw., vol. 28, pp. 43–57, learning techniques for fiber nonlinearity compensation.
Apr. 2018.
[73] F. Musumeci et al., “An overview on application of machine learning tech-
niques in optical networks,” IEEE Commun. Surv. Tut., to be published,
doi: 10.1109/COMST.2018.2880039.
[74] F. Morales, M. Ruiz, L. Gifre, L. M. Contreras, V. López, and L. Velasco, Chao Lu received the B.Eng. degree in electronic engineering from Tsinghua
“Virtual network topology adaptability based on data analytics for traffic
University, Beijing, China, in 1985, and the M.Sc. and Ph.D. degrees from the
prediction,” J. Opt. Commun. Netw., vol. 9, no. 1, pp. A35–A45, Jan.
University of Manchester, Manchester, U.K., in 1987 and 1990, respectively.
2017.
In 1991, he joined, as a Lecturer, the School of Electrical and Electronic Engi-
[75] D. Rafique and L. Velasco, “Machine learning for network automation: neering, Nanyang Technological University, Singapore, where he has been an
Overview, architecture, and applications,” J. Opt. Commun. Netw., vol. 10,
Associate Professor since January 1999. From June 2002 to December 2005,
no. 10, pp. D126–D143, Oct. 2018. he was seconded to the Institute for Infocomm Research, Agency for Science,
[76] R. Alvizu, S. Troia, G. Maier, and A. Pattavina, “Matheuristic with Technology and Research, Singapore, as a Program Director and Department
machine-learning-based prediction for software-defined mobile metro- Manager, helping to establish a research group in the area of optical commu-
core networks,” J. Opt. Commun. Netw., vol. 9, no. 9, pp. D19–D30, nication and fiber devices. Since April 2006, he has been a Professor with
Sep. 2017. the Department of Electronic and Information Engineering, The Hong Kong
[77] Y. V. Kiran, T. Venkatesh, and C. S. Murthy, “A reinforcement learning Polytechnic University, Hong Kong. His research interests include optical com-
framework for path selection and wavelength selection in optical burst munication systems and networks, fiber devices for optical communication, and
switched networks,” IEEE J. Sel. Areas Commun., vol. 25, no. 9, pp. 18– sensor systems.
26, Dec. 2007.
[78] X. Chen, J. Guo, Z. Zhu, R. Proietti, A. Castro, and S. J. B. Yoo, “Deep-
RMSA: A deep-reinforcement-learning routing, modulation and spectrum
assignment agent for elastic optical networks,” in Proc. Opt. Fiber Com-
mun., San Diego, CA, USA, 2018, Paper W4F.2.
[79] S. Yan et al., “Field trial of machine-learning-assisted and SDN-based op- Alan Pak Tao Lau received the B.A.Sc. degree in engineering science (electri-
tical network planning with network-scale monitoring database,” in Proc. cal option) and the M.A.Sc. degree in electrical and computer engineering from
Eur. Conf. Opt. Commun., Gothenburg, Sweden, 2017, Paper Th.PDP.B.4. the University of Toronto, Toronto, ON, Canada, in 2003 and 2004, respec-
[80] B. Karanov et al., “End-to-end deep learning of optical fiber communica- tively, and the Ph.D. degree in electrical engineering from Stanford University,
tions,” J. Lightw. Technol., vol. 36, no. 20, pp. 4843–4855, Oct. 2018. Stanford, CA, USA, in 2008. In 2008, he joined, as an Assistant Professor, The
[81] R. T. Jones, S. Gaiarin, M. P. Yankov, and D. Zibar, “Time-domain neural Hong Kong Polytechnic University, where he is currently a Professor. He col-
network receiver for nonlinear frequency division multiplexed systems,” laborates with industry in various aspects of optical communications and serves
IEEE Photon. Technol. Lett., vol. 30, no. 12, pp. 1079–1082, Jun. 2018. in organizing committees of numerous conferences in optical communications.
[82] J. Estaran et al., “Artificial neural networks for linear and non-linear His current research interests include long-haul and short-reach coherent optical
impairment mitigation in high-baudrate IM/DD systems,” in Proc. Eur. communication systems, optical performance monitoring, and machine learning
Conf. Opt. Commun., Düsseldorf, Germany, 2016, Paper M.2.B.2. applications in optical communications and networks.