Machine Learning in The Air

1
Machine Learning in the Air

Deniz Gündüz, Paul de Kerret, Nicholas D. Sidiropoulos, David Gesbert,
Chandra R. Murthy, and Mihaela van der Schaar
Abstract—Thanks to the recent advances in processing speed learning ((D)RL) [1], [2]. These tools have resulted in re-
and data acquisition and storage, machine learning (ML) is markable advances in audio and image recognition, natural
penetrating every facet of our lives, and transforming research in language processing, recommender systems, and have beaten
many areas in a fundamental manner. Wireless communications
is another success story – ubiquitous in our lives, from handheld human grandmasters in chess and Go. They have also led to
devices to wearables, smart homes, and automobiles. While recent many advances in applications from healthcare to autonomous
years have seen a flurry of research activity in exploiting ML driving, finance, marketing and robotics. The success of these
tools for various wireless communication problems, the impact approaches in many practical applications, and particularly the
of these techniques in practical communication systems and fact that they perform far better than disciplined approaches
standards is yet to be seen. In this paper, we review some of the
major promises and challenges of ML in wireless communication based on sound theory, has challenged the very foundation
systems, focusing mainly on the physical layer. We present some of our engineering education. Very few people believed that
of the most striking recent accomplishments that ML techniques such a resurgence of ‘black box’ methods would ever happen,
have achieved with respect to classical approaches, and point much less that they would work so remarkably well in practice.
to promising research directions where ML is likely to make Hence the latest wave in ML caught many communications en-
the biggest impact in the near future. We also highlight the
complementary problem of designing physical layer techniques gineers by surprise. But because we are engineers, we cannot
to enable distributed ML at the wireless network edge, which look away from something that works. We have to understand
further emphasizes the need to understand and connect ML with it and use it to our advantage when possible. ML is certainly
fundamental concepts in wireless communications. ‘in the air’, with many special issues, workshops, special
sessions and panels, exploring the potentials and promises of
I. I NTRODUCTION ML for wireless systems. While the activity in this area is
growing at an exponential rate, some seasoned researchers in
Recent advances in machine learning (ML) have caused a the community are skeptical, and the impact of ML techniques
wave that has swept across all walks of science and engi- in practical communication systems and standards is indeed
neering. The main premise of ML is to enable computers yet to be seen.
to learn and perform certain tasks (e.g., classification and Before we go into the challenges of applying ML in wireless
prediction) without being explicitly programmed to do so. This systems, we would like to understand whether ML is really
is achieved by training algorithms on vast amounts of data a novelty to communications researchers. Is it a completely
available for the task to be accomplished. While the basic new paradigm that can transform communications research and
ideas and ambitions of ML go back to the 1950s, recent future communication systems, or is it yet another “old wine
years have witnessed an unprecedented surge in interest in in a new bottle”, presenting various old and known techniques
this area, fuelled by the availability of increasingly powerful with a new flavour? Indeed, the connections between ML
computers, large and well-curated datasets, and developments and the theory of information transmission and storage are
in theoretical understanding of various learning algorithms. numerous and often striking. The fundamental problem of
Arguably, the most impressive success stories of modern communication, as stated by Shannon [3], “reproducing at
ML are due to the remarkable efficacy of deep learning, in the one point either exactly or approximately a message selected
form of deep neural networks (DNNs), generative adversarial at another point,” can in fact be recast as a classification
networks (GANs), and the resurgence of (deep) reinforcement problem. More specifically, symbol and sequence detection
Deniz Gündüz is with the Information Processing and Communications that constitute the core of any communication system are
Laboratory, Department of Electrical and Electronics Engineering, Imperial special cases of the general classification problem, which is
College London, London, UK. Paul de Kerret and David Gesbert are with the at the heart of ML. Shannon’s entropy, mutual information,
Communication Systems Department, EURECOM, Sophia Antipolis, France.
Nicholas D. Sidiropoulos is with the Department of Electrical and Computer and Kullback-Leibler divergence are widely used in ML as
Engineering, University of Virginia Charlottesville, VA. Chandra Murthy is training objectives. Vector quantization (VQ) is key for source
with the Department of Electrical and Computer Engineering at the Indian coding and rate-distortion, going back to Shannon [4]. VQ is
Institute of Science, Bangalore, India. Mihaela van der Schaar is with the
University of California at Los Angeles (UCLA). also known as k-means clustering – a staple of unsupervised
D. Gündüz received support from European Research Council (ERC) under ML. Universal source coding inherently learns the distribution
the European Unions Horizon 2020 research and innovation program Starting of the underlying information source in an online fashion
Grant BEACON (grant agreement no. 677854). P. de Kerret and D. Gesbert
are supported by the ERC under the European Unions Horizon 2020 research [5], [6], and the most successful lossless compression algo-
and innovation program (agreement no. 670896). N.D. Sidiropoulos was par- rithms are based on information theoretic principles, such as
tially supported by NSF CIF-1525194, ECCS-1608961, and ECCS-1807660. Lempel-Ziv and BurrowsWheeler transforms, and have been
C. Murthy’s work was supported in part by the Young Faculty Research
Fellowship from the Ministry of Electronics and Information Technology, successfully implemented for everyday use (gzip, pdf, GIF,
Government of India. etc.). Channel estimation is the task of learning a linear system
Digital Object Identifier: 10.1109/JSAC.2019.2933969
1558-0008 c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
2
in a supervised fashion when training/pilots are used. When can be used for better resource provisioning and improved
(e.g., power amplifier) nonlinearities come into play, one has network operation. 3GPP has already introduced the network
to learn a more general nonlinear system. Coding can be data analytics function (NWDAF) in order to standardize the
considered as controlled dimensionality expansion from the way such data is collected and communicated across various
reduced-dimension latent information symbols to the channel network functions [7]. While this has limited functionality at
input space, and decoding reduces things back to the original the moment, it is widely accepted that analysis using higher
low-dimensional information space. layer network and user behaviour data will be an integral
Despite all the fascinating connections, there remain some part of 5G and future communication network architectures,
key differences between generic ML and conventional wireless where network functions will interact through NWDAF to
communication systems. Perhaps the most crucial differences provide relevant data to be used by other network functions,
are that i) in communications we have a fairly good grasp and will apply various ML techniques on the available data
of what to expect by way of channel and system models, to make control and resource allocation decisions. Therefore,
which obey physical laws; and ii) we have complete control an important question in this context is what type of lower
of what, when, and how to transmit. In principle this makes layer data can be used to improve the utilization of limited
communications an overall better playing field for model- physical layer resources, and which ML techniques would
based solutions than generic ML applications. provide timely and useful inferences and predictions based
The most striking aspect of the recent success of ML is on this data.
its ‘data-driven’ nature – we have access to a lot of data In this paper we will try to answer these questions, focusing
nowadays; hence, we can rely on data to draw conclusions, and mainly on some exemplary applications of ML tools for lower
design systems like never before. The data-driven approach of layer design. We note here that the goal of this paper is
ML is significantly different from the model-based approaches not to provide a survey of recent results in this very active
that have long dominated communication system design. research area, but to highlight some of the striking recent
Communication and networking engineers for many years results that promise significant gains compared to conven-
have developed models with ever-increasing complexity and tional physical layer design techniques, and provide a general
accuracy for the underlying physical communication channels, discussion on why these techniques are promising, and the
antenna patterns, data traffic, user mobility, interference, and potential roadblocks for their implementation in real systems
many other aspects of communication systems. They have and adoption in standards. We refer the readers to excellent
then designed highly complex modulation/ demodulation tech- survey and overview papers on various aspects of ML in
niques, error correction codes, and networking protocols based wireless communications to gather a more complete picture
on these models, which can be implemented efficiently (even of its recent applications in different settings [8]–[13]. Next,
on low-complexity and energy-limited mobile devices), and we will go over some of the major challenges of applying ML
can enable reliable communications at fairly high data rates. in the lower layers of the protocol stack.
The model-based approach has been tremendously successful
for communication system design, taking us from the first to
the fifth generation (5G) of wireless networks, successfully A. Challenges of Applying ML Tools in Wireless Communica-
keeping up with the rapidly growing demand for higher quality tions
and lower latency content delivery. However, as we move A major criticism for the data-driven approach to com-
towards implementing 5G networks and adopting a more munication system design is the ‘black-box’ nature of some
flexible network architecture (network function virtualization, of the ML algorithms, e.g., DNNs, and the lack of guar-
software defined networking, etc.), it is likely that there will antees for performance; whereas communication engineers
be many scenarios in which the modeling assumptions used are accustomed to providing performance guarantees on error
in traditional designs become questionable. For example, with probability, interference level, channel outage, latency, etc. In
network slicing and multiple service classes, the interference many cases, such as emergency communication networks or
can become highly non-stationary and non-Gaussian. Also, for critical infrastructures, reliability and latency requirements
low-latency communications that should be supported with can be extremely stringent. However, such provable guarantees
5G may not allow accurate channel estimation, and short hinge on the assumed channel, traffic, and device models,
blocklength codes cannot benefit from the ergodicity of some and their validity is as good as the accuracy of these models.
of the randomness in the channel. Similarly, low latency Channel modeling, for example, no matter how ingenious, is
requirements make the highly structured and modular lay- always approximate, and the true channel dynamically evolves
ered network architecture highly suboptimal, requiring more and is subject to all sorts of nonlinear/ phase transition effects,
integrated cross-layer designs, which increases the complexity from amplifier nonlinearities to loss of synchronization, which
of optimizing and operating these networks. These challenges bring us closer to the realm of more general ML. On the
point to the need for less structured solutions that are robust contrary, the data-driven approach does not need powerful
to model mismatches. Can the data-driven approach of ML models, and instead can learn the optimal system architecture
be useful for designing such wireless communication systems simply from available data. One particularly striking example
and protocols? Modern ML techniques can help us make in- is the use of the autoencoder as a general nonlinear detection
ferences and predictions about network traffic, user behaviour, mechanism – without having to physically model, estimate,
application requirements and security threats, all of which and explicitly implement an equalizer or an error control
3
mechanism. The advantage of such an approach is that it can generate synthetic data from the model, which can be used
can “invert” even unknown nonlinear channels directly, based for offline training. In this case, by a judicious choice of the
only on training data and nothing else. Therefore, it is not architecture, one can arrive at an ML algorithm that (at least
clear which would provide a more reliable communication empirically) outperforms its conventional counterparts, when
system: the one optimally designed based on complex yet algorithms of similar computational complexity are compared.
approximate models, or the one designed by black-box ML This points to another setting in which ML-based techniques
algorithms based on training data. While the former is limited have proven useful: even when the system model is accurately
by the accuracy of the model in representing the reality, the or perfectly known, the optimal solution may be too complex,
latter uses real data in the learning process, but the data is or even intractable [20] (referred to as ‘algorithm-deficit’ in
always limited in size and generalizability. We expect that ML- [13]). In such a case, the model can be used to generate data,
based solutions will be effective particularly when an accurate which can then be used to train a limited-complexity ML-
model for the problem of interest is not available (i.e.,‘model- model, which can either try to imitate an available model-
deficit’ as referred to in [13]), and a sufficiently large and based approximate or optimal solution, or directly achieve the
representative training dataset is available. optimal performance. Such an approach has been shown to
The ‘black box’ aspect of DNNs also brings along the provide approximate solutions to even NP-hard problems using
interpretability problem. Understanding the reasons behind the moderate computational resources [21], [22], or to outperform
success or failure of ML methods, particularly those based human experts in fully known yet highly complex models,
on DNNs, is an on-going research challenge [14], which such as chess, Go, or Atari games [23], [24]. We will provide
is yet to be addressed satisfactorily. From an engineering more details in Section V along with several examples how
perspective, not knowing the reasons behind the decisions ML can be used in wireless networks with known models as
taken by an algorithm makes it very difficult to tackle failures, a way to optimize the network performance (e.g., sum rate in
or to predict the impact of changes in the environment on a multi-user network).
the performance. Also relevant for communication networks Another issue raised when adopting ML-based techniques in
is to guarantee some sense of fairness across users, such that wireless communication networks is the limited computational
they are not penalized unintentionally by a ML algorithm and memory resources available to most wireless devices,
due to the type of their device, their location, protocol being especially low-complexity terminals at the network’s edge.
used, etc. Fairness in ML is a very important and growing Many of the impressive results with data-driven ML techniques
research challenge, particularly for applications that involve are obtained using very powerful computing machinery and
personal data, and those that make decisions that have direct massive datasets, which may not be possible to reach by
impact on our lives, e.g., automatic evaluation and ranking mobile devices with limited computation, memory, and energy
of CVs, recommender systems for online shopping, and even resources. The computing power of even the most recent mo-
for criminal investigations [15]. Given the sensitivity of our bile devices is orders of magnitude less than high performance
mobile traces, and the close association between users and computers used to train complex ML models. Also, each
their mobile devices, fairness will likely to be an important wireless end-user device typically has only a limited amount
concern in future applications of ML in wireless systems. of data, further limiting the training capabilities. The current
Another challenge of applying data-driven ML tools to approach to overcome the limitations of wireless devices is
wireless systems is the limited availability of training data. cloud or edge processing, in which all the data available at
Unlike in computer vision, speech processing, or health- wireless devices are transferred to an edge or cloud unit, where
care applications, in most wireless applications standardized a powerful ML algorithm can be centrally trained using all the
datasets for testing and comparison of proposed ML tech- data. However, such a solution comes at the cost of transferring
niques are not available. However, we expect that, with the the data from energy and bandwidth limited wireless edge
increasing adoption of ML techniques, more public datasets devices to a central edge or cloud processor, and the latency
will become available to the community. There are a number this would incur – not to mention privacy concerns, which
of initial efforts in this direction, and several publicly available are increasingly becoming a serious challenge to centralized
datasets can be used to perform and compare some basic ML data processing. This necessitates new ways of achieving
tasks on wireless signals [16]–[19]. decentralized learning in the wireless setting. Decentralized
Even if such datasets become available, it is questionable learning and decision making is ultimately limited by how
whether success on such datasets can promise success in much information is let to flow between the learning devices,
other channel and network conditions. Wireless channels are and how much noise corrupts the local device’s information.
often highly non-stationary, and offline training on a generic Clearly, algorithms which can adapt to arbitrarily distributed
dataset may not lead to satisfactory results when tested on a information settings would be highly desirable. More details
very different wireless environment. This may require online and concrete examples will be given in Section VI on dis-
training of the existing models to adapt them to the current tributed ML at the wireless network edge.
scenario; however, training time of an ML model for reason- In the rest of this paper, we will highlight some specific
able performance is often beyond the operation timescales of problems in wireless communication networks, and in partic-
communication systems. ular at the physical layer, where we believe ML techniques
On the other hand, when a reasonably accurate model (e.g., can make a significant impact. In some of the settings, data-
for the underlying communication channel) is available, one driven solutions are posed to solve hard wireless networking
4
problems, trained on data generated from existing models. HT y, vk , xk and HT Hxk , where vk and xk are obtained
These emphasize the use of ML as an optimization technique iteratively as outputs of each layer of the DNN. The results in
to obtain solutions that can surpass the state-of-the-art (SoA). [26] show that the DNN-based decoder achieves comparable
We will also highlight some applications in which data-driven performance with a high-complexity decoder based on semi-
ML techniques are suitable due to the lack of accurate models. definite relaxation, while running 30 times faster. This is a
Although we provide pointers to a large number of key good example of a solution that combines DNN-based “black-
references, the presented examples are naturally influenced by box” solution with domain knowledge, which is used to “steer”
our personal research experiences and interests. Nonetheless, the DNN to exploit the available information in an efficient
we believe that the highlighted observations are likely to manner. Note that, the decoder is not provided any additional
‘generalize’ to other relevant problems and scenarios. information, and in theory should be able to learn to mimic
this structure; however, providing the same information in the
II. D EEP L EARNING BASED D ETECTION AND D ECODING most convenient form can speed up the learning process, and
avoid suboptimal local optima.
Data detection over a noisy channel, which is an essen- In [27], the authors study the data detection problem over
tial component of any communication system, is inherently a known channel model, without CSI at the receiver. The
a classification problem. Current detection systems rely on DNN-based decoder in this case is trained to output the
model-based solutions, employing a mathematical model de- estimated data symbols based purely on the received signal,
scribing the underlying communication channel. Moreover, without explicitly estimating the wireless channel. The DNN
we typically use a detector derived assuming perfect channel is trained using synthetically generated input-output data.
state information (CSI) at the receiver, with the channel state Specifically, the channel is drawn from the wireless world
replaced by its estimate computed from training symbols. initiative for new radio (WINNER II) model, which models
This renders the detector sub-optimal in the presence of CSI a typical urban channel with 24 sample-spaced paths. The
estimation errors, and a well-trained ML algorithm can out- DNN consists of an input layer, three hidden layers and an
perform classical approaches. In the context of data detection output layer, with a heuristically selected number of neurons
under a Poisson channel model (which arises in molecular in each layer. The rectified linear unit (ReLU) is used as
communication), in [25], a recurrent NN (RNN) is used in the the activation function in all but the output layer, where the
presence of intersymbol interference (ISI). While the proposed sigmoid function is used to map the output to the interval [0, 1].
RNN structure can be trained to learn to disentangle the impact MSE between the transmitted and predicted symbols is used
of ISI without any additional information, the performance as the loss function for training the DNN. Numerical results
of the classical Viterbi decoder (VD) depends heavily on the illustrate several interesting points. First, when sufficiently
accuracy of CSI, as well as the memory length of ISI in the many pilots are present, the bit error rate (BER) performance
channel. In [25], authors also train a detector based purely on of the DNN-based decoder matches that of the minimum
data collected from a molecular communication channel. Since mean square error (MMSE) receiver. However, under non-
accurate models for this system are lacking, they show that, ideal conditions, such as fewer pilots, absence of the cyclic
under CSI estimation errors, the NN-based detector performs prefix, or nonlinear clipping noise, the DNN-based decoder
significantly better than SoA detectors. This result corresponds can significantly outperform the MMSE receiver. Of course,
to a fully data-driven approach for a complex system with the MMSE receiver is no longer optimal under these non-
hard-to-model imperfections and nonlinearities, where ML idealities, and another hand-crafted solution that address them
provides an attractive alternative. could perform as well or better than the DNN-based decoder.
The detection problem is also studied in [26] considering a Nonetheless, the deep learning approach offers a relatively
MIMO channel: straightforward and promising solution, that can potentially
y = Hx + w, (1) deal with a variety of non-idealities in a robust manner. In
[28], the authors carry this idea further and illustrate over-the-
where y ∈ RN is the received vector, H ∈ RN ×K is the air results of an online trainable OFDM receiver.
channel matrix, x ∈ {−1, +1}K is the unknown channel input
vector consisting of independent and equally likely binary
symbols, and w ∈ RN is the noise vector consisting of inde- A. Learning to Decode
pendent zero-mean Gaussian variables of unknown variance. While the above works mainly focus on detecting the
Even under the perfect knowledge of the channel model and channel input symbols, DNNs can also be used to recover
the channel matrix H, as the dimensionality of the problem, coded symbols. Decoding of codewords from a certain channel
N × K, increases, the optimal maximum likelihood (ML) code is another classification problem. However, the number
detector becomes impractical due to formidable computational of classes to classify the received signal into grows exponen-
complexity. The authors propose a DNN-based detector. The tially with the blocklength, leading to exponentially growing
challenge here is to find the best way to feed the CSI to the training complexity. Therefore, most of the current approaches
DNN to allow the network to learn to exploit this additional to DNN-based channel decoding incorporate DNNs into the
information. The authors exploit the structure of a projected existing decoder structures. For example, [29] uses a NN to
gradient solution, and design the network architecture accord- learn the weights that should be assigned to the Tanner graph
ingly. In particular, they feed into each layer of the network of the belief propagation (BP) algorithm. In [30], authors
5
propose improving the performance of conventional iterative transmitter side, CSI allows employing adaptive transmission
decoding for polar codes by complementing it with NN-based techniques, which can provide significant gains in performance
components. In particular, they divide the decoder into sub- and efficiency. Transmitter CSI is especially important in
blocks, each of which is replaced by a NN-based decoder, and massive MIMO systems. For time division duplex schemes
the results of these decoders are fed into a belief propagation CSI can be obtained at the transmitter side by exploiting
(BP) decoder. This allows controlling the training complexity reciprocity. However, in frequency division duplex schemes,
by adjusting the number of subblocks. A fully DNN-based CSI estimated at the receiver needs to be conveyed to the
channel decoder is considered in [31]. To keep the complexity transmitter over a feedback link. In order to minimize the
reasonable, codelength is limited to 16 while the code rate resources dedicated to CSI feedback, it is essential to compress
is fixed to 1/2. The authors trained the decoder NN both for the CSI estimate at the receiver as efficiently as possible.
a polar code and a random code. While a performance close Below we review some of the recent applications of DNNs
to a maximum aposteriori (MAP) decoder can be achieved to the channel estimation and CSI compression problems.
for the polar code, the gap to the MAP decoder performance While detection and decoding studied in the previous section
is much larger for the random code. Although this gap can correspond to classification problems, channel estimation is
be closed with increasing the number of training epochs, the a regression problem, and CSI compression represents an
result highlights the point that NNs are most effective when unsupervised clustering problem.
the data has an underlying structure that can be learned. The
authors also considered limiting the set of codewords observed
during training. This is to test whether the NN-based decoder
can generalize to unseen codewords. They observed that this A. Channel Estimation
was indeed the case for the polar code; the decoder was able
to learn to decode codewords it has never seen before, which Recall that MMSE channel estimation entails knowledge
can only be explained by the fact that the NN-based decoder of the channel statistics and a potentially computationally
has learned the structure of the decoding algorithm. This is expensive conditional mean computation. In [34], the authors
not the case for the random code, which did not have any model the channel as conditionally Gaussian distributed given
particular structure that could be exploited by the NN-based a set of (hyper)parameters. These hyperparameters are also
decoder. random, whose distribution is eventually learned from training
data. The MMSE estimator under this model can be written
as a linear estimator, with weights depending on the statistics
B. Observations of the hyperparameters. By vectorizing the MMSE estimate,
Certain features that are common to the aforementioned the authors write the estimator in a form that is amenable to
works are worth mentioning. First, most of them use the so- implementation as a feed-forward neural network with two
called one-hot representation of the transmitted signal [32], linear layers connected by a nonlinear activation function.
[33] . In the one-hot representation, the signal is represented These layers are made learnable, and are trained via stochastic
as a binary vector of length equal to the number of possible gradient descent with the mean squared channel estimation
signals. The binary vector contains a single 1 at the location error as the loss function. It is shown that, under certain
corresponding to the transmitted signal, and zeros everywhere. assumptions, this can lead to a computationally inexpensive,
While one-hot encoding typically provides better results as near-optimal MMSE estimator when the channel covariance
it prevents any numerical ordering between the inputs, it matrix is Toeplitz and has a shift-invariance structure. Sim-
also leads to an exponentially growing input size for channel ulation results suggest that the NN based channel estimator
decoding. outperforms SoA estimators and has low complexity.
The output layer of the DNN that attempts to reconstruct In the context of wideband channels, [35] models the chan-
the input signal is typically chosen as the sigmoid function. nel time-frequency response as an image, and the pilot based
In this case, the DNN attempts to output the likelihoods of samples as a low-resolution sampled version of the image. The
possible signals, which is useful, for example, in detection of authors use convolutional neural networks (CNN) based image
coded symbols, where the bit log likelihood ratios need to be super-resolution and image restoration techniques to estimate
fed into a channel decoder. the channel, with the mean squared error (MSE) as the loss
Another interesting recent development is the use of RNNs function. Empirically, the performance is demonstrated to be
with long-short term memory (LSTM), which allows for similar to that of an ideal MMSE estimator that has perfect
smaller generalization error [25]. This allows for unseen knowledge of the channel statistics.
channel instantiations to be handled effectively. As a final note on DNN-based channel estimation, above
ideas have been extended to the case where the receiver
III. D EEP L EARNING FOR C HANNEL E STIMATION AND has fewer RF chains than antenna elements, for example,
C HANNEL S TATE I NFORMATION (CSI) F EEDBACK in mmWave systems. In this case, the key challenge for the
CSI is essential in wireless communications. On the receiver receiver is to estimate the channel from compressed measure-
side, knowledge of the channel state allows coherent detection ments. In [36], [37], it is shown that these estimators can even
and decoding. Channel estimation at the receiver is typically outperform estimators based on sparse signal recovery, when
carried out by sending pilots from the transmitter. On the trained with sufficient amount of data.
6
B. CSI Compression the goal of recovering the input at the output of the decoder.
Typically the bottleneck layer has lower dimension than the
As mentioned above, accurate CSI knowledge at the trans- input data, and if the autoencoder can learn to recover the input
mitter can significantly increase the performance of wire- with a minimal distortion, this means that the bottleneck layer
less communication systems, for example, by avoiding poor carries the essential information to approximately reconstruct
channel states, or by employing beamforming. In frequency the input data; and hence, can be considered as a compressed
duplex systems, the transmitter depends on feedback from version of the input signal. Autoencoders are used in ML for
the receiver to acquire CSI. In order to limit the resources feature extraction or as generative models for data [1]. It is
dedicated to CSI feedback, it is important to design an efficient an unsupervised learning technique as it does not require any
compression algorithm which can provide a high accuracy CSI labels.
estimate to the transmitter while using limited communication The main advantage of autoencoders for data compression is
resources, measured in terms of bits per channel symbol. This that they do not require the knowledge of the underlying data
becomes particularly important in massive MIMO systems, distribution, or explicit identification of a certain structure, but
which require accurate downlink CSI to achieve the promised instead they learn a low-dimensional representation directly
performance gains, while the CSI feedback overhead can be from data. Moreover, autoencoders can be optimized for very
excessive due to the massive number of antennas. specific information sources. While standard image compres-
Simple scalar quantization methods are not suitable for sion techniques apply the same algorithm on all types of
schemes that are highly sensitive to CSI estimation quality images, an autoencoder can be trained only on, say, underwater
at the transmitter. Moreover, they cannot exploit the spatial images, and learn specific features of these images, resulting
structure in the channel matrix, and result in high feedback in a much higher compression efficiency.
overhead. CSI feedback reduction techniques based on vector This data-driven autoencoder-based compression approach
quantization [38] are also limited, particularly for massive is particularly attractive for CSI feedback compression as
MIMO systems, as the codebook size, and thus the feedback it is difficult to identify and characterize the features of
overhead, grow proportionally with the number of transmit channel matrices, which can have quite complicated inter-
antennas. the feedback overhead. More recently, compressive dependencies through the physical environment. On the other
sensing has been considered to exploit the sparse structure of hand, acquiring CSI data for training can be easy if we have a
the underlying channel in a transform domain [39], [40]. While relatively simple model that can represent the physical channel
this provides significant reductions in CSI feedback, they do accurately. Many such models have been developed over the
not fully exploit the correlations among antennas. years, such as the 3GPP spatial channel model (SCM) [45],
Since CSI compression is a special case of the more general WINNER [46], IEEE 802.16a,e [47], or the more advanced
data compression problem, let us briefly mention here how ML geometry-based COST 2100 stochastic channel model [48].
techniques can be used for the data compression in general. An autoencoder based compression scheme, called CSINet,
Data compression is a fundamental problem in information is studied in [49], and it shown to provide significant im-
and coding theory, and significant research efforts have been provement in compression efficiency compared to the SoA
dedicated to developing efficient compression algorithms for techniques exploiting sparsity. In [50], the authors consider
various information sources, such as image, audio, or video. temporal correlations in time-varying channels, and improve
The traditional approach has been to leverage expert feature the performance of CSINet for this scenario using a RNN.
knowledge for each domain to design specific compression Utilizing channel reciprocity, the authors in [51] use the
schemes; so much so that there have been separate research uplink CSI as additional correlated side information to further
communities working on each of these data domains, and dis- improve the compression efficiency.
tinct compression standards, such as MP3, JPEG and MPEG, However, the aforementioned works focus mainly on the
have been developed. In most cases the algorithms try to dimensionality reduction aspect, and they do not directly
exploit the sparsity of the information source in a transform tackle the compression problem, which requires a binary
domain, such as discrete cosine transform in image compres- representation of the CSI, which is then transmitted reliable
sion, or some other structures, such as motion compensation in over the feedback link. While dimensionality reduction can
video compression. While these highly specialized techniques, potentially reduce the required feedback resources, in principle
which have been refined and perfected over many decades of each autoencoder output is still a real number that needs
research and development, provide reasonably good perfor- to be quantized before being fed back to the transmitter. In
mance in general, recently there has been significant progress practice, sufficient accuracy can be achieved by using 32-bit
in exploiting DNN architectures for compression in all these quantization for each of the autoencoder outputs. However, it is
data domains [41]–[44], with results meeting or surpassing not clear if this leads to the most efficient binary representation
SoA expert-based compression techniques, which is again of the CSI matrix. An alternative approach is to directly
remarkable. incorporate the quantization operation into the autoencoder
The main component in most of these implementations is training process. This is challenging, however, as the quan-
the autoencoder structure. An autoencoder is a pair of NNs, tization operation is non-differentiable. Various methods have
called the encoder and the decoder networks. The output of been proposed in the image compression literature to overcome
the encoder network, called the bottleneck layer, is the input to this difficulty. In [52] quantizer gradient is approximated as 1
the decoder network. The two network are trained jointly with in the backward pass, while [42] replaces the quantization with
7
additive uniform noise, and a stochastic binarization function

is used in [53].
This approach is used in [54] for CSI compression, where a
more advanced autoencoder architecture is used compared to
[49], and trained together with the quantizer. The output of the
quantizer is entropy coded as in standard image compression
algorithms to further improve the compression rate. This leads
to significant improvement in the compression efficiency. For
example, for 32 transmit antennas and 256 subcarriers, the
results in [54] show that including the quantization and entropy
coding in the training process can provide approximately
7dB reduction in the mean-square error of the reconstructed
Fig. 1. Encoder and decoder NN architectures used in the implementation of
channel matrix at 0.01 − 0.12 bits per channel symbol. the deep JSCC scheme in [59].
IV. AUTOENCODERS FOR E ND - TO - END C OMMUNICATION training, which would significantly slow down training. To
S YSTEM D ESIGN circumvent this limitation, [58] proposes a two-phase training
The correspondence of the autoencoder structure to a com- approach: the first phase uses a channel model as before and
munication system with an encoder and a decoder is quite the encoder and decoder are trained based on this model as
obvious. As mentioned in Section III-B, autoencoders have before. Once these networks are deployed at the transmitter
been successfully applied to image and video compression, and the receiver, the receiver network is trained further based
which can be considered as communication over a finite-rate on the transmission of known signals from the transmitter. This
error-free channel. End-to-end learning of encoder and decoder is similar to pilot transmission in channel estimation, and does
functions for communications over a physical layer channel is not require feedback to the transmitter.
first proposed in [55], and later expanded in [56]. The noisy
communication channel that connects the output of the encoder A. Joint source-channel coding (JSCC)
NN to the input of the decoder NN is treated as an untrainable All the above works have exclusively focused on trans-
layer with a fixed transformation. This end-to-end training of mitting bits over the noisy channel, that is, the goal is to
the physical layer bypasses the modular structure of conven- design error correction codes jointly with modulation, channel
tional communication systems that consists of separate blocks estimation, etc. Note that, when the input to the encoder is a
for data compression, channel coding, modulation, channel bit sequence, there is no structure in the data, and the goal
estimation and equalization, each of which can be individually is to learn the best mapping of the message bits into the
optimized. While this modular structure has advantages in channel input space, and jointly the best inverse mapping. This
terms of complexity and ease of practical implementation, is achieved mainly by distributing the input signals as much
it is known to be suboptimal. An autoencoder is trained for as possible in the channel input space within the constraints
coding and modulation over an additive white Gaussian noise of the transmitter, and taking into account the random channel
channel in [56], and it is shown to have a performance very transformation. However, in many real applications, the goal
close to conventional coding and modulation scheme in short is to transmit some information signal, e.g., a picture, video,
blocklengths. or an audio signal, which is not in the form of a sequence of
The aforementioned works on autoencoder-based end-to- equally likely bits, and typically has significant redundancy.
end physical layer design assume a known channel model, and The current standard approach to transmission of such sig-
the encoder and decoder networks are trained jointly by simu- nals is to first compress them with a source coding algorithm
lating many realizations of this channel model. While models in order to get rid of the inherent redundancy, and to reduce
for wireless channels are considered to be accurate in general, the amount of transferred information; then the compressed
they may still have mismatch with the real channel experienced bitstream is encoded and modulated over the channel. Shan-
by the transceivers, limiting the overall performance of the non’s separation theorem proves that this two-step separate
system. An alternative would be to use a GAN architecture source and channel coding approach is optimal theoretically
to learn a channel model based on real data collected from in the asymptotic limit of infinitely long source and chan-
the channel. This can provide a more accurate model of the nel blocks [60]. However, in practical applications, JSCC
channel, particularly if sufficient data can be collected from is known to outperform the separate approach, particularly
the channel. In [57], the authors propose to use the learned in short-blocklength and low-SNR regimes. Many emerging
GAN as the channel layer between the encoder and decoder applications from the Internet-of-things (IoT) to autonomous
NNs of an end-to-end communication system. driving and to tactile Internet require transmission of high data
A fundamental challenge in training autoencoders directly rate information (image/video, various sensor measurements)
on a real channel is the significant delay this may cause. under extreme latency, bandwidth and/or energy constraints,
Since the encoder and decoder must be trained jointly, the which preclude computationally demanding long-blocklength
backpropagation has to propagate the gradient from the resource and channel coding techniques. However, characteriz-
ceiver to the transmitter, requiring a feedback link during ing the optimal JSCC in non-asymptotic regimes has remained
8
AWGN channel, JPEG comparison

AWGN channel 35
50
Deep JSCC (SNR=0dB)
Deep JSCC (SNR=10dB)
45 JPEG (SNR=0dB) 33
JPEG (SNR=10dB)
JPEG2000 (SNR=0dB)
40 31
JPEG2000 (SNR=10dB)
35
PSNR (dB)
29
PSNR (dB)
30
27
25
Deep JSCC (SNRtrain=1dB)
25 Deep JSCC (SNR =4dB)
20 train
Deep JSCC (SNR =7dB)
train
23 Deep JSCC (SNR
train
=13dB)
15
1/2 rate LDPC + 4QAM
2/3 rate LDPC + 4QAM
10 21 1/2 rate LDPC + 16QAM
0 0.1 0.2 0.3 0.4 0.5 2/3 rate LDPC + 16QAM
Channel uses per pixel 1/2 rate LDPC + 64QAM
19
-2 1 4 7 10 13 16 19 22 25
Fig. 2. Performance of the deep JSCC algorithm in [59] on CIFAR-10 test SNRtest (dB)
images transmitted over an AWGN channel with respect to the available
channel bandwidth per image pixel for different SNR values. For each case,
the same SNR value is used in training and evaluation, and a different network Fig. 3. Performance comparison between the deep JSCC algorithm in [59]
is used to obtain each point in the curve. and JPEG compression followed by LDPC coding. Deep JSCC is trained on
the Imagenet dataset for the indicated SNRtrain values. Compression ratio
is 1/12, i.e., 1 channel use per 12 pixels. SNRtrain value.
an open problem, even for fully known source and channel

distributions, and it is significantly more challenging for the casting to multiple receivers, or when transmitting over a time-
transmission of complicated sources, such as images or videos, varying channel. Indeed, it is shown in [59] that the perfor-
for which we do not have good statistical models. mance improvement of deep JSCC compared to conventional
Alternatively, a deep JSCC architecture can be trained digital schemes is much higher over fading channels. Note
to map the underlying signal samples directly to channel also that, while learning channel codes is challenging even for
inputs. Such an architecture is studied for transmission of very limited blocklengths, deep JSCC can achieve performance
images over wireless channels in [59]. This can be con- levels above or comparable with SoA digital techniques even
sidered as an “analog” JSCC scheme since, unlike digital over large blocklengths.
systems built upon the separation approach, the input signal It is shown in [61] that this deep JSCC architecture also
is never converted into bits, and the channel input signal is allows bandwidth adaptation through successive refinement;
not limited to a finite number of constellation points. The that is, an image can be transmitted over n layers, and
deep JSCC architecture proposed in [59] is illustrated in Fig. a user receiving the first k layers can recover the image
1. This fully convolutional architecture allows compression with peak signal-to-noise ratio PSNRk , k = 1, . . . , n. While
of images of any size. The results illustrated in Fig. 2 show PSNR1 < · · · < PSNRn as expected, PSNRk is very close to
that deep JSCC outperforms SoA digital image transmission the performance one would obtain if the image was transmitted
schemes, e.g., JPEG/ JPEG2000 image compression followed targeting the total bandwidth available for the first k layers;
by capacity-achieving channel codes, particularly in the low that is, transmitting the image in layers comes at almost no
SNR and short channel bandwidth regimes. Note that, both additional cost, providing seamless bandwidth adaptivity.
JPEG and JPEG2000 fail completely at SNR = 0 dB. For
SNR = 10 dB, JPEG2000 can provide reasonable quality
if the channel bandwidth is sufficiently large. A few aspects V. M ACHINE L EARNING BASED R ESOURCE A LLOCATION
of deep JSCC are particularly worth mentioning: First of all, An important class of problems where modern ML tech-
it provides non-trivial image reconstruction even at very low niques can help is formed by (NP-) hard resource allocation
SNR values and limited channel bandwidths, i.e., in the case and decision / scheduling problems – which are very common
of short blocklengths. Moreover, thanks to the analog nature of in wireless communications and networking. Examples range
the encoder, the performance behaves like analog modulation from classical multi-user detection to sum-rate optimal power
schemes, and exhibits graceful degradation with channel SNR. control, multi-user scheduling, and transmission control – or
This can be observed in the performance curves in Fig. 3. A “smart” data-driven TCP-IP. In the following paragraphs, we
deep JSCC architecture trained for a particular target channel will review some illustrating examples.
SNR value gracefully degrades if the channel SNR falls below The case of joint multicast beamforming and antenna selec-
this value, and its performance improves gradually if the tion is considered in [62], where it is shown how a DNN can
channel SNR goes above the target value. be used to successfully solve the discrete optimization part of
This analog behaviour is particularly attractive for broad- the problem. This is an example of a hybrid strategy, where a
9
DNN is employed to solve part of the problem, synergistically use these input-output pairs to train the DNN. At run time,
with classical optimization. we simply pass the input through the trained neural network,
Beamforming for minimum outage [63] has also been which is far cheaper than running WMMSE online, and works
proven to be NP-hard even when the channel distribution is remarkably well.
known exactly, and in fact no practically good approximation The approach can be further refined by training the network
algorithm was known until very recently. Yet, relying on a to optimize sum rate directly as described in [69]. In that
sample average ‘counting’ approximation of outage, simple case, the sum rate is directly differentiated with respect to
smoothing, and stochastic gradient updates, a lightweight and the coefficients of the DNNs, which allows to further improve
very effective algorithm was recently designed in [64] that the performance. The existing SoA solutions and the approx-
performs remarkably well, using only recent channel data. The imation approach described above are still used to initialize
problem is formulated as follows: the optimization, and hence avoid the inefficient local optima.

min F (w) := Pr |wH h|2 < γ , (2) A. ML for Decentralized Resource Allocation
w∈W
While the solution approach to the resource allocation prob-
where γ > 0 denotes the outage threshold and W ⊂ CN lems above exploits a central common intelligence with full
is a simple (element-wise or sum) power constraint. We can knowledge of the network state, many networking problems
equivalently express (2) as require decentralized optimization. Such settings include, for

instance, coordination and cooperation tasks among radio
min Pr |w h| < γ ⇔ min Eh [✶{|wH h|2 <γ} ].
H 2
(3)
w∈W w∈W devices in the absence of a central controller, and can be linked
to so-called team decision (TD) and decentralized control
Define
( problems, which are notoriously difficult to tackle. In TD
1, if |wH h|2 < γ problems, multiple-agents aim at cooperating to achieve a
f (w; h) := ✶{|wH h|2 <γ} = (4) common goal on the basis of imperfect and nonhomogeneous
0, otherwise
information.
as the indicator function of the event |wH h|2 < γ. Consider The derivation of robust multi-device decision-making algo-
a given set of ‘recent’ channel realizations HT := {ht }Tt=1 . rithms with arbitrary input uncertainties across agents is well
Utilizing HT , we may construct the following sample average known to be a challenging task, and cannot be solved via
estimate of Eh [f (w; h)] conventional optimization methods. Team decision problems
T were first formulated by Radner in [70], and later studied
1X by Marschak in [71]. Although some particular simple cases
F̂ (w; HT ) := f (w; ht ). (5)
T t=1 could be solved (e.g., a linear objective), the general problem
remains open, with no good approximate solution. This makes
The interpretation is that we minimize the total number of
this class of communication design problems an interesting
outages over (‘recent’) channel history - very reasonable, since
playing field for ML.
under appropriate mixing conditions we have
Distributed radio resource optimization in communication
lim F̂ (w; HT ) = Eh [f (w; h)] = F (w), ∀ w ∈ W, (6) networks with the goal of maximizing the network perfor-
T →∞
mance can be recast as a multi-agent coordination problem.
almost surely. Replacing F (w) by F̂ (w; HT ) in (2), we obtain Typically, the agents (i.e., radio devices) optimize their trans-
mission parameters on the basis of imperfect local information,
min F̂ (w; HT ). (7) for example, noisy CSI [72].
w∈W
Consider the general problem with n agents, where agent j
The final step is to construct a smooth approximation of
takes decision dj using the information locally available,
f (w; h), and optimize the resulting function using stochastic
denoted by yj :
gradient descent. As it is shown in [64], this approach works
dj = sj (yj ), (8)
unexpectedly well, on a problem that has challenged many
disciplined optimization experts for years. where sj can be any arbitrary function from the information
Finally, the sum-rate optimal power control problem is space to the decision space. Information yj available at
known to be NP-hard, but we have good, albeit computa- agent j may be the result of sensing, estimation, feedback,
tionally expensive, approximation schemes at our disposal. or information shared over the backhaul network before the
These include the iterative weighted minimum mean squared actual transmission. Due to the limited amount of available
error (WMMSE) approach [65], [66], and successive convex resources, the resulting estimate obtained is expected to be a
approximation [67]. These algorithms are too complex for potentially imperfect and/or incomplete estimate of the true
practical implementation, but a key idea advocated in [20], representation of the network state x. This information model
[68] is that we can take this complexity offline by training is extremely general and encompasses as special cases the
a DNN to mimic the input-output behavior of the WMMSE centralized CSI configuration of the problems discussed above
algorithm. The way to do this is to use historical (measured) as well as the local CSI configuration.
and/or simulated channel data, run the WMMSE algorithm We focus here on the fully cooperative scenario, in which
offline to generate the associated power allocation values, and all the agents aim at jointly maximizing a common objective
10
function u in an expected sense. The optimization problem cannot be reached. More specifically, each agent can only
can then be written as update its action unilaterally. This means that a solution
necessitating several agents to update their strategies at the
(s⋆1 , . . . , s⋆n ) = argmax E [u(x, s1 (y1 ), . . . , sn (yn ))] , (9)
s1 ,...,sn same time cannot be reached. Second, it still requires solving
a functional optimization problem at each agent, and hence,
where the expectation is taken over the joint distribu- is severely limited by the complexity when the dimension of
tion px,y1 ,...,yK . We assume that all the agents know this the problem grows.
distribution, or equivalently, as it will become clear later on, 1) Centralized Training of Decentralized Strategies: We
that the training dataset is available at all the agents. It is will now discuss how the Team-DNN (T-DNN) approach
important to note that we consider for the sake of clarity proposed in [77] allows to leverage recent developments in
in (9) an optimization problem with only implicit constraints deep learning to solve the two main challenges: (i) achieving
on the decision functions and no explicit constraints. Yet, a strong form of cooperation, and (ii) reducing the complexity.
this formulation trivially extends to constrained optimization This approach is extended in [78] to the design via DNN of
problems. instantaneous and quantized message exchanges between the
This is however a rather simplified case of more general transmitters, where it is highlighted how joint optimization of
decentralized multi-agent optimization problems as we con- message sharing and transmission provide more robustness. It
sider only a one-shot optimization rather than a repeated is further extended to other settings and arbitrary constraints
setting where agents take decisions in multiple rounds, while in [79].
receiving some form of feedback at the end of each round. As a first step, optimizing (9) over the space of functions,
The feedback could be in the form of a reward function, or it is natural to resort to a set of basis functions to reduce
explicit information exchange among the agents, which can the dimensionality of the optimization space (see e.g., [80]).
be classified as active and passive feedback, respectively. In Hence, we propose to restrict the strategy of agent j to belong
the case of active feedback, each agent would optimize the θ
to a parameterized subspace, i.e., to be of the form sj j , θj
information to share with the other agents, possibly in multiple
being a vector of real parameters. We will consider DNNs to
rounds, jointly with the decision functions. In either case, the
parametrize the decision functions for their many advantages;
problem could then be formulated as a reinforcement learning
in particular, for their efficient implementation and the abun-
(RL) problem [73], which has been successfully applied to
dant literature [1], [81], but other functional approximation
many communication problems [74].
methods could also be considered. Optimization problem (9)
In a naive approach to solve optimization problem (9), each
can be approximated as:
agent assumes that its information about the world is perfect, h i
and all the other agents share the same information. Hence, (θ1⋆ , . . . , θn⋆ ) = argmax E u(x, sθ1 1 (y1 ), . . . , sθnn (yn )) .
the optimization problem solved by agent j is θ1 ,...,θn
(12)
. . . , snaive

j , . . . = argmax E [u(yj , s1(yj ), . . . , sn (yj ))]. (10) Following a data-driven approach, we then aim at maximizing
s1 ,...,sn
the average performance using the training samples from
This approach can be improved by taking into account the the known distribution. This is possible as the objective
imperfection in yj with respect to x, i.e., taking the ex- utility function u is known and differentiable. This will be
pectation over pxyj as conventionally done in robust signal achieved by centralized training to optimize over decision
processing [75], instead of simply taking yj as being perfect, functions using the training samples. In practice, this means
i.e., yj = x. Yet, it is still fundamentally limited as the that we will jointly update the parameter vectors of all the
decentralized information structure is not taken into account: agents (θ1 , . . . , θn ) using the stochastic gradient approach
Coordination cannot be reached. The alternative best-response during the training phase, as is standard in deep learning, i.e.,
strategy is optimal given the strategies of the other agents, at step k
i.e., a Nash equilibrium [76]. Hence, best-response strategies (k) (k−1)
(sBR BR (θ1 , . . . , θn(k) ) = (θ1 , . . . , θn(k−1) )+
1 , . . . , sn ) satisfy: (k−1)
θ1 θ (k−1)
sBR
BR
αk ∇(θ1 ,...,θn ) u(x, s1 (y1 ), . . . , snn (yn )). (13)
j = argmax E u(x, sj , s̄j ) , (11)
sj
We illustrate the proposed T-DNN approach in Fig. 4.
where we have used s̄BR j as a short-hand notation for all Interestingly, it remains an open problem to determine how
strategies sBRk except k = j, and omitted the functional efficient these methods designed in the centralized setting
dependencies for the sake of clarity. A best-response strat- work in the decentralized setting at hand. In particular, it is
egy is also called a per-agent optimal strategy, and can be not known which DNN architectures are better suited for the
reached by iterating over the agents, which transforms the decentralized setting and how to improve the efficiency of the
decentralized optimization problem (9) into a succession of training.
conventional centralized functional optimization problems that 2) Application in Wireless Networks: Learning to Cooper-
can be tackled with conventional optimization tools. Yet, this ate in Coordinated Power Control: To illustrate the application
best-response solution suffers from two important limitations. of the concepts described above in wireless networks, we con-
First, it only enables a weak form of cooperation as solutions sider a toy example consisting of 2 single-antenna transmitters,
necessitating a tight inter-dependency between the agents with imperfect estimates of the channel coefficients, serving
11
as an upperbound on the performance.We first observe that

the naive use of DNNs in which each transmitter applies its
learning algorithm assuming that it controls the two transmit
antennas (i.e., transmitter j is trained using only samples of
the locally available CSI) is outperformed by the proposed T-
DNN approach, which is hence more robust to the distributed
CSI configuration at hand. We can also notice the benefit of a
cooperation link: The T-DNNs have learned during training
how to use the limited cooperation link between them to
exchange useful information; and hence, to coordinate in order
to maximize the sum rate.
VI. L EARNING AT THE W IRELESS E DGE

There is an ongoing rapid growth in Internet of things
(IoT) applications, which depend heavily on data collected by
Fig. 4. Illustration of the T-DNN approach with centralized training and
sensor nodes being continuously communicated to centralized
decentralized testing. processing units, typically located at the network edge, made
possible by the emerging multi-access edge computing (MEC)
paradigm. Data collected at these centralized units is processed
to make inferences and prediction on the state of the system
being monitored, which in turn may lead to status updates that
are communicated to users, or action instructions delivered to
actuators.
ML tools are increasingly deployed for the analysis of huge
amount of data collected from IoT devices. With an increasing
number of successful and promising IoT applications and
deployments, we expect that communication of IoT data for
learning tasks will constitute a significant portion of the wire-
less network traffic in the near future. However, there are two
potential roadblocks in front of this MEC-based centralized
training approach. First of all, offloading all the data to a
cloud processor for centralized training will be challenging
particularly in wireless networks with limited bandwidth and
energy resources. This is particularly true for data intensive
Fig. 5. Percentage of the average sum-rate achieved by a centralized DNN as
a function of the maximum transmit power P for different precoding schemes. applications, such as autonomous vehicles or virtual reality.
For example, a self-driving car is expected to generate about
one gigabyte of data per second, and continuously offloading
2 single-antenna receivers. We consider a standard Rayleigh such an amount of data to the edge network is not realistic.
fading scenario and joint precoding across the transmitters Privacy is another concern that can prevent centralized ML for
with the goal of maximizing the sum rate. We also extend the most sensor data collected by IoT devices, e.g., smart meters
previous team decision formulation (9) to allow a one stage [82] or electric vehicles [83]. While local processing of IoT
limited exchange of information between the two transmitters. data is an alternative, often a single device is limited in terms
We consider a T-DNN architecture, where a different DNN of both available data and computation power. An alternative
is used to parameterize each decision function, while all the is to implement learning at the wireless edge, also called edge
DNNs are trained jointly. Following the assumption of one- learning [84], in the form of distributed stochastic gradient
step exchange, the two DNNs generate the messages to be descent (DSGD) or federated learning (FL) [85].
exchanged, while they also learn the power control from all It is now commonly accepted that the main bottleneck in
the inputs. distributed learning is the communication load [86]. Due to the
We consider a simple CSI configuration where transmitter 1 lack of centralized processing capabilities, these algorithms
has a noisy CSI, where all the coefficients are corrupted depend heavily on information exchange between multiple
by an additive independent Gaussian noise of variance σ 2 , learning agents, representing different devices each with its
while transmitter 2 has access to perfect CSI. To facilitate own local dataset, either through a ‘master’ orchestrating
the qualitative interpretation, we furthermore reduce to an node, called a parameter server, or in a fully distributed fash-
asymmetric setting where only transmitter 1 can share a ion through device-to-device communications. In either case,
message with transmitter 2 via one-step cooperation. distributed learning requires iterative information exchanges
In Fig. 5, we show the average sum rate after normalization among the participating devices and the parameter server,
by the performance achieved if perfect CSI is handed over where the devices share either their local gradient estimates in
to a central node controlling both transmitters, which serves DSGD, or local model updates in FL.
12
0.8
0.7
Training accuracy
0.6
0.5
0.4 Analog DSGD (K = 40 users)

Analog DSGD (K = 20 users)
0.3 Digital DSGD (K = 20 users)
Digital DSGD (K = 40 users)
0.2
0 10 20 30 40 50
Iteration count, t
Fig. 7. Accuracy of analog and digital transmission schemes for wireless

Fig. 6. Wireless edge learning where various devices communicate their local edge learning with different number of users.
model estimates/ gradients to the parameter server over a shared wireless
channel.
There have been numerous studies that focus on without any coding, in an ‘analog’ fashion. If the devices are
communication-efficient distributed ML. These studies can be synchronized, than the wireless channel adds their estimates,
grouped into three different approaches, namely quantization, directly conveying the desired value to the parameter server
sparsification, and local updates. Quantization algorithms aim (which simply divides this sum by the number of devices
at reducing the amount of information that need to be com- to find the average). A random projection of the gradient
municated to convey the result of local learning iteration, e.g., estimates is proposed in [94] to reduce the required channel
the local gradient estimate [87], [88]. Sparsification, on the bandwidth. This approach can also be extended to the scenario
other hand, reduces the communication load by transmitting with fading [95], [96], in which case power control can be
only the important values of local estimates [89]–[91]. Another employed at the devices to align their transmissions at the
approach is to reduce the frequency of communication from same received power level.
the devices by allowing local parameter updates [92], [93].
We remark, however, that, these studies do not explicitly In Fig. 7 we illustrate the performance of the digital and
model the underlying communication channel between the analog computation approaches for learning over the wireless
devices and the parameter server, and mainly focus on large edge. The figure compares the training accuracy when a single
scale distributed learning within server farms, where hundreds, layer NN is trained on the MNIST dataset. A total of 60000
maybe thousands of machines collaborate to learn a high- data samples are distributed across K devices, which employ
dimensional model on an extremely large dataset. However, DSGD utilizing ADAM optimizer. The figure compares the
as we will show below, taking the particular channel model accuracy achieved for a fixed average transmit power value
into account is critical in wireless edge learning, where the for each user. We observe that analog transmission of gradient
channel can be severely limiting. estimates achieves a significantly higher accuracy compared
to first quantizing the estimates, and then transmitting the
Consider DSGD over a shared wireless medium, as illus-
quantized bits with a channel code. We also make an in-
trated in Fig. 6, where the transmission of local gradient
teresting observation from Fig. 7: while the accuracy of the
estimates from the devices participating in the learning process
analog scheme increases with the number of devices, as each
to the parameter server can be formulated as a wireless
additional device comes with its own power source, the digital
computation problem [94]. One approach to this problem is to
scheme has an optimal number of devices, beyond which
treat communication and computation separately, and exploit
the accuracy degrades. This is because, channel resources
a coding scheme across computing agents such that each of
per device becomes limited beyond this optimal number of
them is assigned a non-zero rate to convey its gradient estimate
devices, which, in turn, limits the accuracy of the gradient
at each iteration. Therefore, each agent employs quantization
estimates that are conveyed to the parameter server.
to reduce the amount of information to be transmitted to the
level that is allowed by the wireless channel. This can be Overall, the results highlight the fact that, for efficient ML at
called a ‘separate digital’ scheme as the gradient estimates are the wireless edge, communication and computation have to be
converted into bits, which are communicated by independent considered jointly, and distributed ML can benefit from physi-
channel codes. cal layer techniques to improve the efficiency and accuracy. A
Note, however, that, the parameter server is interested similar observation is also made in [97] by considering coded
only in the average of the gradient estimates, rather than wireless computation in the map-shuffle-reduce framework,
their individual values. Accordingly, a much more efficient where physical layer techniques are leveraged to provide
communication strategy would be to transmit local estimates robustness against both device and channel uncertainties.
13
VII. C ONCLUSIONS Communications engineers are trained to think about physical

models and optimal solutions, but the success of deep learning
We have presented what we hope is a stimulating overview hinges on using lots of data together with ‘naive’ lightweight
of the promises and challenges of ML for the physical layer approaches, like stochastic gradient descent, to solve NP-
of wireless networks. We have presented a wide variety of hard problems. It takes quite a bit of cultural transformation
wireless communications problems, in which ML tools have to digest this. Moreover, the performance of these generic
been shown to offer significant gains. As indicated earlier, lightweight tools can be improved significantly through do-
these correspond to scenarios in which either we do not have main expertise in wireless communications, complemented
an accurate model of the system, or we have an accurate with a thorough knowledge of the tricks of the trade in ML.
model but the optimal solution is extremely complex and thus
cannot be attained with conventional means. Various power R EFERENCES
allocation problems have been presented as good examples of
[1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT
the latter scenario. The joint source-channel coding problem Press, 2016, http://www.deeplearningbook.org.
can be considered as exhibiting both limitations. In the case of [2] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
image transmission, we do not have a good statistical model Second Edition, MIT Press, Cambridge, MA, 2018.
[3] C. E. Shannon, “A mathematical theory of communication,” Bell Syst.
of natural images; however, even when transmitting Gaussian Technical Journal, vol. 27, no. 3, pp. 379–423, Jul 1948.
sources over a noisy channel, the optimal solution is not known [4] A. Gersho and R. M. Gray, Vector Quantization and Signal Compres-
for finite blocklengths, as the separation theorem fails. We sion. Norwell, MA, USA: Kluwer Academic Publishers, 1991.
[5] J. Ziv and A. Lempel, “A universal algorithm for sequential data
have also highlighted another connection between ML and the compression,” IEEE Transactions on Information Theory, vol. 23, no. 3,
wireless physical layer through edge learning. We have shown pp. 337–343, May 1977.
that the accuracy of distributed ML over wireless channels can [6] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, “The context-tree
weighting method: basic properties,” IEEE Transactions on Information
benefit greatly from the joint treatment of the physical layer Theory, vol. 41, no. 3, pp. 653–664, May 1995.
and the employed learning algorithm. [7] 3GPP TS 23.501, “System architecture for the 5G system,” LTE Rel.
15, v 15.1.0, Mar. 2018.
In light of these intriguing achievements and challenges, an [8] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile and
important question that our research community will tackle wireless networking: A survey,” CoRR, vol. abs/1803.04311, 2018.
in the next few years is the following: Will the ongoing [Online]. Available: http://arxiv.org/abs/1803.04311
[9] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah, “Machine
ML revolution completely transform communication system learning for wireless networks with artificial intelligence: A tutorial
design, so that we will soon be designing autonomous com- on neural networks,” CoRR, vol. abs/1710.02913, 2017. [Online].
munication devices that do not need standards or protocols, Available: http://arxiv.org/abs/1710.02913
[10] L. Liang, H. Ye, and G. Y. Li, “Toward intelligent vehicular networks:
and can simply learn to communicate with one another using A machine learning framework,” IEEE Internet of Things Journal,
data-driven ML techniques? Or do the existing drawbacks of vol. 6, no. 1, pp. 124–135, Feb 2019.
ML-based techniques limit their relevance for communication [11] J. Jagannath, N. Polosky, A. Jagannath, F. Restuccia, and T. Melodia,
“Machine learning for wireless communications in the internet of
systems, and we should instead “stick to our guns” – the time- things: A comprehensive survey,” CoRR, vol. abs/1901.07947, 2019.
tested highly-optimized model-based approaches? While time [Online]. Available: http://arxiv.org/abs/1901.07947
will tell what the answer is, it will probably land somewhere [12] X. Zhou, M. Sun, G. Y. Li, and B. Fred Juang, “Intelligent wireless
communications enabled by cognitive radio and machine learning,”
in the middle; strong domain knowledge and model-based China Communications, vol. 15, no. 12, pp. 16–48, Dec 2018.
approaches will need to be combined with powerful data- [13] O. Simeone, “A very brief introduction to machine learning with
driven ML techniques. Another important question relates to applications to communication systems,” IEEE Trans. on Cognitive
Commun. and Networking, vol. 4, no. 4, pp. 648–664, Dec. 2018.
the use of physical layer techniques for edge learning: Given [14] F. Doshi-Velez and B. Kim, “Towards A Rigorous Science of Inter-
the rapid speed of developments in ML and particularly edge pretable Machine Learning,” arXiv e-prints, p. arXiv:1702.08608, Feb.
learning, do we need new standards and new communication 2017.
[15] K. Holstein, J. W. Vaughan, H. D. III, M. Dudı́k, and H. M. Wallach,
techniques that can sustain the growing demand for ML “Improving fairness in machine learning systems: What do industry
applications at the edge? practitioners need?” CoRR, vol. abs/1812.05239, 2018.
We believe that ML and data-driven approaches in general [16] T. O’Shea and N. West, “Radio machine learning dataset generation
with gnu radio,” Proceedings of the GNU Radio Conference, vol. 1,
have a lot to offer to all aspects of the communication network no. 1, 2016. [Online]. Available: https://pubs.gnuradio.org/index.php/
architecture, and they have already started to have impact grcon/article/view/11
on the higher layers [98]–[100]. Yet, to realize this promise, [17] A. Alkhateeb, “DeepMIMO: A generic deep learning dataset for mil-
limeter wave and massive MIMO applications,” in Proc. of Information
significant research efforts are needed, from adaptation of Theory and Applications Workshop (ITA), San Diego, CA, Feb 2019,
existing ML techniques to the development of new ones that pp. 1–8.
can meet the constraints and requirements of communication [18] I. Nascimento, F. Mendes, M. Dias, A. Silva, and A. Klautau, “Deep
learning in rat and modulation classification with a new radio signals
networks, including the implementation of at least some of dataset,” in Proc. XXXVI Simposio Brasileiro de Telecomunicacoes e
these capabilities in low-power chips that can be used in Processamento de Sinais (SBrT), Brazil, Sep. 2018.
mobile devices [101], [102], and/ or the development of fully [19] M. Arnold, J. Hoydis, and S. ten Brink, “Novel massive MIMO channel
sounding data applied to deep learning-based indoor positioning,” in
distributed, yet efficient implementations that can employ low- Proc. Int’l ITG Conf. on Systems, Communications and Coding (SCC),
power low-complexity mobile devices. Feb. 2019.
To conclude, one message that comes out loud and clear [20] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,
“Learning to optimize: Training deep neural networks for interference
from our recent experience with deep learning and other data- management,” IEEE Transactions on Signal Processing, vol. 66, no. 20,
driven approaches is that we should think big and be bold. pp. 5438–5453, Oct 2018.
14
[21] O. Vinyals, M. Fortunato, and N. Jaitly, “Pointer networks,” in [43] J. Han, S. Lombardo, C. Schroers, and S. Mandt, “Deep probabilistic
Advances in Neural Information Processing Systems 28, C. Cortes, video compression,” CoRR, vol. abs/1810.02845, 2018.
N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. [44] S. Kankanahalli, “End-to-end optimized speech coding with deep
Curran Associates, Inc., 2015, pp. 2692–2700. [Online]. Available: neural networks,” CoRR, vol. abs/1710.09064, 2017. [Online].
http://papers.nips.cc/paper/5866-pointer-networks.pdf Available: http://arxiv.org/abs/1710.09064
[22] A. Milan, S. Rezatofighi, R. Garg, A. Dick, and I. Reid, “Data-driven [45] 3GPP - 3GPP2 Spatial Channel Model Ad-hoc Group3GPP TR 25.996,
approximations to np-hard problems,” 2017. [Online]. Available: “Spatial channel model for multiple input multiple output (MIMO)
https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14700 simulations,” IEEE Wireless Commun., Sep. 2003.
[23] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, [46] “Http://www.ist-winner.org/.”
D. Wierstra, and M. A. Riedmiller, “Playing atari with deep [47] V. Erceg, K. V. S. Hari, and M. S. S. et al., “Channel models for fixed
reinforcement learning,” CoRR, vol. abs/1312.5602, 2013. [Online]. wireless applications,” Contribution IEEE 802.16.3c- 01/29r4, IEEE
Available: http://arxiv.org/abs/1312.5602 802.16 Broadband Wireless Access Working Group.
[24] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, [48] L. Liu, C. Oestges, J. Poutanen, K. Haneda, P. Vainikainen, F. Quitin,
A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, F. Tufvesson, and P. D. Doncker, “The COST 2100 MIMO channel
T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and model,” IEEE Wireless Commun., vol. 19, no. 6, pp. 92–99, December
D. Hassabis, “Mastering the game of go without human knowledge,” 2012.
Nature, vol. 550, pp. 354 EP –, 10 2017. [Online]. Available: [49] T. Wang, C. Wen, S. Jin, and G. Y. Li, “Deep learning-based csi
https://doi.org/10.1038/nature24270 feedback approach for time-varying massive mimo channels,” IEEE
[25] N. Farsad and A. Goldsmith, “Neural network detection of data Wireless Communications Letters, vol. 8, no. 2, pp. 416–419, April
sequences in communication systems,” IEEE Trans. Signal Process., 2019.
vol. 66, no. 21, pp. 5663–5678, Nov. 2018. [50] C. Lu, W. Xu, H. Shen, J. Zhu, and K. Wang, “MIMO channel
[26] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” 2017. information feedback using deep recurrent network,” IEEE Commun.
[Online]. Available: https://arxiv.org/pdf/1706.01151.pdf Lett., vol. 23, no. 1, pp. 188–191, Jan 2019.
[27] H. Ye, G. Y. Li, and B. Juang, “Power of deep learning for channel [51] Z. Liu, L. Zhang, and Z. Ding, “Exploiting bi-directional channel
estimation and signal detection in OFDM systems,” IEEE Wireless reciprocity in deep learning for low rate massive mimo csi feedback,”
Comm. Letters, vol. 7, no. 1, pp. 114–117, Feb. 2018. IEEE Wireless Commun. Lett., pp. 1–1, 2019.
[28] P. Jiang, T. Wang, B. Han, X. Gao, J. Zhang, C.-K. Wen, S. Jin, and [52] L. Theis, W. Shi, A. Cunnigham, and F. Huszár, “Lossy image
G. Y. Li, “Artificial intelligence-aided OFDM receiver: Design and compression with compressive autoencoders,” in Proc. of the Int. Conf.
experimental results,” arXiv e-prints, p. arXiv:1812.06638, Dec. 2018. on Learning Representations (ICLR), 2017.
[29] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode [53] G. Toderici and S. M. OMalley et al., “Variable rate image compression
linear codes using deep learning,” in Proc. Allerton Conference on with recurrent neural networks,” in Proc. of the Int. Conf. on Learning
Communication, Control, and Computing (Allerton), 2016. Representations (ICLR), 2016.
[30] S. Cammerer, T. Gruber, J. Hoydis, and S. ten Brink, “Scaling [54] D. G. Qianqian Yang, Mahdi Boloursaz Mashhadi, “Deep convo-
deep learning-based decoding of polar codes via partitioning,” 2017. lutional compression for massive mimo csi feedback,” in Proc. of
[Online]. Available: https://arxiv.org/abs/1702.06901 IEEE International Workshop on MACHINE LEARNING FOR SIGNAL
PROCESSING (MLSP), 2019.
[31] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep
[55] T. J. O’Shea, K. Karra, and T. C. Clancy, “Learning to communicate:
learning-based channel decoding,” in Proc. Conference on Information
Channel auto-encoders, domain specific regularizers, and attention,”
Sciences and Systems (CISS), 2017.
CoRR, vol. abs/1608.06409, 2016. [Online]. Available: http://arxiv.
[32] A. Felix, S. Cammerer, S. Dörner, J. Hoydis, and S. ten Brink,
org/abs/1608.06409
“OFDM-autoencoder for end-to-end learning of communications
[56] T. OShea and J. Hoydis, “An introduction to deep learning for the
systems,” CoRR, vol. abs/1803.05815, 2018. [Online]. Available:
physical layer,” IEEE Transactions on Cognitive Communications and
http://arxiv.org/abs/1803.05815
Networking, vol. 3, no. 4, pp. 563–575, Dec 2017.
[33] V. Raj and S. Kalyani, “Backpropagating through the air: Deep learning [57] H. Ye, G. Y. Li, B. F. Juang, and K. Sivanesan, “Channel agnostic
at physical layer without channel models,” IEEE Communication end-to-end learning based communication systems with conditional
Letters, vol. 22, no. 11, pp. 2278–2281, Nov. 2018. GAN,” CoRR, vol. abs/1807.00447, 2018. [Online]. Available:
[34] D. Neumann, T. Wiese, and W. Utschick, “Learning the MMSE channel http://arxiv.org/abs/1807.00447
estimator,” IEEE Transactions on Signal Processing, vol. 66, pp. 2905– [58] S. Dörner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep learning-
2917, 2018. based communication over the air,” 2017. [Online]. Available:
[35] M. Soltani, A. Mirzaei, V. Pourahmadi, and H. Sheikhzadeh, “Deep https://arxiv.org/abs/1707.03384
learning-based channel estimation,” CoRR, vol. abs/1810.05893v2, [59] E. Bourtsoulatze, D. B. Kurka, and D. Gündüz, “Deep joint
2010. [Online]. Available: http://arxiv.org/abs/1810.05893 source-channel coding for wireless image transmission,” CoRR, vol.
[36] M. Koller, C. Hellings, M. Knoedlseder, T. Wiese, D. Neumann, abs/1809.01733, 2018. [Online]. Available: http://arxiv.org/abs/1809.
and W. Utschick, “Machine learning for channel estimation from 01733
compressed measurements,” 2018 15th International Symposium on [60] T. Cover and A. Thomas, Elements of information theory. Wiley-
Wireless Communication Systems (ISWCS), pp. 1–5, 2018. Interscience, Jul. 2006.
[37] H. He, C. Wen, S. Jin, and G. Y. Li, “Deep learning-based channel [61] D. B. Kurka and D. Gündüz, “Successive refinement of images with
estimation for beamspace mmwave massive MIMO systems,” CoRR, deep joint source-channel coding,” CoRR, vol. abs/1903.06333, 2019.
vol. abs/1802.01290, 2018. [Online]. Available: http://arxiv.org/abs/ [Online]. Available: http://arxiv.org/abs/1903.06333
1802.01290 [62] M. S. Ibrahim, A. S. Zamzam, X. Fu, and N. D. Sidiropoulos,
[38] D. J. Love, R. W. Heath, V. K. N. Lau, D. Gesbert, B. D. Rao, “Learning-based antenna selection for multicasting,” in 2018 IEEE 19th
and M. Andrews, “An overview of limited feedback in wireless International Workshop on Signal Processing Advances in Wireless
communication systems,” vol. 26, no. 8, pp. 1341–1365, Oct. 2008. Communications (SPAWC), June 2018, pp. 1–5.
[39] P. Kuo, H. T. Kung, and P. Ting, “Compressive sensing based channel [63] V. Ntranos, N. D. Sidiropoulos, and L. Tassiulas, “On multicast
feedback protocols for spatially-correlated massive antenna arrays,” in beamforming for minimum outage,” IEEE Transactions on Wireless
IEEE Wireless Commun. and Netw. Conf. (WCNC), April 2012, pp. Communications, vol. 8, no. 6, pp. 3172–3181, June 2009.
492–497. [64] Y. Shi, A. Konar, N. D. Sidiropoulos, X. Mao, and Y. Liu, “Learning
[40] X. Rao and V. K. N. Lau, “Distributed compressive CSIT estimation to beamform for minimum outage,” IEEE Transactions on Signal
and feedback for FDD multi-user massive MIMO systems,” IEEE Processing, vol. 66, no. 19, pp. 5180–5193, Oct 2018.
Transactions on Signal Processing, vol. 62, no. 12, pp. 3261–3271, [65] S. S. Christensen, R. Agarwal, E. Carvalho, and J. M. Cioffi, “Weighted
June 2014. sum-rate maximization using weighted MMSE for MIMO-BC beam-
[41] R. Setiono and G. Lu, “Image compression using a feedforward neural forming design,” IEEE Trans. on Wireless Commun., vol. 7, no. 12, pp.
network,” in Proceedings of 1994 IEEE International Conference on 4792–4799, Dec. 2008.
Neural Networks (ICNN’94), vol. 7, June 1994, pp. 4761–4765 vol.7. [66] Q. Shi, M. Razaviyayn, Z. Luo, and C. He, “An iteratively weighted
[42] J. Balle, V. Laparra, and E. P. Simoncelli, “End-to-end optimized MMSE approach to distributed sum-utility maximization for a MIMO
image compression,” in Proc. of Int. Conf. on Learning Representations interfering Broadcast Channel,” IEEE Trans. Signal Process., vol. 59,
(ICLR), Apr. 2017, pp. 1–27. no. 9, pp. 4331–4340, Sept 2011.
15
[67] J. Kaleva, A. Tolli, and M. Juntti, “Successive convex approximation [92] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,
for simultaneous linear TX/RX design in MIMO BC,” in Proc. IEEE “Communication-Efficient Learning of Deep Networks from Decen-
Asilomar Conference on Signals, Systems and Computers (ACSSC), tralized Data,” in Proceedings of the 20th International Conference
Nov. 2015. on Artificial Intelligence and Statistics, ser. Proceedings of Machine
[68] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos, Learning Research, A. Singh and J. Zhu, Eds., vol. 54. Fort
“Learning to optimize: Training deep neural networks for wireless Lauderdale, FL, USA: PMLR, 20–22 Apr 2017, pp. 1273–1282.
resource management,” in 2017 IEEE 18th International Workshop [93] T. Chen, G. B. Giannakis, T. Sun, and W. Yin, “Lag: Lazily aggregated
on Signal Processing Advances in Wireless Communications (SPAWC), gradient for communication-efficient distributed learning,” in Proceed-
July 2017, pp. 1–6. ings of the 32Nd International Conference on Neural Information
[69] W. Lee, M. Kim, and D. Cho, “Deep power control: Transmit power Processing Systems, ser. NIPS’18, USA, 2018, pp. 5055–5065.
control scheme based on convolutional neural network,” IEEE Com- [94] M. Mohammadi Amiri and D. Gündüz, “Machine learning at
munications Letters, vol. 22, no. 6, pp. 1276–1279, Jun. 2018. the wireless edge: Distributed stochastic gradient descent over-
[70] R. Radner, “Team decision problems,” The Annals of Mathematical the-air,” CoRR, vol. abs/1901.00844, 2019. [Online]. Available:
Statistics, 1962. http://arxiv.org/abs/1901.00844
[71] J. Marschak and R. Radner, Economic theory of teams. Yale University [95] ——, “Federated learning over wireless fading channels,” vol.
Press, New Haven and London, Feb. 1972. abs/1907.09769, 2019. [Online]. Available: https://arxiv.org/abs/1907.
[72] D. Gesbert and P. de Kerret, “Team Methods for Device Cooperation 09769
in Wireless Networks,” in Cooperative and Graph Signal Processing, [96] G. Zhu, Y. Wang, and K. Huang, “Low-latency broadband analog
ch. 18, pp. 469 – 487, Ed. by P. M. Djuric and C. Richard, Academic aggregation for federated edge learning,” CoRR, vol. abs/1812.11494,
Press, 2018. 2018.
[73] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. [97] S. Ha, J. Zhang, O. Simeone, and J. Kang, “Coded federated
MIT Press, Cambridge, MA, 1998. computing in wireless networks with straggling devices and
[74] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y. Liang, imperfect CSI,” CoRR, vol. abs/1901.05239, 2019. [Online]. Available:
and D. I. Kim, “Applications of deep reinforcement learning in commu- http://arxiv.org/abs/1901.05239
nications and networking: A survey,” IEEE Communications Surveys [98] R. Li, Z. Zhao, X. Zhou, G. Ding, Y. Chen, Z. Wang, and H. Zhang,
Tutorials, pp. 1–1, 2019. “Intelligent 5g: When cellular networks meet artificial intelligence,”
[75] D. A. Awan, R. L. G. Cavalcante, and S. Stanczak, “A robust machine IEEE Wireless Communications, vol. 24, no. 5, pp. 175–183, October
learning method for cell-load approximation in wireless networks,” in 2017.
Proc. IEEE International Conference on Acoustics, Speech and Signal [99] M. G. Kibria, K. Nguyen, G. P. Villardi, K. Ishizu, and F. Kojima, “Big
Processing (ICASSP), 2018. data analytics and artificial intelligence in next-generation wireless
[76] J. Nash, Non-cooperative games. Annals of Mathematics, 1951. networks,” CoRR, vol. abs/1711.10089, 2017. [Online]. Available:
[77] P. de Kerret and D. Gesbert, “Robust decentralized joint precoding us- http://arxiv.org/abs/1711.10089
ing team deep neural network,” in Proc. IEEE International Symposium [100] V. P. Kafle, Y. Fukushima, P. Martinez-Julia, and T. Miyazawa, “Con-
on Wireless Communication Systems (ISWCS), Aug. 2018. sideration on automation of 5g network slicing with machine learning,”
in 2018 ITU Kaleidoscope: Machine Learning for a 5G Future (ITU
[78] M. Kim, P. de Kerret, and D. Gesbert, “Robust decentralized joint
K), Nov 2018, pp. 1–8.
precoding using team deep neural network,” in Proc. IEEE Asilomar
[101] S. I. Venieris, A. Kouris, and C. Bouganis, “Deploying deep neural
Conference on Signals, Systems and Computers (ACSSC), Nov. 2018.
networks in the embedded space,” CoRR, vol. abs/1806.08616, 2018.
[79] H. Lee, S. Hyun, and T. Q. S. Quek, “Deep learning for distributed
[Online]. Available: http://arxiv.org/abs/1806.08616
optimization: Applications to wireless resource management,” IEEE J.
[102] H. Yoo, “Intelligence on silicon: From deep-neural-network accelera-
Sel. Areas Commun., 2019.
tors to brain mimicking AI-SoCs,” in IEEE Int’l Solid- State Circuits
[80] G. Grecco and M. Sanguinetti, “Smooth optimal decision strategies
Conf. - (ISSCC), Feb 2019, pp. 20–26.
for static team optimization problems and their approximations,” in
SOFSEM 2010: Theory and Practice of Computer Science, 2010.
[81] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller, “Efficient
BackProp,” in Neural Networks: Tricks of the Trade. Lecture Notes
in Computer Science, vol 7700. Springer, 2012.
[82] G. Giaconi, D. Gunduz, and H. V. Poor, “Privacy-aware smart metering:
Progress and challenges,” IEEE Signal Processing Magazine, vol. 35,
no. 6, pp. 59–78, Nov 2018.
[83] N. Saxena, S. Grijalva, V. Chukwuka, and A. V. Vasilakos, “Network
security and privacy challenges in smart vehicle-to-grid,” IEEE Wireless
Communications, vol. 24, no. 4, pp. 88–98, Aug 2017.
[84] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless
network intelligence at the edge,” CoRR, vol. abs/1812.02858, 2018.
[Online]. Available: http://arxiv.org/abs/1812.02858
[85] J. Konecný, H. B. McMahan, D. Ramage, and P. Richtárik,
“Federated optimization: Distributed machine learning for on-device
intelligence,” CoRR, vol. abs/1610.02527, 2016. [Online]. Available:
http://arxiv.org/abs/1610.02527
[86] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication
efficient distributed machine learning with the parameter server,” in
Advances in Neural Information Processing Systems 27, Z. Ghahra-
mani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger,
Eds. Curran Associates, Inc., 2014, pp. 19–27.
[87] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
learning with limited numerical precision,” in ICML, Jul. 2015.
[88] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradient
descent and its application to data-parallel distributed training of speech
DNNs,” in INTERSPEECH, Singapore, Sep. 2014, pp. 1058–1062.
[89] N. Strom, “Scalable distributed DNN training using commodity gpu
cloud computing,” in INTERSPEECH, 2015.
[90] A. F. Aji and K. Heafield, “Sparse communication for distributed
gradient descent,” arXiv:1704.05021v2 [cs.CL], Jul. 2017.
[91] F. Sattler, S. Wiedemann, K. Müller, and W. Samek, “Sparse binary
compression: Towards distributed deep learning with minimal commu-
nication,” CoRR, vol. abs/1805.08768, 2018.
16
Deniz Gündüz [S03-M08-SM13] received the B.S. Nicholas D. Sidiropoulos (F’09) received the
degree in electrical and electronics engineering from Diploma degree in electrical engineering from Aris-
METU, Turkey in 2002, and the M.S. and Ph.D. totelian University of Thessaloniki, Thessaloniki,
degrees in electrical engineering from NYU Tan- Greece, and the M.S. and Ph.D. degrees in Electri-
don School of Engineering (formerly Polytechnic cal Engineering from the University of Maryland-
University) in 2004 and 2007, respectively. After College Park, College Park, MD, USA, in 1988,
his PhD, he served as a postdoctoral research as- 1990, and 1992, respectively. He has served on the
sociate at Princeton University, and as a consulting faculty of the University of Virginia (UVA), Univer-
assistant professor at Stanford University. He was sity of Minnesota, and the Technical University of
a research associate at CTTC in Barcelona, Spain Crete, Greece, prior to his current appointment as
until September 2012, when he joined the Electrical Louis T. Rader Professor and Chair of the Electrical
and Electronic Engineering Department of Imperial College London, UK, and Computer Engineering Department at UVA.
where he is currently a Reader (Associate Professor) in information theory and His research interests are in signal processing, communications, optimiza-
communications, and leads the Information Processing and Communications tion, tensor decomposition, and factor analysis, with applications in machine
Laboratory (IPC-Lab). learning and communications. He received the NSF/CAREER award in 1998,
His research interests lie in the areas of communications and information the IEEE Signal Processing Society (SPS) Best Paper Award in 2001, 2007,
theory, machine learning, and privacy. Dr. Gündüz is an Editor of the IEEE and 2011, served as IEEE SPS Distinguished Lecturer (2008-2009), and
Transactions on Green Communications and Networking, and a Guest Editor currently serves as Vice President - Membership of IEEE SPS. He received
of the IEEE Journal on Selected Areas in Communications, Special Issue on the 2010 IEEE Signal Processing Society Meritorious Service Award, and the
Machine Learning in Wireless Communication. He served as an Editor of the 2013 Distinguished Alumni Award from the University of Maryland, Dept.
Transactions on Communications from 2013 until 2018. He is the recipient of ECE. He is a Fellow of IEEE (2009) and a Fellow of EURASIP (2014).
of the IEEE Communications Society - Communication Theory Technical
Committee (CTTC) Early Achievement Award in 2017, a Starting Grant of the
European Research Council (ERC) in 2016, IEEE Communications Society
Best Young Researcher Award for the Europe, Middle East, and Africa Region
in 2014, Best Paper Award at the 2016 IEEE Wireless Communications and
Networking Conference (WCNC), and the Best Student Paper Awards at the
2018 IEEE Wireless Communications and Networking Conference (WCNC)
and the 2007 IEEE International Symposium on Information Theory (ISIT).
He was the General Co-chair of the 2019 London Symposium on Information
Theory, 2018 International ITG Workshop on Smart Antennas, 2016 IEEE
Information Theory Workshop, and 2012 European School of Information
Theory.
David Gesbert (IEEE Fellow) is Professor and Head

of the Communication Systems Department, EURE-
COM. He obtained the Ph.D degree from Ecole Na-
tionale Superieure des Telecommunications, France,
in 1997. From 1997 to 1999 he has been with the
Information Systems Laboratory, Stanford Univer-
sity. He was then a founding engineer of Iospan
Wireless Inc, a Stanford spin off pioneering MIMO-
OFDM (now Intel). Before joining EURECOM in
2004, he has been with the Department of Infor-
Paul de Kerret received in 2010 an Engineering matics, University of Oslo as an adjunct professor.
degree from the French Graduate School IMT Atlan- D. Gesbert has published about 300 papers and 25 patents, some of them
tique and a Diploma degree in Electrical Engineer- winning 2019 ICC Best Paper Award, 2015 IEEE Best Tutorial Paper Award
ing and Information Technology from the Munich (Communications Society), 2012 SPS Signal Processing Magazine Best Paper
University of Technology through a double degree Award, 2004 IEEE Best Tutorial Paper Award (Communications Society),
program, and in 2013 a Ph.D. degree from the 2005 Young Author Best Paper Award for Signal Proc. Society journals,
French Graduate School Tlcom Paris. Since 2015, and paper awards at conferences 2011 IEEE SPAWC, 2004 ACM MSWiM.
he is working as a Senior Researcher at EURECOM He has been a Technical Program Co-chair for ICC2017. He was named a
as part of the ERC-funded project PERFUME to Thomson-Reuters Highly Cited Researchers in Computer Science. Since 2015,
investigate how to enable efficient and decentralized he holds the ERC Advanced grant ”PERFUME” on the topic of smart device
cooperation to boost performance in future wireless Communications in future wireless networks. Since early 2019, he heads the
networks. He is particularly active in the decentralized use of machine learning Huawei-funded Chair on Adwanced Wireless Systems Towards 6G Networks.
methods to solve coordination problems. He has been involved in several Since 2017 he is also a visiting Academic Master within the Program 111
European collaborative projects on mobile communications, co-presented at the Beijing University of Posts and Telecommunications as well as as a
several tutorials at major IEEE international conferences, and authored over member in the Joint BUPT-EURECOM Open5G Lab. He is a Board Member
30 papers in IEEE flagship conferences. for the OpenAirInterface (OAI) Software Alliance.
17
Chandra R. Murthy received the B. Tech. degree

in Electrical Engineering from the Indian Institute
of Technology Madras, Chennai, India, in 1998, the
M.S. and Ph.D. degrees in Electrical and Computer
Engineering from Purdue University, West Lafayette,
IN and the University of California, San Diego,
CA, in 2000 and 2006, respectively. From 2000
to 2002, he worked as an engineer for Qualcomm
Inc., San Jose, USA, where he worked on WCDMA
baseband transceiver design and 802.11b baseband
receivers. From 2006 to 2007, he worked as a staff
engineer at Beceem Communications Inc., Bangalore, India on advanced
receiver architectures for the 802.16e Mobile WiMAX standard. Currently,
he is working as a Professor in the department of Electrical Communication
Engineering at the Indian Institute of Science, Bangalore, India.
His research interests are in the areas of energy harvesting communications,
multiuser MIMO systems, and sparse signal recovery techniques applied
to wireless communications. His paper won the best paper award in the
Communications Track at NCC 2014 and a paper co-authored with his student
won the student best paper award at the IEEE ICASSP 2018. He has 50+
journal papers and 80+ conference papers to his credit. He was an associate
editor for the IEEE Signal Processing Letters during 2012-16. He is an elected
member of the IEEE SPCOM Technical Committee for the years 2014-16,
and has been re-elected for the 2017-19 term. He is a past Chair of the
IEEE Signal Processing Society, Bangalore Chapter. He is currently serving
as an associate editor for the IEEE Transactions on Signal Processing and
IEEE Transactions on Information Theory, and as an editor for the IEEE
Transactions on Communications.
Mihaela van der Schaar is John Humphrey Plum-

mer Professor of Machine Learning, Artificial Intel-
ligence and Medicine at the University of Cambridge
and a Turing Fellow at The Alan Turing Institute in
London, where she leads the effort on data science
and machine learning for personalised medicine. She
is an IEEE Fellow (2009). She has received the Oon
Prize on Preventative Medicine from the University
of Cambridge (2018). She has also been the recipient
of an NSF Career Award, 3 IBM Faculty Awards,
the IBM Exploratory Stream Analytics Innovation
Award, the Philips Make a Difference Award and several best paper awards,
including the IEEE Darlington Award. She holds 35 granted USA patents.

Machine Learning in The Air

Uploaded by

Copyright:

Available Formats

Machine Learning in The Air

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning in The Air

Uploaded by

Copyright:

Available Formats

1

Machine Learning in the Air

additive uniform noise, and a stochastic binarization function

AWGN channel, JPEG comparison

an open problem, even for fully known source and channel

as an upperbound on the performance.We first observe that

VI. L EARNING AT THE W IRELESS E DGE

0.4 Analog DSGD (K = 40 users)

Fig. 7. Accuracy of analog and digital transmission schemes for wireless

VII. C ONCLUSIONS Communications engineers are trained to think about physical

David Gesbert (IEEE Fellow) is Professor and Head

Chandra R. Murthy received the B. Tech. degree

Mihaela van der Schaar is John Humphrey Plum-

You might also like