FSSC: Federated Learning of Transformer Neural Networks for Semantic Image Communication

Yuna Yan1, Xin Zhang1, Lixin Li1, Wensheng Lin1, Rui Li2, Wenchi Cheng3, and Zhu Han4 1School of Electronics and Information, Northwestern Polytechnical University, Xi’an, China, 710129 2Samsung AI Center, Cambridge, UK 3State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an, China, 710071 4Department of Electrical and Computer Engineering, University of Houston, Houston, TX, 77004

Abstract

In this paper, we address the problem of image semantic communication in a multi-user deployment scenario and propose a federated learning (FL) strategy for a Swin Transformer-based semantic communication system (FSSC). Firstly, we demonstrate that the adoption of a Swin Transformer for joint source-channel coding (JSCC) effectively extracts semantic information in the communication system. Next, the FL framework is introduced to collaboratively learn a global model by aggregating local model parameters, rather than directly sharing clients’ data. This approach enhances user privacy protection and reduces the workload on the server or mobile edge. Simulation evaluations indicate that our method outperforms the typical JSCC algorithm and traditional separate-based communication algorithms. Particularly after integrating local semantics, the global aggregation model has further increased the Peak Signal-to-Noise Ratio (PSNR) by more than 2dB, thoroughly proving the effectiveness of our algorithm.

Index Terms:

Semantic communication, federated learning, swin Transformer, privacy protection.

I Introduction

Wireless communication is embracing an explosive growth of data traffic semantic nature, generated by various contemporary multi-media applications such as AR/VR, video streaming, gaming and telehealth. Traditional communication systems take the source messages (e.g. images and texts) and first encode these messages with source coding algorithms on the application layer, in order to remove redundant information. Subsequently, when the compressed information reaches the physical layer, it is then further channel-encoded by injection of redundant bits, to be resilient to the noise in the communication channel. There are three major factors that make this current approach inefficient in the 5th Generation Mobile Communication Technology(5G) and future generations of mobile networks. Firstly, although modern channel coding algorithm on its own has theoretically reached the Shannon limit on the physical layer [1], when jointly considering the source coding and channel coding, the Shannon’s optimality theory only holds when the message blocks are asymptotically infinitely long, which is impractical in the real-world [2, 3, 4, 5, 6]. Secondly, the separation-based approach[7] compresses and then upscales messages, thus inherently introducing extra latency, which lead to inferior performances for major time-sensitive 5G/6G and upcoming applications, e.g. tactile internet and autonomous driving. Third, channel coding strategies treat all bits of information equally, missing the opportunity to utilize the abundant semantic information available in the messages, especially in image and video contents.

To address these issues in the traditional communication regime, semantic communication[8, 9] a.k.a. joint source and channel coding, has been proposed and studied, which focuses on the transmission of meaning rather than bits, helping to escape the “Shannon trap”[10]. Compared to conventional communication methods, semantic communication is centered around the relevant semantic features essential for transmitting information, thereby eliminating unnecessary redundancy. This approach leads to a significant reduction in data transmission latency and bandwidth consumption. To enhance the versatility of semantic communication systems, we propose the employment of a Swin Transformer-based semantic communication (STSC) system that primarily focuses on the reconstruction of global image signals. Its unique architecture enables the image-oriented semantic communication system to effectively capture deeper semantic features from potential semantic information and represent complex semantic relationships within the input data. In addition, the Swin Transformer is capable of reducing computational complexity and speeding up both model training and inference compared to traditional Transformer models. As a result, it is well-suited for handling large-scale semantic communication, especially in multi-user deployments.

Refer to caption — Figure 1: The overall architecture of the proposed FSSC system.

However, image-oriented semantic communication targets applications that often involves users’ private data, such as their faces, home interior, license plates, etc., which cannot always be uploaded to the cloud server. Moreover, most of the prior art on image-based semantic communication are point-to-point systems, which do not fit in many applications where multiple users are involved. In this paper, we propose a multi-user federated learning (FL) framework for semantic communication, showcasing the potential of efficient learning of a global semantic communication model that combines the learned knowledge from multiple users, without direct access to users’ private data. We demonstrate that the proposed FL semantic system, when powered with a hierachical transformer architecture, i.e. the Swin Transformer, achieves a better image transmission quality at a faster convergence rate compared to the single-user.

In summary, the main contributions of this paper are summarized as follows:

•

We propose a Swin Transformer based semantic communication architecture, which aims at minimizing the difference between the transmitted and received images, so as to effectively communicate image semantic information.
•

In contrast to the centralised approach of sharing the user data directly to a cloud server in order to train a transformer neural network for semantic communication, we consider a cross-silo FL setting that employs multiple users to train locally their own model using local data, and share the local updates with a global server, thereby preserving users’ data-privacy.
•

Through the use of well-established dataset for computer vision, simulation results show that our proposed framework surpasses the representative JSCC methods, and further boosts peak signal-to-noise ratio (PSNR) by over $2$ dB on top of aggregating local semantics.

The rest of this paper is organized as follows. In Section II, we introduce the framework of the image semantic communication system based on FL and formulate the corresponding research problem. We then present the federated learning strategy for a Swin Transformer-based semantic communication (FSSC) in Section III. In Section IV, we present numerical simulation results and showcase the semantic communication performance of FSSC as well as the convergence of the system. Finally, the paper is concluded in Section V.

II System Setup

In this paper, we consider a multi-user FL semantic communication system, as shown in Fig. 1, which consists of a cloud server and multiple clients. An FSSC client can be an edge server or a terminal device. When the client is an edge server, it is able to perform computational tasks, hence the training and inference of STSC happens on the client locally. When the client is a light-weight terminal device equipped with only limited or no computational power, the client should be served by a corresponding edge server which is trusted by the client to share data with in order to perform training. The federated training process is detailed as follows:

1.

The cloud server initialize the STSC model for semantic communication, and distribute it to all participating clients. Each client will then perform $n$ epochs of local training using its own data. Once the local training is completed, all clients transmit the updated model parameters to the cloud server, and the server will then aggregate the updated model parameters following Federated Averaging [13], which completes one round of the federated learning. This process is repeated until the model converges;
2.

During deployment of a trained STSC model, a user will initiate a communication request and sends a semantic message to the client. The client will then process this message by performing an inference using the trained STSC model with the message sample as the joint source-channel encoding process, and then transmit the output of the STSC model through the uplink channel to the receiver;
3.

The transmitted message will then go through a noisy channel which typically add noisy corruption to the message. Once the receiver receives the message, it will then use the decoding part of the STSC model to recover the message. Note that the receiver in this setting could be either a cloud server, or a client.

Obviously, semantic information extraction and reconstruction of images are considered to be the key to the success of this system. Therefore, in this paper, the average mean square error (MSE) between the original input image and the reconstructed image is used as the loss function, which is defined as,

MSE=\frac{1}{L}\sum_{i=1}^{n}\left(\boldsymbol{x}_{i}-\boldsymbol{\hat{x}}_{i}% \right)^{2},

(1)

where $\boldsymbol{x}_{i}$ is the original image vector, $\hat{x}_{i}$ is the reconstructed image vector, and $L$ represents the length of the image vector. A smaller MSE indicates that the reconstructed image is closer to the original image and the quality of image reconstruction is better.

III Image Semantic Encoder and Decoder Network

In this section, we first present an edge architecture for Swin Transformer-based semantic communication (STSC), which aims to capture long-term hierarchical semantic information from images. Additionally, to enhance the accuracy of semantic information extraction and ensure user privacy, we introduce the federated learning strategy for a Swin Transformer-based semantic communication (FSSC), which enables the training of edge models using FL across multiple devices hosted on cloud servers. As a result, our proposed scheme can effectively learn semantic representations from images transmitted by different users, and ensure reliable communication on various channels.

III-A STSC Model

Fig. 2 shows the semantic communication network structure based on Swin Transformer. In this paper, joint source channel coding is adopted for semantic communication, in which the Swin Transformer module [11, 12] is adopted for encoding and decoding. Meanwhile, a non-trainable fully connected layer is used to simulate the physical channel.

At the transmitter, a set of $N$ images $\mathbf{X}=\left\{\boldsymbol{x}_{i}\right\}_{i=1}^{N}$ is given, where $\boldsymbol{x}_{i}\in\mathbb{R}^{3\times H\times W}$ denotes the $i$ -th image, and $H$ and $W$ denote height and width of the image, respectively. Firstly, the images are divided into non-overlapping patches through patch partion module. The size of patch is $4\times 4$ and the vector dimension is converted to $(H/4,\mathrm{~{}W}/4,48)$ . Then, the feature block sequence is linearly embedded. And through the learnable embedding matrix $\boldsymbol{E}$ , the feature blocks can be projected into an embedding representation of arbitrary dimension $C$ . After passing through the Swin Transformer block, the vector dimension is transformed to $(H/4,\mathrm{~{}W}/4,C)$ , and the whole process is called Stage 1. After the embedded representation, the feature blocks are fed into the Swin Transformer module, as shown in Fig. 2. As can be seen from the figure, the first patch merging layer and Swin Transformer block are combined into Stage 2. The patch merging layer concatenates each group of adjacent patches of size $2\times 2$ , so that the number of patch tokens becomes $1/4$ of the original, i.e., $H/8\times W/8$ . Meanwhile, the dimension of patch token is expanded by four times, i.e., $4C$ . In order to reduce the output dimension and realize the downsampling of the feature map, the patch merging layer performs a fully connected operation (implemented by a $1\times 1$ convolutional layer). After this process, the dimension of the concatenated feature patch is reduced from $4C$ to $2C$ . Then, the patch goes through the swin Transformer block for feature transformation. Finally, the output dimension becomes $(H/8,\mathrm{~{}W}/8,2C)$ . The output $\tilde{\boldsymbol{x}}$ passes through a fully connected layer to facilitate transmission in the channel. Specifically, it can be expressed as,

\boldsymbol{s}=\boldsymbol{W}_{1}\tilde{\boldsymbol{x}}+\boldsymbol{b}_{1},

(2)

where, $\boldsymbol{W}_{1}$ is the weight parameter matrix of the full connection, and $\boldsymbol{b}_{1}$ is the bias of the full connection.

In order to realize the joint training of encoder and decoder, a layer of fully connected neural network is used to simulate the channel. The neural network is actually a pair of input-output mappings, where the mapping relationship is determined by the neuron weight $\boldsymbol{W}_{n}$ and the bias $\boldsymbol{b}_{1}$ . Here, the weight $\boldsymbol{W}_{n}$ represents the channel gain, and the bias $\boldsymbol{b}_{n}$ is a random variable added to simulate the noise in the channel. The variance of the variable corresponds to the power of the channel noise, and its value mainly depends on the SNR and transmission power. After transmission in the physical channel, the signal $\boldsymbol{s}$ receives interference from channel noise and becomes the signal $\boldsymbol{y}$ when it reaches the receiver.

At the receiver, the decoder consists of the Swin Transformer blocks. The function is to decode the signal $\boldsymbol{y}$ to restore the signal $\hat{\boldsymbol{x}}$ .

III-B FL Training

In this paper, we develop the FSSC algorithm, by applying STSC to a FL framework, which can extract semantic information accurately and carry out semantic communication while protecting user privacy.While Fig. 1 depics the overal architecture of FSSC, we detail below the specific training process of FSSC algorithm.

We use Fedavg[13] as the aggregation algorithm for our federated training. Assume that $N$ clients participate in FL training. $\boldsymbol{D}_{k}$ denotes the local Dataset of client $E_{k}$ , and $\left|\boldsymbol{D}_{k}\right|$ is the size of the corresponding dataset. Thus, the training loss of the client based on the model in the local sample is

\operatorname{Loss}^{k}(\omega)=\frac{1}{\left|\boldsymbol{D}_{k}\right|}\sum_% {i=1}^{\left|\boldsymbol{D}_{k}\right|}\operatorname{Loss}_{i}^{k}(\omega),

(3)

where $\operatorname{Loss}^{k}(\cdot)$ is the training Loss of the local client $E_{k}$ , and $\operatorname{Loss}_{i}^{k}(\cdot)$ is the loss of the sample $\boldsymbol{x_{i}}$ corresponding to the client. In the FSSC algorithm, $\operatorname{Loss}(\cdot)$ is represented in (1).

Therefore, the global federated training loss can be obtained by weighted average of the loss of each client according to the dataset size. The specific formula is as follows.

\operatorname{Loss}^{\circ}(\omega)=\sum_{k=1}^{N}\frac{\left|\boldsymbol{D}_{% k}\right|}{\left|\boldsymbol{D}_{o}\right|}\operatorname{Loss}^{k}(\omega),

(4)

where $\operatorname{Loss}^{\circ}(\omega)$ is the global loss after cloud aggregation, and $|\boldsymbol{D}_{o}|$ is the global dataset size. The purpose of the FL training is to find the global optimal model that minimizes the sum of training losses over all data. The specific federated training process is as follows.

Initialization. The dataset is non-uniformly split and distributed on each client. The semantic communication network based on Swin Transformer is initialized on the cloud server, and the initialized global model parameters $\boldsymbol{W}^{o}$ are broadcast to each client to build a local STSC network.

Model training. In the communication round $t$ , the client participating in the training will conduct local training on STSC by using the local data set. And the stochastic gradient descent method is used to update the local model parameters. For each batch $i$ , the process can be expressed as,

\omega_{i+1}^{k}\leftarrow\omega_{i}^{k}-\eta\frac{\partial\operatorname{Loss}% \left(\omega_{i}^{k};b\right)}{\partial\omega_{i}^{k}},

(5)

where $\eta$ is the learning rate. Each client saves its own trained local parameters and uploads them to the cloud server.

Parameter aggregation. The cloud server aggregates model parameters from different clients, and updates $\boldsymbol{W}_{t}^{o}$ to $\boldsymbol{W}_{t+1}^{o}$ . The aggregation mechanism shown in the following equation. Through this process, the update of the global shared model can be completed.

w_{t+1}\leftarrow\sum_{k=1}^{K}\frac{\left|\boldsymbol{D}_{k}\right|}{\left|% \boldsymbol{D}_{o}\right|}w_{t+1}^{k},

(6)

where $K$ is the total number of clients participating in training. This aggregation mechanism provides a certain guarantee for the stability of the system from the perspective of model convergence. Specifically, a certain degree of client-side fluctuations can be accepted without affecting the final convergence of the model.

Model convergence. The parameters of the global model $\boldsymbol{W}_{t+1}^{o}$ are downloaded to each client. Then, the next round of training begins. Repeat this step until a specified global training cycle is reached or the model converges.

IV Experiment and Numerical Results

IV-A Simulation Settings

In this paper, we use the CIFAR-10 dataset[14] to evaluate the performance of the proposed FSSC algorithm. This dataset contains 50,000 training images and 10,000 test images, each with a size of $32\times 32$ . In addition, the number of clients is set to 3, and each client is assigned a different amount of non-overlapping training data. All clients use the same set of validation and test images. The training and testing environment are Windows 11+ CUDA 12.1, and the deep learning framework is Pytorch2.2.1.

In the experiments, the process of downloading and uploading model parameters between the client and the cloud server is regarded as an ideal situation. The focus is on the effects of global models obtained through local training and federated aggregation. The following introduces the initialization parameter settings for the semantic communication network based on Swin Transformer. Firstly, the network hyper-parameter $C$ is set to 32. The network training all uses MSE as the loss function, the number of communication round is set to 60, the batch size of the data is 64, the learning rate is 0.001, and the Adam optimizer[15] is used. In all the tests, the compression ratio (CR) of the images is set to 0.33.

To demonstrate the superiority of our method, we choose the typical JSCC [2] and the traditional communication algorithm as the comparison algorithm. The traditional communication algorithm adopts JPEG [16] as the source coding method, the channel coding adopts LDPC [17], and the modulation mode and the demodulation mode are Quadrature Amplitude Modulation(QAM). And The modulation order is set to 4.

IV-B Simulation Results

In order to evaluate the semantic representation of the FSSC model, the experiments in this section compared the performance results of the local model algorithm STSC, JSCC and the traditional communication algorithms on AWGN channels, and verified the changes of the PSNR of the three methods under different SNR conditions, as shown in Fig. 3. It can be seen from Fig. 4 that the PSNR of STSC and JSCC gradually increase with the increase of SNR. The STSC algorithm outperformed JSCC in terms of PSNR, with an average improvement of over 5 decibels. This significant advantage demonstrates the superior efficiency of the STSC algorithm in semantic feature extraction and reconstructing image data, resulting in a more accurate representation of the original image. At the same time, it can be noted that the PSNR of JPEG+LDPC+QAM remains unchanged when the channel conditions are very bad or greatly improved. This is because when the SNR is low, the traditional communication algorithm basically cannot transmit any semantic information. However, when the SNR is high, the PSNR reaches the performance saturation of the traditional communication algorithm. Thus, the accuracy of the proposed algorithm in this paper is significantly better than that of the traditional communication algorithms, especially when the SNR is improved, the performance improvement is very obvious. This is because while LDPC codes and QAM modulation can improve the robustness of data transmission, the JPEG and BPG compression algorithm are lossy and can lead to irreversible information loss. Combining the JPEG, LDPC and QAM techniques may reduce the error resilience of the overall system, leading to degraded image quality or data integrity in the presence of transmission errors. In comparison, the STSC algorithm retains the semantic information to the greatest extent, which ensures the effective transmission of images.

In order to evaluate the convergence of the proposed STSC model under the FL framework, the convergence analysis experiment of the FSSC model is also carried out in this section. Taking $SNR=12$ as an example, Fig. 4 shows how the target loss varies as the training rounds/time increases in the case of FL training and local training, where the target loss is related to the degree of image reconstruction. As can be seen from the figure, on AWGN channels, the target loss of the communication model decreases rapidly in a short period of time, then fluctuates for a period of time, and finally remains unchanged. In other words, the model eventually reaches convergence. Moreover, the convergence efficiency of the FL training model is higher than that of the local training of the client. At the same time, the final training loss of global training model is relatively lower. This is because the local model uses a local single data set for local training, which makes it impossible to fully learn the global semantic distribution characteristics of the data. However, FL training has the advantages of data integration and improving model performance, and thus solves this problem well.

Fig. 5 shows how the PSNR change with the channel SNR for the global model and the local model without federated aggregation after model convergence. From the figure, it can be observed that with the increase of SNR, the PSNR of the global model and the local model without joint aggregation also significantly improved. In contrast to local models devoid of collaborative aggregation, global models can further enhance the PSNR by 2-3dB. This is attributed to the global model’s ability to learn richer noise models and finer image semantic features through effective information aggregation and synchronization. Consequently, under high signal-to-noise ratio conditions, its PSNR can continue to improve, demonstrating superior generalization and noise resistance capabilities. This is because the image reconstruction is better as the channel condition improves. Moreover, FSSC avoids the sharing of sensitive data, thus protecting the privacy of users and motivating users to participate in semantic communication under the FL framework to a certain extent.

V Conclusion

In this paper, we proposed a semantic communication architecture for image transmission over wireless channels in multi-user cases, called FSSC. In this architecture, the local STSC semantic communication system model accurately extracts semantic information on the respective dataset and carries out the semantic communication model training. Then, joint training is performed through FL parameter aggregation to minimize the target loss function of the reconstructed image. Simulation results show that the algorithm can converge effectively. This approach is able to combine the semantic information exists in the diverse data from clients, and distributes the computational workload from edge to users, without compromising the user’s privacy. Under the premise of privacy protection, the proposed algorithm can greatly reduce the amount of data required by each client while ensuring the fidelity in recovered images. Therefore, the FSSC proposed in this paper is a promising candidate scheme for multi-user image semantic communication system.

References

[1] K. Lu, Q. Zhou, R. Li, Z. Zhao, X. Chen, J. Wu, and H. Zhang, “Rethinking modern communication from semantic coding to semantic communication,” IEEE Wireless Communications, vol. 30, no. 1, pp. 158-164, February 2023.
[2] E. Bourtsoulatze, D. Burth Kurka and D. Gündüz, “Deep Joint Source-Channel Coding for Wireless Image Transmission,” IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567-579, September 2019.
[3] N. Farsad, M. Rao, and A. Goldsmith, “Deep learning for joint source-channel coding of text,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), Calgary, AB, Canada, April 2018.
[4] H. Xie, Z. Qin, G. Y. Li, and B. H. Juang, “Deep learning enabled semantic communication systems” IEEE Transactions on Signal Processing, vol. 69, pp. 2663–2675, April 2021.
[5] K. Choi, K. Tatwawadi, A. Grover, T. Weissman, and S. Ermon, “Neural joint source-channel coding,” in Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, United states, May 2019.
[6] J. Xu, B. Ai, W. Chen, A. Yang, P. Sun, and M. Rodrigues, “Wireless image transmission using deep source channel coding with attention modules,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2315-2328, April 2022.
[7] Claude E Shannon, “A mathematical theory of communication,” The Bell system technical journal, vol. 27, no. 3, pp. 379–423, July 1948.
[8] D. Gündüz, Z. Qin, I. E. Aguerri, H. S. Dhillon, Z. Yang, A. Yener, K. K. Wong, and C.-B. Chae, “Beyond transmitting bits: Context, semantics, and task-oriented communications,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 5–41, November 2022.
[9] M. Kountouris and N. Pappas, “Semantics-empowered communication for networked intelligent systems,” IEEE Communications Magazine, vol. 59, no. 6, pp. 96–102, June 2021.
[10] W. Yang et al., “Semantic Communications for Future Internet: Fundamentals, Applications, and Challenges,” IEEE Communications Surveys & Tutorials, vol. 25, no. 1, pp. 213-250, November 2023.
[11] Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” in IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, October 2021.
[12] K. Yang, S. Wang, J. Dai, K. Tan, K. Niu and P. Zhang, “WITT: A Wireless Image Transmission Transformer for Semantic Communications,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, June 2023.
[13] H.B. Mcmahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication-Efficient Learning of Deep Networks from Decentralized Data,” in International Conference on Artificial Intelligence and Statistics (AISTATS), Fort Lauderdale, Florida, USA, February 2017.
[14] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, University of Tront, April 2009.
[15] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, December 2014.
[16] G. K. Wallace, “The JPEG still picture compression standard,” IEEE Transactions on Consumer Electronics, vol. 38, no. 1, pp.xviii-xxxiv, February 1992.
[17] R. Gallager, “Low-density parity-check codes,” IRE Transactions on Information Theory, vol. 8, no. 1, pp. 21–28, January 1962.