Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Two-Stage Resource Allocation in Reconfigurable Intelligent Surface Assisted Hybrid Networks via Multi-Player Bandits
thanks: This work was supported in part by the National Natural Science Foundation of China under Grant 61771017; in part by the US National Science Foundation under Grants CNS-2107216 and CNS-2128368. L. Fu is the corresponding author (e-mail: liqun@xmu.edu.cn)
thanks: J. Tong and L. Fu are with the Department of Information and Communication Engineering, Xiamen University, Xiamen 361005, China (e-mails: tongjingwen@stu.xmu.edu.cn; liqun@xmu.edu.cn). thanks: H. Zhang is with the Department of Electrical and Computer Engineering, Princeton University, NJ, USA (e-mail: hongliang.zhang92@gmail.com). thanks: A. Leshem is with the Faculty of Engineering, Bar-Ilan University, Ramat Gan 52900, Israel (e-mail: leshema@biu.ac.il). thanks: Z. Han is with the Department of Electrical and Computer Engineering at the University of Houston, Houston, TX 77004 USA, and also with the Department of Computer Science and Engineering, Kyung Hee University, Seoul, South Korea, 446-701, USA (e-mail: zhan2@uh.edu).

Jingwen Tong,   Hongliang Zhang,   Liqun Fu,   Amir Leshem,   and Zhu Han
Abstract

This paper considers a resource allocation problem where several Internet-of-Things (IoT) devices send data to a base station (BS) with or without the help of the reconfigurable intelligent surface (RIS) assisted cellular network. The objective is to maximize the sum rate of all IoT devices by finding the optimal RIS and spreading factor (SF) for each device. Since these IoT devices lack prior information on the RISs or the channel state information (CSI), a distributed resource allocation framework with low complexity and learning features is required to achieve this goal. Therefore, we model this problem as a two-stage multi-player multi-armed bandit (MPMAB) framework to learn the optimal RIS and SF sequentially. Then, we put forth an exploration and exploitation boosting (E2Boost) algorithm to solve this two-stage MPMAB problem by combining the ϵitalic-ϵ\epsilonitalic_ϵ-greedy algorithm, Thompson sampling (TS) algorithm, and non-cooperation game method. We derive an upper regret bound for the proposed algorithm, i.e., 𝒪(log21+δT)𝒪subscriptsuperscript1𝛿2𝑇\mathcal{O}(\log^{1+\delta}_{2}T)caligraphic_O ( roman_log start_POSTSUPERSCRIPT 1 + italic_δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T ), increasing logarithmically with the time horizon T𝑇Titalic_T. Numerical results show that the E2Boost algorithm has the best performance among the existing methods and exhibits a fast convergence rate. More importantly, the proposed algorithm is not sensitive to the number of combinations of the RISs and SFs thanks to the two-stage allocation mechanism, which can benefit high-density networks.

Index Terms:
Reconfigurable intelligent surface (RIS), Internet-of-Things (IoT), multi-player multi-armed bandit (MPMAB), Thompson sampling (TS), exploration and exploitation boosting (E2Boost) algorithm.

I Introduction

Reconfigurable intelligent surface (RIS), which enhances the communication quality by adjusting the amplitude and the phase shift of the incident signal on a 2D planar surface with massive low-cost passive reflecting elements, has drawn increasing attention in future communication networks [1, 2, 3]. There have existed some works accounting for this vision by studying the performance of the RIS-assisted cellular network [4, 5], RIS-assisted unmanned aerial vehicle network [6], and RIS-assisted secure wireless communications [7].

Meanwhile, the cellular Internet-of-Things (C-IoT) with RIS is regarded as one of the paradigms in future communication networks, providing the capabilities of low-cost, large-scale, and ultra-durable connectivity for everything [8, 9, 10]. By employing the LoRa (short for Long Range) technology, C-IoT can operate on the unlicensed band since the resulting signal has substantial anti-interference properties [11]. On the other hand, C-IoT can achieve the rate adaptation by employing the chirp spreading spectrum modulation at the physical layer with different spreading factors (SFs) [12]. However, the study of the network-level performance of these C-IoT devices in the RIS-assisted hybrid cellular network still needs more research.

In light of this, we consider a hybrid uplink network where several C-IoT devices transmit data to a base station (BS) by opportunistically accessing the RIS-assisted cellular network. The goal is to maximize the sum rate of all C-IoT devices by finding the best RIS and SF for each device. Although these C-IoT devices can directly send data to the BS, a higher SF that corresponds to a lower data rate will be assigned to combat the harsh channel environment or to enable a long-range transmission [13]. As pointed out in [9], the low data rate will result in high data latency and security problems. Therefore, these C-IoT devices may opportunistically access the vacant RISs to improve their data rate by reflecting their signal to the BS. However, finding the optimal RIS and collecting the exact channel state information (CSI) is challenging for these C-IoT devices. On the one hand, the C-IoT device has no information (e.g., the phase shifts) about the RISs since they are deployed for cellular users (UEs). On the other hand, there is no communication among C-IoT devices in such a distributed network. These features render most traditional optimization methods infeasible in this resource allocation problem, such as the convex optimization methods [14] and the combinatorial optimization methods [15].

To overcome the above impediments, the learning theory has been considered in [8, 11, 16, 17, 18] to address this problem by sequentially exploring all actions and automatically exploiting the best action. Refs. [16, 17] study the distributed resource allocation problem in wireless networks by formulating it as a Markov decision processing (MDP) problem. Then, the authors propose the multi-agent reinforcement learning (RL) based method to solve this MDP problem. Unfortunately, these solutions often suffer from the issues of the curse of dimensionality, lack of performance guarantee (e.g., the unknown convergence rate), and high computational complexity [18]. As pointed out in [8, 11], low complexity and fast convergence resource allocation algorithms are crucial for energy-constrained IoT devices in future communication networks.

This inspires us to consider the multi-armed bandit (MAB) technique. MAB is a basic framework for the sequential decision-making problem [19]. In the classic MAB setting, in each round, a decision-maker (or player) selects an arm from a set of arms (or arm space) with an unknown distribution. After that, the player will observe a reward from the environment (or the unknown distribution). The goal is to minimize the pseudo-regret that is defined as the difference between the mean rewards of the optimal arm and the currently selected arm. During this process, the player faces an exploration and exploitation (EE) dilemma. On the one hand, the player needs to explore the arm space sufficiently to ensure its long-term performance (i.e., not miss the optimal arm); on the other hand, it needs to exploit the current best arm as many times as possible to maximize its total rewards. Compared with the other learning-based methods, MAB has a theoretical guarantee (i.e., regret bound) and low computational complexity, and it is easy to implement.

Recently, the multi-player MAB (MPMAB) framework has gained much attention in wireless communications [20, 21, 22, 23]. Ref. [20] studies the SF allocation problem in the LoRa network by devising a fully distributed MPMAB framework. It solves this MPMAB problem by using the Exponential-weight algorithm for Exploration and Exploitation (Exp3) [24] algorithm. However, the solution of the Exp3 algorithm is selfish in that it cannot guarantee the optimal allocation for each device. The optimal MPMAB framework is considered in [21], where different players contend for the same set of channels in an ad-hoc network. Based on the Hungarian algorithm [25], the authors propose a probably approximately correct (PAC) based MPMAB algorithm to estimate the CSI matrix sequentially. However, the PAC-based MPMAB algorithm requires players to exchange messages, leading to extra signaling in the system. The fully distributed resource allocation framework with the optimal solution is investigated in [22] and [23]. Ref. [22] aims to maximize the sum rate of all users by combining the MAB algorithm and the auction algorithm. A more general version of the distributed MPMAB framework named the game-of-thrones (GoT) algorithm has been proposed in [23]. The authors intend to find the optimal assignment for each player by combining the MAB algorithm and the game theory. However, algorithms in [22] and [23] suffer from low convergence rate, especially when the arm space is large.

In this paper, we propose a two-stage MPMAB framework to attack this resource allocation problem in the hybrid uplink network. In this two-stage MPMAB framework, players are the IoT devices; arms are the RISs in the first stage and the SFs in the second stage, respectively. We assume that two or more players who select the same RIS will observe a collision and receive zero reward. This resource allocation problem is quite different from that in [21] and [22], because it not only needs to learn the CSI but also the phase shifts of the RISs. Moreover, the ascending order in the set of SFs and the corresponding descending order in the successful transmission probabilities enable us to devise a two-stage MPMAB framework. To address this two-stage MPMAB problem, we put forth an exploration and exploitation boosting (E2Boost) algorithm by combining the game theory and the MAB algorithm. The E2Boost algorithm proceeds in epochs and has three phases, i.e., ϵitalic-ϵ\epsilonitalic_ϵ-greedy EE phase, non-cooperation game phase, and Thompson sampling (TS) EE phase. Each phase contains a specific mechanism to trade off the EE dilemma, which is why we call it the E2Boost algorithm. In addition, we derive an upper pseudo-regret bound for the E2Boost algorithm, i.e., 𝒪(log21+δT)𝒪subscriptsuperscript1𝛿2𝑇\mathcal{O}(\log^{1+\delta}_{2}T)caligraphic_O ( roman_log start_POSTSUPERSCRIPT 1 + italic_δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T ) where 0δ<10𝛿10\leq\delta<10 ≤ italic_δ < 1, indicating that the per-round regret will trend to 00 when the time horizon T𝑇Titalic_T is sufficiently large. More importantly, this upper regret bound is about M𝑀Mitalic_M times lower than that in the GoT algorithm, where M𝑀Mitalic_M is the number of SFs. In other words, the proposed algorithm is not sensitive to the number of combinations of the RISs and SFs, which can benefit high-density networks.

Refer to caption
Figure 1: A RIS-assisted hybrid uplink network.

The difference between this work and the existing ones and the main contributions of this work are summarized as follows.

  • The E2Boost algorithm embeds the ϵitalic-ϵ\epsilonitalic_ϵ-greedy algorithm [26] in the first phase to reduce the regrets generated from the uniform exploration. Specifically, we use the Wasserstein distance (WD) [27] to measure the convergence rate of the second phase. In return, this measurement is regarded as a criterion to optimize the parameter ϵitalic-ϵ\epsilonitalic_ϵ.

  • The E2Boost algorithm adopts the TS algorithm [28, 29] in the third phase to determine the best SF. Since the only observed information is the success or failure transmission feedback, the TS algorithm maintains a Beta distribution on the successful transmission probability of each SF. For the Bernoulli reward processing, the TS algorithm accounts for the best performance among the existing stochastic MAB algorithms [29].

  • The E2Boost algorithm has a smaller arm space to explore than the GoT algorithm. Thanks to the two-stage allocation mechanism, the E2Boost algorithm only needs to explore the sets of the RISs or the SFs. In contrast, the GoT algorithm requires exploring the combinations of the RISs and SFs.

The remainder of this paper is organized as follows. In Section II, we introduce the channel model and the achievable data rate. The problem formulation is given in Section III. In Section IV, we introduce the two-stage MPMAB framework for this joint resource assignment problem. The E2Boost algorithm is presented in Section V. Numerical results are given in Section VI to evaluate the proposed algorithm. This paper is concluded in Section VII.

II System Model

We consider a hybrid uplink cellular network, as shown in Fig. 1, where several UEs and N𝑁Nitalic_N IoT devices are located in an area. Both UE and IoT device need to transmit data to the BS. Since there may exist some obstacles (e.g., buildings) between UEs and the BS, the signal will experience deep-fading. Thus, K𝐾Kitalic_K RISs are deployed to reflect the UEs’ signals to the BS by adjusting the RISs’ phase shifts. These RISs are operated over different frequencies111The RIS can operate at different frequencies by changing the location and the wave-number of each element [5].. The N𝑁Nitalic_N IoT device has no information about these RISs, but it may opportunistically access these RISs to improve its data rate. In this hybrid network, UE is the legal user to communicate with the BS through the RIS; while the IoT device is required to perform spectrum sensing222If the received signal strength (RSS) exceeds a threshold, the IoT device marks this RIS with the busy state; otherwise, the state of the RIS is idle. before access to the vacant RIS. Time is slotted in t=1,2,,T𝑡12𝑇t=1,2,\ldots,Titalic_t = 1 , 2 , … , italic_T. At each time slot, we assume that one RIS can serve multiple UEs but can be exploited by only one IoT device. The reason is that the BS requires the precoding and beamforming vectors to maintain communication quality in this multi-user RIS-assisted system [30]. These vectors often contain the information of the CSI and the RISs’ phase shifts determined by the UEs. As a result, an RIS can only support one IoT device since the IoT device lacks these precoding and beamforming vectors.

II-A Channel Models

There are two transmission patterns for each IoT device in this hybrid network. The first one is the RIS-assisted transmission pattern (Pattern I), where the IoT device transmits to the BS through the RIS when the target RIS is detected in an idle state. The second one is non-RIS-assisted transmission pattern (Pattern II), where the IoT device directly transmits to the BS with a low data rate if the target RIS is detected in a busy state.

Pattern I: Assume that each element on the RIS is equipped with b𝑏bitalic_b PIN diodes, producing 2bsuperscript2𝑏2^{b}2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT phase shifts in [0,2π)02𝜋[0,2\pi)[ 0 , 2 italic_π ) by controlling the ON/OFF state of each diode. Hence, the available phase shift at the (l1,l2)subscript𝑙1subscript𝑙2(l_{1},l_{2})( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )-th element is

τl1,l2=πρl1,l22b1,subscript𝜏subscript𝑙1subscript𝑙2𝜋subscript𝜌subscript𝑙1subscript𝑙2superscript2𝑏1\tau_{l_{1},l_{2}}=\frac{\pi\rho_{l_{1},l_{2}}}{2^{b-1}},italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG italic_π italic_ρ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_b - 1 end_POSTSUPERSCRIPT end_ARG , (1)

where (l1,l2)subscript𝑙1subscript𝑙2(l_{1},l_{2})( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is the index of the RIS elements’ matrix and ρl1,l2subscript𝜌subscript𝑙1subscript𝑙2\rho_{l_{1},l_{2}}italic_ρ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is an integer in [0,2b1]0superscript2𝑏1[0,2^{b}-1][ 0 , 2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - 1 ]. Let Al1,l2subscript𝐴subscript𝑙1subscript𝑙2A_{l_{1},l_{2}}italic_A start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the reflection factor at the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-th row and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-th column of the RIS elements’ matrix, which is defined as

Al1,l2=Aejτl1,l2,subscript𝐴subscript𝑙1subscript𝑙2𝐴superscript𝑒𝑗subscript𝜏subscript𝑙1subscript𝑙2A_{l_{1},l_{2}}=Ae^{-j\tau_{l_{1},l_{2}}},italic_A start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_A italic_e start_POSTSUPERSCRIPT - italic_j italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (2)

where A𝐴Aitalic_A is a reflection amplitude with a constant value among (0,1]01(0,1]( 0 , 1 ]333The reflection amplitude can be a function of the phase shift as in [31]..

By taking advantage of the directional reflections of the RIS, the BS - RIS - IoT device link is usually stronger than other multi-paths as well as the deep-fading direct link between the BS and the IoT device [7]. Therefore, we model the channel between the BS and the IoT device as a Ricean model. In this way, the BS - (RIS k𝑘kitalic_k) - (IoT device n𝑛nitalic_n) link acts as the dominant “LoS” component; while all the other paths together form the “non-LoS (NLoS)” component. Hence, the RIS-assisted channel model hl1,l2n,ksubscriptsuperscript𝑛𝑘subscript𝑙1subscript𝑙2h^{n,k}_{l_{1},l_{2}}italic_h start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is defined as

hl1,l2n,k=ζζ+1h~l1,l2n,k+1ζ+1h^l1,l2n,k,subscriptsuperscript𝑛𝑘subscript𝑙1subscript𝑙2𝜁𝜁1subscriptsuperscript~𝑛𝑘subscript𝑙1subscript𝑙21𝜁1subscriptsuperscript^𝑛𝑘subscript𝑙1subscript𝑙2h^{n,k}_{l_{1},l_{2}}=\sqrt{\frac{\zeta}{\zeta+1}}\tilde{h}^{n,k}_{l_{1},l_{2}% }+\sqrt{\frac{1}{\zeta+1}}\hat{h}^{n,k}_{l_{1},l_{2}},italic_h start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG italic_ζ end_ARG start_ARG italic_ζ + 1 end_ARG end_ARG over~ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_ζ + 1 end_ARG end_ARG over^ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (3)

where h~l1,l2n,ksubscriptsuperscript~𝑛𝑘subscript𝑙1subscript𝑙2\tilde{h}^{n,k}_{l_{1},l_{2}}over~ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and h^l1,l2n,ksubscriptsuperscript^𝑛𝑘subscript𝑙1subscript𝑙2\hat{h}^{n,k}_{l_{1},l_{2}}over^ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are the LoS component and the NLoS component with the k𝑘kitalic_k-th RIS and the n𝑛nitalic_n-th IoT device through the (l1,l2)subscript𝑙1subscript𝑙2(l_{1},l_{2})( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )-th element, respectively. Symbol ζ𝜁\zetaitalic_ζ is the Rician factor, indicating the ratio of the LoS component to the NLoS component. In the following, we omit the IoT device index n𝑛nitalic_n and the RIS index k𝑘kitalic_k in the superscript if no confusion occurs.

Let Dl1,l2subscript𝐷subscript𝑙1subscript𝑙2D_{l_{1},l_{2}}italic_D start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the distance between the BS and the (l1,l2)subscript𝑙1subscript𝑙2(l_{1},l_{2})( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )-th RIS element, and let dl1,l2subscript𝑑subscript𝑙1subscript𝑙2d_{l_{1},l_{2}}italic_d start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the distance between the (l1,l2)subscript𝑙1subscript𝑙2(l_{1},l_{2})( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )-th RIS element and the IoT device. The transmission distance of BS - ((l1,l2)subscript𝑙1subscript𝑙2(l_{1},l_{2})( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )-th RIS element) - (IoT device n𝑛nitalic_n) link is Ll1,l2=Dl1,l2+dl1,l2subscript𝐿subscript𝑙1subscript𝑙2subscript𝐷subscript𝑙1subscript𝑙2subscript𝑑subscript𝑙1subscript𝑙2L_{l_{1},l_{2}}=D_{l_{1},l_{2}}+d_{l_{1},l_{2}}italic_L start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_d start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. According to [4], the LoS component of this link is given by

h~l1,l2=GDl1,l2ιdl1,l2ιej2πλLl1,l2=G[Dl1,l2ιej2πλDl1,l2][dl1,l2ιej2πλdl1,l2],subscript~subscript𝑙1subscript𝑙2𝐺subscriptsuperscript𝐷𝜄subscript𝑙1subscript𝑙2subscriptsuperscript𝑑𝜄subscript𝑙1subscript𝑙2superscript𝑒𝑗2𝜋𝜆subscript𝐿subscript𝑙1subscript𝑙2𝐺delimited-[]subscriptsuperscript𝐷𝜄subscript𝑙1subscript𝑙2superscript𝑒𝑗2𝜋𝜆subscript𝐷subscript𝑙1subscript𝑙2delimited-[]subscriptsuperscript𝑑𝜄subscript𝑙1subscript𝑙2superscript𝑒𝑗2𝜋𝜆subscript𝑑subscript𝑙1subscript𝑙2\small\begin{split}\tilde{h}_{l_{1},l_{2}}&=\sqrt{GD^{-\iota}_{l_{1},l_{2}}d^{% -\iota}_{l_{1},l_{2}}}e^{-j\frac{2\pi}{\lambda}L_{l_{1},l_{2}}}\\ &=\sqrt{G}\left[\sqrt{D^{-\iota}_{l_{1},l_{2}}}e^{-j\frac{2\pi}{\lambda}D_{l_{% 1},l_{2}}}\right]\left[\sqrt{d^{-\iota}_{l_{1},l_{2}}}e^{-j\frac{2\pi}{\lambda% }d_{l_{1},l_{2}}}\right],\end{split}start_ROW start_CELL over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL = square-root start_ARG italic_G italic_D start_POSTSUPERSCRIPT - italic_ι end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT - italic_ι end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - italic_j divide start_ARG 2 italic_π end_ARG start_ARG italic_λ end_ARG italic_L start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = square-root start_ARG italic_G end_ARG [ square-root start_ARG italic_D start_POSTSUPERSCRIPT - italic_ι end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - italic_j divide start_ARG 2 italic_π end_ARG start_ARG italic_λ end_ARG italic_D start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] [ square-root start_ARG italic_d start_POSTSUPERSCRIPT - italic_ι end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - italic_j divide start_ARG 2 italic_π end_ARG start_ARG italic_λ end_ARG italic_d start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ] , end_CELL end_ROW (4)

where ι𝜄\iotaitalic_ι is the path-loss exponent. Symbol G𝐺Gitalic_G is the antenna gain, and λ𝜆\lambdaitalic_λ is the wavelength of the signal. Meanwhile, the NLoS component is given by

h^l1,l2=PLNLoS(Ll1,l2)gl1,l2,subscript^subscript𝑙1subscript𝑙2𝑃subscript𝐿NLoSsubscript𝐿subscript𝑙1subscript𝑙2subscript𝑔subscript𝑙1subscript𝑙2\hat{h}_{l_{1},l_{2}}=\sqrt{PL_{\mathrm{NLoS}}(L_{l_{1},l_{2}})}g_{l_{1},l_{2}},over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = square-root start_ARG italic_P italic_L start_POSTSUBSCRIPT roman_NLoS end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG italic_g start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (5)

where gl1,l2subscript𝑔subscript𝑙1subscript𝑙2g_{l_{1},l_{2}}italic_g start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the small-scale NLoS component, following the i.i.d. complex Gaussian distribution, i.e., gl1,l2𝒞𝒩(0,1)similar-tosubscript𝑔subscript𝑙1subscript𝑙2𝒞𝒩01g_{l_{1},l_{2}}\sim\mathcal{CN}(0,1)italic_g start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ caligraphic_C caligraphic_N ( 0 , 1 ). Term PLNLoS()𝑃subscript𝐿NLoSPL_{\mathrm{NLoS}}(\cdot)italic_P italic_L start_POSTSUBSCRIPT roman_NLoS end_POSTSUBSCRIPT ( ⋅ ) is the NLoS channel power gain that we adopt the urban macro (UMa) path-loss model444The calculation of PLNLoS()𝑃subscript𝐿NLoSPL_{\mathrm{NLoS}}(\cdot)italic_P italic_L start_POSTSUBSCRIPT roman_NLoS end_POSTSUBSCRIPT ( ⋅ ) in dB form is 10log10PLNLoS(d)=13.54+39.08log10(d)+20log10(fc)0.6(hIoT1.5)10subscript10𝑃subscript𝐿NLoS𝑑13.5439.08subscript10𝑑20subscript10subscript𝑓𝑐0.6subscriptIoT1.510\log_{10}PL_{\mathrm{NLoS}}(d)=13.54+39.08\log_{10}(d)+20\log_{10}(f_{c})-0.% 6(h_{\mathrm{IoT}}-1.5)10 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT italic_P italic_L start_POSTSUBSCRIPT roman_NLoS end_POSTSUBSCRIPT ( italic_d ) = 13.54 + 39.08 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_d ) + 20 roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) - 0.6 ( italic_h start_POSTSUBSCRIPT roman_IoT end_POSTSUBSCRIPT - 1.5 ), where d𝑑ditalic_d is the Euclidean distance between the device and the BS, and hhitalic_h is the height of the device. Symbol fcsubscript𝑓𝑐f_{c}italic_f start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the central frequency. [32] in the simulation.

Pattern II: In the non-RIS-assisted transmission pattern, the IoT device directly transmits to the BS without the help of the RIS. Since there are some obstacles, the signal may experience deep fading. Thus, we use the shadow fading model to describe the channel between IoT device n𝑛nitalic_n and the BS [33], i.e.,

hn=ϱngn,subscript𝑛subscriptitalic-ϱ𝑛subscript𝑔𝑛h_{n}=\sqrt{\varrho_{n}}g_{n},italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = square-root start_ARG italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , (6)

where ϱnsubscriptitalic-ϱ𝑛\varrho_{n}italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the channel power gain, following the i.i.d. log-normal distribution with mean μnsubscript𝜇𝑛\mu_{n}italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and standard deviation σnsubscript𝜎𝑛\sigma_{n}italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of ln(ϱn)subscriptitalic-ϱ𝑛\ln\left(\varrho_{n}\right)roman_ln ( italic_ϱ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). The typical value of σnsubscript𝜎𝑛\sigma_{n}italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is between 6666 and 12 dB for practical radio channels [32]. In addition, gnsubscript𝑔𝑛g_{n}italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a small-scale NLoS component, following i.i.d. complex Gaussian distribution, i.e., gn𝒞𝒩(0,1)similar-tosubscript𝑔𝑛𝒞𝒩01g_{n}\sim\mathcal{CN}(0,1)italic_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∼ caligraphic_C caligraphic_N ( 0 , 1 ).

II-B Signal Model and Achievable Data Rate

The received signal from IoT device n𝑛nitalic_n to the BS through RIS k𝑘kitalic_k is given by

{ϝn,k=l1,l2Al1,l2khl1,l2n,kΩnx+y+ω,PatternI,ϝn=hnΩnx+y+ω,PatternII,casessubscriptitalic-ϝ𝑛𝑘subscriptsubscript𝑙1subscript𝑙2subscriptsuperscript𝐴𝑘subscript𝑙1subscript𝑙2subscriptsuperscript𝑛𝑘subscript𝑙1subscript𝑙2subscriptΩ𝑛𝑥𝑦𝜔PatternImissing-subexpressionsubscriptitalic-ϝ𝑛subscript𝑛subscriptΩ𝑛𝑥𝑦𝜔PatternIImissing-subexpression\left\{\begin{array}[]{lll}\digamma_{n,k}=\sum_{l_{1},l_{2}}A^{k}_{l_{1},l_{2}% }h^{n,k}_{l_{1},l_{2}}\sqrt{\Omega_{n}}x+y+\omega,&\mathrm{Pattern\ I},\\ \digamma_{n}=h_{n}\sqrt{\Omega_{n}}x+y+\omega,&\mathrm{Pattern\ II},\end{array% }\right.{ start_ARRAY start_ROW start_CELL italic_ϝ start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT square-root start_ARG roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG italic_x + italic_y + italic_ω , end_CELL start_CELL roman_Pattern roman_I , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_ϝ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT square-root start_ARG roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG italic_x + italic_y + italic_ω , end_CELL start_CELL roman_Pattern roman_II , end_CELL start_CELL end_CELL end_ROW end_ARRAY (7)

where x𝑥xitalic_x is the transmission signal with |x|2=1superscript𝑥21|x|^{2}=1| italic_x | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 and y𝑦yitalic_y is the received interference signal555Notice that the interference y𝑦yitalic_y may come from the neighboring cellular networks or the local UEs when the UEs’ signals are missed detection by the IoT devices. which can be modeled as a log-normal distribution [34], i.e., yLog𝒩(μy,σy2)similar-to𝑦𝐿𝑜𝑔𝒩subscript𝜇𝑦subscriptsuperscript𝜎2𝑦y\sim Log\mathcal{N}(\mu_{y},\sigma^{2}_{y})italic_y ∼ italic_L italic_o italic_g caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ). In addition, ω𝒞𝒩(0,σω2)similar-to𝜔𝒞𝒩0superscriptsubscript𝜎𝜔2\omega\sim\mathcal{CN}(0,\sigma_{\omega}^{2})italic_ω ∼ caligraphic_C caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is the i.i.d. additive complex Gaussian noise and ΩnsubscriptΩ𝑛\Omega_{n}roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the transmit power. Then, the received signal-to-interference-plus-noise ratio (SINR) can be calculated by

{γn,k=Ωn(l1,l2Al1,l2kh~l1,l2n,kl1,l2(Al1,l2k)(h~l1,l2n,k))exp(2μy+2σy2)+σω2,γn=Ωnhnhnexp(2μy+2σy2)+σω2,casessubscript𝛾𝑛𝑘absentsubscriptΩ𝑛subscriptsubscript𝑙1subscript𝑙2subscriptsuperscript𝐴𝑘subscript𝑙1subscript𝑙2subscriptsuperscript~𝑛𝑘subscript𝑙1subscript𝑙2subscriptsubscript𝑙1subscript𝑙2superscriptsubscriptsuperscript𝐴𝑘subscript𝑙1subscript𝑙2superscriptsubscriptsuperscript~𝑛𝑘subscript𝑙1subscript𝑙22subscript𝜇𝑦2subscriptsuperscript𝜎2𝑦superscriptsubscript𝜎𝜔2missing-subexpressionsubscript𝛾𝑛absentsubscriptΩ𝑛subscript𝑛subscriptsuperscript𝑛2subscript𝜇𝑦2subscriptsuperscript𝜎2𝑦superscriptsubscript𝜎𝜔2missing-subexpression\small\left\{\begin{array}[]{lll}\gamma_{n,k}&=\frac{\Omega_{n}\left(\sum_{l_{% 1},l_{2}}A^{k}_{l_{1},l_{2}}\tilde{h}^{n,k}_{l_{1},l_{2}}\sum_{l_{1},l_{2}}% \left(A^{k}_{l_{1},l_{2}}\right)^{\ast}\left(\tilde{h}^{n,k}_{l_{1},l_{2}}% \right)^{\ast}\right)}{\exp\left(2\mu_{y}+2\sigma^{2}_{y}\right)+\sigma_{% \omega}^{2}},&\\ \gamma_{n}&=\frac{\Omega_{n}h_{n}h^{\ast}_{n}}{\exp\left(2\mu_{y}+2\sigma^{2}_% {y}\right)+\sigma_{\omega}^{2}},&\end{array}\right.{ start_ARRAY start_ROW start_CELL italic_γ start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT end_CELL start_CELL = divide start_ARG roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT over~ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over~ start_ARG italic_h end_ARG start_POSTSUPERSCRIPT italic_n , italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG roman_exp ( 2 italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL start_CELL = divide start_ARG roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_h start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG roman_exp ( 2 italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , end_CELL start_CELL end_CELL end_ROW end_ARRAY (8)

where ()superscript(\cdot)^{\ast}( ⋅ ) start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the conjugate operation.

In a practical system, each IoT device can only support a finite number of data rates according to the available SFs. Let ={c1,c2,,cM}subscript𝑐1subscript𝑐2subscript𝑐𝑀\mathcal{M}=\{c_{1},c_{2},\cdots,c_{M}\}caligraphic_M = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } and 𝒮={s1,s2,,sM}𝒮subscript𝑠1subscript𝑠2subscript𝑠𝑀\mathcal{S}=\{s_{1},s_{2},\cdots,s_{M}\}caligraphic_S = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } be the set of data rates and SFs, respectively. According to [12], the relationship between data rate and SF is given by

cm=Bsm2sm×CR,subscript𝑐𝑚𝐵subscript𝑠𝑚superscript2subscript𝑠𝑚𝐶𝑅c_{m}=\frac{Bs_{m}}{2^{s_{m}}}\times CR,italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = divide start_ARG italic_B italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG × italic_C italic_R , (9)

where B𝐵Bitalic_B is the bandwidth in Hz and CR𝐶𝑅CRitalic_C italic_R is the code rate. It can be seen that a higher SF is associated with a lower data rate. In other words, if 𝒮𝒮\mathcal{S}caligraphic_S is in ascending order s1<s2<<sMsubscript𝑠1subscript𝑠2subscript𝑠𝑀s_{1}<s_{2}<\cdots<s_{M}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < ⋯ < italic_s start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, \mathcal{M}caligraphic_M will be the descending order c1>c2>>cMsubscript𝑐1subscript𝑐2subscript𝑐𝑀c_{1}>c_{2}>\cdots>c_{M}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > ⋯ > italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT.

The achievable data rate not only corresponds to the selected modulation and coding scheme but also depends on the received SINR [35]. Thus, according to the instantaneous received SINR, the successful transmission probability of data rate cmsubscript𝑐𝑚c_{m}italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is given by

{θk,cmnPr{γn,kΨm},PatternI,θcmnPr{γnΨm},PatternII,casessuperscriptsubscript𝜃𝑘subscript𝑐𝑚𝑛absentPrsubscriptsuperscript𝛾𝑛𝑘subscriptΨ𝑚PatternIsuperscriptsubscript𝜃subscript𝑐𝑚𝑛absentPrsubscriptsuperscript𝛾𝑛subscriptΨ𝑚PatternII\left\{\begin{array}[]{lll}\theta_{k,c_{m}}^{n}&\triangleq\text{Pr}\{\gamma^{% \prime}_{n,k}\geq\Psi_{m}\},&\mathrm{Pattern\ I},\\ \theta_{c_{m}}^{n}&\triangleq\text{Pr}\{\gamma^{\prime}_{n}\geq\Psi_{m}\},&% \mathrm{Pattern\ II},\end{array}\right.{ start_ARRAY start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_CELL start_CELL ≜ Pr { italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ≥ roman_Ψ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } , end_CELL start_CELL roman_Pattern roman_I , end_CELL end_ROW start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_CELL start_CELL ≜ Pr { italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ roman_Ψ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } , end_CELL start_CELL roman_Pattern roman_II , end_CELL end_ROW end_ARRAY (10)

where ΨmsubscriptΨ𝑚\Psi_{m}roman_Ψ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the minimum required SINR for the BS to demodulate the received signal when the data rate is cmsubscript𝑐𝑚c_{m}italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Note that SINR γn,ksubscriptsuperscript𝛾𝑛𝑘\gamma^{\prime}_{n,k}italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT (or γnsubscriptsuperscript𝛾𝑛\gamma^{\prime}_{n}italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) is a random variable with mean γn,ksubscript𝛾𝑛𝑘\gamma_{n,k}italic_γ start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT (or γnsubscript𝛾𝑛\gamma_{n}italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) due to the small-scale NLoS components and the received interference signal. According to the Shannon formula, the better received SINR, the higher the successful transmission probability when given a data rate. In other words, a descending data rates (c1>c2>>cMsubscript𝑐1subscript𝑐2subscript𝑐𝑀c_{1}>c_{2}>\cdots>c_{M}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > ⋯ > italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT) will lead to an ascending successful transmission probabilities (θc1<θc2<<θcMsubscript𝜃subscript𝑐1subscript𝜃subscript𝑐2subscript𝜃subscript𝑐𝑀\theta_{c_{1}}<\theta_{c_{2}}<\cdots<\theta_{c_{M}}italic_θ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT < italic_θ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT < ⋯ < italic_θ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT end_POSTSUBSCRIPT).

III Problem Formulation

The system’s goal is to maximize the sum rate of all IoT devices at each time slot by finding the optimal RIS and SF for each device under Pattern I, as well as determining the optimal SF for each device under Pattern II. Let ϑt={ϑ1t,ϑ2t,,ϑKt}superscriptitalic-ϑ𝑡subscriptsuperscriptitalic-ϑ𝑡1subscriptsuperscriptitalic-ϑ𝑡2subscriptsuperscriptitalic-ϑ𝑡𝐾\vec{\vartheta}^{t}=\{\vartheta^{t}_{1},\vartheta^{t}_{2},\ldots,\vartheta^{t}% _{K}\}over→ start_ARG italic_ϑ end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = { italic_ϑ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϑ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_ϑ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT } be the state vector of the RISs at time slot t𝑡titalic_t, where ϑkt=1subscriptsuperscriptitalic-ϑ𝑡𝑘1\vartheta^{t}_{k}=1italic_ϑ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 means that the k𝑘kitalic_k-th RIS is vacant; otherwise, it is occupied. Note that this information is known prior to each IoT device with the spectrum sensing operation. Then, the resource allocation problem is given by

maxϕk,cmn,ψcmnabsentsubscriptsubscriptsuperscriptitalic-ϕ𝑛𝑘subscript𝑐𝑚subscriptsuperscript𝜓𝑛subscript𝑐𝑚\displaystyle\underset{}{\max\limits_{\phi^{n}_{k,c_{m}},\psi^{n}_{c_{m}}}}start_UNDERACCENT end_UNDERACCENT start_ARG roman_max start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ψ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG t=1Tn=1Nm=1M(k=1Kcmϑktθk,cmnϕk,cmnPatternI+cmθcmnψcmnPatternII)superscriptsubscript𝑡1𝑇superscriptsubscript𝑛1𝑁superscriptsubscript𝑚1𝑀subscriptsuperscriptsubscript𝑘1𝐾subscript𝑐𝑚subscriptsuperscriptitalic-ϑ𝑡𝑘subscriptsuperscript𝜃𝑛𝑘subscript𝑐𝑚subscriptsuperscriptitalic-ϕ𝑛𝑘subscript𝑐𝑚PatternIsubscriptsubscript𝑐𝑚subscriptsuperscript𝜃𝑛subscript𝑐𝑚subscriptsuperscript𝜓𝑛subscript𝑐𝑚PatternII\displaystyle\sum_{t=1}^{T}\sum_{n=1}^{N}\sum_{m=1}^{M}\left(\underbrace{\sum_% {k=1}^{K}c_{m}\vartheta^{t}_{k}\theta^{n}_{k,c_{m}}\phi^{n}_{k,c_{m}}}\limits_% {\mathrm{Pattern\ I}}+\underbrace{c_{m}\theta^{n}_{c_{m}}\psi^{n}_{c_{m}}}% \limits_{\mathrm{Pattern\ II}}\right)∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ( under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_ϑ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT roman_Pattern roman_I end_POSTSUBSCRIPT + under⏟ start_ARG italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ψ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG start_POSTSUBSCRIPT roman_Pattern roman_II end_POSTSUBSCRIPT ) (11)
s.t.formulae-sequencest\displaystyle\mathrm{s.t.}roman_s . roman_t . m=1Mk=1Kϑktϕk,cmn+m=1Mψcmn=1,n𝒩,formulae-sequencesuperscriptsubscript𝑚1𝑀superscriptsubscript𝑘1𝐾subscriptsuperscriptitalic-ϑ𝑡𝑘subscriptsuperscriptitalic-ϕ𝑛𝑘subscript𝑐𝑚superscriptsubscript𝑚1𝑀subscriptsuperscript𝜓𝑛subscript𝑐𝑚1for-all𝑛𝒩\displaystyle\sum_{m=1}^{M}\sum_{k=1}^{K}\vartheta^{t}_{k}\phi^{n}_{k,c_{m}}+% \sum_{m=1}^{M}\psi^{n}_{c_{m}}=1,\ \forall n\in\mathcal{N},∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ϑ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_ψ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 , ∀ italic_n ∈ caligraphic_N ,
n=1Nϕk,cmn1,cm,andk𝒦,formulae-sequencesuperscriptsubscript𝑛1𝑁subscriptsuperscriptitalic-ϕ𝑛𝑘subscript𝑐𝑚1formulae-sequencefor-allsubscript𝑐𝑚andfor-all𝑘𝒦\displaystyle\sum_{n=1}^{N}\phi^{n}_{k,c_{m}}\leq 1,\ \forall c_{m}\in\mathcal% {M},\ \text{and}\ \forall k\in\mathcal{K},∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ 1 , ∀ italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ caligraphic_M , and ∀ italic_k ∈ caligraphic_K ,

where ϕk,cmnsubscriptsuperscriptitalic-ϕ𝑛𝑘subscript𝑐𝑚\phi^{n}_{k,c_{m}}italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT and ψcmnsubscriptsuperscript𝜓𝑛subscript𝑐𝑚\psi^{n}_{c_{m}}italic_ψ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT are the binary variables, where ϕk,cmn=1subscriptsuperscriptitalic-ϕ𝑛𝑘subscript𝑐𝑚1\phi^{n}_{k,c_{m}}=1italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 denotes that IoT device n𝑛nitalic_n transmits on the k𝑘kitalic_k-th RIS with SF smsubscript𝑠𝑚s_{m}italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to reflect its signal to the BS; otherwise, ϕk,cmn=0subscriptsuperscriptitalic-ϕ𝑛𝑘subscript𝑐𝑚0\phi^{n}_{k,c_{m}}=0italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0. The symbol ψcmn=1subscriptsuperscript𝜓𝑛subscript𝑐𝑚1\psi^{n}_{c_{m}}=1italic_ψ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 denotes that IoT device n𝑛nitalic_n directly transmits to the BS with SF smsubscript𝑠𝑚s_{m}italic_s start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT; otherwise, ψcmn=0subscriptsuperscript𝜓𝑛subscript𝑐𝑚0\psi^{n}_{c_{m}}=0italic_ψ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0. Thus, the first constraint indicates that each IoT device either transmits on Pattern I or Pattern II. If IoT device n𝑛nitalic_n transmits on Pattern I, then m=1Mk=1Kϑktϕk,cmn=1superscriptsubscript𝑚1𝑀superscriptsubscript𝑘1𝐾subscriptsuperscriptitalic-ϑ𝑡𝑘subscriptsuperscriptitalic-ϕ𝑛𝑘subscript𝑐𝑚1\sum_{m=1}^{M}\sum_{k=1}^{K}\vartheta^{t}_{k}\phi^{n}_{k,c_{m}}=1∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_ϑ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 means that each IoT device can only select a pair of RIS and SF; if IoT device n𝑛nitalic_n transmits on Pattern II, then m=1Mψcmn=1superscriptsubscript𝑚1𝑀subscriptsuperscript𝜓𝑛subscript𝑐𝑚1\sum_{m=1}^{M}\psi^{n}_{c_{m}}=1∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_ψ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 denotes that each IoT device can only select a SF. The second constraint means that the number of IoT devices that select the k𝑘kitalic_k-RIS and the m𝑚mitalic_m-th SF is subject to 00 or 1111. In addition, 𝒩={1,2,,N}𝒩12𝑁\mathcal{N}=\{1,2,\cdots,N\}caligraphic_N = { 1 , 2 , ⋯ , italic_N } and 𝒦={1,2,,K}𝒦12𝐾\mathcal{K}=\{1,2,\cdots,K\}caligraphic_K = { 1 , 2 , ⋯ , italic_K } are the sets of IoT devices and RISs, respectively. The symbol θk,cmnsubscriptsuperscript𝜃𝑛𝑘subscript𝑐𝑚\theta^{n}_{k,c_{m}}italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the successful transmission probability that IoT device n𝑛nitalic_n transmits on the k𝑘kitalic_k-th RIS and the m𝑚mitalic_m-th SF; while θcmnsubscriptsuperscript𝜃𝑛subscript𝑐𝑚\theta^{n}_{c_{m}}italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the successful transmission probability that IoT device n𝑛nitalic_n directly sends data to the BS with SF m𝑚mitalic_m.

It is difficult to solve problem (11) in this distributed hybrid network, especially in Pattern I. First, cmsubscript𝑐𝑚c_{m}italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and θk,cmnsubscriptsuperscript𝜃𝑛𝑘subscript𝑐𝑚\theta^{n}_{k,c_{m}}italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT are discrete values666According to (9), cmsubscript𝑐𝑚c_{m}italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is discrete since the number of SFs is limited in practice. In addition, according to (8) and (10), θk,cmnsubscriptsuperscript𝜃𝑛𝑘subscript𝑐𝑚\theta^{n}_{k,c_{m}}italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a function of the RIS’s phase shifts, which are discrete values in range [0,2π)02𝜋[0,2\pi)[ 0 , 2 italic_π )., resulting in a non-convex problem. Second, it requires the exact value of θk,cmnsubscriptsuperscript𝜃𝑛𝑘subscript𝑐𝑚\theta^{n}_{k,c_{m}}italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT. This information is difficult to obtain since the channel characteristic is determined by the UE-controlled RISs. Third, it needs some communications among IoT devices to share the information of θk,cmnsubscriptsuperscript𝜃𝑛𝑘subscript𝑐𝑚\theta^{n}_{k,c_{m}}italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT so as to determine the optimal available RIS for each IoT device. In addition, problem (11) in Pattern II can be regarded as a rate adaptation problem [36] since the SF allocation is independent for each IoT device, but it still requires the exact CSI (or the value of θcmnsubscriptsuperscript𝜃𝑛subscript𝑐𝑚\theta^{n}_{c_{m}}italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT), which is hard to estimate from the time-varying channel.

To overcome these challenges, we adopt the online learning method to learn the values of θcmnsubscriptsuperscript𝜃𝑛subscript𝑐𝑚\theta^{n}_{c_{m}}italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT and θk,cmnsubscriptsuperscript𝜃𝑛𝑘subscript𝑐𝑚\theta^{n}_{k,c_{m}}italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT sequentially and to allocate the optimal RIS and SF to each IoT device adaptively. During this process, the IoT device not only needs to explore the combinations of the RISs and SFs sufficiently but also needs to exploit the current best RIS and SF as many times as possible at each time slot. To better tradeoff this EE dilemma, we introduce the MPMAB framework to solve this problem, where players are the IoT devices, and arms are the combinations of the RISs and SFs.

However, the MPMAB framework still suffers from a slow convergence rate due to the large arm space (i.e., the combinations of the RISs and SFs) under Pattern I. Therefore, we decouple the MPMAB problem into a two-stage MPMAB framework to shrink the feasible arm space. The reason is that, on the one hand, descending data rates will result in ascending successful transmission probabilities; on the other hand, an IoT device with different data rates will experience the same channel fading under a particular RIS. These features indicate that the average successful transmission probabilities of the ordered data rates over different RISs have the same trend. Therefore, we can explore these RISs by arbitrarily assigning a data rate to the IoT device. In other words, the SF allocation and the RIS allocation processes are independent of each other under Pattern I.

IV Two-Stage MPMAB-based Resource Allocation Framework

In this two-stage MPMAB framework, players are the IoT devices; arms are the RISs and the SFs in the first and second stages, respectively. The first-stage MPMAB problem is to determine the best RIS for each IoT device; while the second-stage MPMAB problem is to find the optimal SF based on the state of the determined RIS.

IV-A First-Stage MPMAB Framework

We first introduce the transmission feedback model and the collision model. In the transmission feedback model, the IoT device can receive the transmission feedback from the BS when it transmits on Pattern I or Pattern II. Specifically, let In,tsubscriptsuperscript𝐼𝑛𝑡I^{\prime}_{n,t}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT be the selected arm by the n𝑛nitalic_n-th IoT device at time slot t𝑡titalic_t. After transmitting on the In,tsubscriptsuperscript𝐼𝑛𝑡I^{\prime}_{n,t}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT-th arm, the IoT device n𝑛nitalic_n will receive transmission feedback XIn,t(t)subscript𝑋subscriptsuperscript𝐼𝑛𝑡𝑡X_{I^{\prime}_{n,t}}(t)italic_X start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) from the BS. If the transmission is successful, then XIn,t(t)=1subscript𝑋subscriptsuperscript𝐼𝑛𝑡𝑡1X_{I^{\prime}_{n,t}}(t)=1italic_X start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) = 1; otherwise, XIn,t(t)=0subscript𝑋subscriptsuperscript𝐼𝑛𝑡𝑡0X_{I^{\prime}_{n,t}}(t)=0italic_X start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) = 0.

The collision model only exists in the first stage, referring to that two or more IoT devices that choose the same RIS will receive no rewards. We assume that each IoT device can deduce this collision information by observing the timeout feedback flag. Specifically, let η𝜂\etaitalic_η be the collision indicator. If an IoT device does not receive any feedback from the BS in the current time slot, then a collision happens, i.e., η=0𝜂0\eta=0italic_η = 0; otherwise, η=1𝜂1\eta=1italic_η = 1. Therefore, IoT devices can distinguish the collision and the transmission failure events by checking whether or not they receive transmission feedback from the BS. Moreover, this collision model also works in some extreme situations. For example, when the received SINR in Pattern I is too low to be recognized by the BS, the IoT device can always set η=0𝜂0\eta=0italic_η = 0 (i.e., the reward is 00) since the target RIS is suboptimal to it.

Denote 𝑰t={I1,t,I2,t,IN,t}subscriptsuperscript𝑰𝑡subscriptsuperscript𝐼1𝑡subscriptsuperscript𝐼2𝑡subscriptsuperscript𝐼𝑁𝑡\boldsymbol{I}^{\prime}_{t}=\{I^{\prime}_{1,t},I^{{}^{\prime}}_{2,t}\cdots,I^{% \prime}_{N,t}\}bold_italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 , italic_t end_POSTSUBSCRIPT , italic_I start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 , italic_t end_POSTSUBSCRIPT ⋯ , italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N , italic_t end_POSTSUBSCRIPT } by the strategy profile at time slot t𝑡titalic_t. The collision indicator of RIS k𝑘kitalic_k is defined as

ηk(𝑰t)={0,|𝒩k|>1,1,otherwise,subscript𝜂𝑘subscriptsuperscript𝑰𝑡cases0subscript𝒩𝑘11otherwise\begin{split}\eta_{k}\left(\boldsymbol{I}^{\prime}_{t}\right)=\left\{\begin{% array}[]{ll}0,&|\mathcal{N}_{k}|>1,\\ 1,&\text{otherwise},\end{array}\right.\end{split}start_ROW start_CELL italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = { start_ARRAY start_ROW start_CELL 0 , end_CELL start_CELL | caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | > 1 , end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL otherwise , end_CELL end_ROW end_ARRAY end_CELL end_ROW (12)

where 𝒩ksubscript𝒩𝑘\mathcal{N}_{k}caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the set of players that select the k𝑘kitalic_k-th RIS in strategy profile 𝑰tsubscriptsuperscript𝑰𝑡\boldsymbol{I}^{\prime}_{t}bold_italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The reward that IoT device n𝑛nitalic_n transmits on the k𝑘kitalic_k-th RIS is given by

rn,In,t=k(t)ηk(𝑰t)XIn,t=k(t).subscript𝑟𝑛subscriptsuperscript𝐼𝑛𝑡𝑘𝑡subscript𝜂𝑘subscriptsuperscript𝑰𝑡subscript𝑋subscriptsuperscript𝐼𝑛𝑡𝑘𝑡r_{n,I^{\prime}_{n,t}=k}(t)\triangleq\eta_{k}\left(\boldsymbol{I}^{\prime}_{t}% \right)X_{I^{\prime}_{n,t}=k}(t).italic_r start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = italic_k end_POSTSUBSCRIPT ( italic_t ) ≜ italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_X start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = italic_k end_POSTSUBSCRIPT ( italic_t ) . (13)

Then, the estimated average successful transmission probability that the n𝑛nitalic_n-th IoT device transmits on the k𝑘kitalic_k-th RIS is given by

θ^n,k=𝔼[rn,In,t=k(t)],subscript^𝜃𝑛𝑘𝔼delimited-[]subscript𝑟𝑛subscriptsuperscript𝐼𝑛𝑡𝑘𝑡\begin{split}\hat{\theta}_{n,k}&=\mathbb{E}\left[r_{n,I^{\prime}_{n,t}=k}(t)% \right],\\ \end{split}start_ROW start_CELL over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT end_CELL start_CELL = blackboard_E [ italic_r start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = italic_k end_POSTSUBSCRIPT ( italic_t ) ] , end_CELL end_ROW (14)

where 𝔼[]𝔼delimited-[]\mathbb{E}[\cdot]blackboard_E [ ⋅ ] is the expectation operator.

IV-B Second-Stage MPMAB Framework

In this stage, each player has a targeted RIS after the first-stage allocation. Thus, the IoT device transmits directly or on a targeted RIS to the BS at each time slot. Note that there is no collision in this stage since two or more players can choose the same SF. Therefore, this stage can be regarded as a single-player MAB framework.

Let In,t′′subscriptsuperscript𝐼′′𝑛𝑡I^{\prime\prime}_{n,t}italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT be the currently selected arm at the second-stage allocation. The reward that the IoT device n𝑛nitalic_n chooses the m𝑚mitalic_m-th data rate is defined as

rn,In,t′′=m(t)cmηk(𝑰t)XIn,t′′=m(t),subscript𝑟𝑛subscriptsuperscript𝐼′′𝑛𝑡𝑚𝑡subscript𝑐𝑚subscript𝜂𝑘subscriptsuperscript𝑰𝑡subscript𝑋subscriptsuperscript𝐼′′𝑛𝑡𝑚𝑡r_{n,I^{{}^{\prime\prime}}_{n,t}=m}(t)\triangleq c_{m}\eta_{k}\left(% \boldsymbol{I}^{\prime}_{t}\right)X_{I^{\prime\prime}_{n,t}=m}(t),italic_r start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = italic_m end_POSTSUBSCRIPT ( italic_t ) ≜ italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_X start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = italic_m end_POSTSUBSCRIPT ( italic_t ) , (15)

where cmsubscript𝑐𝑚c_{m}italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the m𝑚mitalic_m-th data rate and 𝑰tsubscriptsuperscript𝑰𝑡\boldsymbol{I}^{\prime}_{t}bold_italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the strategy profile of all players’ target RISs at the first stage. The instantaneous rewards rn,In,t′′(t)subscript𝑟𝑛subscriptsuperscript𝐼′′𝑛𝑡𝑡r_{n,I^{\prime\prime}_{n,t}}(t)italic_r start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) are independently and identically distributed w.r.t. player n𝑛nitalic_n and time slot t𝑡titalic_t. Therefore, the estimated average reward is given by

μ^n,m=𝔼[rn,In,t′′=m(t)]=cmθ^n,m,subscript^𝜇𝑛𝑚𝔼delimited-[]subscript𝑟𝑛subscriptsuperscript𝐼′′𝑛𝑡𝑚𝑡subscript𝑐𝑚subscript^𝜃𝑛𝑚\hat{\mu}_{n,m}=\mathbb{E}\left[r_{n,I^{\prime\prime}_{n,t}=m}(t)\right]=c_{m}% \hat{\theta}_{n,m},over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT = blackboard_E [ italic_r start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = italic_m end_POSTSUBSCRIPT ( italic_t ) ] = italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT , (16)

where θ^n,msubscript^𝜃𝑛𝑚\hat{\theta}_{n,m}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT is the estimated average successful transmission probability that the n𝑛nitalic_n-th IoT device transmits on the m𝑚mitalic_m-th data rate.

IV-C Performance Metric for the Two-Stage MPMAB Framework

In the following, we design a criterion to quantify the performance loss that players select the suboptimal arms rather than the optimal arm in this two-stage MPMAB problem. According to (11), the objective function consists of Pattern I and Pattern II. For Pattern I, we define the joint RIS and SF selection profile by 𝒂={a1,a2,,aN}𝒂subscript𝑎1subscript𝑎2subscript𝑎𝑁\boldsymbol{a}=\{a_{1},a_{2},\ldots,a_{N}\}bold_italic_a = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where an𝒦subscript𝑎𝑛tensor-product𝒦a_{n}\in\mathcal{K}\otimes\mathcal{M}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_K ⊗ caligraphic_M and tensor-product\otimes is the Cartesian product of the RIS set and the data rate set. However, for Pattern II, the selection profile 𝒂𝒂\boldsymbol{a}bold_italic_a is the set of data rates, i.e., ansubscript𝑎𝑛a_{n}\in\mathcal{M}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_M. Therefore, the two-stage allocation aims to solve the following problem,

𝒂=argmax𝒂n=1Nμ^n,an=argmax𝒂n=1N𝔼[c𝒂ηk(𝒂)X𝒂(t)],superscript𝒂argsubscript𝒂subscriptsuperscript𝑁𝑛1subscript^𝜇𝑛subscript𝑎𝑛argsubscript𝒂subscriptsuperscript𝑁𝑛1𝔼delimited-[]subscript𝑐𝒂subscript𝜂𝑘𝒂subscript𝑋𝒂𝑡\begin{split}\boldsymbol{a}^{\ast}&=\text{arg}\max\limits_{\boldsymbol{a}}\sum% \limits^{N}_{n=1}\hat{\mu}_{n,a_{n}}\\ &=\text{arg}\max\limits_{\boldsymbol{a}}\sum\limits^{N}_{n=1}\mathbb{E}\left[c% _{\boldsymbol{a}}\eta_{k}\left(\boldsymbol{a}\right)X_{\boldsymbol{a}}(t)% \right],\end{split}start_ROW start_CELL bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL start_CELL = arg roman_max start_POSTSUBSCRIPT bold_italic_a end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT over^ start_ARG italic_μ end_ARG start_POSTSUBSCRIPT italic_n , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = arg roman_max start_POSTSUBSCRIPT bold_italic_a end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT blackboard_E [ italic_c start_POSTSUBSCRIPT bold_italic_a end_POSTSUBSCRIPT italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_a ) italic_X start_POSTSUBSCRIPT bold_italic_a end_POSTSUBSCRIPT ( italic_t ) ] , end_CELL end_ROW (17)

where 𝒂={a1,a2,,aN}superscript𝒂superscriptsubscript𝑎1superscriptsubscript𝑎2superscriptsubscript𝑎𝑁\boldsymbol{a}^{\ast}=\{a_{1}^{\ast},a_{2}^{\ast},\cdots,a_{N}^{\ast}\}bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } is the optimal strategy profile.

Then, we define the difference between the optimal arm and the currently selected arm as the performance metric, also known as regret. According to [23], the expression of accumulated regrets is given by

egt=1Tn=1Nrn,an(t)t=1Tn=1Nrn,an(t),𝑒𝑔superscriptsubscript𝑡1𝑇superscriptsubscript𝑛1𝑁subscript𝑟𝑛superscriptsubscript𝑎𝑛𝑡superscriptsubscript𝑡1𝑇superscriptsubscript𝑛1𝑁subscript𝑟𝑛subscript𝑎𝑛𝑡\mathcal{R}eg\triangleq\sum_{t=1}^{T}\sum_{n=1}^{N}{r}_{n,a_{n}^{\ast}}(t)-% \sum_{t=1}^{T}\sum_{n=1}^{N}{r}_{n,a_{n}}(t),caligraphic_R italic_e italic_g ≜ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_n , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_t ) - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_n , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) , (18)

where an𝒂subscriptsuperscript𝑎𝑛superscript𝒂a^{\ast}_{n}\in\boldsymbol{a}^{\ast}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and T𝑇Titalic_T is the total time slots. For mathematical analysis, we further define the pseudo-regret [19] w.r.t. the stochastic rewards and the randomly selected arms as

eg¯=n=1N(T×μn,an𝔼t=1Tμn,an)={n=1Ni=1K×MΔn,i𝔼[Wn,i],PatternI,n=1Ni=1MΔn,i𝔼[Wn,i],PatternII,¯𝑒𝑔superscriptsubscript𝑛1𝑁𝑇subscript𝜇𝑛subscriptsuperscript𝑎𝑛𝔼superscriptsubscript𝑡1𝑇subscript𝜇𝑛subscript𝑎𝑛casessuperscriptsubscript𝑛1𝑁superscriptsubscript𝑖1𝐾𝑀subscriptΔ𝑛𝑖𝔼delimited-[]subscript𝑊𝑛𝑖PatternIsuperscriptsubscript𝑛1𝑁superscriptsubscript𝑖1𝑀subscriptΔ𝑛𝑖𝔼delimited-[]subscript𝑊𝑛𝑖PatternII\small\begin{split}\overline{\mathcal{R}eg}&=\sum_{n=1}^{N}\left(T\times\mu_{n% ,a^{\ast}_{n}}-\mathbb{E}\sum_{t=1}^{T}\mu_{n,a_{n}}\right)\\ &=\left\{\begin{array}[]{ll}\sum_{n=1}^{N}\sum_{i=1}^{K\times M}\Delta_{n,i}% \mathbb{E}[W_{n,i}],&\mathrm{Pattern}\ \mathrm{I},\\ \sum_{n=1}^{N}\sum_{i=1}^{M}\Delta_{n,i}\mathbb{E}[W_{n,i}],&\mathrm{Pattern}% \ \mathrm{II},\end{array}\right.\end{split}start_ROW start_CELL over¯ start_ARG caligraphic_R italic_e italic_g end_ARG end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_T × italic_μ start_POSTSUBSCRIPT italic_n , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT - blackboard_E ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = { start_ARRAY start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K × italic_M end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT blackboard_E [ italic_W start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT ] , end_CELL start_CELL roman_Pattern roman_I , end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT roman_Δ start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT blackboard_E [ italic_W start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT ] , end_CELL start_CELL roman_Pattern roman_II , end_CELL end_ROW end_ARRAY end_CELL end_ROW (19)

where Δn,i=μn,anμn,isubscriptΔ𝑛𝑖subscript𝜇𝑛subscriptsuperscript𝑎𝑛subscript𝜇𝑛𝑖\Delta_{n,i}=\mu_{n,a^{\ast}_{n}}-\mu_{n,i}roman_Δ start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_n , italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT and Wn,isubscript𝑊𝑛𝑖W_{n,i}italic_W start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT is the number of times that arm i𝑖iitalic_i has been selected up to time T𝑇Titalic_T. Term μn,isubscript𝜇𝑛𝑖\mu_{n,i}italic_μ start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT is the real expected throughput of player n𝑛nitalic_n at arm i𝑖iitalic_i.

V E2boost Algorithm

In this section, we propose an E2Boost algorithm to solve this two-stage MPMAB problem by combining the game theory and the MAB algorithms. The structure of the E2Boost algorithm is shown in Fig. 2. Since time horizon T𝑇Titalic_T is unknown to each player, the E2Boost algorithm proceeds in epochs (i.e., z=1,,Z𝑧1𝑍z=1,\cdots,Zitalic_z = 1 , ⋯ , italic_Z). Each epoch consists of three phases: ϵitalic-ϵ\epsilonitalic_ϵ-Greedy EE, non-cooperation game, and Thompson sampling EE phases. Each phase contains several time slots and specific mechanisms to balance the EE dilemma.

Refer to caption
Figure 2: The structure of the E2Boost algorithm.

V-A The Exploration and Exploitation Boosting Algorithm

The E2Boost algorithm is shown in Algorithm 1. The first two phases are designed to find the optimal RIS for each IoT device by solving the first-stage MPMAB problem; while the last phase is to determine the best SF by solving the second-stage MPMAB problem. In the following, we elaborate on the above three phases in detail.

Algorithm 1 E2Boost Algorithm Run by Player n𝑛nitalic_n
1:Initialization: δ>0,ε>0formulae-sequence𝛿0𝜀0\delta>0,\varepsilon>0italic_δ > 0 , italic_ε > 0 and ν,ν1,ν2,ν3>0𝜈subscript𝜈1subscript𝜈2subscript𝜈30\nu,\nu_{1},\nu_{2},\nu_{3}>0italic_ν , italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT > 0. Let ϵ=1,Vn,k(0)=0,Qn,k(0)=0,αn,m(0)=0,βn,m(0)=0,k𝒦formulae-sequenceitalic-ϵ1formulae-sequencesubscript𝑉𝑛𝑘00formulae-sequencesubscript𝑄𝑛𝑘00formulae-sequencesubscript𝛼𝑛𝑚00formulae-sequencesubscript𝛽𝑛𝑚00for-all𝑘𝒦\epsilon=1,V_{n,k}(0)=0,Q_{n,k}(0)=0,\alpha_{n,m}(0)=0,\beta_{n,m}(0)=0,\ % \forall k\in\mathcal{K}italic_ϵ = 1 , italic_V start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ( 0 ) = 0 , italic_Q start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ( 0 ) = 0 , italic_α start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( 0 ) = 0 , italic_β start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( 0 ) = 0 , ∀ italic_k ∈ caligraphic_K,  mfor-all𝑚\forall m\in\mathcal{M}∀ italic_m ∈ caligraphic_M.
2:for each epoch z=1,2,𝑧12z=1,2,\cdotsitalic_z = 1 , 2 , ⋯, Z do
3:     i) ϵitalic-ϵ\epsilonitalic_ϵ-Greedy EE Phase:  For the next ν1zδsubscript𝜈1superscript𝑧𝛿\nu_{1}z^{\delta}italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT time slots.
4:      a) Pick up a data rate In,t=msubscriptsuperscript𝐼𝑛𝑡𝑚I^{\prime}_{n,t}=mitalic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = italic_m uniformly from set \mathcal{M}caligraphic_M if z=1𝑧1z=1italic_z = 1, otherwise In,t=cnsubscriptsuperscript𝐼𝑛𝑡superscriptsubscript𝑐𝑛I^{\prime}_{n,t}=c_{n}^{\ast}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT;
5:      b) Select a RIS In,t=ksubscriptsuperscript𝐼𝑛𝑡𝑘I^{\prime}_{n,t}=kitalic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = italic_k uniformly from set 𝒦𝒦\mathcal{K}caligraphic_K with probability ϵitalic-ϵ\epsilonitalic_ϵ or In,t=knsubscriptsuperscript𝐼𝑛𝑡superscriptsubscript𝑘𝑛I^{\prime}_{n,t}=k_{n}^{\ast}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with probability 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ;
6:      c) Detect the selected RIS: jump to Phase iii if busy; otherwise, continue the following steps:
7:      d) Observe the transmission feedback Xn,In,tsubscript𝑋𝑛subscriptsuperscript𝐼𝑛𝑡X_{n,I^{\prime}_{n,t}}italic_X start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Set ηk=0subscript𝜂𝑘0\eta_{k}=0italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0 if timeout and ηk=1subscript𝜂𝑘1\eta_{k}=1italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 otherwise;
8:      e) If ηk=1subscript𝜂𝑘1\eta_{k}=1italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 then update Vn,In,t(t)=Vn,In,t(t1)+1subscript𝑉𝑛subscriptsuperscript𝐼𝑛𝑡𝑡subscript𝑉𝑛subscriptsuperscript𝐼𝑛𝑡𝑡11V_{n,I^{\prime}_{n,t}}(t)=V_{n,I^{\prime}_{n,t}}(t-1)+1italic_V start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) = italic_V start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t - 1 ) + 1 and Qn,In,t(t)=Qn,In,t(t1)+Xn,In,t(t)subscript𝑄𝑛subscriptsuperscript𝐼𝑛𝑡𝑡subscript𝑄𝑛subscriptsuperscript𝐼𝑛𝑡𝑡1subscript𝑋𝑛subscriptsuperscript𝐼𝑛𝑡𝑡Q_{n,I^{\prime}_{n,t}}(t)=Q_{n,I^{\prime}_{n,t}}(t-1)+X_{n,I^{\prime}_{n,t}}(t)italic_Q start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) = italic_Q start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t - 1 ) + italic_X start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t );
9:      At the end of this phase, compute the successful transmission probabilities of RISs by
θ^n,kz=Qn,kVn,k,k𝒦.formulae-sequencesubscriptsuperscript^𝜃𝑧𝑛𝑘subscript𝑄𝑛𝑘subscript𝑉𝑛𝑘for-all𝑘𝒦\hat{\theta}^{z}_{n,k}=\frac{Q_{n,k}}{V_{n,k}},\quad\forall k\in\mathcal{K}.over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT = divide start_ARG italic_Q start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT end_ARG start_ARG italic_V start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT end_ARG , ∀ italic_k ∈ caligraphic_K .
10:     ii) Non-cooperation Game Phase:  For the next ν2zδsubscript𝜈2superscript𝑧𝛿\nu_{2}z^{\delta}italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT time slots, play with the dynamics. Set STn=C𝑆subscript𝑇𝑛𝐶ST_{n}=Citalic_S italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_C, and let k¯¯𝑘\bar{k}over¯ start_ARG italic_k end_ARG be the last RIS chosen in the zz21𝑧𝑧21z-\lfloor\frac{z}{2}\rfloor-1italic_z - ⌊ divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ⌋ - 1 Game phase, or a random choice if z=1,2𝑧12z=1,2italic_z = 1 , 2.
11:      a) If STn=C𝑆subscript𝑇𝑛𝐶ST_{n}=Citalic_S italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_C choose a RIS In,tsubscriptsuperscript𝐼𝑛𝑡I^{\prime}_{n,t}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT using (21) and if STn=D𝑆subscript𝑇𝑛𝐷ST_{n}=Ditalic_S italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_D select In,tsubscriptsuperscript𝐼𝑛𝑡I^{\prime}_{n,t}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT at random (22);
12:      b) Detect the selected RIS: jump to Phase iii if busy; otherwise continue the following steps:
13:      c) If In,tk¯subscriptsuperscript𝐼𝑛𝑡¯𝑘I^{\prime}_{n,t}\neq\bar{k}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT ≠ over¯ start_ARG italic_k end_ARG or un=0subscript𝑢𝑛0u_{n}=0italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0 or STn=D𝑆subscript𝑇𝑛𝐷ST_{n}=Ditalic_S italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_D then set STn=CorD𝑆subscript𝑇𝑛𝐶or𝐷ST_{n}=C\ \text{or}\ Ditalic_S italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_C or italic_D according to (24);
14:      d) Record the number of times each RIS has been selected within the content state using (25);
15:      e) Adjust parameter ϵitalic-ϵ\epsilonitalic_ϵ according to Lemma 1 when z2𝑧2z\geq 2italic_z ≥ 2.
16:      At the end of this phase, determine the current best RIS by
kn=argmaxk𝒦j=0z2Fnzj(k).superscriptsubscript𝑘𝑛argsubscript𝑘𝒦superscriptsubscript𝑗0𝑧2subscriptsuperscript𝐹𝑧𝑗𝑛𝑘k_{n}^{\ast}=\text{arg}\max\limits_{k\in\mathcal{K}}\ \sum_{j=0}^{\lfloor\frac% {z}{2}\rfloor}F^{z-j}_{n}(k).italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = arg roman_max start_POSTSUBSCRIPT italic_k ∈ caligraphic_K end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ⌋ end_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT italic_z - italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ) .
17:     iii) Thompson Sampling EE Phase: For the next ν32zsubscript𝜈3superscript2𝑧\nu_{3}2^{z}italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT time slots, run the Thompson sampling algorithm based on the current state of the best RIS, as well as the corresponding collision indicator.
18:      a) Draw θ^n,mBeta(αn,m(t)+1,βn,m(t)+1)similar-tosubscript^𝜃𝑛𝑚Betasubscript𝛼𝑛𝑚𝑡1subscript𝛽𝑛𝑚𝑡1\hat{\theta}_{n,m}\sim\text{Beta}\left(\alpha_{n,m}(t)+1,\beta_{n,m}(t)+1\right)over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ∼ Beta ( italic_α start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_t ) + 1 , italic_β start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_t ) + 1 );
19:      b) Select a data rate In,t′′=argmaxmcm×θ^n,msubscriptsuperscript𝐼′′𝑛𝑡argsubscript𝑚subscript𝑐𝑚subscript^𝜃𝑛𝑚I^{\prime\prime}_{n,t}=\mathrm{arg}\max_{m\in\mathcal{M}}c_{m}\times\hat{% \theta}_{n,m}italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_m ∈ caligraphic_M end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT;
20:      c) Detect the target RIS: The device directly transmits to the BS if busy, otherwise continue the following steps:
21:      d) Transmit on the selected data rate and observe the random transmission feedback XIn,t′′(t)subscript𝑋subscriptsuperscript𝐼′′𝑛𝑡𝑡X_{I^{\prime\prime}_{n,t}}(t)italic_X start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t );
22:      e) Posterior update: Set αn,In,t′′(t)=αn,In,t′′(t1)+XIn,t′′(t)subscript𝛼𝑛subscriptsuperscript𝐼′′𝑛𝑡𝑡subscript𝛼𝑛subscriptsuperscript𝐼′′𝑛𝑡𝑡1subscript𝑋subscriptsuperscript𝐼′′𝑛𝑡𝑡\alpha_{n,I^{\prime\prime}_{n,t}}(t)=\alpha_{n,I^{\prime\prime}_{n,t}}(t-1)+X_% {I^{\prime\prime}_{n,t}}(t)italic_α start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) = italic_α start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t - 1 ) + italic_X start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) and βn,In,t′′(t)=βn,In,t′′(t1)+1XIn,t′′(t)subscript𝛽𝑛subscriptsuperscript𝐼′′𝑛𝑡𝑡subscript𝛽𝑛subscriptsuperscript𝐼′′𝑛𝑡𝑡11subscript𝑋subscriptsuperscript𝐼′′𝑛𝑡𝑡\beta_{n,I^{\prime\prime}_{n,t}}(t)=\beta_{n,I^{\prime\prime}_{n,t}}(t-1)+1-X_% {I^{\prime\prime}_{n,t}}(t)italic_β start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ) = italic_β start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t - 1 ) + 1 - italic_X start_POSTSUBSCRIPT italic_I start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_t ).
23:      At the end of this phase, determine the current best data rate by
cn=argmaxmcmαn,mαn,m+βn,m.superscriptsubscript𝑐𝑛argsubscript𝑚subscript𝑐𝑚subscript𝛼𝑛𝑚subscript𝛼𝑛𝑚subscript𝛽𝑛𝑚c_{n}^{\ast}=\text{arg}\max\limits_{m\in\mathcal{M}}\ \frac{c_{m}\alpha_{n,m}}% {\alpha_{n,m}+\beta_{n,m}}.italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = arg roman_max start_POSTSUBSCRIPT italic_m ∈ caligraphic_M end_POSTSUBSCRIPT divide start_ARG italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT end_ARG .
24:end for

ϵitalic-ϵ\epsilonitalic_ϵ-Greedy EE Phase: There are ν1zδsubscript𝜈1superscript𝑧𝛿\nu_{1}z^{\delta}italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT rounds in this phase for epoch z=1,,Z𝑧1𝑍z=1,\cdots,Zitalic_z = 1 , ⋯ , italic_Z, where ν1>0subscript𝜈10\nu_{1}>0italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 and δ>0𝛿0\delta>0italic_δ > 0 are two constants. It aims to estimate the average successful transmission probability of each RIS. The SF is randomly chosen from the set 𝒮𝒮\mathcal{S}caligraphic_S when z=1𝑧1z=1italic_z = 1; otherwise, it uses the SF determined in the last epoch of the third phase. We adopt the ϵitalic-ϵ\epsilonitalic_ϵ-greedy algorithm to balance the EE dilemma. Specifically, if z=1𝑧1z=1italic_z = 1, we set ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1 to uniformly explore all RISs; otherwise, we update the parameter ϵitalic-ϵ\epsilonitalic_ϵ according to Lemma 1, as given in the next paragraph. Hence, when the players’ strategy profile deviates from 𝒂superscript𝒂\boldsymbol{a}^{\ast}bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Algorithm 1 tends to uniformly explore all actions; otherwise, it inherits the last epoch’s action with a high probability.

Non-cooperation Game Phase: This phase has a length of ν2zδsubscript𝜈2superscript𝑧𝛿\nu_{2}z^{\delta}italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT rounds, which is the core step of Algorithm 1 to allocate the optimal RIS for each player. By adopting the estimated average successful transmission probability θ^n,kzsubscriptsuperscript^𝜃𝑧𝑛𝑘\hat{\theta}^{z}_{n,k}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT in the first phase as a utility, players in this phase will play a non-cooperation game.

Specifically, let the utility of player n𝑛nitalic_n in strategy profile 𝑰superscript𝑰\boldsymbol{I}^{\prime}bold_italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be

un(𝑰)ηk(𝑰)θ^n,kz,k𝒦,formulae-sequencesubscript𝑢𝑛superscript𝑰subscript𝜂𝑘superscript𝑰subscriptsuperscript^𝜃𝑧𝑛𝑘for-all𝑘𝒦u_{n}(\boldsymbol{I}^{\prime})\triangleq\eta_{k}(\boldsymbol{I}^{\prime})\hat{% \theta}^{z}_{n,k},\ \forall k\in\mathcal{K},italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≜ italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , ∀ italic_k ∈ caligraphic_K , (20)

where θ^n,kzsubscriptsuperscript^𝜃𝑧𝑛𝑘\hat{\theta}^{z}_{n,k}over^ start_ARG italic_θ end_ARG start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT is the estimated successful transmission probability that the n𝑛nitalic_n-th IoT device transmits on the k𝑘kitalic_k-th RIS from epoch 1111 to z𝑧zitalic_z at the first phase. Let un,max=max𝑰un(𝑰)subscript𝑢𝑛subscriptsuperscript𝑰subscript𝑢𝑛superscript𝑰u_{n,\max}=\max\limits_{\boldsymbol{I}^{\prime}}u_{n}(\boldsymbol{I}^{\prime})italic_u start_POSTSUBSCRIPT italic_n , roman_max end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT bold_italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) be the maximum utility of player n𝑛nitalic_n. Assume that each player has a private state STn={C,D},n𝒩formulae-sequence𝑆subscript𝑇𝑛𝐶𝐷for-all𝑛𝒩ST_{n}=\{C,D\},\ \forall n\in\mathcal{N}italic_S italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_C , italic_D } , ∀ italic_n ∈ caligraphic_N, where C𝐶Citalic_C and D𝐷Ditalic_D represent content and discontent state, respectively. In addition, each player maintains a baseline RIS k¯¯𝑘\bar{k}over¯ start_ARG italic_k end_ARG. Then, a player chooses a RIS according to the following strategy:

  • A content player has a very high probability of staying at the current baseline RIS:

    Pn,k={ενK1,kk¯;1εν,k=k¯.subscript𝑃𝑛𝑘casessuperscript𝜀𝜈𝐾1𝑘¯𝑘1superscript𝜀𝜈𝑘¯𝑘\begin{split}P_{n,k}=\left\{\begin{array}[]{ll}\frac{\varepsilon^{\nu}}{K-1},&% k\neq\bar{k};\\ 1-\varepsilon^{\nu},&k=\bar{k}.\end{array}\right.\end{split}start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL divide start_ARG italic_ε start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT end_ARG start_ARG italic_K - 1 end_ARG , end_CELL start_CELL italic_k ≠ over¯ start_ARG italic_k end_ARG ; end_CELL end_ROW start_ROW start_CELL 1 - italic_ε start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT , end_CELL start_CELL italic_k = over¯ start_ARG italic_k end_ARG . end_CELL end_ROW end_ARRAY end_CELL end_ROW (21)
  • A discontent player selects a RIS following a uniform distribution, i.e.,

    Pn,k=1K,k𝒦.formulae-sequencesubscript𝑃𝑛𝑘1𝐾for-all𝑘𝒦P_{n,k}=\frac{1}{K},\ \forall k\in\mathcal{K}.italic_P start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG , ∀ italic_k ∈ caligraphic_K . (22)

The transition between content state C𝐶Citalic_C and discontent state D𝐷Ditalic_D is given by:

  • If k=k¯𝑘¯𝑘k=\bar{k}italic_k = over¯ start_ARG italic_k end_ARG, un>0subscript𝑢𝑛0u_{n}>0italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT > 0, and STn=C𝑆subscript𝑇𝑛𝐶ST_{n}=Citalic_S italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_C, then a content player keeps state C𝐶Citalic_C with a probability of 1:

    (k¯,C)(k¯,C).¯𝑘𝐶¯𝑘𝐶(\bar{k},C)\rightarrow(\bar{k},C).( over¯ start_ARG italic_k end_ARG , italic_C ) → ( over¯ start_ARG italic_k end_ARG , italic_C ) . (23)
  • If kk¯𝑘¯𝑘k\neq\bar{k}italic_k ≠ over¯ start_ARG italic_k end_ARG or un=0subscript𝑢𝑛0u_{n}=0italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0 or STn=D𝑆subscript𝑇𝑛𝐷ST_{n}=Ditalic_S italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_D, then transitions of baselines and states are given by

    (k¯,C/D)={(k,C),unun,maxεun,maxun;(k,D),1unun,maxεun,maxun.¯𝑘𝐶𝐷cases𝑘𝐶subscript𝑢𝑛subscript𝑢𝑛superscript𝜀subscript𝑢𝑛subscript𝑢𝑛𝑘𝐷1subscript𝑢𝑛subscript𝑢𝑛superscript𝜀subscript𝑢𝑛subscript𝑢𝑛\begin{split}\left(\bar{k},C/D\right)=\left\{\begin{array}[]{ll}\left(k,C% \right),&\frac{u_{n}}{u_{n,\max}}\varepsilon^{u_{n,\max}-u_{n}};\\ \left(k,D\right),&1-\frac{u_{n}}{u_{n,\max}}\varepsilon^{u_{n,\max}-u_{n}}.% \end{array}\right.\end{split}start_ROW start_CELL ( over¯ start_ARG italic_k end_ARG , italic_C / italic_D ) = { start_ARRAY start_ROW start_CELL ( italic_k , italic_C ) , end_CELL start_CELL divide start_ARG italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_n , roman_max end_POSTSUBSCRIPT end_ARG italic_ε start_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_n , roman_max end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ; end_CELL end_ROW start_ROW start_CELL ( italic_k , italic_D ) , end_CELL start_CELL 1 - divide start_ARG italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_n , roman_max end_POSTSUBSCRIPT end_ARG italic_ε start_POSTSUPERSCRIPT italic_u start_POSTSUBSCRIPT italic_n , roman_max end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT . end_CELL end_ROW end_ARRAY end_CELL end_ROW (24)

Eq. (24) indicates that, when a RIS records a collision or in a busy state (i.e., ηk=0subscript𝜂𝑘0\eta_{k}=0italic_η start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 0), the player will transfer to the discontent state D𝐷Ditalic_D with a probability of 1111 as un=0subscript𝑢𝑛0u_{n}=0italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 0. On the other hand, when a RIS is optimal to the player, it will transfer to the content state C𝐶Citalic_C with a probability of 1111 as un=un,maxsubscript𝑢𝑛subscript𝑢𝑛u_{n}=u_{n,\max}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_u start_POSTSUBSCRIPT italic_n , roman_max end_POSTSUBSCRIPT.

Assume that all players’ actions and states constitute a strategy profile 𝒂𝟏subscript𝒂1\boldsymbol{a_{1}}bold_italic_a start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT. Then, a strategy graph can be constructed at the end of this phase, where the vertex is the strategy profile, and an edge exists if the players can switch from one strategy to the other. Actually, this strategy graph forms a perturbed time-reversible Markov process over state space n=1N(𝒦n×(C,D))superscriptsubscriptproduct𝑛1𝑁subscript𝒦𝑛𝐶𝐷\prod_{n=1}^{N}\left(\mathcal{K}_{n}\times(C,D)\right)∏ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( caligraphic_K start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × ( italic_C , italic_D ) ). As pointed out in [37, 23, 38], there exists an optimal strategy profile that players will visit many times than other strategy profiles. As a result, each player can agree on its optimal arm by recording the number of times that each arm has been selected, i.e.,

Fnz(k)t𝒢z𝕀(In,t=k,STn=C),k𝒦,formulae-sequencesubscriptsuperscript𝐹𝑧𝑛𝑘subscript𝑡subscript𝒢𝑧𝕀formulae-sequencesubscriptsuperscript𝐼𝑛𝑡𝑘𝑆subscript𝑇𝑛𝐶for-all𝑘𝒦F^{z}_{n}(k)\triangleq\sum_{t\in\mathcal{G}_{z}}\mathbb{I}\left(I^{\prime}_{n,% t}=k,ST_{n}=C\right),\ \forall k\in\mathcal{K},italic_F start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ) ≜ ∑ start_POSTSUBSCRIPT italic_t ∈ caligraphic_G start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_I ( italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT = italic_k , italic_S italic_T start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_C ) , ∀ italic_k ∈ caligraphic_K , (25)

where Fnz(k)subscriptsuperscript𝐹𝑧𝑛𝑘F^{z}_{n}(k)italic_F start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ) is the number of times that the k𝑘kitalic_k-th RIS has been played by the n𝑛nitalic_n-th player at the z𝑧zitalic_z-th epoch under state C𝐶Citalic_C. The symbol 𝒢zsubscript𝒢𝑧\mathcal{G}_{z}caligraphic_G start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT represents the number of time slots in the z𝑧zitalic_z-th epoch and 𝕀()𝕀\mathbb{I}(\cdot)blackboard_I ( ⋅ ) is an indicator function. Finally, we can determine the best RIS by using the recent z/2+1𝑧21\lfloor z/2\rfloor+1⌊ italic_z / 2 ⌋ + 1 epochs’ Fnzsubscriptsuperscript𝐹𝑧𝑛F^{z}_{n}italic_F start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, i.e.,

kn=argmaxk𝒦j=0z2Fnzj(k).superscriptsubscript𝑘𝑛argsubscript𝑘𝒦superscriptsubscript𝑗0𝑧2subscriptsuperscript𝐹𝑧𝑗𝑛𝑘k_{n}^{\ast}=\text{arg}\max\limits_{k\in\mathcal{K}}\ \sum_{j=0}^{\lfloor\frac% {z}{2}\rfloor}F^{z-j}_{n}(k).italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = arg roman_max start_POSTSUBSCRIPT italic_k ∈ caligraphic_K end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ⌋ end_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT italic_z - italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_k ) . (26)

In addition, Fnzsubscriptsuperscript𝐹𝑧𝑛F^{z}_{n}italic_F start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can be used to design a criterion to balance the EE dilemma in the first phase by adjusting the parameter ϵitalic-ϵ\epsilonitalic_ϵ. The reason is that it is unnecessary to uniformly explore all RISs when the non-cooperation game phase asymptotically approaches the optimal RIS. This asymptotical behavior can be quantified by the distance between two adjacent vectors of Fnzsubscriptsuperscript𝐹𝑧𝑛F^{z}_{n}italic_F start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and Fnz1subscriptsuperscript𝐹𝑧1𝑛F^{z-1}_{n}italic_F start_POSTSUPERSCRIPT italic_z - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, which can be measured by the WD777Wasserstein distance, also known as earth mover’s distance, is a measure to calculate the distance between two probability distributions on a metric space. In the simulation, we compute it using the corresponding function in Matlab.. As a result, the distance between the probability mass functions (PMFs)888The PMF is computed by (Fnz(i))=Fnz(i)/iFnz(i),i𝒦formulae-sequencesubscriptsuperscript𝐹𝑧𝑛𝑖subscriptsuperscript𝐹𝑧𝑛𝑖subscript𝑖subscriptsuperscript𝐹𝑧𝑛𝑖for-all𝑖𝒦\mathbb{P}(F^{z}_{n}(i))=F^{z}_{n}(i)/\sum_{i}F^{z}_{n}(i),\ \forall i\in% \mathcal{K}blackboard_P ( italic_F start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) ) = italic_F start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) / ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_F start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) , ∀ italic_i ∈ caligraphic_K. of Fnzsubscriptsuperscript𝐹𝑧𝑛F^{z}_{n}italic_F start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and Fnz1subscriptsuperscript𝐹𝑧1𝑛F^{z-1}_{n}italic_F start_POSTSUPERSCRIPT italic_z - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is regarded as a criterion to adjust the parameter ϵitalic-ϵ\epsilonitalic_ϵ. Thus, we have the following lemma.

Lemma 1

For the n𝑛nitalic_n-th player, given z>1𝑧1z>1italic_z > 1, the parameter of the ϵitalic-ϵ\epsilonitalic_ϵ-greedy algorithm in the first phase can be chosen according to

ϵmin{1,DWD((Fnz)||(Fnz1))},\epsilon\triangleq\min\{1,\text{D}_{\small{\text{WD}}}\left(\mathbb{P}(F^{z}_{% n})\ ||\ \mathbb{P}(F^{z-1}_{n})\right)\},italic_ϵ ≜ roman_min { 1 , D start_POSTSUBSCRIPT WD end_POSTSUBSCRIPT ( blackboard_P ( italic_F start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) | | blackboard_P ( italic_F start_POSTSUPERSCRIPT italic_z - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) } ,

where DWD(||)\text{D}_{\text{WD}}\left(\cdot\ ||\ \cdot\right)D start_POSTSUBSCRIPT WD end_POSTSUBSCRIPT ( ⋅ | | ⋅ ) is the calculation of the WD in [39]. Terms (Fnz)subscriptsuperscript𝐹𝑧𝑛\mathbb{P}(F^{z}_{n})blackboard_P ( italic_F start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and (Fnz1)subscriptsuperscript𝐹𝑧1𝑛\mathbb{P}(F^{z-1}_{n})blackboard_P ( italic_F start_POSTSUPERSCRIPT italic_z - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) are the PMFs of (25) at epochs z𝑧zitalic_z and z1𝑧1z-1italic_z - 1 of the second phase, respectively.

Thompson Sampling EE Phase: The last phase has a length of ν32zsubscript𝜈3superscript2𝑧\nu_{3}2^{z}italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT rounds where ν3>0subscript𝜈30\nu_{3}>0italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT > 0 is a constant. The objective is to find the best SF for each player based on the busy or idle state of the determined RIS in the second phase. Note that there are two types of resource allocations in this phase for transmission Pattern I and Pattern II. The main difference between the two allocations is that, in Pattern I, it requires jointly estimate θk,cmnsubscriptsuperscript𝜃𝑛𝑘subscript𝑐𝑚\theta^{n}_{k,c_{m}}italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT and θcmnsubscriptsuperscript𝜃𝑛subscript𝑐𝑚\theta^{n}_{c_{m}}italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT at each epoch; while in Pattern II, it only needs to estimate the θcmnsubscriptsuperscript𝜃𝑛subscript𝑐𝑚\theta^{n}_{c_{m}}italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT at each time slot.

As mentioned before, the second-stage MPMAB problem can be regarded as a single-player MAB problem. Therefore, we can adopt the TS algorithm [29] to solve the second-stage MPMAB problem to track the Bernoulli distribution rewards (i.e., transmission success or failure). The TS algorithm first maintains a Beta prior distribution999Beta(α,β)𝛼𝛽(\alpha,\beta)( italic_α , italic_β ) is the beta distribution with probability density function (pdf): fα,β(y)=yα1(1y)β1B(α,β),y[0,1],whereB(α,β)=Γ(α)Γ(β)Γ(α+β).formulae-sequencesubscript𝑓𝛼𝛽𝑦superscript𝑦𝛼1superscript1𝑦𝛽1superscript𝐵𝛼𝛽formulae-sequence𝑦01wheresuperscript𝐵𝛼𝛽Γ𝛼Γ𝛽Γ𝛼𝛽f_{\alpha,\beta}(y)=\frac{y^{\alpha-1}(1-y)^{\beta-1}}{B^{\prime}(\alpha,\beta% )},y\in[0,1],\ \mathrm{where}\ B^{\prime}(\alpha,\beta)=\frac{\Gamma(\alpha)% \Gamma(\beta)}{\Gamma(\alpha+\beta)}.italic_f start_POSTSUBSCRIPT italic_α , italic_β end_POSTSUBSCRIPT ( italic_y ) = divide start_ARG italic_y start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT ( 1 - italic_y ) start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_α , italic_β ) end_ARG , italic_y ∈ [ 0 , 1 ] , roman_where italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_α , italic_β ) = divide start_ARG roman_Γ ( italic_α ) roman_Γ ( italic_β ) end_ARG start_ARG roman_Γ ( italic_α + italic_β ) end_ARG . for each SF. Thus, the objective of this phase is equivalent to estimating the parameter in the Beta distribution, which will converge to the true value of θk,cmnsubscriptsuperscript𝜃𝑛𝑘subscript𝑐𝑚\theta^{n}_{k,c_{m}}italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT or θcmnsubscriptsuperscript𝜃𝑛subscript𝑐𝑚\theta^{n}_{c_{m}}italic_θ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Based on the transmission feedback, TS algorithm is able to update the posterior distribution by: α=α+1𝛼𝛼1\alpha=\alpha+1italic_α = italic_α + 1 if transmission is successful, otherwise β=β+1𝛽𝛽1\beta=\beta+1italic_β = italic_β + 1. Notice that the value function (i.e., cmθ^n,msubscript𝑐𝑚subscript^𝜃𝑛𝑚c_{m}\hat{\theta}_{n,m}italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT) is the current data rate, instead of the successful or failed transmission feedback. At the end of this phase, it can determine the current best SF for each IoT device by using the estimated parameters in the Beta distribution, i.e.,

cn=argmaxmcmαn,mαn,m+βn,m.superscriptsubscript𝑐𝑛subscript𝑚subscript𝑐𝑚subscript𝛼𝑛𝑚subscript𝛼𝑛𝑚subscript𝛽𝑛𝑚c_{n}^{\ast}=\arg\max\limits_{m\in\mathcal{M}}\ \frac{c_{m}\alpha_{n,m}}{% \alpha_{n,m}+\beta_{n,m}}.italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_m ∈ caligraphic_M end_POSTSUBSCRIPT divide start_ARG italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT end_ARG . (27)

V-B Complexity and Feasibility Analysis

We first give a brief discussion on the computational complexity of the proposed algorithm. In Algorithm 1, the computational complexity of the first phase is 𝒪(ν1zδLED)𝒪subscript𝜈1superscript𝑧𝛿subscript𝐿ED\mathcal{O}(\nu_{1}z^{\delta}L_{\mathrm{ED}})caligraphic_O ( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT roman_ED end_POSTSUBSCRIPT ), where LEDsubscript𝐿EDL_{\mathrm{ED}}italic_L start_POSTSUBSCRIPT roman_ED end_POSTSUBSCRIPT is the length of samples in the energy detector of the spectrum sensing operation. Meanwhile, the computational complexity of the second phase is 𝒪(ν2zδ+zKlog2K)𝒪subscript𝜈2superscript𝑧𝛿𝑧𝐾subscript2𝐾\mathcal{O}(\nu_{2}z^{\delta}+zK\log_{2}K)caligraphic_O ( italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT + italic_z italic_K roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_K ), where the second term comes from the WD in the calculation of parameter ϵitalic-ϵ\epsilonitalic_ϵ [27]. In addition, the computational complexity of the third phase is 𝒪(ν32zMlog2M)𝒪subscript𝜈3superscript2𝑧𝑀subscript2𝑀\mathcal{O}(\nu_{3}2^{z}M\log_{2}M)caligraphic_O ( italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT italic_M roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_M ), where the complexity comes from the ‘argument maximum’ operation in the TS algorithm [29]. Therefore, the total computational complexity is 𝒪(ν1zδLED+ν2zδ+zKlog2K+ν32zMlog2M)𝒪subscript𝜈1superscript𝑧𝛿subscript𝐿EDsubscript𝜈2superscript𝑧𝛿𝑧𝐾subscript2𝐾subscript𝜈3superscript2𝑧𝑀subscript2𝑀\mathcal{O}(\nu_{1}z^{\delta}L_{\mathrm{ED}}+\nu_{2}z^{\delta}+zK\log_{2}K+\nu% _{3}2^{z}M\log_{2}M)caligraphic_O ( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT roman_ED end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT + italic_z italic_K roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_K + italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT italic_M roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_M ), which increases linearly logarithmically with the number of RISs K𝐾Kitalic_K and SFs M𝑀Mitalic_M. As the time epoch z𝑧zitalic_z increases, the complexity of the third phase will become the dominant factor in the total complexity. Therefore, the total complexity of Algorithm 1 is about 𝒪(ν32zMlog2M)𝒪subscript𝜈3superscript2𝑧𝑀subscript2𝑀\mathcal{O}(\nu_{3}2^{z}M\log_{2}M)caligraphic_O ( italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT italic_M roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_M ).

Next, we discuss the feasibility of the proposed algorithm in practical applications, e.g., B5G/6G networks. First, the proposed algorithm performs in real time and automatically converges to the optimal solution (i.e., the online learning feature). Second, its distributed feature can reduce the communication overhead and make it easy to apply to other network scenarios. Third, the complexity of the proposed algorithm increases linearly logarithmically with the number of RISs K𝐾Kitalic_K and the SFs M𝑀Mitalic_M. These features demonstrate that the proposed algorithm has great potential for application in B5G/6G networks with different requirements for rate, delay, scalability, and reliability.

In Algorithm 1, we only consider the case that the number of IoT devices is less than the RISs, i.e., KN𝐾𝑁K\geq Nitalic_K ≥ italic_N. However, the proposed algorithm can also handle the case of K<N𝐾𝑁K<Nitalic_K < italic_N by dividing N𝑁Nitalic_N IoT devices into K𝐾Kitalic_K clusters using the k𝑘kitalic_k-means clustering method [40] according to their geographic locations. We assume that the IoT devices in the same cluster prefer the same RIS and communicate with the BS using the round-robin method. Therefore, one of the clustering IoT devices can transmit on the RIS; while others directly transmit data to the BS at each time slot. In other words, Algorithm 1 still works in the case of K<N𝐾𝑁K<Nitalic_K < italic_N by allocating the optimal RIS to each cluster rather than each IoT device.

Algorithm 2 Modified E2Boost Algorithm Run by Player n𝑛nitalic_n for the case K<N𝐾𝑁K<Nitalic_K < italic_N
1:Initialize: Parameters in Algorithm 1
2:for each time slot t=1,2,𝑡12t=1,2,\cdotsitalic_t = 1 , 2 , ⋯, T𝑇Titalic_T do
3:     Check the round-robin flag
4:     If the flag is equal to 1111, run the E2Boost algorithm in Algorithm 1
5:     Otherwise, run the TS algorithm in the third phase of Algorithm 1
6:end for

Therefore, we give a modified E2Boost algorithm to handle the case of K<N𝐾𝑁K<Nitalic_K < italic_N, as shown in Algorithm 2. At the beginning of each time slot, the IoT device n𝑛nitalic_n first checks its round-robin flag to determine its transmission patterns: If the flag is equal to 1111, the IoT device runs the E2Boost algorithm in Algorithm 1 to find the optimal RIS and SF; otherwise, it runs the TS algorithm in the third phase of Algorithm 1 to find the optimal SF.

V-C An Upper Pseudo-Regret Bound

We derive an upper pseudo-regret bound for the E2Boost algorithm. Since each IoT device has two transmission patterns, the pseudo-regret also consists of the RIS-enabled regret and the non-RIS-enabled regret parts.

For the RIS-enabled regret part, the performance analysis of the RIS-enabled regret is mainly built on [23] since the E2Boost algorithm shares the same architecture of the GoT algorithm. On the other hand, the Bernoulli reward processing in this work strictly meets the key condition of Definition 1 in [23]. However, compared with the GoT algorithm, the proposed algorithm has the following features. First, it is a two-stage MPMAB framework that has a small arm space (i.e., 𝒦𝒦\mathcal{K}caligraphic_K) to explore. Its total pseudo-regret only depends on the number of RISs, instead of the whole arm space (i.e., 𝒦tensor-product𝒦\mathcal{K}\otimes\mathcal{M}caligraphic_K ⊗ caligraphic_M) as that in the GoT algorithm. Second, we embed the ϵitalic-ϵ\epsilonitalic_ϵ-greedy algorithm into the first phase to further trade off the EE dilemma. Thus, the accumulated regrets of this phase will trend to 00 when all players agree on their optimal RISs. Third, we incorporate the TS algorithm into the third phase to determine the best SF. Similarly, only a few accumulated regrets will be accrued in this phase when the optimal RIS is determined. Therefore, these features enable us to achieve a tighter pseudo-regret bound than the GoT algorithm.

For the non-RIS-assisted regret part, the E2Boost algorithm only has the third phase, i.e., the second-stage MPMAB problem. Since two or more players that select the same arm (or SF) will not collide, this MPMAB problem is reviewed as a single stochastic MAB problem. In Algorithm 1, we use the modified TS (MTS) algorithm [36] to solve this single stochastic MAB problem. Therefore, we adopt the theoretical results in [36] to derive the non-RIS-enabled regret.

To conclude, we have the following theorem.

Theorem 1

Let Γmax=maxn,iμn,isubscriptΓsubscript𝑛𝑖subscript𝜇𝑛𝑖\Gamma_{\max}=\max_{n,i}\mu_{n,i}roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT be the maximum real expected rewards among all players’ arms. For any hybrid uplink network, given ν1>0,ν2>0,ν3>0formulae-sequencesubscript𝜈10formulae-sequencesubscript𝜈20subscript𝜈30\nu_{1}>0,\nu_{2}>0,\nu_{3}>0italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 , italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 , italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT > 0, δ0𝛿0\delta\geq 0italic_δ ≥ 0, 0<ϖ<10italic-ϖ10<\varpi<10 < italic_ϖ < 1 and a small enough ε𝜀\varepsilonitalic_ε, the total upper pseudo-regret bound obtained by the E2Boost algorithm is

Reg¯NΓmax(1Pa)(2(ν1+ν2)log21+δ(Tν3+2)+(6NK+1)ν3log2(Tν3+2))+Pa(1+ϖ)n=1Nanlog2TDKL(an,an)Δn,an,¯𝑅𝑒𝑔𝑁subscriptΓ1subscript𝑃𝑎2subscript𝜈1subscript𝜈2subscriptsuperscript1𝛿2𝑇subscript𝜈326𝑁𝐾1subscript𝜈3subscript2𝑇subscript𝜈32subscript𝑃𝑎1italic-ϖsuperscriptsubscript𝑛1𝑁subscriptsubscript𝑎𝑛subscript2𝑇subscriptDKLsubscript𝑎𝑛superscriptsubscript𝑎𝑛subscriptΔ𝑛subscript𝑎𝑛\begin{split}\overline{Reg}\leq&N\Gamma_{\max}(1-P_{a})\left(2(\nu_{1}+\nu_{2}% )\log^{1+\delta}_{2}\left(\frac{T}{\nu_{3}}+2\right)\right.\\ &\left.+(6NK+1)\nu_{3}\log_{2}\left(\frac{T}{\nu_{3}}+2\right)\right)\\ &+P_{a}(1+\varpi)\sum_{n=1}^{N}\sum_{a_{n}\in\mathcal{M}}\frac{\log_{2}T}{% \mathrm{D}_{\mathrm{KL}}(a_{n},a_{n}^{\ast})}\Delta_{n,a_{n}},\end{split}start_ROW start_CELL over¯ start_ARG italic_R italic_e italic_g end_ARG ≤ end_CELL start_CELL italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 1 - italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ( 2 ( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) roman_log start_POSTSUPERSCRIPT 1 + italic_δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_T end_ARG start_ARG italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG + 2 ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( 6 italic_N italic_K + 1 ) italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_T end_ARG start_ARG italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG + 2 ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( 1 + italic_ϖ ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_M end_POSTSUBSCRIPT divide start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T end_ARG start_ARG roman_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG roman_Δ start_POSTSUBSCRIPT italic_n , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , end_CELL end_ROW (28)

where log21+δ(T/ν3+2)superscriptsubscript21𝛿𝑇subscript𝜈32\log_{2}^{1+\delta}\left(T/\nu_{3}+2\right)roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 + italic_δ end_POSTSUPERSCRIPT ( italic_T / italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + 2 ) denotes log2(T/ν3+2)subscript2𝑇subscript𝜈32\log_{2}\left(T/\nu_{3}+2\right)roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T / italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + 2 ) to the power of (1+δ)1𝛿(1+\delta)( 1 + italic_δ ) and DKL()subscriptDKL\mathrm{D}_{\mathrm{KL}}(\cdot)roman_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( ⋅ ) is the Kukkback-Leibler divergence. Term Pasubscript𝑃𝑎P_{a}italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is the active probability of the UE.

Proof 1

See Appendix A.

Remark 1

The first two terms of the upper pseudo-regret bound account for the RIS-assisted regret part; while the third term is the non-RIS-assisted regret part. We can see that the weights of these two parts rely on the active probability of the UE.

Remark 2

The total upper pseudo-regret bound increases logarithmically with T𝑇Titalic_T, i.e., Reg¯=𝒪(log21+δT)¯𝑅𝑒𝑔𝒪subscriptsuperscript1𝛿2𝑇\overline{Reg}=\mathcal{O}(\log^{1+\delta}_{2}T)over¯ start_ARG italic_R italic_e italic_g end_ARG = caligraphic_O ( roman_log start_POSTSUPERSCRIPT 1 + italic_δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T ), indicating that Algorithm 1 will converge and the per-round regret approaches zero when T𝑇Titalic_T is sufficiently large.

Remark 3

The total upper pseudo-regret bound in the E2Boost algorithm is much tighter than the GoT algorithm. According to [23], the total upper pseudo-regret bound of the GoT algorithm is

Reg¯GoT4NΓmax(ν1+ν2)log21+δ(Tν3+2)+NΓmax(6NKM+1)ν3log2(Tν3+2)=𝒪(log21+δT).subscript¯𝑅𝑒𝑔GoT4𝑁subscriptΓsubscript𝜈1subscript𝜈2subscriptsuperscript1𝛿2𝑇subscript𝜈32𝑁subscriptΓ6𝑁𝐾𝑀1subscript𝜈3subscript2𝑇subscript𝜈32𝒪subscriptsuperscript1𝛿2𝑇\begin{split}\overline{Reg}_{\mathrm{GoT}}\leq&4N\Gamma_{\max}(\nu_{1}+\nu_{2}% )\log^{1+\delta}_{2}\left(\frac{T}{\nu_{3}}+2\right)\\ &+N\Gamma_{\max}(6NKM+1)\nu_{3}\log_{2}\left(\frac{T}{\nu_{3}}+2\right)\\ &=\mathcal{O}(\log^{1+\delta}_{2}T).\end{split}start_ROW start_CELL over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUBSCRIPT roman_GoT end_POSTSUBSCRIPT ≤ end_CELL start_CELL 4 italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) roman_log start_POSTSUPERSCRIPT 1 + italic_δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_T end_ARG start_ARG italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG + 2 ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 6 italic_N italic_K italic_M + 1 ) italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_T end_ARG start_ARG italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG + 2 ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = caligraphic_O ( roman_log start_POSTSUPERSCRIPT 1 + italic_δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T ) . end_CELL end_ROW (29)

For example, when ν1=ν2=ν3subscript𝜈1subscript𝜈2subscript𝜈3\nu_{1}=\nu_{2}=\nu_{3}italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, δ=0𝛿0\delta=0italic_δ = 0, and Pa=0subscript𝑃𝑎0P_{a}=0italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0, we have

Reg¯(5+6NK)ν1NΓmaxlog2(Tν3+2),¯𝑅𝑒𝑔56𝑁𝐾subscript𝜈1𝑁subscriptΓsubscript2𝑇subscript𝜈32\begin{split}\overline{Reg}\leq\left(5+6NK\right)\nu_{1}N\Gamma_{\max}\log_{2}% \left(\frac{T}{\nu_{3}}+2\right),\end{split}start_ROW start_CELL over¯ start_ARG italic_R italic_e italic_g end_ARG ≤ ( 5 + 6 italic_N italic_K ) italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_T end_ARG start_ARG italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG + 2 ) , end_CELL end_ROW (30)

and

Reg¯GoT(9+6NKM)ν1NΓmaxlog2(Tν3+2).subscript¯𝑅𝑒𝑔GoT96𝑁𝐾𝑀subscript𝜈1𝑁subscriptΓsubscript2𝑇subscript𝜈32\begin{split}\overline{Reg}_{\mathrm{GoT}}\leq\left(9+6NKM\right)\nu_{1}N% \Gamma_{\max}\log_{2}\left(\frac{T}{\nu_{3}}+2\right).\end{split}start_ROW start_CELL over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUBSCRIPT roman_GoT end_POSTSUBSCRIPT ≤ ( 9 + 6 italic_N italic_K italic_M ) italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_T end_ARG start_ARG italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG + 2 ) . end_CELL end_ROW (31)

It can be seen that Reg¯¯𝑅𝑒𝑔\overline{Reg}over¯ start_ARG italic_R italic_e italic_g end_ARG is about M𝑀Mitalic_M times lower than Reg¯GoTsubscript¯𝑅𝑒𝑔GoT\overline{Reg}_{\mathrm{GoT}}over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUBSCRIPT roman_GoT end_POSTSUBSCRIPT. This observation can be verified by the numerical results in the following section.

VI Simulation Results

We conduct extensive simulations to evaluate the performance of the proposed algorithms. The simulation parameters are chosen according to the 3GPP standard [32] and refs. [4, 5]. All results are obtained from 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT Monte Carlo (MC) trials.

VI-A Parameter Configuration and Baseline Algorithms

Parameter Configuration: The transmit power at each IoT device is Ωn=20subscriptΩ𝑛20\Omega_{n}=20roman_Ω start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 20 dBm, n𝒩for-all𝑛𝒩\forall n\in\mathcal{N}∀ italic_n ∈ caligraphic_N. The background noise plus interference power is 9595-95- 95 dBm, and the wavelength λ𝜆\lambdaitalic_λ is set according to the central carrier frequency 5.95.95.95.9 GHz. The bandwidth B𝐵Bitalic_B is 40404040 MHz. The Rician factor is ζ=4𝜁4\zeta=4italic_ζ = 4, and the antenna gain G𝐺Gitalic_G is set to 1111. Each IoT device has 6666 SFs to choose from, as shown in TABLE I. The data rates are determined by (9) and the thresholds are the minimum required SINR to demodulate the received signal. Assume that the active probability of each RIS (i.e., occupied by the legal UEs) is Pak=0.2subscriptsuperscript𝑃𝑘𝑎0.2P^{k}_{a}=0.2italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0.2. In addition, we adopt the UMa model [32] to describe the path loss of both LoS and NLoS components.

TABLE I: The transmission parameters for the C-IoT device
SF 7777 8888 9999 10101010 11111111 12121212
Data Rate (Mbps) 1.091.091.091.09 0.630.630.630.63 0.350.350.350.35 0.200.200.200.20 0.110.110.110.11 0.060.060.060.06
Threshold (×103absentsuperscript103\times 10^{3}× 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT) 4.54.54.54.5 4444 3.53.53.53.5 3333 2.52.52.52.5 2222

The RIS is placed perpendicular to the ground, and the number of elements is 101×101101101101\times 101101 × 101. The direction of the RIS in the XY𝑋𝑌XYitalic_X italic_Y-plane is shown in Fig. 3. The angle φ𝜑\angle\varphi∠ italic_φ and all elements’ locations in RIS are determined according to Appendix B. Each element contains b=8𝑏8b=8italic_b = 8 PIN diodes with the refection amplitude A=1𝐴1A=1italic_A = 1. We consider two types of phase shift settings, i.e., the optimal phase shift and the constant phase shift. For the optimal phase shift setting, each RIS’s phase shifts are set to be optimal to the UEs (see Proposition 2 of [4]), i.e.,

τl1,l2=(Π2πλLl1,l2k)2b2π2π2b,subscript𝜏subscript𝑙1subscript𝑙2Π2𝜋𝜆subscriptsuperscript𝐿𝑘subscript𝑙1subscript𝑙2superscript2𝑏2𝜋2𝜋superscript2𝑏\begin{split}\tau_{l_{1},l_{2}}=\left\lfloor\left(\Pi-\frac{2\pi}{\lambda}L^{k% }_{l_{1},l_{2}}\right)\frac{2^{b}}{2\pi}\right\rfloor\frac{2\pi}{2^{b}},\end{split}start_ROW start_CELL italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ⌊ ( roman_Π - divide start_ARG 2 italic_π end_ARG start_ARG italic_λ end_ARG italic_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) divide start_ARG 2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_π end_ARG ⌋ divide start_ARG 2 italic_π end_ARG start_ARG 2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_ARG , end_CELL end_ROW (32)

where ΠΠ\Piroman_Π is an arbitrary constant and Ll1,l2ksubscriptsuperscript𝐿𝑘subscript𝑙1subscript𝑙2L^{k}_{l_{1},l_{2}}italic_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the distance between the BS, and the UEs through the k𝑘kitalic_k-th RIS’s (l1,l2)subscript𝑙1subscript𝑙2(l_{1},l_{2})( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) element. For the constant phase shift setting, we assume that the phase shifts on all RISs’ elements are equal. That is, all integers ρl1,l2subscript𝜌subscript𝑙1subscript𝑙2\rho_{l_{1},l_{2}}italic_ρ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT in (1) are simply set to a constant (we set to 170170170170 in the following simulations). Note that ρl1,l2subscript𝜌subscript𝑙1subscript𝑙2\rho_{l_{1},l_{2}}italic_ρ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT can be an arbitrary integer in the range of [0,2b1]0superscript2𝑏1[0,2^{b}-1][ 0 , 2 start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT - 1 ].

Refer to caption
Figure 3: A schematic diagram of the UE-RIS-BS link (top view).

Baseline Algorithms: We compare the E2Boost algorithm with the optimal solution, GoT algorithm, Q-learning method, random selection method, E2Boost without TS algorithm, and E2Boost without WD algorithm. Next, we introduce these baseline algorithms in detail.

  • Optimal Solution: The optimal solution is obtained by solving the two-stage MPMAB problem in a centralized form. In the Pattern I, it allocates the optimal RIS and SF to each IoT device by using the Hungarian algorithm [15], where the only required information is θk,cmnsuperscriptsubscript𝜃𝑘subscript𝑐𝑚𝑛\theta_{k,c_{m}}^{n}italic_θ start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and θcmnsuperscriptsubscript𝜃subscript𝑐𝑚𝑛\theta_{c_{m}}^{n}italic_θ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. In the simulation, we obtain this information by recording the received SINR γn,ksubscript𝛾𝑛𝑘\gamma_{n,k}italic_γ start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT and γnsubscript𝛾𝑛\gamma_{n}italic_γ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with the above simulation parameters over 105superscript10510^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT MC trails. Then, θ^k,cmnsuperscriptsubscript^𝜃𝑘subscript𝑐𝑚𝑛\hat{\theta}_{k,c_{m}}^{n}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and θ^cmnsuperscriptsubscript^𝜃subscript𝑐𝑚𝑛\hat{\theta}_{c_{m}}^{n}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT can be estimated by comparing these SINRs with a given threshold ΨmsubscriptΨ𝑚\Psi_{m}roman_Ψ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. Note that θ^k,cmnsuperscriptsubscript^𝜃𝑘subscript𝑐𝑚𝑛\hat{\theta}_{k,c_{m}}^{n}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and θ^cmnsuperscriptsubscript^𝜃subscript𝑐𝑚𝑛\hat{\theta}_{c_{m}}^{n}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT can approach the true values of θk,cmnsuperscriptsubscript𝜃𝑘subscript𝑐𝑚𝑛\theta_{k,c_{m}}^{n}italic_θ start_POSTSUBSCRIPT italic_k , italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and θcmnsuperscriptsubscript𝜃subscript𝑐𝑚𝑛\theta_{c_{m}}^{n}italic_θ start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT arbitrarily as long as the number of MC trials is sufficiently large. Based on this information and the data rate cmsubscript𝑐𝑚c_{m}italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT in Table I, the optimal RIS and SF of each IoT device can be obtained by using the Hungarian algorithm (i.e., the munkres function in Matlab). In the Pattern II, we determine the optimal SF for each IoT device using the genie-aided solution (i.e., from God’s perspective) as the θ^cmnsuperscriptsubscript^𝜃subscript𝑐𝑚𝑛\hat{\theta}_{c_{m}}^{n}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and cmsubscript𝑐𝑚c_{m}italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are known.

  • GoT Algorithm: The GoT algorithm in [23] is a fully distributed algorithm to solve the decentralized resource allocation problems. It has the same architecture as the proposed algorithm. However, it lacks the ϵitalic-ϵ\epsilonitalic_ϵ-greedy algorithm and the TS algorithm in the first and third phases to further balance the EE dilemma. In addition, it needs to explore the combinations of RISs and SFs; while the proposed algorithm explores the RISs and SFs separately.

  • Q-learning method: For the Q-learning method, the state is the target RIS’s busy or idle state. The state transition probability is the RIS’s active or passive probability Paksubscriptsuperscript𝑃𝑘𝑎P^{k}_{a}italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT or 1Pak1subscriptsuperscript𝑃𝑘𝑎1-P^{k}_{a}1 - italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. The actions are the set of SFs \mathcal{M}caligraphic_M if the target RIS is in a busy state; otherwise, the actions are the combinations of SFs and RISs, i.e., 𝒦tensor-product𝒦\mathcal{K}\otimes\mathcal{M}caligraphic_K ⊗ caligraphic_M.

  • Random Selection: For the random selection method, each IoT device uniformly chooses an arm from the arm space 𝒦tensor-product𝒦\mathcal{K}\otimes\mathcal{M}caligraphic_K ⊗ caligraphic_M in Pattern I or the arm space \mathcal{M}caligraphic_M in Pattern II at each time slot. There is no EE mechanism inside.

  • E2Boost without TS: Compared with the E2Boost algorithm, it removes the TS algorithm from the third phase. Moreover, it requires exploring the combinations of the RISs and SFs in the first phase with the ϵitalic-ϵ\epsilonitalic_ϵ-greedy algorithm.

  • E2Boost without WD: Compared with the E2Boost algorithm, the only difference is that it maintains a constant exploration rate ϵitalic-ϵ\epsilonitalic_ϵ for the ϵitalic-ϵ\epsilonitalic_ϵ-greedy algorithm, rather than adaptively adjusting ϵitalic-ϵ\epsilonitalic_ϵ in the E2Boost algorithm.

It is worth noting that there is a performance gap between the solutions of the two-stage MPMAB problem and the original problem (11). Specifically, the solution of problem (11) is to allocate the optimal available RIS and SF to each IoT device at each time slot t𝑡titalic_t; while the solution (i.e., the optimal solution) of the two-stage MPMAB problem is to assign the optimal RIS and SF to each IoT device average over the time horizon T𝑇Titalic_T. As a result, the performance of the two-stage MPMAB problem is slightly poorer than that of problem (11). However, Theorem 1 shows that the proposed algorithm can converge to the optimal solution when T𝑇Titalic_T is sufficiently large. The following simulation results will also verify this.

VI-B Fixed Network Scenario

We first consider a fixed network scenario in a 200200200200 mm\mathrm{m}roman_m ×\times× 200200200200 mm\mathrm{m}roman_m square area, as shown in Fig. 4, where N=3𝑁3N=3italic_N = 3 cellular IoT devices are located in a 45454545 mm\mathrm{m}roman_m ×\times× 45454545 mm\mathrm{m}roman_m circle area. For simplicity, we assume that all UEs are centered in the point (x,y)=(150,150)𝑥𝑦150150(x,y)=(150,150)( italic_x , italic_y ) = ( 150 , 150 ) mm\mathrm{m}roman_m. Outside this circle are the BS and K=3𝐾3K=3italic_K = 3 RISs. The distances between the BS and the center of the RIS, as well as the IoT device and the center of the RIS, are calculated by the Euclidean distances w.r.t. their locations (i.e., Dli,ljsubscript𝐷subscript𝑙𝑖subscript𝑙𝑗D_{l_{i},l_{j}}italic_D start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT and dli,ljsubscript𝑑subscript𝑙𝑖subscript𝑙𝑗d_{l_{i},l_{j}}italic_d start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT in Fig. 3), respectively. The RIS and the BS heights are 10101010 m and 20202020 m, respectively.

Refer to caption
Figure 4: A fixed network scenario in a 200×200200200200\times 200200 × 200 m square area with K=3,N=3formulae-sequence𝐾3𝑁3K=3,N=3italic_K = 3 , italic_N = 3 (top view).

Fig. 5 shows the allocation results of the E2Boost algorithm for the optimal phase shift setting. The simulation parameters for the E2Boost algorithm are ν=1.4𝜈1.4\nu=1.4italic_ν = 1.4, δ=0𝛿0\delta=0italic_δ = 0, ε=0.01𝜀0.01\varepsilon=0.01italic_ε = 0.01, Z=10𝑍10Z=10italic_Z = 10, ν1=ν2=1000subscript𝜈1subscript𝜈21000\nu_{1}=\nu_{2}=1000italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1000 and ν3=100subscript𝜈3100\nu_{3}=100italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 100. The average throughput is computed by 1tt=1Tμn,In,t1𝑡superscriptsubscript𝑡1𝑇subscript𝜇𝑛subscript𝐼𝑛𝑡\frac{1}{t}\sum_{t=1}^{T}\mu_{n,I_{n,t}}divide start_ARG 1 end_ARG start_ARG italic_t end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n , italic_I start_POSTSUBSCRIPT italic_n , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT. It can be seen from Fig. 5 (b-d) that each player will converge to its own optimal SF and RIS, i.e., player 1 \rightarrow (RIS3, SF1), player 2 \rightarrow (RIS1, SF1) and player 3 \rightarrow (RIS2, SF1). All players prefer SF1 with the highest data rate of 1.091.091.091.09 Mbps in Table I. The average throughput of all players is 2.38592.38592.38592.3859 Mbps, which is slightly less than the optimal solution’s 2.43152.43152.43152.4315 Mbps. In addition, IoT device 3 accounts for the lowest average throughput by 0.49710.49710.49710.4971 Mbps due to the long transmission distance between IoT 3-RIS2-BS links. The IoT device 3 does not choose RIS3 because the direction (or the phase shifts) of RIS2 is more suitable for IoT 3 than RIS3, i.e., RIS2-UEs-IoT3 in a line.

Similarly, Fig. 6 shows the allocation results of the E2Boost algorithm for the constant phase shift setting. We can see that the average throughput of all players is just about 0.57820.57820.57820.5782, which is much lower than the optimal phase shift setting. In addition, player 1 and player 2 disagree on the optimal RIS since there is a collision between them. The reason is that the time horizon of the second phase in Algorithm 1 is too short (i.e., ν2=1,000subscript𝜈21000\nu_{2}=1,000italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 , 000) to resolve this collision. As a result, the highest SF for Pattern II is chosen frequently, resulting in low average throughput. To conclude, Figs. 5 and 6 demonstrate that the channel gains between the IoT device and the BS not only rely on the path-loss gain but also depend on the settings of phase shifts and direction in the RIS.

Refer to caption
(a) Throughput
Refer to caption
(b) IoT device 1 (Player 1)
Refer to caption
(c) IoT device 2 (Player 2)
Refer to caption
(d) IoT device 3 (Player 3)
Figure 5: (a) Average throughput of three IoT devices, (b-d) The number of selected times at each arm for each player, by running the E2Boost algorithm with optimal phase shift setting in the fixed network scenario Fig. 4.
Refer to caption
(a) Throughput
Refer to caption
(b) IoT device 1 (Player 1)
Refer to caption
(c) IoT device 2 (Player 2)
Refer to caption
(d) IoT device 3 (Player 3)
Figure 6: (a) Average throughput of three IoT devices, (b-d) The number of selected times at each arm for each player, by running the E2Boost algorithm with constant phase shift setting in the fixed network scenario Fig. 4.

Fig. 7 depicts the total pseudo-regret of the E2Boost algorithm, the E2Boost algorithm without TS, and the GoT algorithm in the cases of ν1=ν2=1,000formulae-sequencesubscript𝜈1subscript𝜈21000\nu_{1}=\nu_{2}=1,000italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 , 000 and ν1=ν2=2,000formulae-sequencesubscript𝜈1subscript𝜈22000\nu_{1}=\nu_{2}=2,000italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 , 000, when Z=10𝑍10Z=10italic_Z = 10 under the optimal phase shift setting. Other parameters are the same as those in Fig. 5. We can see that the proposed algorithm has the lowest expected total pseudo-regret in both cases since it has a small arm space (i.e., due to the two-stage allocation mechanism) to explore. In addition, the total pseudo-regrets of all algorithms in the case of ν1=ν2=1,000formulae-sequencesubscript𝜈1subscript𝜈21000\nu_{1}=\nu_{2}=1,000italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 , 000 are lower than those in the ν1=ν2=2,000formulae-sequencesubscript𝜈1subscript𝜈22000\nu_{1}=\nu_{2}=2,000italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 , 000 case. The reason is that a larger value of ν1subscript𝜈1\nu_{1}italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and ν2subscript𝜈2\nu_{2}italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT indicates that a longer time is needed to explore all arms, leading to more performance loss. However, when ν1=ν2=1,000formulae-sequencesubscript𝜈1subscript𝜈21000\nu_{1}=\nu_{2}=1,000italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 , 000, the GoT algorithm and the E2Boost algorithm without TS will not converge since the value of ν2subscript𝜈2\nu_{2}italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is too small for the second phase to resolve the collisions among IoT devices. More importantly, Fig. 7 validates our theoretical analysis of Theorem 1, where the total pseudo-regret of the E2Boost algorithm increases logarithmically w.r.t. the time horizon T𝑇Titalic_T and is about four times better than the GoT algorithm.

Refer to caption
Figure 7: The total pseudo-regret via time slot in the cases of ν1=ν2=1,000formulae-sequencesubscript𝜈1subscript𝜈21000\nu_{1}=\nu_{2}=1,000italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 , 000 and ν1=ν2=2,000formulae-sequencesubscript𝜈1subscript𝜈22000\nu_{1}=\nu_{2}=2,000italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 , 000 with optimal phase shift setting in the fixed network scenario Fig. 4.

Fig. 8 compares the average total throughput of the E2Boost algorithm, the E2Boost algorithm with ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0 and ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1 (without WD), the E2Boost algorithm without TS, the GoT algorithm, and the random selection method in the optimal phase shift setting with ν1=ν2=2,000,Z=10formulae-sequencesubscript𝜈1subscript𝜈22000𝑍10\nu_{1}=\nu_{2}=2,000,Z=10italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 , 000 , italic_Z = 10. It can be seen that the proposed algorithm outperforms the other algorithms and is close to the optimal solution. By contrast, the proposed algorithm with ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0 accounts for the worst performance since there is no exploration in the first phase. Meanwhile, the E2Boost algorithm with ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1 and the E2Boost algorithm without TS is better than the GoT algorithm, indicating that the E2Boost algorithm with WD can effectively trade off the EE dilemma by sequentially optimizing the parameter ϵitalic-ϵ\epsilonitalic_ϵ. More importantly, since the two-stage allocation mechanism, the E2Boost algorithm has a faster convergence rate than the GoT algorithm and the E2Boost algorithm without TS.

Refer to caption
Figure 8: The performance of different algorithms versus time slot with optimal phase shift setting when ν1=ν2=2,000,Z=10formulae-sequencesubscript𝜈1subscript𝜈22000𝑍10\nu_{1}=\nu_{2}=2,000,Z=10italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 , 000 , italic_Z = 10 in a fixed network scenario of Fig. 4.

Next, we evaluate the impact of the RIS-enabled channel on the performance of the proposed algorithm. Fig. 9 depicts the performance of the E2Boost algorithm under the optimal and constant phase shift setting for different Rice factors (ζ=0.5,1,4,10𝜁0.51410\zeta=0.5,1,4,10italic_ζ = 0.5 , 1 , 4 , 10) when ν1=ν2=2,000,Z=10formulae-sequencesubscript𝜈1subscript𝜈22000𝑍10\nu_{1}=\nu_{2}=2,000,Z=10italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 , 000 , italic_Z = 10. We can see that the performance of the optimal phase shift setting is much better than the constant phase shift setting for different Rice factors. This is because the optimal phase shift is designed for the centralized UEs. Thus, IoT devices close to the UEs will also have better performance. On the other hand, a bigger ζ𝜁\zetaitalic_ζ will result in a higher average total throughput. This phenomenon can be explained by (3), where a big ζ𝜁\zetaitalic_ζ means that the channel gain is dominated by the LoS component, i.e., the directional reflection link of IoT-RIS-BS. Therefore, the channel gain is dominated by the RIS when ζ𝜁\zetaitalic_ζ trends to ++\infty+ ∞; while ζ𝜁\zetaitalic_ζ trends to 00 mean that the IoT device only transmits on Pattern II.

Refer to caption
Figure 9: The performance of E2Boost algorithm for different phase shift settings and Rice factors in a fixed network scenario of Fig. 4.

VI-C Random Network Scenario

In the following, we evaluate the proposed algorithm under the random network scenario. At each MC trial, we regenerate the locations of the IoT devices uniformly in the circle area of Fig. 4. Meanwhile, the distance of any two devices is subject to no less than 5555 m. The locations of RISs, UEs, and BS are set the same as those in Fig. 4.

Fig. 10 compares the average total throughput of different algorithms in the optimal phase shift setting with ν1=ν2=2,000,Z=11formulae-sequencesubscript𝜈1subscript𝜈22000𝑍11\nu_{1}=\nu_{2}=2,000,Z=11italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 , 000 , italic_Z = 11 over 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT random network scenarios. It can be observed that the performance of all algorithms except the random selection method increases with time slot t𝑡titalic_t. Again, the E2Boost algorithm has the best performance and a fast convergence rate. The Q-learning method also exhibits a fast convergence rate, but it suffers from some performance loss due to the lack of the non-cooperation game phase to resolve the collisions among players. Moreover, the gaps between the optimal solution and these algorithms increase compared with Fig. 8 in the fixed network case. The reason is that these algorithms fail to find the optimal RIS for each player under some extreme network scenarios with the constant parameter ν2subscript𝜈2\nu_{2}italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and the limited time horizon T𝑇Titalic_T.

Refer to caption
Figure 10: The performance of different algorithms versus time slot with optimal phase shift setting when ν1=ν2=2000,Z=10formulae-sequencesubscript𝜈1subscript𝜈22000𝑍10\nu_{1}=\nu_{2}=2000,Z=10italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2000 , italic_Z = 10 over 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT random network scenarios.
Refer to caption
(a) A random network scenario
Refer to caption
(b) Performance comparison at left network scenario
Figure 11: (a) A random network scenario in a 200×200200200200\times 200200 × 200 m square area with K=3,N=11formulae-sequence𝐾3𝑁11K=3,N=11italic_K = 3 , italic_N = 11. (b) The performance of different algorithms versus time slots with optimal phase shift setting over 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT random network scenarios.

Moreover, we study the performance of the proposed algorithm by considering the case that the number of IoT devices is larger than that of RISs, i.e., N>K𝑁𝐾N>Kitalic_N > italic_K. We first generate a new random network scenario, as shown in Fig. 11a, where N=11𝑁11N=11italic_N = 11, K=3𝐾3K=3italic_K = 3, and the other parameters are the same as those in Fig. 4. We can see from Fig. 11a that N=11𝑁11N=11italic_N = 11 IoT devices are divided into three clusters by using the k𝑘kitalic_k-means clustering method according to their geographic locations. Fig. 11b compares the performance of the modified E2Boost algorithm (i.e., Algorithm 2) with different settings of ν1=ν2={1000,2000,3000}subscript𝜈1subscript𝜈2100020003000\nu_{1}=\nu_{2}=\{1000,2000,3000\}italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = { 1000 , 2000 , 3000 }, and the random selection method in the optimal phase shift setting over 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT random network scenarios of Fig. 11a. It can be seen that the modified E2Boost algorithm with ν1=ν2=1,000formulae-sequencesubscript𝜈1subscript𝜈21000\nu_{1}=\nu_{2}=1,000italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 , 000 has the best performance, and all the algorithms except for the random selection method can converge to the optimal allocation. Compared with the results in Fig. 10, the average total throughput in the network scenario of Fig. 11a is about 2.43502.43502.43502.4350 Mbps, which is slightly better than 2.07842.07842.07842.0784 Mbps in Fig. 4. This demonstrates that, although the number of IoT devices increases, the performance gain from the non-RIS-assisted transmission pattern is insignificant.

At last, we investigate the influence of the number of IoT devices on the proposed algorithm. The number of RISs is set to 10101010 and is placed on a semicircle with a radius of 55555555 m from 3π/43𝜋43\pi/43 italic_π / 4 to 5π/45𝜋45\pi/45 italic_π / 4, as shown in Fig. 12a. The distances between two neighboring RISs are equal except for the two pairs that are located in the middle and both ends.

Refer to caption
(a) A random network scenario
Refer to caption
(b) Performance comparison at left network scenario
Figure 12: (a) A random network scenario in a 200200200200 m ×\times× 200200200200 m square area with K=10𝐾10K=10italic_K = 10. (b) The performance of the optimal solution, the E2Boost algorithm, the original GoT algorithm, and the random selection method versus the number of IoT devices at 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT random network scenarios of the left figure.

Fig. 12b shows the performance of the optimal solution, the E2Boost algorithm, the original GoT algorithm, and the random selection method versus the number of IoT devices in the optimal phase shift setting where ν1=ν2=2,000,K=10,Z=10formulae-sequencesubscript𝜈1subscript𝜈22000𝐾10𝑍10\nu_{1}=\nu_{2}=2,000,K=10,Z=10italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 , 000 , italic_K = 10 , italic_Z = 10 over 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT random network scenarios of Fig. 12a. It can be seen that the performance of these algorithms increases with the number of IoT devices. However, the proposed algorithm is better than the GoT algorithm and the random selection method since it has a small arm space to explore. In addition, the performance gaps between the optimal solution and these algorithms increase with the number of IoT devices. The reason is that collision probabilities among players increase with the number of IoT devices, resulting in more performance loss.

VII Conclusion and Discussion

This paper studied the resource allocation problem in a RIS-assisted hybrid uplink network where several IoT devices transmit data to the BS. The objective is to maximize the sum rates of all IoT devices by finding the optimal RIS and SF for each device. We modeled this problem as a two-stage MPMAB framework, where the first stage is to find the optimal RIS, and the second stage is to find the optimal SF. Then, we proposed an E2Boost algorithm to tackle this problem by combining the ϵitalic-ϵ\epsilonitalic_ϵ-greedy algorithm, TS algorithm, and non-cooperation game method. Therefore, it can efficiently balance the EE dilemma. Furthermore, we provided an upper regret bound for the E2Boost algorithm, i.e., 𝒪(log21+δT)𝒪subscriptsuperscript1𝛿2𝑇\mathcal{O}(\log^{1+\delta}_{2}T)caligraphic_O ( roman_log start_POSTSUPERSCRIPT 1 + italic_δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T ), indicating that the per-round regret will trend to 00 when T𝑇Titalic_T is sufficiently large. In addition, simulation results demonstrated the effectiveness of the proposed algorithm. More importantly, it is not sensitive to the joint arm space thanks to the two-stage allocation mechanism, which can benefit practical applications.

In the system model, we assume that different RISs use different frequencies, and one RIS can serve at most one IoT device. A more general scenario is that RIS can reuse these frequencies and serve multi-IoT devices. Then, two interesting problems are how to design a mechanism that allows the UE and IoT device signals to coexist in the same RIS and how to design the RIS-assisted multi-IoT system by estimating the exact information of the RIS and CSI. These are important yet challenging problems for future study.

Appendix A Proof of Theorem 1

At each time slot, IoT device transmits on either the Pattern I or the Pattern II. For the Pattern I, the total pseudo-regret term Reg¯(1)superscript¯𝑅𝑒𝑔1\overline{Reg}^{(1)}over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT can be expanded by investigating Reg¯zsubscript¯𝑅𝑒𝑔𝑧\overline{Reg}_{z}over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, where z𝑧zitalic_z is the epoch. Thus, we begin to bound Reg¯zsubscript¯𝑅𝑒𝑔𝑧\overline{Reg}_{z}over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT by computing the probability of event Ezsubscript𝐸𝑧E_{z}italic_E start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, which is the event that the optimal assignment 𝒂superscript𝒂\boldsymbol{a}^{\ast}bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is not adopted in the third phase at epoch z𝑧zitalic_z. We have

Pr(Ez)=Pr(Ezk,Ezm)+Pr(Ezk,Ezm¯)+Pr(Ezk¯,Ezm)=Pr(Ezk)+Pr(Ezk¯,Ezm),Prsubscript𝐸𝑧Prsubscriptsuperscript𝐸superscript𝑘𝑧subscriptsuperscript𝐸superscript𝑚𝑧Prsubscriptsuperscript𝐸superscript𝑘𝑧¯subscriptsuperscript𝐸superscript𝑚𝑧Pr¯subscriptsuperscript𝐸superscript𝑘𝑧subscriptsuperscript𝐸superscript𝑚𝑧Prsubscriptsuperscript𝐸superscript𝑘𝑧Pr¯subscriptsuperscript𝐸superscript𝑘𝑧subscriptsuperscript𝐸superscript𝑚𝑧\small\begin{split}\mathrm{Pr}(E_{z})&=\mathrm{Pr}\left(E^{k^{\ast}}_{z},E^{m^% {\ast}}_{z}\right)+\mathrm{Pr}\left(E^{k^{\ast}}_{z},\overline{E^{m^{\ast}}_{z% }}\right)+\mathrm{Pr}\left(\overline{E^{k^{\ast}}_{z}},E^{m^{\ast}}_{z}\right)% \\ &=\mathrm{Pr}\left(E^{k^{\ast}}_{z}\right)+\mathrm{Pr}\left(\overline{E^{k^{% \ast}}_{z}},E^{m^{\ast}}_{z}\right),\end{split}start_ROW start_CELL roman_Pr ( italic_E start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) end_CELL start_CELL = roman_Pr ( italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_E start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) + roman_Pr ( italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , over¯ start_ARG italic_E start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG ) + roman_Pr ( over¯ start_ARG italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG , italic_E start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = roman_Pr ( italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) + roman_Pr ( over¯ start_ARG italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG , italic_E start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) , end_CELL end_ROW (33)

where Ezksubscriptsuperscript𝐸superscript𝑘𝑧E^{k^{\ast}}_{z}italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is the event that the optimal RIS is not used at the end of the z𝑧zitalic_z-th epoch of the second phase, and Ezmsubscriptsuperscript𝐸superscript𝑚𝑧E^{m^{\ast}}_{z}italic_E start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is the event that the optimal SF is not used at the end of the z𝑧zitalic_z-th epoch of the third phase.

First, we bound the probability that event Ezksubscriptsuperscript𝐸superscript𝑘𝑧E^{k^{\ast}}_{z}italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT holds, i.e.,

Pr(Ezk)Pr(j=0z2Pe,zj)+Pg,z,Prsubscriptsuperscript𝐸superscript𝑘𝑧Prsuperscriptsubscript𝑗0𝑧2subscript𝑃𝑒𝑧𝑗subscript𝑃𝑔𝑧\mathrm{Pr}\left(E^{k^{\ast}}_{z}\right)\leq\mathrm{Pr}\left(\bigcup_{j=0}^{% \lfloor\frac{z}{2}\rfloor}P_{e,z-j}\right)+P_{g,z},roman_Pr ( italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ≤ roman_Pr ( ⋃ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ⌋ end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_e , italic_z - italic_j end_POSTSUBSCRIPT ) + italic_P start_POSTSUBSCRIPT italic_g , italic_z end_POSTSUBSCRIPT , (34)

where Pe,zsubscript𝑃𝑒𝑧P_{e,z}italic_P start_POSTSUBSCRIPT italic_e , italic_z end_POSTSUBSCRIPT is the probability that the optimal assignment is different from 𝒂superscript𝒂\boldsymbol{a}^{\ast}bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the first phase at epoch z𝑧zitalic_z, and Pg,zsubscript𝑃𝑔𝑧P_{g,z}italic_P start_POSTSUBSCRIPT italic_g , italic_z end_POSTSUBSCRIPT is the probability that the frequently visited strategy profile is not 𝒂superscript𝒂\boldsymbol{a}^{\ast}bold_italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the last z/2+1𝑧21\lfloor z/2\rfloor+1⌊ italic_z / 2 ⌋ + 1 non-cooperation game phases. Then, we need to calculate the probabilities of Pe,zsubscript𝑃𝑒𝑧P_{e,z}italic_P start_POSTSUBSCRIPT italic_e , italic_z end_POSTSUBSCRIPT and Pg,zsubscript𝑃𝑔𝑧P_{g,z}italic_P start_POSTSUBSCRIPT italic_g , italic_z end_POSTSUBSCRIPT before bounding Reg¯zsubscript¯𝑅𝑒𝑔𝑧\overline{Reg}_{z}over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT. In the first phase, we estimate the average successful transmission probabilities θ^n,ksubscript^𝜃𝑛𝑘\hat{\theta}_{n,k}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT of all RISs. Assume i.i.d. rewards Xn,ksubscript𝑋𝑛𝑘X_{n,k}italic_X start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT and each player uniformly explores all K𝐾Kitalic_K RISs when event Ezksubscriptsuperscript𝐸superscript𝑘𝑧E^{k^{\ast}}_{z}italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT holds. By adopting the result in [23] (see Lemma 8), we have

Pe,z2NKewν1(z2)δz+NKeν1(z2)δ36K2z,subscript𝑃𝑒𝑧2𝑁𝐾superscript𝑒𝑤subscript𝜈1superscript𝑧2𝛿𝑧𝑁𝐾superscript𝑒subscript𝜈1superscript𝑧2𝛿36superscript𝐾2𝑧P_{e,z}\leq 2NKe^{-w\nu_{1}\left(\frac{z}{2}\right)^{\delta}z}+NKe^{-\frac{\nu% _{1}\left(\frac{z}{2}\right)^{\delta}}{36K^{2}}z},italic_P start_POSTSUBSCRIPT italic_e , italic_z end_POSTSUBSCRIPT ≤ 2 italic_N italic_K italic_e start_POSTSUPERSCRIPT - italic_w italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT + italic_N italic_K italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_ARG start_ARG 36 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_z end_POSTSUPERSCRIPT , (35)

where w𝑤witalic_w is a predefined positive constant. Therefore,

Pr(j=0z2Pe,zj)2NKew2ν1(z4)δz1ewν1(z4)δ+NKeν1(z4)δ72K2z1eν136K2(z2)δ.Prsuperscriptsubscript𝑗0𝑧2subscript𝑃𝑒𝑧𝑗2𝑁𝐾superscript𝑒𝑤2subscript𝜈1superscript𝑧4𝛿𝑧1superscript𝑒𝑤subscript𝜈1superscript𝑧4𝛿𝑁𝐾superscript𝑒subscript𝜈1superscript𝑧4𝛿72superscript𝐾2𝑧1superscript𝑒subscript𝜈136superscript𝐾2superscript𝑧2𝛿\small\begin{split}\mathrm{Pr}\left(\bigcup_{j=0}^{\lfloor\frac{z}{2}\rfloor}P% _{e,z-j}\right)\leq\frac{2NKe^{-\frac{w}{2}\nu_{1}\left(\frac{z}{4}\right)^{% \delta}z}}{1-e^{-w\nu_{1}\left(\frac{z}{4}\right)^{\delta}}}+\frac{NKe^{-\frac% {\nu_{1}\left(\frac{z}{4}\right)^{\delta}}{72K^{2}}z}}{1-e^{-\frac{\nu_{1}}{36% K^{2}}\left(\frac{z}{2}\right)^{\delta}}}.\end{split}start_ROW start_CELL roman_Pr ( ⋃ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ⌋ end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_e , italic_z - italic_j end_POSTSUBSCRIPT ) ≤ divide start_ARG 2 italic_N italic_K italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_w end_ARG start_ARG 2 end_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 4 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT - italic_w italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 4 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_N italic_K italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 4 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_ARG start_ARG 72 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_z end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 36 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG . end_CELL end_ROW (36)

In the second phase, we investigate the probability that the optimal strategy profile is not visited frequently. Let vz=[𝒂k,CN]superscript𝑣𝑧superscript𝒂𝑘superscript𝐶𝑁v^{z\ast}=[\boldsymbol{a}^{k\ast},C^{N}]italic_v start_POSTSUPERSCRIPT italic_z ∗ end_POSTSUPERSCRIPT = [ bold_italic_a start_POSTSUPERSCRIPT italic_k ∗ end_POSTSUPERSCRIPT , italic_C start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ] be the optimal strategy profile in the z𝑧zitalic_z-th game phase and Fz(v)subscript𝐹𝑧superscript𝑣F_{z}(v^{\ast})italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) be the number of times the optimal strategy profile has been visited at the last z2+1𝑧21\lfloor\frac{z}{2}\rfloor+1⌊ divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ⌋ + 1 game phases. According to [23] (see Lemma 16), we have

Fz(v)i=zz2zt𝒢z𝕀(v(t)=vi),k𝒦.formulae-sequencesubscript𝐹𝑧superscript𝑣superscriptsubscript𝑖𝑧𝑧2𝑧subscript𝑡subscript𝒢𝑧𝕀𝑣𝑡superscript𝑣𝑖for-all𝑘𝒦F_{z}(v^{\ast})\triangleq\sum_{i=z-\lfloor\frac{z}{2}\rfloor}^{z}\sum_{t\in% \mathcal{G}_{z}}\mathbb{I}\left(v(t)=v^{i\ast}\right),\ \forall k\in\mathcal{K}.italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≜ ∑ start_POSTSUBSCRIPT italic_i = italic_z - ⌊ divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ⌋ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t ∈ caligraphic_G start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_I ( italic_v ( italic_t ) = italic_v start_POSTSUPERSCRIPT italic_i ∗ end_POSTSUPERSCRIPT ) , ∀ italic_k ∈ caligraphic_K . (37)

Denote the stationary distribution of the optimal strategy profile by πv=minzz2jzπvisubscript𝜋superscript𝑣subscript𝑧𝑧2𝑗𝑧subscript𝜋superscript𝑣𝑖\pi_{v^{\ast}}=\min\limits_{z-\lfloor\frac{z}{2}\rfloor\leq j\leq z}\pi_{v^{i% \ast}}italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_z - ⌊ divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ⌋ ≤ italic_j ≤ italic_z end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_i ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. If 0<η<120𝜂120<\eta<\frac{1}{2}0 < italic_η < divide start_ARG 1 end_ARG start_ARG 2 end_ARG, then πv>12(1η)subscript𝜋superscript𝑣121𝜂\pi_{v^{\ast}}>\frac{1}{2(1-\eta)}italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT > divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_η ) end_ARG for a sufficiently large z𝑧zitalic_z, we have

Pg,zPr(Fz(v)12i=zz2zν2iδ)(C0eν2η2144Tm(18)(πv12(1η))(z2)δ)z,subscript𝑃𝑔𝑧Prsubscript𝐹𝑧superscript𝑣12superscriptsubscript𝑖𝑧𝑧2𝑧subscript𝜈2superscript𝑖𝛿superscriptsubscript𝐶0superscript𝑒subscript𝜈2superscript𝜂2144subscript𝑇𝑚18subscript𝜋superscript𝑣121𝜂superscript𝑧2𝛿𝑧\begin{split}P_{g,z}&\triangleq\mathrm{Pr}\left(F_{z}(v^{\ast})\leq\frac{1}{2}% \sum_{i=z-\lfloor\frac{z}{2}\rfloor}^{z}\nu_{2}i^{\delta}\right)\\ &\leq\left(C_{0}e^{-\frac{\nu_{2}\eta^{2}}{144T_{m}(\frac{1}{8})}\left(\pi_{v^% {\ast}}-\frac{1}{2(1-\eta)}\right)\left(\frac{z}{2}\right)^{\delta}}\right)^{z% },\end{split}start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_g , italic_z end_POSTSUBSCRIPT end_CELL start_CELL ≜ roman_Pr ( italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = italic_z - ⌊ divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ⌋ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 144 italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG 8 end_ARG ) end_ARG ( italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_η ) end_ARG ) ( divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT , end_CELL end_ROW (38)

where C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a constant and independent of z,πv𝑧subscript𝜋superscript𝑣z,\pi_{v^{\ast}}italic_z , italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and η𝜂\etaitalic_η.

Second, we bound the probability that event (Ezk¯,Ezm)¯subscriptsuperscript𝐸superscript𝑘𝑧subscriptsuperscript𝐸superscript𝑚𝑧\left(\overline{E^{k^{\ast}}_{z}},E^{m^{\ast}}_{z}\right)( over¯ start_ARG italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG , italic_E start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) holds. The method is based on the regret analysis of the TS algorithm [29]. Here, event Ezk¯¯subscriptsuperscript𝐸superscript𝑘𝑧\overline{E^{k^{\ast}}_{z}}over¯ start_ARG italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG means that the player found the optimal RIS at the end of the z𝑧zitalic_z-th game phase. Let Pt,znsubscriptsuperscript𝑃𝑛𝑡𝑧P^{n}_{t,z}italic_P start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_z end_POSTSUBSCRIPT be the probability that player n𝑛nitalic_n fails to find the best SF. Since players can find the optimal SF in the third phase only when event Ezk¯¯subscriptsuperscript𝐸superscript𝑘𝑧\overline{E^{k^{\ast}}_{z}}over¯ start_ARG italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG holds, we have

Pr(Ezm|Ezk¯)n=1NPr(m=1,mmMj=1zWn,mj12i=1zν32i)(a)n=1N2Dkl((1cmθn,mm=1Mcmθn,m)12)i=1zν32i(b)n=1N22(cmθn,mm=1Mcmθn,m12)2(2z+12)ν3(c)N2(M2)2(2z1)ν3M2,Prconditionalsubscriptsuperscript𝐸superscript𝑚𝑧¯subscriptsuperscript𝐸superscript𝑘𝑧superscriptsubscript𝑛1𝑁Prsuperscriptsubscriptformulae-sequence𝑚1𝑚superscript𝑚𝑀superscriptsubscript𝑗1𝑧subscriptsuperscript𝑊𝑗𝑛𝑚12superscriptsubscript𝑖1𝑧subscript𝜈3superscript2𝑖𝑎superscriptsubscript𝑛1𝑁superscript2subscript𝐷klconditional1superscriptsubscript𝑐𝑚subscriptsuperscript𝜃𝑛𝑚superscriptsubscript𝑚1𝑀subscript𝑐𝑚subscript𝜃𝑛𝑚12superscriptsubscript𝑖1𝑧subscript𝜈3superscript2𝑖𝑏superscriptsubscript𝑛1𝑁superscript22superscriptsuperscriptsubscript𝑐𝑚subscriptsuperscript𝜃𝑛𝑚superscriptsubscript𝑚1𝑀subscript𝑐𝑚subscript𝜃𝑛𝑚122superscript2𝑧12subscript𝜈3𝑐𝑁superscript2superscript𝑀22superscript2𝑧1subscript𝜈3superscript𝑀2\small\begin{split}\mathrm{Pr}\left(E^{m^{\ast}}_{z}|\overline{E^{k^{\ast}}_{z% }}\right)&\triangleq\sum_{n=1}^{N}\mathrm{Pr}\left(\sum_{m=1,m\neq m^{\ast}}^{% M}\sum_{j=1}^{z}W^{j}_{n,m}\geq\frac{1}{2}\sum_{i=1}^{z}\nu_{3}2^{i}\right)\\ &\overset{(a)}{\leq}\sum_{n=1}^{N}2^{-D_{\mathrm{kl}}\left(\left(1-\frac{c_{m}% ^{\ast}\theta^{\ast}_{n,m}}{\sum_{m=1}^{M}c_{m}\theta_{n,m}}\right)\|\frac{1}{% 2}\right)\sum_{i=1}^{z}\nu_{3}2^{i}}\\ &\overset{(b)}{\leq}\sum_{n=1}^{N}2^{-2\left(\frac{c_{m}^{\ast}\theta^{\ast}_{% n,m}}{\sum_{m=1}^{M}c_{m}\theta_{n,m}}-\frac{1}{2}\right)^{2}\left(2^{z+1}-2% \right)\nu_{3}}\\ &\overset{(c)}{\leq}N2^{-\frac{(M-2)^{2}\left(2^{z}-1\right)\nu_{3}}{M^{2}}},% \end{split}start_ROW start_CELL roman_Pr ( italic_E start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT | over¯ start_ARG italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG ) end_CELL start_CELL ≜ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_Pr ( ∑ start_POSTSUBSCRIPT italic_m = 1 , italic_m ≠ italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT italic_W start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ≥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT - italic_D start_POSTSUBSCRIPT roman_kl end_POSTSUBSCRIPT ( ( 1 - divide start_ARG italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT end_ARG ) ∥ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG ≤ end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT - 2 ( divide start_ARG italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 start_POSTSUPERSCRIPT italic_z + 1 end_POSTSUPERSCRIPT - 2 ) italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL start_OVERACCENT ( italic_c ) end_OVERACCENT start_ARG ≤ end_ARG italic_N 2 start_POSTSUPERSCRIPT - divide start_ARG ( italic_M - 2 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT - 1 ) italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT , end_CELL end_ROW (39)

where Dklsubscript𝐷klD_{\mathrm{kl}}italic_D start_POSTSUBSCRIPT roman_kl end_POSTSUBSCRIPT is the KL-divergence and Wn,mjsubscriptsuperscript𝑊𝑗𝑛superscript𝑚W^{j}_{n,m^{\ast}}italic_W start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the number of times that the m𝑚mitalic_m-th suboptimal SF has been selected by player n𝑛nitalic_n up to the j𝑗jitalic_j-th epoch. Inequality (a) holds by using the large deviation theory [41]. Inequality (b) follows from Pinsker’s inequality, i.e., Dkl(pq)2(pq)2subscript𝐷klconditional𝑝𝑞2superscript𝑝𝑞2D_{\mathrm{kl}}(p\|q)\geq 2(p-q)^{2}italic_D start_POSTSUBSCRIPT roman_kl end_POSTSUBSCRIPT ( italic_p ∥ italic_q ) ≥ 2 ( italic_p - italic_q ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Inequality (c) holds due to Mcmθn,mm=1Mcmθn,m𝑀superscriptsubscript𝑐𝑚subscriptsuperscript𝜃𝑛𝑚superscriptsubscript𝑚1𝑀subscript𝑐𝑚subscript𝜃𝑛𝑚Mc_{m}^{\ast}\theta^{\ast}_{n,m}\geq\sum_{m=1}^{M}c_{m}\theta_{n,m}italic_M italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ≥ ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT, considering the worst case that each SF has the same probability of being selected. Therefore, by using Pr(Ezk¯,Ezm)=Pr(Ezm|Ezk¯)Pr(Ezk¯)Pr¯subscriptsuperscript𝐸superscript𝑘𝑧subscriptsuperscript𝐸superscript𝑚𝑧Prconditionalsubscriptsuperscript𝐸superscript𝑚𝑧¯subscriptsuperscript𝐸superscript𝑘𝑧Pr¯subscriptsuperscript𝐸superscript𝑘𝑧\mathrm{Pr}\left(\overline{E^{k^{\ast}}_{z}},E^{m^{\ast}}_{z}\right)=\mathrm{% Pr}\left(E^{m^{\ast}}_{z}|\overline{E^{k^{\ast}}_{z}}\right)\mathrm{Pr}\left(% \overline{E^{k^{\ast}}_{z}}\right)roman_Pr ( over¯ start_ARG italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG , italic_E start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) = roman_Pr ( italic_E start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT | over¯ start_ARG italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG ) roman_Pr ( over¯ start_ARG italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG ), we have (40) which is given in the top of next page.

Pr(Ezk¯,Ezm)N2(M2)2(2z1)ν3M2(12NKew2ν1(z4)δz1ewν1(z4)δ+NKeν1(z4)δ72K2z1eν136K2(z2)δ+(C0eν2η2144Tm(18)(πv12(1η))(z2)δ)z).Pr¯subscriptsuperscript𝐸superscript𝑘𝑧subscriptsuperscript𝐸superscript𝑚𝑧𝑁superscript2superscript𝑀22superscript2𝑧1subscript𝜈3superscript𝑀212𝑁𝐾superscript𝑒𝑤2subscript𝜈1superscript𝑧4𝛿𝑧1superscript𝑒𝑤subscript𝜈1superscript𝑧4𝛿𝑁𝐾superscript𝑒subscript𝜈1superscript𝑧4𝛿72superscript𝐾2𝑧1superscript𝑒subscript𝜈136superscript𝐾2superscript𝑧2𝛿superscriptsubscript𝐶0superscript𝑒subscript𝜈2superscript𝜂2144subscript𝑇𝑚18subscript𝜋superscript𝑣121𝜂superscript𝑧2𝛿𝑧\small\begin{split}\mathrm{Pr}\left(\overline{E^{k^{\ast}}_{z}},E^{m^{\ast}}_{% z}\right)\leq N2^{-\frac{(M-2)^{2}\left(2^{z}-1\right)\nu_{3}}{M^{2}}}\left(1-% \frac{2NKe^{-\frac{w}{2}\nu_{1}\left(\frac{z}{4}\right)^{\delta}z}}{1-e^{-w\nu% _{1}\left(\frac{z}{4}\right)^{\delta}}}+\frac{NKe^{-\frac{\nu_{1}\left(\frac{z% }{4}\right)^{\delta}}{72K^{2}}z}}{1-e^{-\frac{\nu_{1}}{36K^{2}}\left(\frac{z}{% 2}\right)^{\delta}}}+\left(C_{0}e^{-\frac{\nu_{2}\eta^{2}}{144T_{m}(\frac{1}{8% })}\left(\pi_{v^{\ast}}-\frac{1}{2(1-\eta)}\right)\left(\frac{z}{2}\right)^{% \delta}}\right)^{z}\right).\end{split}start_ROW start_CELL roman_Pr ( over¯ start_ARG italic_E start_POSTSUPERSCRIPT italic_k start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_ARG , italic_E start_POSTSUPERSCRIPT italic_m start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ≤ italic_N 2 start_POSTSUPERSCRIPT - divide start_ARG ( italic_M - 2 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT - 1 ) italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT ( 1 - divide start_ARG 2 italic_N italic_K italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_w end_ARG start_ARG 2 end_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 4 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT - italic_w italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 4 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_N italic_K italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 4 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_ARG start_ARG 72 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_z end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 36 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG + ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 144 italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG 8 end_ARG ) end_ARG ( italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_η ) end_ARG ) ( divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) . end_CELL end_ROW (40)

Then, we continue to bound Reg¯zsubscript¯𝑅𝑒𝑔𝑧\overline{Reg}_{z}over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT based on (36), (38) and (40). For z>z0𝑧subscript𝑧0z>z_{0}italic_z > italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we have

Reg¯zNΓmaxν2zδ+Pr(Ez)NΓmaxν1zδ+Pr(Ez)NΓmaxν32zNΓmaxν2zδ+NΓmax(ν1zδ+ν32z)(2NKew2ν1(z4)δz1ewν1(z4)δ+NKeν1(z4)δ72K2z1eν136K2(z2)δ+(C0eν2η2144Tm(18)(πv12(1η))(z2)δ)z)+N2Γmax(ν1zδ+ν32z)2(M2)2(2z1)ν3M2(12NKew2ν1(z4)δz1ewν1(z4)δ+NKeν1(z4)δ72K2z1eν136K2(z2)δ+(C0eν2η2144Tm(18)(πv12(1η))(z2)δ)z)NΓmax(ν12+3NK+ν2)zδ+NΓmax(6NK+1)ν3+N2Γmax(ν1zδ+ν32z)2ν3(2z1).subscript¯𝑅𝑒𝑔𝑧𝑁subscriptΓsubscript𝜈2superscript𝑧𝛿Prsubscript𝐸𝑧𝑁subscriptΓsubscript𝜈1superscript𝑧𝛿Prsubscript𝐸𝑧𝑁subscriptΓsubscript𝜈3superscript2𝑧𝑁subscriptΓsubscript𝜈2superscript𝑧𝛿𝑁subscriptΓsubscript𝜈1superscript𝑧𝛿subscript𝜈3superscript2𝑧2𝑁𝐾superscript𝑒𝑤2subscript𝜈1superscript𝑧4𝛿𝑧1superscript𝑒𝑤subscript𝜈1superscript𝑧4𝛿𝑁𝐾superscript𝑒subscript𝜈1superscript𝑧4𝛿72superscript𝐾2𝑧1superscript𝑒subscript𝜈136superscript𝐾2superscript𝑧2𝛿superscriptsubscript𝐶0superscript𝑒subscript𝜈2superscript𝜂2144subscript𝑇𝑚18subscript𝜋superscript𝑣121𝜂superscript𝑧2𝛿𝑧superscript𝑁2subscriptΓsubscript𝜈1superscript𝑧𝛿subscript𝜈3superscript2𝑧superscript2superscript𝑀22superscript2𝑧1subscript𝜈3superscript𝑀212𝑁𝐾superscript𝑒𝑤2subscript𝜈1superscript𝑧4𝛿𝑧1superscript𝑒𝑤subscript𝜈1superscript𝑧4𝛿𝑁𝐾superscript𝑒subscript𝜈1superscript𝑧4𝛿72superscript𝐾2𝑧1superscript𝑒subscript𝜈136superscript𝐾2superscript𝑧2𝛿superscriptsubscript𝐶0superscript𝑒subscript𝜈2superscript𝜂2144subscript𝑇𝑚18subscript𝜋superscript𝑣121𝜂superscript𝑧2𝛿𝑧𝑁subscriptΓsubscript𝜈123𝑁𝐾subscript𝜈2superscript𝑧𝛿𝑁subscriptΓ6𝑁𝐾1subscript𝜈3superscript𝑁2subscriptΓsubscript𝜈1superscript𝑧𝛿subscript𝜈3superscript2𝑧superscript2subscript𝜈3superscript2𝑧1\footnotesize\begin{split}&\overline{Reg}_{z}\leq N\Gamma_{\max}\nu_{2}z^{% \delta}+\mathrm{Pr}(E_{z})N\Gamma_{\max}\nu_{1}z^{\delta}+\mathrm{Pr}(E_{z})N% \Gamma_{\max}\nu_{3}2^{z}\\ &\leq N\Gamma_{\max}\nu_{2}z^{\delta}+N\Gamma_{\max}\left(\nu_{1}z^{\delta}+% \nu_{3}2^{z}\right)\left(\frac{2NKe^{-\frac{w}{2}\nu_{1}\left(\frac{z}{4}% \right)^{\delta}z}}{1-e^{-w\nu_{1}\left(\frac{z}{4}\right)^{\delta}}}\right.\\ &\left.+\frac{NKe^{-\frac{\nu_{1}\left(\frac{z}{4}\right)^{\delta}}{72K^{2}}z}% }{1-e^{-\frac{\nu_{1}}{36K^{2}}\left(\frac{z}{2}\right)^{\delta}}}+\left(C_{0}% e^{-\frac{\nu_{2}\eta^{2}}{144T_{m}(\frac{1}{8})}\left(\pi_{v^{\ast}}-\frac{1}% {2(1-\eta)}\right)\left(\frac{z}{2}\right)^{\delta}}\right)^{z}\right)+\\ &N^{2}\Gamma_{\max}\left(\nu_{1}z^{\delta}+\nu_{3}2^{z}\right)2^{-\frac{(M-2)^% {2}\left(2^{z}-1\right)\nu_{3}}{M^{2}}}\left(1-\frac{2NKe^{-\frac{w}{2}\nu_{1}% \left(\frac{z}{4}\right)^{\delta}z}}{1-e^{-w\nu_{1}\left(\frac{z}{4}\right)^{% \delta}}}\right.\\ &\left.+\frac{NKe^{-\frac{\nu_{1}\left(\frac{z}{4}\right)^{\delta}}{72K^{2}}z}% }{1-e^{-\frac{\nu_{1}}{36K^{2}}\left(\frac{z}{2}\right)^{\delta}}}+\left(C_{0}% e^{-\frac{\nu_{2}\eta^{2}}{144T_{m}(\frac{1}{8})}\left(\pi_{v^{\ast}}-\frac{1}% {2(1-\eta)}\right)\left(\frac{z}{2}\right)^{\delta}}\right)^{z}\right)\\ &\leq N\Gamma_{\max}\left(\frac{\nu_{1}}{2}+3NK+\nu_{2}\right)z^{\delta}+N% \Gamma_{\max}(6NK+1)\nu_{3}\\ &+N^{2}\Gamma_{\max}\left(\nu_{1}z^{\delta}+\nu_{3}2^{z}\right)2^{-\nu_{3}% \left(2^{z}-1\right)}.\end{split}start_ROW start_CELL end_CELL start_CELL over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ≤ italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT + roman_Pr ( italic_E start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT + roman_Pr ( italic_E start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT + italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT + italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) ( divide start_ARG 2 italic_N italic_K italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_w end_ARG start_ARG 2 end_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 4 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT - italic_w italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 4 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG italic_N italic_K italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 4 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_ARG start_ARG 72 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_z end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 36 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG + ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 144 italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG 8 end_ARG ) end_ARG ( italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_η ) end_ARG ) ( divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT + italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) 2 start_POSTSUPERSCRIPT - divide start_ARG ( italic_M - 2 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT - 1 ) italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT ( 1 - divide start_ARG 2 italic_N italic_K italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_w end_ARG start_ARG 2 end_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 4 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT - italic_w italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 4 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + divide start_ARG italic_N italic_K italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 4 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_ARG start_ARG 72 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_z end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 36 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_ARG + ( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 144 italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG 8 end_ARG ) end_ARG ( italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_η ) end_ARG ) ( divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( divide start_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + 3 italic_N italic_K + italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT + italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 6 italic_N italic_K + 1 ) italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT + italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) 2 start_POSTSUPERSCRIPT - italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT - 1 ) end_POSTSUPERSCRIPT . end_CELL end_ROW (41)

where Γmax=maxn,iμn,isubscriptΓsubscript𝑛𝑖subscript𝜇𝑛𝑖\Gamma_{\max}=\max_{n,i}\mu_{n,i}roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT is the maximum real expected reward among all players’ arms. The first inequality holds since we consider the worst case that each player contributes the maximum regret ΓmaxsubscriptΓ\Gamma_{\max}roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. The second inequality follows by using (36) and (38). The last inequality establishes on the facts that, for z>z0𝑧subscript𝑧0z>z_{0}italic_z > italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT,

max{C0eν2η2144Tm(18)(πv12(1η))(z2)δ,ew2ν1(z4)δ,eν1(z4)δ72K2}<12subscript𝐶0superscript𝑒subscript𝜈2superscript𝜂2144subscript𝑇𝑚18subscript𝜋superscript𝑣121𝜂superscript𝑧2𝛿superscript𝑒𝑤2subscript𝜈1superscript𝑧4𝛿superscript𝑒subscript𝜈1superscript𝑧4𝛿72superscript𝐾212\small\begin{split}&\max\left\{C_{0}e^{-\frac{\nu_{2}\eta^{2}}{144T_{m}(\frac{% 1}{8})}\left(\pi_{v^{\ast}}-\frac{1}{2(1-\eta)}\right)\left(\frac{z}{2}\right)% ^{\delta}},e^{-\frac{w}{2}\nu_{1}\left(\frac{z}{4}\right)^{\delta}},e^{-\frac{% \nu_{1}\left(\frac{z}{4}\right)^{\delta}}{72K^{2}}}\right\}\\ &<\frac{1}{2}\end{split}start_ROW start_CELL end_CELL start_CELL roman_max { italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 144 italic_T start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG 8 end_ARG ) end_ARG ( italic_π start_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 ( 1 - italic_η ) end_ARG ) ( divide start_ARG italic_z end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_w end_ARG start_ARG 2 end_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 4 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_z end_ARG start_ARG 4 end_ARG ) start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_ARG start_ARG 72 italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL < divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_CELL end_ROW (42)

and

2(M2)2(2z1)ν3M22ν3(2z1).superscript2superscript𝑀22superscript2𝑧1subscript𝜈3superscript𝑀2superscript2subscript𝜈3superscript2𝑧12^{-\frac{(M-2)^{2}\left(2^{z}-1\right)\nu_{3}}{M^{2}}}\leq 2^{-\nu_{3}\left(2% ^{z}-1\right)}.2 start_POSTSUPERSCRIPT - divide start_ARG ( italic_M - 2 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT - 1 ) italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT ≤ 2 start_POSTSUPERSCRIPT - italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT - 1 ) end_POSTSUPERSCRIPT . (43)

Finally, let Z𝑍Zitalic_Z be the total number of epochs. The total pseudo-regret Reg¯(1)superscript¯𝑅𝑒𝑔1\overline{Reg}^{(1)}over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT in Pattern I can be bounded as

Reg¯(2)(a)z=1ZReg¯z(b)NΓmaxz=1z0((ν1+ν2)zδ+ν32z)+NΓmaxz=z0+1Z(NΓmax(ν12+3NK+ν2)zδ+NΓmax(6NK+1)ν3+N2Γmax(ν1zδ+ν32z)2ν3(2z1))(c)NΓmaxz=1z0ν32z+NΓmaxz=1Z(ν1+ν2)zδ+ZNΓmax(6NK+1)ν3(d)NΓmax(ν1+ν2)log21+δ(Tν3+2)+NΓmaxν32z0+1+NΓmax(6NK+1)ν3log2(Tν3+2)=O(log21+δT),superscript¯𝑅𝑒𝑔2𝑎superscriptsubscript𝑧1𝑍subscript¯𝑅𝑒𝑔𝑧𝑏𝑁subscriptΓsuperscriptsubscript𝑧1subscript𝑧0subscript𝜈1subscript𝜈2superscript𝑧𝛿subscript𝜈3superscript2𝑧𝑁subscriptΓsuperscriptsubscript𝑧subscript𝑧01𝑍𝑁subscriptΓsubscript𝜈123𝑁𝐾subscript𝜈2superscript𝑧𝛿𝑁subscriptΓ6𝑁𝐾1subscript𝜈3superscript𝑁2subscriptΓsubscript𝜈1superscript𝑧𝛿subscript𝜈3superscript2𝑧superscript2subscript𝜈3superscript2𝑧1𝑐𝑁subscriptΓsuperscriptsubscript𝑧1subscript𝑧0subscript𝜈3superscript2𝑧𝑁subscriptΓsuperscriptsubscript𝑧1𝑍subscript𝜈1subscript𝜈2superscript𝑧𝛿𝑍𝑁subscriptΓ6𝑁𝐾1subscript𝜈3𝑑𝑁subscriptΓsubscript𝜈1subscript𝜈2subscriptsuperscript1𝛿2𝑇subscript𝜈32𝑁subscriptΓsubscript𝜈3superscript2subscript𝑧01𝑁subscriptΓ6𝑁𝐾1subscript𝜈3subscript2𝑇subscript𝜈32𝑂subscriptsuperscript1𝛿2𝑇\small\begin{split}&\overline{Reg}^{(2)}\overset{(a)}{\leq}\sum_{z=1}^{Z}% \overline{Reg}_{z}\overset{(b)}{\leq}N\Gamma_{\max}\sum_{z=1}^{z_{0}}\left((% \nu_{1}+\nu_{2})z^{\delta}+\nu_{3}2^{z}\right)\\ &+N\Gamma_{\max}\sum_{z=z_{0}+1}^{Z}\left(N\Gamma_{\max}\left(\frac{\nu_{1}}{2% }+3NK+\nu_{2}\right)z^{\delta}\right.\\ &\left.+N\Gamma_{\max}(6NK+1)\nu_{3}+N^{2}\Gamma_{\max}\left(\nu_{1}z^{\delta}% +\nu_{3}2^{z}\right)2^{-\nu_{3}\left(2^{z}-1\right)}\right)\\ &\overset{(c)}{\leq}N\Gamma_{\max}\sum_{z=1}^{z_{0}}\nu_{3}2^{z}+N\Gamma_{\max% }\sum_{z=1}^{Z}(\nu_{1}+\nu_{2})z^{\delta}+ZN\Gamma_{\max}(6NK+1)\nu_{3}\\ &\overset{(d)}{\leq}N\Gamma_{\max}(\nu_{1}+\nu_{2})\log^{1+\delta}_{2}\left(% \frac{T}{\nu_{3}}+2\right)+N\Gamma_{\max}\nu_{3}2^{z_{0}+1}\\ &+N\Gamma_{\max}(6NK+1)\nu_{3}\log_{2}\left(\frac{T}{\nu_{3}}+2\right)=O(\log^% {1+\delta}_{2}T),\end{split}start_ROW start_CELL end_CELL start_CELL over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_OVERACCENT ( italic_a ) end_OVERACCENT start_ARG ≤ end_ARG ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Z end_POSTSUPERSCRIPT over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT start_OVERACCENT ( italic_b ) end_OVERACCENT start_ARG ≤ end_ARG italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( ( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT + italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_z = italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Z end_POSTSUPERSCRIPT ( italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( divide start_ARG italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + 3 italic_N italic_K + italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 6 italic_N italic_K + 1 ) italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT + italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) 2 start_POSTSUPERSCRIPT - italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT - 1 ) end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL start_OVERACCENT ( italic_c ) end_OVERACCENT start_ARG ≤ end_ARG italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT + italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Z end_POSTSUPERSCRIPT ( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT + italic_Z italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 6 italic_N italic_K + 1 ) italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL start_OVERACCENT ( italic_d ) end_OVERACCENT start_ARG ≤ end_ARG italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) roman_log start_POSTSUPERSCRIPT 1 + italic_δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_T end_ARG start_ARG italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG + 2 ) + italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 6 italic_N italic_K + 1 ) italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_T end_ARG start_ARG italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG + 2 ) = italic_O ( roman_log start_POSTSUPERSCRIPT 1 + italic_δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T ) , end_CELL end_ROW (44)

where (b) holds since (41) for z>z0𝑧subscript𝑧0z>z_{0}italic_z > italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the worst case of ΓmaxsubscriptΓ\Gamma_{\max}roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT pre-round regret for zz0𝑧subscript𝑧0z\leq z_{0}italic_z ≤ italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Inequality (d) follows from z=1ZzδZ1+δsuperscriptsubscript𝑧1𝑍superscript𝑧𝛿superscript𝑍1𝛿\sum_{z=1}^{Z}z^{\delta}\leq Z^{1+\delta}∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Z end_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ≤ italic_Z start_POSTSUPERSCRIPT 1 + italic_δ end_POSTSUPERSCRIPT and the fact that Tz=1Z1ν32zν3(2Z2)𝑇superscriptsubscript𝑧1𝑍1subscript𝜈3superscript2𝑧subscript𝜈3superscript2𝑍2T\geq\sum_{z=1}^{Z-1}\nu_{3}2^{z}\geq\nu_{3}(2^{Z}-2)italic_T ≥ ∑ start_POSTSUBSCRIPT italic_z = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Z - 1 end_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ≥ italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( 2 start_POSTSUPERSCRIPT italic_Z end_POSTSUPERSCRIPT - 2 ), which gives Z1+δlog21+δ(T/ν3+2)superscript𝑍1𝛿subscriptsuperscript1𝛿2𝑇subscript𝜈32Z^{1+\delta}\leq\log^{1+\delta}_{2}\left({T}/{\nu_{3}}+2\right)italic_Z start_POSTSUPERSCRIPT 1 + italic_δ end_POSTSUPERSCRIPT ≤ roman_log start_POSTSUPERSCRIPT 1 + italic_δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_T / italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + 2 ).

For the Pattern II, the total pseudo-regret Reg¯(2)superscript¯𝑅𝑒𝑔2\overline{Reg}^{(2)}over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT can be bounded according to the regret analysis of the MTS algorithm in [28], i.e.,

Reg¯(2)Pa(1+ϖ)n=1Nanlog2TDKL(an,an)Δn,an,superscript¯𝑅𝑒𝑔2subscript𝑃𝑎1italic-ϖsuperscriptsubscript𝑛1𝑁subscriptsubscript𝑎𝑛subscript2𝑇subscriptDKLsubscript𝑎𝑛superscriptsubscript𝑎𝑛subscriptΔ𝑛subscript𝑎𝑛\overline{Reg}^{(2)}\leq P_{a}(1+\varpi)\sum_{n=1}^{N}\sum_{a_{n}\in\mathcal{M% }}\frac{\log_{2}T}{\mathrm{D}_{\mathrm{KL}}(a_{n},a_{n}^{\ast})}\Delta_{n,a_{n% }},over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ≤ italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( 1 + italic_ϖ ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_M end_POSTSUBSCRIPT divide start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T end_ARG start_ARG roman_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG roman_Δ start_POSTSUBSCRIPT italic_n , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (45)

where ϖ(0,1)italic-ϖ01\varpi\in(0,1)italic_ϖ ∈ ( 0 , 1 ) and DKL()subscriptDKL\mathrm{D}_{\mathrm{KL}}(\cdot)roman_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( ⋅ ) is the KL-divergence. Term Pasubscript𝑃𝑎P_{a}italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is the active probability of the UE.

To sum up, the total pseudo-regret Reg¯¯𝑅𝑒𝑔\overline{Reg}over¯ start_ARG italic_R italic_e italic_g end_ARG of Algorithm 1 is given by

Reg¯=Reg¯(1)+Reg¯(2)NΓmax(1Pa)(2(ν1+ν2)log21+δ(Tν3+2)+(6NK+1)ν3log2(Tν3+2))+Pa(1+ϖ)n=1Nanlog2TDKL(an,an)Δn,an,¯𝑅𝑒𝑔superscript¯𝑅𝑒𝑔1superscript¯𝑅𝑒𝑔2𝑁subscriptΓ1subscript𝑃𝑎2subscript𝜈1subscript𝜈2subscriptsuperscript1𝛿2𝑇subscript𝜈326𝑁𝐾1subscript𝜈3subscript2𝑇subscript𝜈32subscript𝑃𝑎1italic-ϖsuperscriptsubscript𝑛1𝑁subscriptsubscript𝑎𝑛subscript2𝑇subscriptDKLsubscript𝑎𝑛superscriptsubscript𝑎𝑛subscriptΔ𝑛subscript𝑎𝑛\begin{split}\overline{Reg}=&\overline{Reg}^{(1)}+\overline{Reg}^{(2)}\\ \leq&N\Gamma_{\max}(1-P_{a})\left(2(\nu_{1}+\nu_{2})\log^{1+\delta}_{2}\left(% \frac{T}{\nu_{3}}+2\right)\right.\\ &\left.+(6NK+1)\nu_{3}\log_{2}\left(\frac{T}{\nu_{3}}+2\right)\right)\\ &+P_{a}(1+\varpi)\sum_{n=1}^{N}\sum_{a_{n}\in\mathcal{M}}\frac{\log_{2}T}{% \mathrm{D}_{\mathrm{KL}}(a_{n},a_{n}^{\ast})}\Delta_{n,a_{n}},\end{split}start_ROW start_CELL over¯ start_ARG italic_R italic_e italic_g end_ARG = end_CELL start_CELL over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + over¯ start_ARG italic_R italic_e italic_g end_ARG start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL italic_N roman_Γ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( 1 - italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ( 2 ( italic_ν start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_ν start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) roman_log start_POSTSUPERSCRIPT 1 + italic_δ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_T end_ARG start_ARG italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG + 2 ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ( 6 italic_N italic_K + 1 ) italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( divide start_ARG italic_T end_ARG start_ARG italic_ν start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG + 2 ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_P start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( 1 + italic_ϖ ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_M end_POSTSUBSCRIPT divide start_ARG roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_T end_ARG start_ARG roman_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG roman_Δ start_POSTSUBSCRIPT italic_n , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , end_CELL end_ROW (46)

Appendix B RIS’s Direction and Element’s Location

We first determine the direction of the RIS in XY𝑋𝑌XYitalic_X italic_Y-plane by computing the angle φ𝜑\angle\varphi∠ italic_φ between X𝑋Xitalic_X-axis and the RIS, as shown in Fig. 3. Given the coordinates of B=(xB,yB)Bsubscript𝑥Bsubscript𝑦B\mathrm{B}=\left(x_{\mathrm{B}},y_{\mathrm{B}}\right)roman_B = ( italic_x start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPT ), R=(xR,yR)Rsubscript𝑥Rsubscript𝑦R\mathrm{R}=\left(x_{\mathrm{R}},y_{\mathrm{R}}\right)roman_R = ( italic_x start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT ), U=(xU,yU)Usubscript𝑥Usubscript𝑦U\mathrm{U}=\left(x_{\mathrm{U}},y_{\mathrm{U}}\right)roman_U = ( italic_x start_POSTSUBSCRIPT roman_U end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_U end_POSTSUBSCRIPT ), we have two vectors RB=(xBxR,yByR)RBsubscript𝑥Bsubscript𝑥Rsubscript𝑦Bsubscript𝑦R\overrightarrow{\mathrm{RB}}=\left(x_{\mathrm{B}}-x_{\mathrm{R}},y_{\mathrm{B}% }-y_{\mathrm{R}}\right)over→ start_ARG roman_RB end_ARG = ( italic_x start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT ) and RU=(xUxR,yUyR)RUsubscript𝑥Usubscript𝑥Rsubscript𝑦Usubscript𝑦R\overrightarrow{\mathrm{RU}}=\left(x_{\mathrm{U}}-x_{\mathrm{R}},y_{\mathrm{U}% }-y_{\mathrm{R}}\right)over→ start_ARG roman_RU end_ARG = ( italic_x start_POSTSUBSCRIPT roman_U end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_U end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT ). According to the plane analytical geometry theory, we can obtain the direction vector RCRC\overrightarrow{\mathrm{RC}}over→ start_ARG roman_RC end_ARG, i.e., the bisector of angle BRUBRU\angle\text{BRU}∠ BRU, as

  1. 1.

    If cosRB,RU0RBRU0\cos\langle\overrightarrow{\mathrm{RB}},\overrightarrow{\mathrm{RU}}\rangle\geq 0roman_cos ⟨ over→ start_ARG roman_RB end_ARG , over→ start_ARG roman_RU end_ARG ⟩ ≥ 0, then

    RC=(xRC,yRC)=RB|RB|+RU|RU|=(xRxB|RB|+xUxR|RU|,yRyB|RB|+yUyR|RU|);RCsubscript𝑥RCsubscript𝑦RCRBRBRURUsubscript𝑥Rsubscript𝑥BRBsubscript𝑥Usubscript𝑥RRUsubscript𝑦Rsubscript𝑦BRBsubscript𝑦Usubscript𝑦RRU\begin{split}\overrightarrow{\mathrm{RC}}&=\left(x_{\mathrm{RC}},y_{\mathrm{RC% }}\right)=-\frac{\overrightarrow{\mathrm{RB}}}{|\overrightarrow{\mathrm{RB}}|}% +\frac{\overrightarrow{\mathrm{RU}}}{|\overrightarrow{\mathrm{RU}}|}\\ &=\left(\frac{x_{\mathrm{R}}-x_{\mathrm{B}}}{|\overrightarrow{\mathrm{RB}}|}+% \frac{x_{\mathrm{U}}-x_{\mathrm{R}}}{|\overrightarrow{\mathrm{RU}}|},\frac{y_{% \mathrm{R}}-y_{\mathrm{B}}}{|\overrightarrow{\mathrm{RB}}|}+\frac{y_{\mathrm{U% }}-y_{\mathrm{R}}}{|\overrightarrow{\mathrm{RU}}|}\right);\end{split}start_ROW start_CELL over→ start_ARG roman_RC end_ARG end_CELL start_CELL = ( italic_x start_POSTSUBSCRIPT roman_RC end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_RC end_POSTSUBSCRIPT ) = - divide start_ARG over→ start_ARG roman_RB end_ARG end_ARG start_ARG | over→ start_ARG roman_RB end_ARG | end_ARG + divide start_ARG over→ start_ARG roman_RU end_ARG end_ARG start_ARG | over→ start_ARG roman_RU end_ARG | end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ( divide start_ARG italic_x start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPT end_ARG start_ARG | over→ start_ARG roman_RB end_ARG | end_ARG + divide start_ARG italic_x start_POSTSUBSCRIPT roman_U end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT end_ARG start_ARG | over→ start_ARG roman_RU end_ARG | end_ARG , divide start_ARG italic_y start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPT end_ARG start_ARG | over→ start_ARG roman_RB end_ARG | end_ARG + divide start_ARG italic_y start_POSTSUBSCRIPT roman_U end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT end_ARG start_ARG | over→ start_ARG roman_RU end_ARG | end_ARG ) ; end_CELL end_ROW
  2. 2.

    If cosRB,RU<0RBRU0\cos\langle\overrightarrow{\mathrm{RB}},\overrightarrow{\mathrm{RU}}\rangle<0roman_cos ⟨ over→ start_ARG roman_RB end_ARG , over→ start_ARG roman_RU end_ARG ⟩ < 0, then

    RC=(xRC,yRC)=RB|RB|+RU|RU|=(xBxR|RB|+xUxR|RU|,yByR|RB|+yUyR|RU|).RCsubscript𝑥RCsubscript𝑦RCRBRBRURUsubscript𝑥Bsubscript𝑥RRBsubscript𝑥Usubscript𝑥RRUsubscript𝑦Bsubscript𝑦RRBsubscript𝑦Usubscript𝑦RRU\begin{split}\overrightarrow{\mathrm{RC}}&=\left(x_{\mathrm{RC}},y_{\mathrm{RC% }}\right)=\frac{\overrightarrow{\mathrm{RB}}}{|\overrightarrow{\mathrm{RB}}|}+% \frac{\overrightarrow{\mathrm{RU}}}{|\overrightarrow{\mathrm{RU}}|}\\ &=\left(\frac{x_{\mathrm{B}}-x_{\mathrm{R}}}{|\overrightarrow{\mathrm{RB}}|}+% \frac{x_{\mathrm{U}}-x_{\mathrm{R}}}{|\overrightarrow{\mathrm{RU}}|},\frac{y_{% \mathrm{B}}-y_{\mathrm{R}}}{|\overrightarrow{\mathrm{RB}}|}+\frac{y_{\mathrm{U% }}-y_{\mathrm{R}}}{|\overrightarrow{\mathrm{RU}}|}\right).\end{split}start_ROW start_CELL over→ start_ARG roman_RC end_ARG end_CELL start_CELL = ( italic_x start_POSTSUBSCRIPT roman_RC end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT roman_RC end_POSTSUBSCRIPT ) = divide start_ARG over→ start_ARG roman_RB end_ARG end_ARG start_ARG | over→ start_ARG roman_RB end_ARG | end_ARG + divide start_ARG over→ start_ARG roman_RU end_ARG end_ARG start_ARG | over→ start_ARG roman_RU end_ARG | end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ( divide start_ARG italic_x start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT end_ARG start_ARG | over→ start_ARG roman_RB end_ARG | end_ARG + divide start_ARG italic_x start_POSTSUBSCRIPT roman_U end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT end_ARG start_ARG | over→ start_ARG roman_RU end_ARG | end_ARG , divide start_ARG italic_y start_POSTSUBSCRIPT roman_B end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT end_ARG start_ARG | over→ start_ARG roman_RB end_ARG | end_ARG + divide start_ARG italic_y start_POSTSUBSCRIPT roman_U end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT end_ARG start_ARG | over→ start_ARG roman_RU end_ARG | end_ARG ) . end_CELL end_ROW

Thus, the direction of the RIS in XY𝑋𝑌XYitalic_X italic_Y-plane (i.e., the normal vector ARAR\overrightarrow{\mathrm{AR}}over→ start_ARG roman_AR end_ARG of line RC) is AR=(yRC,xRC)ARsubscript𝑦RCsubscript𝑥RC\overrightarrow{\mathrm{AR}}=\left(-y_{\mathrm{RC}},x_{\mathrm{RC}}\right)over→ start_ARG roman_AR end_ARG = ( - italic_y start_POSTSUBSCRIPT roman_RC end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT roman_RC end_POSTSUBSCRIPT ). It is easy to obtain the angle φ𝜑\angle\varphi∠ italic_φ by

φ=arctan(xRCyRC).𝜑subscript𝑥RCsubscript𝑦RC\angle\varphi=-\arctan\left(\frac{x_{\mathrm{RC}}}{y_{\mathrm{RC}}}\right).∠ italic_φ = - roman_arctan ( divide start_ARG italic_x start_POSTSUBSCRIPT roman_RC end_POSTSUBSCRIPT end_ARG start_ARG italic_y start_POSTSUBSCRIPT roman_RC end_POSTSUBSCRIPT end_ARG ) . (47)

Next, based on the angle φ𝜑\angle\varphi∠ italic_φ, we can compute the location of each element in the RIS, i.e.,

{x(l1,l2)=(l151)dvcosφ+xR,y(l1,l2)=(l151)dvsinφ+yR,z(l1,l2)=(l251)dh+10,cases𝑥subscript𝑙1subscript𝑙2subscript𝑙151subscript𝑑𝑣𝜑subscript𝑥Rmissing-subexpression𝑦subscript𝑙1subscript𝑙2subscript𝑙151subscript𝑑𝑣𝜑subscript𝑦Rmissing-subexpression𝑧subscript𝑙1subscript𝑙2subscript𝑙251subscript𝑑10missing-subexpression\begin{split}\left\{\begin{array}[]{ll}x(l_{1},l_{2})=\left(l_{1}-51\right)d_{% v}\cos\angle\varphi+x_{\mathrm{R}},\\ y(l_{1},l_{2})=\left(l_{1}-51\right)d_{v}\sin\angle\varphi+y_{\mathrm{R}},\\ z(l_{1},l_{2})=\left(l_{2}-51\right)d_{h}+10,\end{array}\right.\end{split}start_ROW start_CELL { start_ARRAY start_ROW start_CELL italic_x ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 51 ) italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_cos ∠ italic_φ + italic_x start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_y ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 51 ) italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT roman_sin ∠ italic_φ + italic_y start_POSTSUBSCRIPT roman_R end_POSTSUBSCRIPT , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_z ( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 51 ) italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT + 10 , end_CELL start_CELL end_CELL end_ROW end_ARRAY end_CELL end_ROW (48)

where dv=dh=0.01subscript𝑑𝑣subscript𝑑0.01d_{v}=d_{h}=0.01italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 0.01 are the offsets in RIS’s horizontal and vertical planes, respectively. Constant 51515151 is the 51515151-th row or column elements in the RIS and constant 10101010 is the height of the RIS. Symbol (l1,l2)subscript𝑙1subscript𝑙2(l_{1},l_{2})( italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) are the integers in [0,101]0101[0,101][ 0 , 101 ], standing for the index ceil of the RIS elements’ matrix.

References

  • [1] Q. Wu and R. Zhang, “Towards smart and reconfigurable environment: Intelligent reflecting surface aided wireless network,” IEEE Commun. Mag., vol. 58, no. 1, pp. 106–112, Feb. 2019.
  • [2] M. Di Renzo, A. Zappone, M. Debbah, M.-S. Alouini, C. Yuen, J. De Rosny, and S. Tretyakov, “Smart radio environments empowered by reconfigurable intelligent surfaces: How it works, state of research, and the road ahead,” IEEE J. Select. Areas in Commun., vol. 38, no. 11, pp. 2450–2525, Nov. 2020.
  • [3] M. A. ElMossallamy, H. Zhang, L. Song, K. G. Seddik, Z. Han, and G. Y. Li, “Reconfigurable intelligent surfaces for wireless communications: Principles, challenges, and opportunities,” IEEE Trans. on Cogn. Commun. and Net., vol. 6, no. 3, pp. 990–1002, Mar. 2020.
  • [4] H. Zhang, B. Di, L. Song, and Z. Han, “Reconfigurable intelligent surfaces assisted communications with limited phase shifts: How many phase shifts are enough?” IEEE Trans. Veh. Technol., vol. 64, no. 4, pp. 4498–4502, Apr. 2020.
  • [5] B. Di, H. Zhang, L. Song, Y. Li, Z. Han, and H. V. Poor, “Hybrid beamforming for reconfigurable intelligent surface based multi-user communications: Achievable rates with limited discrete phase shifts,” vol. 38, no. 8, pp. 1809–1822, Aug. 2020.
  • [6] S. Li, B. Duo, X. Yuan, Y.-C. Liang, and M. Di Renzo, “Reconfigurable intelligent surface assisted UAV communication: Joint trajectory design and passive beamforming,” vol. 9, no. 5, pp. 716–720, May 2020.
  • [7] H. Zhang, B. Di, L. Song, and Z. Han, Reconfigurable Intelligent Surface-Empowered 6G.   Springer, 2021.
  • [8] O. Liberg, M. Sundberg, E. Wang, J. Bergman, and J. Sachs, Cellular Internet of things: technologies, standards, and performance.   Academic Press, 2017.
  • [9] Q. Qi and X. Chen, “Wireless powered massive access for cellular Internet of Things with imperfect SIC and nonlinear EH,” IEEE Internet of Things J., vol. 6, no. 2, pp. 3110–3120, Feb. 2018.
  • [10] S. Dama, V. Sathya, K. Kuchi, and T. V. Pasca, “A feasible cellular Internet of Things: Enabling edge computing and the iot in dense futuristic cellular networks,” IEEE Consumer Electron. Mag., vol. 6, no. 1, pp. 66–72, Jan. 2016.
  • [11] M. Elsaadany, A. Ali, and W. Hamouda, “Cellular LTE-A technologies for the future Internet-of-Things: Physical layer features and challenges,” IEEE Commun. surveys Tuts., vol. 19, no. 4, pp. 2544–2572, Apr. 2017.
  • [12] A. Waret, M. Kaneko, A. Guitton, and N. El Rachkidy, “LoRa throughput analysis with imperfect spreading factor orthogonality,” IEEE Wireless Commun. Lett., vol. 8, no. 2, pp. 408–411, Oct. 2018.
  • [13] J. Lyu, D. Yu, and L. Fu, “Achieving max-min throughput in LoRa networks,” in Int. Conf. on Comput., Net. and Commun., Big Island, HI, Feb. 2020.
  • [14] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization.   Cambridge University Press, 2004.
  • [15] C. H. Papadimitriou and K. Steiglitz, Combinatorial optimization: algorithms and complexity.   Courier Corporation, 1998.
  • [16] Y. S. Nasir and D. Guo, “Multi-agent deep reinforcement learning for dynamic power allocation in wireless networks,” IEEE J. Select. Areas in Commun., vol. 37, no. 10, pp. 2239–2250, Oct. 2019.
  • [17] B. Gu, X. Zhang, Z. Lin, and M. Alazab, “Deep multiagent reinforcement-learning-based resource allocation for Internet of controllable things,” IEEE Int. of Things J., vol. 8, no. 5, pp. 3066–3074, May 2020.
  • [18] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT Press, 2018.
  • [19] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends® in Mach. Learning, vol. 5, no. 1, pp. 1–122, Dec. 2012.
  • [20] D.-T. Ta, K. Khawam, S. Lahoud, C. Adjih, and S. Martin, “LoRa-MAB: Toward an intelligent resource allocation approach for LoRaWAN,” in Proc. IEEE Glob. Telecom. Conf., Waikoloa, HI, Dec. 2019.
  • [21] H. Tibrewal, S. Patchala, M. K. Hanawal, and S. J. Darak, “Distributed learning and optimal assignment in multiplayer heterogeneous networks,” in Proc. IEEE INFOCOM, Pairs, France, Jun. 2019.
  • [22] S. Zafaruddin, I. Bistritz, A. Leshem, and D. Niyato, “Distributed learning for channel allocation over a shared spectrum,” IEEE J. Select. Areas in Commun., vol. 37, no. 10, pp. 2337–2349, Aug. 2019.
  • [23] I. Bistritz and A. Leshem, “Distributed multi-player bandits-a game of thrones approach,” in Advances in Neural Inform. Process. Syst., Montréal, Canada, Dec. 2018.
  • [24] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multi-armed bandit problem,” SIAM Journal on Computing, vol. 32, no. 1, pp. 48–77, Jan. 2002.
  • [25] R. Jonker and T. Volgenant, “Improving the Hungarian assignment algorithm,” Operations Res. Letters, vol. 5, no. 4, pp. 171–175, Apr. 1986.
  • [26] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Mach. Learning, vol. 47, no. 2-3, pp. 235–256, Feb. 2002.
  • [27] B. Arras, E. Azmoodeh, G. Poly, and Y. Swan, “A bound on the Wasserstein-2 distance between linear combinations of independent random variables,” Stochast. Processes and Their Appl., vol. 129, no. 7, pp. 2341–2375, Jul. 2019.
  • [28] H. Gupta, A. Eryilmaz, and R. Srikant, “Low-complexity, low-regret link rate selection in rapidly-varying wireless channels,” in Proc. IEEE INFOCOM, Honolulu, HI, Apr. 2018.
  • [29] S. Agrawal and N. Goyal, “Further optimal regret bounds for Thompson sampling,” in Artificial Intell. and Statist., Scottsdale, AZ, Apr. 2013.
  • [30] L. You, J. Xiong, Y. Huang, D. W. K. Ng, C. Pan, W. Wang, and X. Gao, “Reconfigurable intelligent surfaces-assisted multiuser MIMO uplink transmission with partial CSI,” IEEE Trans. Wireless Commun., vol. 20, no. 9, pp. 5613–5627, Sep. 2017.
  • [31] S. Abeywickrama, R. Zhang, Q. Wu, and C. Yuen, “Intelligent reflecting surface: Practical phase shift model and beamforming optimization,” IEEE Trans. on Commun., vol. 68, no. 9, pp. 5849–5863, Sep. 2020.
  • [32] 3GPP TR 38.901, “Study on channel model for frequencies from 0.5 to 100 GHz (release 14),” 3GPP, Jan. 2018.
  • [33] J. Tong, M. Jin, Q. Guo, and Y. Li, “Cooperative spectrum sensing: A blind and soft fusion detector,” IEEE Trans. Wireless Commun., vol. 17, no. 4, pp. 2726–2737, Apr. 2018.
  • [34] J. Tong, M. Jin, Q. Guo, and L. Qu, “Energy detection under interference power uncertainty,” IEEE Commun. Lett., vol. 21, no. 8, pp. 1887–1890, Aug. 2017.
  • [35] J. Choi, C. Joo, J. Zhang, and N. B. Shroff, “Distributed link scheduling under SINR model in multihop wireless networks,” IEEE/ACM Trans. Networking., vol. 22, no. 4, pp. 1204–1217, Aug. 2014.
  • [36] H. Gupta, A. Eryilmaz, and R. Srikant, “Low-complexity, low-regret link rate selection in rapidly-varying wireless channels,” in Proc. IEEE INFOCOM, Honolulu, HI, Apr. 2018.
  • [37] A. Menon and J. S. Baras, “Convergence guarantees for a decentralized algorithm achieving Pareto optimality,” in American Control Conf., Washington, DC, Jun. 2013.
  • [38] J. R. Marden, H. P. Young, and L. Y. Pao, “Achieving Pareto optimality through distributed learning,” SIAM J. on Control and Optimization, vol. 52, no. 5, pp. 2753–2770, May 2014.
  • [39] Y. Rubner, C. Tomasi, and L. J. Guibas, “The earth mover’s distance as a metric for image retrieval,” Int. J. of Comput. Vision, vol. 40, no. 2, pp. 99–121, Nov. 2000.
  • [40] A. Likas, N. Vlassis, and J. J. Verbeek, “The global k𝑘kitalic_k-means clustering algorithm,” Pattern Recognition, vol. 36, no. 2, pp. 451–461, Feb. 2003.
  • [41] T. M. Cover and J. A. Thomas, Elements of information theory.   John Wiley & Sons, 2012.