HiBid: A Cross-Channel Constrained Bidding System with Budget Allocation by Hierarchical Offline Deep Reinforcement Learning

Hao Wang*, Bo Tang*, Chi Harold Liu, , Shangqin Mao, Jiahong Zhou, Zipeng Dai, Yaqi Sun, Qianlong Xie, Xingxing Wang, Dong Wang H. Wang and B. Tang contributed equally to this work. H. Wang, C. H. Liu (Corresponding Author), and Z. Dai are with School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China. Email: liuchi02@gmail.comB. Tang is with Meituan and Institute for Advanced Algorithms Research, Shanghai, China.S. Mao, J. Zhou, Y. Sun, Q. Xie, X. Wang and D. Wang are with Meituan, Beijing, China.

Abstract

Online display advertising platforms service numerous advertisers by providing real-time bidding (RTB) for the scale of billions of ad requests every day. The bidding strategy handles ad requests cross multiple channels to maximize the number of clicks under the set financial constraints, i.e., total budget and cost-per-click (CPC), etc. Different from existing works mainly focusing on single channel bidding, we explicitly consider cross-channel constrained bidding with budget allocation. Specifically, we propose a hierarchical offline deep reinforcement learning (DRL) framework called “HiBid”, consisted of a high-level planner equipped with auxiliary loss for non-competitive budget allocation, and a data augmentation enhanced low-level executor for adaptive bidding strategy in response to allocated budgets. Additionally, a CPC-guided action selection mechanism is introduced to satisfy the cross-channel CPC constraint. Through extensive experiments on both the large-scale log data and online A/B testing, we confirm that HiBid outperforms six baselines in terms of the number of clicks, CPC satisfactory ratio, and return-on-investment (ROI). We also deploy HiBid on Meituan advertising platform to already service tens of thousands of advertisers every day.

Index Terms:

Real-time Bidding Systems; Cross-Channel Bidding; Deep Reinforcement Learning;

I Introduction

Real-time systems are witnessing an ever-growing significance in our society, including Industrial Internet of Things (IIoT) systems employed for crowdsensing [1], online cloud auction systems that facilitate seamless streaming services [2, 3], and real-time bidding (RTB) systems that have emerged as an indispensable component in modern online E-commerce, by offering real-time bidding services [4] for millions of advertisers participating in online display advertising. RTB creates an advertisement (ad) auction that varies depending on the platform, to allow interested advertisers to bid simultaneously for ad impressions. Different auction mechanisms (e.g., generalized first price in Google AdSense and Vickrey-Clarke-Groves in Facebook) and pricing mechanisms (e.g., cost-per-mille and cost-per-time) are chosen based on advertising and service format on the platform. One common type is the Generalized Second Price (GSP [5]) auction with cost-per-click (CPC [6]) pricing, which charges advertisers the second-highest bidding price if a user clicks on their ad. It provides a fair and flexible advertising format with an efficient evaluation of advertising performance for advertisers, which plays an important role in the online advertising industry.

Refer to caption — Figure 1: Overview of the considered cross-channel constrained bidding (c³-bidding) scenario.

As shown in Fig. 1, for better use of limited budget within a financial constraint like CPC, more advertising platforms have begun to offer automated bidding services across various ad channels, such as recommendation ads (i.e., ads that recommend products to a potential user), search ads (i.e., ads shown in response to user search content), etc. Different ad channels have varying numbers of ad impressions and users on the basis of their behavioral habits. Some high-quality channels, which are reflected by higher conversion rate (CVR) and click-through rate (CTR), could bring better advertising effectiveness and thus revenue as a return.

It is essential to allocate budget to different channels because: (a) from advertisers’ perspective, due to the inconsistent peak time and volume of ad requests in different channels, investing more budget in those channels (which are more suitable for themselves) can potentially avoid excessive consumption in other channels, leading to higher ROI; and (b) from the platform’s perspective, due to the limited available ad space, appropriately allocating advertisers’ budgets can help advertisers more precisely pinpoint their target users across different channels, thereby enhancing ad relevance and effectiveness. Advertisements with higher efficacy draw advertisers to increase their investments in ad campaigns, leading to a growth in the platform’s revenue, which is a mutually beneficial outcome; and (c) from a market perspective, owing to the higher cost-effectiveness of high-quality channels for all advertisers, distributing the total budget on each channel in a reasonable way may prevent advertisers from competing for those high-quality channels simultaneously.

Extensive works have been conducted on constrained bidding. Most of them focused on improving bidding strategy in a single channel under the set budget [7, 8] but did not adjust their bidding or budget allocation strategy across all channels. Hence, they cannot scale well to cross-channel bidding problems. Previous works on budget allocation [9, 10] derived the optimal strategy using a prediction model to forecast the expected return of the bidding strategy. However, the bidding and allocation strategies are “stand-alone”, that there exist many dynamic factors to affect the performance of the RTB system, causing fluctuations of budget allocation and bidding price. Therefore, an insight is that integrating these two may receive mutual feedback to bring better advertising results to advertisers.

In this paper, we explicitly consider the problem of cross-channel constrained bidding with budget allocation, “c³-bidding”, where the key challenges are: (a) advertisers may allocate most of their budgets to high-quality channels, leading to possible contentions and a decrease in overall performance; (b) the advertisers’ daily budget and the platform allocated budget on each channel highly likely change over time, and therefore the bidding strategy should dynamically adapt to the changing budget constraint; and (c) since each channel is bid independently, it is challenging to ensure the budget constraint for individual channels while guaranteeing cross-channel CPC constraints.

In practice, billions of ad requests arrive sequentially for tens of thousands of advertisers, and thus the solution space of the considered c³-bidding problem is huge which is not solvable by using optimization methods [11]. Meanwhile, the rapid advancement of offline deep reinforcement learning (DRL) demonstrates its potential for learning an optimal policy as well as satisfying the given constraint from the large-scale data. Furthermore, we observe that there is a hierarchy of budget allocation and constrained bidding in RTB. That is, the former assigns a percentage of the budget to each channel from market perspective, while the latter cares more about the suitable bidding price to win the ad impression opportunities under the allocated budget. Thus, we model the considered c³-bidding problem as a hierarchical Constrained Markov Decision Process (CMDP [12]). Existing hierarchical approaches [13, 14] neither address the cross-channel CPC constraints nor consider the decline in performance due to inappropriate allocation. To this end, we propose a novel hierarchical offline DRL framework called “HiBid” based on the state-of-the-art offline DRL approach MCQ [15], as the start point of design. Our contribution is three-fold:

1.

We propose “HiBid”, a hierarchical DRL framework for c³-bidding problem which maintains a high-level planner for budget allocation and a low-level executor for cross-channel constrained bidding.
2.

We introduce batch loss [16] for budget allocation to prevent over-allocation on specific channels, $\lambda$ -generalization [17] for constrained bidding to adaptively respond to changing budget. Then, we propose a CPC-guided action selection mechanism to significantly improve the cross-channel CPC satisfactory ratio, which also has wider applicability to other metrics as well.
3.

We conduct extensive experiments on a large-scale real dataset in Meituan advertising platform. Results show that HiBid outperforms six baselines in terms of number of clicks, CPC satisfactory ratio and ROI. We also deployed HiBid, and performed online A/B testing to validate its effectiveness.

The rest of the paper is organized as follows. We review the related work in Section II and present the system model in Section III. We introduce preliminaries in Section IV. We propose HiBid in Section V, followed by the experimental results in Section VI. Finally, we conclude the paper in Section VII.

II Related Work

II-A Real-Time Bidding (RTB) systems

RTB attracts much attention and has been widely studied for various applications [18]. Some efforts have been devoted to designing bidding mechanisms to enhance the effectiveness and fairness of advertising auctions from the platform perspective. For example, Zhou et al. in [19] introduced a novel deep distribution network for optimal bidding in both open and closed online first-price auctions. Zhang et al. in [20] proposed a succinct and effective bid shading algorithm without parametric assumptions for the win distribution. Ren et al. in [21] proposed a comprehensive framework to jointly optimize user response prediction and bid landscape forecasting. Furthermore, there have been studies that approach the optimization of bidding strategies from the advertiser perspective, to improve the effectiveness of their ads during auctions. For example, Wu et al. in [22] developed a model-free DRL framework “DRLB” for constrained bidding to cope with the volatility of the auction environment. Yang et al. in [23] abstracted the essential demand of advertisers in RTB and proposed an effective linear programming solution. Those works focused on optimizing bidding prices under the given constraint for a single ad channel, to adapt to the unpredictability of the auction environment and satisfy advertisers’ requirements. However, there exist multiple advertising channels with significant quality differences in practical deployment. Also, some studies developed ways to allocate budget across multiple ad channels given a total budget constraint, e.g., some works [9, 10] formalized it as well-stated optimization problems. These methods required accurate estimation of outcome distributions (e.g., the expected number of clicks from choosing a particular budget), which is impracticable in a dynamic auction environment.

Different from the above works, this paper explicitly focuses on simultaneous budget allocation and constrained bidding for multiple ad channels, to maximize the advertising effectiveness for advertisers while ensuring platform revenue.

II-B Deep Reinforcement Learning (DRL)

DRL has been widely applied in real-time systems, including user-item recommendation [24], ad-slots allocation [25], and real-time bidding for ad impression auctions [26, 7, 8, 27]. Cai et al. in [26] and Zhao et al. in [28] utilized DRL to learn the optimal bid for a single ad in display advertising and sponsored search, respectively. He et al. in [27] formulated the budget and financial constraints simultaneously and leveraged DRL to find a unified optimal bidding function on behalf of an advertiser. Unfortunately, the optimal bidding function may not yield optimal results in uncertain auction markets. Wang et al. in [8] proposed a curriculum-guided Bayesian DRL method to generalize to highly dynamic ad markets with ROI constraints. However, the mentioned DRL-based works only focus on single ad channel bidding and have not considered joint modeling across multiple channels. Due to the joint constraint settings of the advertiser’s financial requirements across channels, bidding individually cannot yield optimal results. It is insightful to model the relationship between channels jointly for bidding through a unified approach. Therefore, we leverage hierarchical reinforcement learning to jointly model cross-channel bidding and adjust bidding strategies through high-level budget allocation, which is one motivation of our work.

DRL provides a promising approach to address the c³-bidding problem by interacting with the environment and updating policy iteratively. However, it is not suitable for training the agent in an online setting due to the potential financial risk involved. In offline DRL, the agent learns from a fixed dataset of past interactions, rather than learning online in real-time. The main challenge of offline DRL is the distribution shift [29] of state-action visitation between the learned policy and behavior policy. Recent work [30] utilized distributional penalties to regularize the learned policy to stay close to the behavior policy. Other methods [31] used generative models to approximate the behavior distribution to stay within the support of offline data during the value back up. Ajay et al. in [13] proposed a hierarchical offline reinforcement learning method with unsupervised primitive extraction. However, directly applying existing offline DRL algorithms may not effectively solve our considered c³-bidding problem, as there is no effective method to solve the cross-channel CPC constraint and the changing allocated budget.

Constrained DRL focuses on designing efficient algorithms to find optimal policies for CMDP problems under the given constraints [32]. Some works converted the CMDP problem into a Lagrangian dual problem [33], and then found an optimal Lagrangian multiplier $\lambda$ as well as the corresponding policy which satisfied the constraint. Here $\lambda$ can be manually adjusted as a hyperparameter, which is policy-sensitive and hard to fine-tune. In recent works, gradient descent [34] or bisection search [35] were developed to get the optimal value of $\lambda$ . Unfortunately, the policy needs to be retrained every time when the value of $\lambda$ changes until the constraint is satisfied. The iterative training process is unacceptable in c³-bidding problem due to the frequently changed budget that brings huge computational overhead. Thus, in this paper we adopt a $\lambda$ -generalization [17] method to learn diversified bidding strategies that can dynamically respond to the changed budget constraint. Nevertheless, for the cross-channel CPC constraint, we need a more appropriate way to solve it, which is the key contribution of this paper as CPC-guided action selection in Section V-C.

TABLE I: Important notations used in this paper.

Notation	Explanation
$P,p$	The total number, index of channels
$M,m$	The total number, index of advertisers
$I_{p},i$	The total number, index of ad requests on channel $p$
$B_{m},CPC_{m}^{set}$	Total budget and CPC constraint set by advertisers $m$
$Click(\cdot),Cost(\cdot)$	The number of clicks and actual cost
$a_{m,p,i}$	Bidding price offered by advertiser $m$ for request $i$ on channel $p$
$s_{p}^{h},a_{p}^{h},r_{p}^{h},c_{p}^{h}$	State, action, reward and cost for channel $p$ in high-level MDP
$s_{i}^{l},a_{i}^{l},r_{i}^{l},c_{i}^{l}$	State, action, reward and cost of ad request $i$ in low-level MDP
$Q^{(\cdot)}_{\theta},Q^{(\cdot)}_{\theta^{\prime}},\hat{Q}^{u}_{\eta}$ , $\hat{Q}^{c}_{\phi}$	Q-network, target network and evaluation networks
$\lambda,\lambda^{*}_{i}$	Lagrangian multiplier in bidding strategy, optimal $\lambda$ for ad request $i$
$N,N_{p}$	Data repetition times in offline training and sample number of $\lambda$ in online prediction
$\mathcal{L}_{(\cdot)},w_{1},w_{2},w_{b}$	Loss functions, weight of Q-function loss, OOD action loss, and batch loss
$Impr,Click$	The total number of impressions and clicks
$ROI,CPC,CSR$	Return-on-investment, cost-per-click, and CPC satisfactory ratio

III System Model

In this paper, our overall objective is to maximize the total ad clicks while satisfying all the advertisers’ set budget and CPC constraints, while ensuring that the platform’s revenue remains within an acceptable range. Without loss of generality, we consider an advertising platform is servicing $M$ advertisers across $P$ ad channels with $I_{p}$ ad requests on channel $p$ in a day. Each advertiser $m$ sets a daily budget $B_{m}$ (i.e., the maximum amount of money they are willing to spend for their advertising campaign) and expected cost-per-click $CPC_{m}^{set}$ , then the overall objective is formulated as:

$\displaystyle\underset{{a_{m,p,i}}}{\mathrm{maximize}}$	$\displaystyle\sum\limits_{m=1}^{M}\sum\limits_{p=1}^{P}\sum\limits_{i=1}^{I_{p% }}Click(a_{m,p,i})$	(1)
$\displaystyle\mathrm{subject\,to:\,}$	$\displaystyle\left\|\sum_{m=1}^{M}Cost(a_{m,p,i})-\kappa_{p}\right\|\leq\epsilon% ,\forall p\in\{1,\dots,P\},$	(2)
	$\displaystyle\frac{\sum_{p=1}^{P}\sum_{i=1}^{I_{p}}Cost(a_{m,p,i})}{\sum_{p=1}% ^{P}\sum_{i=1}^{I_{p}}Click(a_{m,p,i})}\leq CPC_{m}^{set},$	(3)
	$\displaystyle\sum_{p=1}^{P}\sum_{i=1}^{I_{p}}Cost(a_{m,p,i})\leq{B_{m}},% \forall m\in\{1,\dots,M\},$	(4)

where $a_{m,p,i}$ represents the bidding price offered by an advertiser $m$ for a request $i$ on a channel $p$ . This bidding price indicates the amount an advertiser $m$ is willing to pay to display their ad in response to the request $i$ on that particular channel. ${Cost}(a_{m,p,i})$ corresponds to the actual expense spent by advertiser $m$ when their specific bid for request $i$ is successful. $Click(a_{m,p,i})$ indicates whether or not a user clicks on the ad after it is displayed. If the offered bidding price does not win the auction, both ${Cost}(a_{m,p,i})$ and ${Click}(a_{m,p,i})$ are set to $0$ . The constraint in Eqn. (2) is added to prevent the channels’ revenue from too much fluctuation. $\epsilon>0$ is a constant, representing the acceptable fluctuation range of the platform. $\kappa_{p}$ is calculated by multiplying historical CTR, CPC and impression count, that represents the expected consuming capacity in an ideal situation.

In practice, the incoming ad requests of all advertisements are not known as a priori. This makes it hard to employ traditional combinatorial optimization methods to solve the considered $c^{3}$ -bidding problem. Considering its inherent hierarchical nature, we can allocate budgets to all the advertisers from the platform’s perspective. For each advertiser, bids can be made based on the allocated budget and set financial constraints. The budget allocation for advertisers and the decision-making of bidding prices for sequentially incoming ad requests both exhibit Markovian properties. Therefore, we model the c³-bidding problem as a hierarchical CMDP where high-level and low-level MDPs are executed on different timescales. As shown in Figure 2, the high-level MDP is responsible for allocating the budget at intervals, while the low-level MDP bids for each ad request according to the allocated budget.

III-A High-level MDP for Budget Allocation

The high-level planner needs to allocate the budget to maximize the number of user clicks while ensuring the revenue on each channel stays within an acceptable range. Thus the objective of the high-level planner is:

$\displaystyle\underset{{a_{m,p}^{h}}}{\mathrm{maximize}}$	$\displaystyle\sum\limits_{m=1}^{M}\sum\limits_{p=1}^{P}Click^{h}(a^{h}_{m,p})$	(5)
$\displaystyle\mathrm{subject\,to:\,}$	$\displaystyle\sum_{p=1}^{P}a^{h}_{m,p}\leq{B_{m}},\quad\forall m\in\{1,\dots,M\},$	(6)
	$\displaystyle\left\|\sum_{m=1}^{M}Cost^{h}(a^{h}_{m,p})-\kappa_{p}\right\|\leq% \epsilon,\quad\forall p\in\{1,\dots,P\},$	(7)

where $a_{m,p}^{h}$ denotes the allocated budget on channel $p$ for advertiser $m$ , $Click^{h}(a_{m,p}^{h})$ and $Cost^{h}(a_{m,p}^{h})$ represent the number of clicks and cost given the budget $a_{m,p}^{h}$ within the interval, respectively. Meanwhile, the revenue constraint in Eqn. (7) is also regarded as a channel-capacity constraint that prevents advertisers from engaging in severe competition for high-quality channels.

We formulate the high-level MDP for each advertiser $m$ as a tuple $(\mathcal{S}^{h},\mathcal{A}^{h},\mathcal{T}^{h},\gamma^{h},\mathcal{R}^{h},% \mathcal{C}^{h})$ . That is, the high-level planner allocates the channel-level budget at each interval according to the advertiser’s budget requirements, while each channel allocation is regarded as a decision step. The details are defined as follows:

State $\mathcal{S}^{h}$ : Let $\mathcal{S}^{h}$ denote the higher-level state space, consisting of the allocated budget and the historical statistics on the advertiser-level (e.g., the average CTR and CVR).

Action $\mathcal{A}^{h}$ : The action $a_{p}^{h}\in\mathcal{A}^{h}$ is the budget assigned to channel $p$ . We discretize the action by using the percentage of the budget and mask invalid actions that go beyond the total budget.

Reward $\mathcal{R}^{h}$ : Reward $r^{h}_{p}\in\mathcal{R}^{h}$ is defined as the sum of number of user clicks in channel $p$ within the interval.

Constraint $\mathcal{C}^{h}$ : The channel-capacity constraint and total budget constraint are given in Eqn. (6) and Eqn. (7), respectively.

III-B Low-level MDP for Cross-Channel Constrained Bidding

After receiving the allocated budget $a^{h}_{p}$ on each channel, the low-level executor aims to maximize the number of clicks while satisfying that budget and CPC constraint simultaneously. For each advertiser $m$ , the objective function of the low-level executor can be formulated as:

$\displaystyle\underset{a^{l}_{p,i}}{\mathrm{maximize}}$	$\displaystyle\sum\limits_{p=1}^{P}\sum\limits_{i=1}^{I_{p}}Click(a^{l}_{p,i})$	(8)
$\displaystyle\mathrm{subject\,to:\,}$	$\displaystyle\sum_{i=1}^{I_{p}}Cost(a^{l}_{p,i})\leq a^{h}_{p},\quad\forall p% \in\{1,\dots,P\},$	(9)
	$\displaystyle\frac{\sum_{p=1}^{P}\sum_{i=1}^{I_{p}}Cost(a^{l}_{p,i})}{\sum_{p=% 1}^{P}\sum_{i=1}^{I_{p}}Click(a^{l}_{p,i})}\leq CPC_{m}^{set},$	(10)

where $Cost(a^{l}_{p,i})$ and $Click(a^{l}_{p,i})$ denotes the real cost and whether the user clicks the ad after giving a bidding price $a^{l}_{p,i}$ , respectively. Each request $i$ comes from a specific channel and only affects the cost of that channel, and thus we model each channel individually. Then, the CMDP for channel $p$ can be formulated as a tuple $(\mathcal{S}^{l},\mathcal{A}^{l},\mathcal{T}^{l},\gamma^{l},\mathcal{R}^{l},% \mathcal{C}^{l})$ , which is defined as follows:

State $\mathcal{S}^{l}$ : The low-level state space is a collection of allocated budgets $a^{h}_{p}$ , request and advertiser level information. The request-level information includes time, and current advertising status (e.g., budget consumption rate and financial constraints satisfactory ratio). The advertiser-level information is identical to the high-level planner state.

Action $\mathcal{A}^{l}$ : Following [27], an action $a_{i}^{l}\in[a^{l}_{\min},a^{l}_{\max}]$ represents the bidding ratio and the final bidding price is calculated by $a_{i}^{l}*CPC_{m}^{set}$ .

Reward $\mathcal{R}^{l}$ : For each request $i$ , reward $r_{i}^{l}\in\{0,1\}$ is set to $1$ if the bidding is successful and the user eventually clicks the ad.

Constraint $\mathcal{C}^{l}$ : Single-channel budget constraints and cross-channel CPC constraints are given in Eqn. (9) and Eqn. (10), respectively.

IV Preliminary

In order to apply offline DRL methods to the considered c³-bidding problem, the key is to maintain a conservative value estimation (i.e., to eliminate the possible over-estimation). Recall that Q-network $Q_{\theta}(s,a)$ measures the accumulative discounted reward starting from state-action pair $(s,a)$ parameterized by $\theta$ . $Q_{\theta}(s,a)$ can be improved via minimizing the temporal difference [36] as:

\displaystyle\mathcal{L}_{Q}(\theta)=\mathbb{E}_{(s,a,r,s^{\prime})\sim% \mathcal{D}}\left[(r+\gamma Q_{\theta^{\prime}}(s^{\prime},a^{\prime})-Q_{% \theta}(s,a))^{2}\right],

(11)

where $a^{\prime}=\mathop{\arg\max}_{a^{\prime}}Q_{\theta^{\prime}}(s^{\prime},a^{% \prime})$ and $Q_{\theta^{\prime}}$ is a target network for learning stability. The out-of-distribution (OOD) state-action pairs bring extrapolation error [31] during the offline training, resulting in a severely overestimated value function and an aggressive bidding strategy. This strategy may result in a catastrophic financial loss in c³-bidding because it is inclined to give higher bidding prices, leading to vicious competition and significant risks. Thus it is important to keep conservatism in value estimation which can help prevent the learned policy from taking risky actions.

In this paper, we adopt Mildly Conservative Q-learning (MCQ [15]) as the start of the design for both the high-level planner and the low-level executor, where OOD state-action pairs are actively trained by assigning proper pseudo Q values. In the considered c³-bidding problem, the policy distribution within the dataset exhibits a multi-modality pattern due to the highly non-stationary external market. A simple parameterized approach (e.g., using MLP with cross-entropy) cannot work well as it focuses on mapping input to output and neglect the full distribution. Specifically, we utilize a conditional variational autoencoder (CVAE [37]) to extensively model the distribution of the behavior policy $\mu$ . Given log data, the objective of CVAE is to reconstruct actions conditioned on the states, such that the generated actions come from the same distribution as the actions in the log. The utilized CVAE is denoted as $G_{\omega}(s)$ parameterized by $\omega$ , which is consisted of an encoder $G^{E}_{\omega_{1}}(s,a)$ and an decoder $G^{D}_{\omega_{2}}(s,z)$ . We optimize its variational lower bound by:

\small\mathcal{L}_{CVAE}(\omega)=\mathbb{E}\left[\big{(}a-G^{D}_{\omega_{2}}(s% ,z)\big{)}^{2}+KL\big{(}G^{E}_{\omega_{1}}(s,a),\mathcal{N}(0,\mathbf{I})\big{% )}\right],

(12)

where hidden state $z=G^{E}_{\omega_{1}}(s,a)$ , $KL(\cdot)$ denotes the KL-divergence, $\mathcal{N}$ is multivariate normal distribution, and $\mathbf{I}$ is the identity matrix.

Then, given a state $s$ , we generate several in-distribution actions $a_{\mu}$ by CVAE $G_{\omega}(s)$ , and the auxiliary loss for OOD actions is calculated by:

\displaystyle\mathcal{L}_{OOD}(\theta)=\mathbb{E}_{s\sim\mathcal{D}}[(\max_{a_% {\mu}\sim G_{\omega}(s)}Q_{\theta}(s,a_{\mu})-Q_{\theta}(s,a_{\pi}))^{2}].

(13)

If $a_{\pi}$ (generated by current policy $\pi$ ) is an OOD action, the auxiliary loss will limit the corresponding value estimation below the maximum value of in-distribution action $a_{\mu}$ . In this way, we help the value estimator stay conservative such that OOD actions will not be severely overestimated. However, MCQ cannot guarantee the budget allocation and bidding strategy to satisfy user requirements, hence we explicitly propose three key modules.

V Our solution: HiBid

Due to the large solution space and multiple constraints of the considered c³-bidding problem, we propose a hierarchical offline DRL framework called HiBid, as shown in Figure 3. We first introduce an auxiliary batch loss [16] to prevent over-allocating budget on specific channels (see Section V-A), $\lambda-$ generalization [17] for constrained bidding to adaptively respond to changing budget (see Section V-B), and a CPC-guided action selection (CPC-AS) scheme for the cross-channel CPC constraint satisfaction (see Section V-C).

V-A Auxiliary Batch Loss for Non-competitive Budget Allocation

Recall that the high-level planner’s objective is to maximize the number of user clicks under the advertiser’s budget while ensuring the revenue on each channel stays within an acceptable range. However, each channel’s request is limited and cannot accommodate all advertisers. If all of them compete for the request on high-quality channels, it will certainly lift the bidding price and reduce advertisers’ CPC satisfactory ratio. A well-designed planner may allocate different budgets for each channel based on the advertiser preference, as well as prevent budget over-allocation on specific channels.

A common solution is to leverage an advertiser-level constraint to limit the amount of budget allocation on the channel. It may result in reducing overall revenue since the bidding abilities of advertisers are quite different and thus one cannot allocate equal budget to all of them. Meanwhile, our design also faces a key challenge as how to restrict the updated policy to satisfy the channel-capacity constraint because it is hard to design a suitable reward function. Inspired by [16], we design a batch loss for the high-level planner to ensure that the budget allocated to each channel fluctuates within an acceptable range. Consider that we sample a batch of experiences containing multiple tuples $(s_{p}^{h},a_{p}^{h},r_{p}^{h},c_{p}^{h},s_{p+1}^{h})$ from the high-level dataset $\mathcal{D}^{h}$ , then the batch loss is calculated by:

\small\mathcal{L}_{Batch}(\theta)=\sum_{p=1}^{P}\bigg{(}\sum_{s_{p}^{h}}Cost% \big{(}\mathop{\arg\max}_{a_{p}^{h}\in\mathcal{A}^{h}}Q_{\theta}^{h}(s_{p}^{h}% ,a_{p}^{h})\big{)}-\kappa_{p}\bigg{)}^{2},

(14)

where $\kappa_{p}$ is calculated by multiplying historical CTR, CPC and impression count based on all advertisers in the batch, and a weight $w_{p}$ is introduced on the batch loss for learning stability. Since the argmax function is not differentiable, a soft version is applied:

\footnotesize Cost\big{(}\mathop{\arg\max}_{a_{p}^{h}\in\mathcal{A}^{h}}Q_{% \theta}^{h}(s_{p}^{h},a_{p}^{h})\big{)}\approx\sum_{n=1}^{|\mathcal{A}^{h}|}% \frac{1}{Z}\exp\left[\beta Q_{\theta}^{h}(s_{p}^{h},a_{p,n}^{h})Cost(a_{p,n}^{% h})\right],

(15)

where $\beta$ is the temperature coefficient and $Z=\sum_{n=1}^{|\mathcal{A}^{h}|}\exp[\beta Q^{h}_{\theta}(s_{p}^{h},a_{p,n}^{h% })]$ is the normalization factor. While the high-level planner updates its strategy using randomly sampled batch experiences to maximize the number of clicks, the batch loss encourages to reallocate budgets across different channels based on historical statistics within the batch experiences. For example, it prompts some high-impact advertisers (i.e., those with higher conversion rates) to reduce their budget allocation on high-quality channels, thereby allowing low-impact advertisers to benefit from the superior advertising effects that these high-quality channels provide. This approach ensures that revenue on the channels does not fluctuate dramatically. Additionally, it prevents the high-level policy from consistently favoring higher investments in high-quality channels, which could lead to detrimental competition.

V-B Offline Data Augmentation by $\lambda$ -Generation and Optimal $\lambda$ -Selection for Online Adaptive Bidding

As the advertiser’s total budget and the platform’s allocated budget may change over time, we need an adaptive bidding strategy that can dynamically respond to the budget changes. However, under the offline DRL training paradigm, the low-level executor learns the bidding strategy from the low-level dataset $\mathcal{D}^{l}$ which cannot be generalized to unseen budget cases, reducing the effectiveness of the high-level budget allocation. To this end, we adopt “ $\lambda$ -generalization” [17] to learn an adaptive bidding strategy for dynamically changing budgets.

In order to learn a bidding strategy that satisfies the budget constraint on each channel individually, the problem of c³-bidding without CPC constraint can be converted into its Lagrangian dual problem [38] as:

		$\displaystyle\min_{\lambda}\max_{a^{l}_{i}}\sum_{i=1}^{I}Click(a^{l}_{i})-% \lambda\bigg{(}\sum_{i=1}^{I}Cost(a^{l}_{i})-a_{p}^{h}\bigg{)}$		(16)
	$\displaystyle\Rightarrow$	$\displaystyle\min_{\lambda}\max_{a^{l}_{i}}\sum_{i=1}^{I}\bigg{(}Click(a_{i}^{% l})-\lambda Cost(a_{i}^{l})\bigg{)}+\lambda a_{p}^{h}\quad s.t.\quad\lambda% \geq 0,$		(16)

where $\lambda$ is the Lagrangian multiplier that controls how much the bidding strategy spends. Thus, we take it as part of the input to the Q-network and modify the low-level reward function by:

r^{l,\lambda}_{i}=r_{i}^{l}-\lambda c_{i}^{l}.

(17)

Then the low-level policy can be formulated as:

\pi^{l}(s_{i}^{l},\lambda)=\mathop{\arg\max}_{a_{i}^{l}\in{\mathcal{A}^{l}}}Q_% {\theta}^{l}(s_{i}^{l},a_{i}^{l},\lambda).

(18)

Therefore, the low-level executor can adaptively respond to changing budgets by selecting the optimal $\lambda^{*}$ . Next, we need to deal with the training of the corresponding policy under different $\lambda$ offline, and getting an optimal $\lambda^{*}$ under the budget constraint during online prediction.

During the offline training, we perform data augmentation by $\lambda$ -generation, allowing the policy to learn how to bid under different $\lambda$ . Given a fixed training dataset consisted of multiple tuples, we extend it by enlarging each tuple $(s_{i}^{l},a_{i}^{l},r_{i}^{l},c_{i}^{l},s_{i+1}^{l})$ into $\{(s_{i}^{l},a_{i}^{l},r_{i}^{l,\lambda_{n}},c_{i}^{l},s_{i+1}^{l},\lambda_{i}% ^{n})\}_{n=1}^{N}$ , where $N$ is data repetition times and $\lambda_{n}$ is a uniformly sampled value from $[0,\lambda_{\max}]$ . The range $[0,\lambda_{\max}]$ can guarantee that the cost of the learned policy falls within a controllable range [17]. We also construct two additional evaluation networks $\hat{Q}^{c}_{\phi}$ and $\hat{Q}^{u}_{\eta}$ to accurately evaluate the expected cost and number of clicks under the low-level policy $\pi^{l}$ . We use the action from $\pi^{l}$ instead of $\max$ operator when computing the target value in two evaluation network updates by:

\displaystyle\mathcal{L}_{\hat{Q}}(\phi)=\mathbb{E}_{\tau^{l}\sim\mathcal{D}^{% l}}\left[c_{i}^{l}+\gamma^{l}\hat{Q}^{c}_{\phi}\big{(}s_{i+1}^{l},\pi^{l}(s_{i% +1}^{l},\lambda_{i}^{n}),\lambda_{i}^{n}\big{)}-\hat{Q}^{c}_{\phi}(s_{i}^{l},a% _{i}^{l},\lambda_{i}^{n})\right],

(19)

where $\tau^{l}$ is sampled trajectories from $\mathcal{D}^{l}$ and $c_{i}^{l}$ denotes the budget consumption under action $a_{i}^{l}$ . Note that we utilize the same loss function $\mathcal{L}_{\hat{Q}}(\eta)$ for $\hat{Q}^{u}_{\eta}$ but compute number of clicks as $c_{i}^{l}$ .

During the online prediction, two evaluation networks are used to perform $\lambda$ -selection to ensure that the low-level policy does not exceed the budget allocated by the high-level planner. We uniformly sample $\{\lambda^{n}\}_{n=1}^{N_{p}}$ for each request $i$ within the range $[0,\lambda_{\max}]$ , and select the $\lambda_{i}^{*}$ which satisfies the allocated budget $a_{p}^{h}$ as well as maximizing the number of clicks:

\lambda_{i}^{*}=\mathop{\arg\max}_{\lambda\in\{\hat{Q}^{c}_{\phi}(s_{i}^{l},% \pi(s_{i}^{l},\lambda^{n}),\lambda^{n})\leq a_{p}^{h}|_{n=1}^{N_{p}}\}}\hat{Q}% ^{u}_{\eta}(s_{i}^{l},\pi^{l}(s_{i}^{l},\lambda),\lambda),\forall i.

(20)

In this way, the low-level executor can adaptively respond to the allocated budget $a_{p}^{h}$ and ensure the effectiveness of the budget allocation made by the higher-level planner.

V-C CPC-guided Action Selection for Cross-channel Constraint Satisfaction

Due to the varying target users and advertising quality, the competition situation differs among channels, resulting in a significant discrepancy in terms of CPC between channels (i.e., high-quality channels have higher CPC). Therefore, we cannot simplify the cross-channel CPC constraint by setting the same target CPC for each channel. When we use Lagrangian relaxation to deal with both budget and CPC constraints simultaneously, it is impossible to find an effective pair of Lagrangian multipliers to satisfy them due to the explosion of the solution space. We design a CPC-guided action selection (CPC-AS) scheme to help the low-level executor choose the action that satisfies the CPC constraint by considering both the past and the future.

When making a decision for a request $i$ , the final CPC for an advertiser $m$ is divided into two parts:

	$\displaystyle CPC_{m}^{real}$	$\displaystyle=\frac{\sum_{p=1}^{P}\sum_{i=1}^{I}Cost(a^{l}_{p,i})}{\sum_{p=1}^% {P}\sum_{i=1}^{I}Click(a^{l}_{p,i})}$		(21)
		$\displaystyle=\frac{Cost_{m,t}+\sum_{p=1}^{P}\sum_{i=t}^{I}Cost(a^{l}_{p,i})}{% Click_{m,t}+\sum_{p=1}^{P}\sum_{i=t}^{I}Click(a^{l}_{p,i})},$		(21)

where $Cost_{m,t}$ and $Click_{m,t}$ denote the costs and number of clicks already happened up to now, respectively. As we describe in Section V-B, two evaluation networks $\hat{Q}^{c}_{\hat{\theta}}$ and $\hat{Q}^{u}_{\hat{\theta}}$ are developed to estimate the expected discounted costs and the number of clicks, respectively. Given the current state-action pair $(s^{l}_{i},a^{l}_{i})$ , we approximate the expected costs and number of clicks through $\hat{Q}^{c}_{\phi}(s^{l}_{i},a^{l}_{i},\lambda)$ and $\hat{Q}^{u}_{\eta}(s^{l}_{i},a^{l}_{i},\lambda)$ . Therefore, we define $CPC_{m}^{pred}(a_{i}^{l})$ by combining the past and expected future by:

	$\displaystyle CPC_{m}^{pred}(a_{i}^{l})=\frac{Cost_{m,t}+\sum_{p=1}^{P}\sum_{i% =t}^{I}(\gamma^{l})^{i-1}Cost(a^{l}_{p,i})}{Click_{m,t}+\sum_{p=1}^{P}\sum_{i=% t}^{I}(\gamma^{l})^{i-1}Click(a^{l}_{p,i})}$		(22)
	$\displaystyle=\frac{Cost_{m,t}+\sum_{p=1}^{P}\hat{Q}^{c}_{\phi}(s^{l}_{p,i},a^% {l}_{p,i},\lambda)}{Click_{m,t}+\sum_{p=1}^{P}\hat{Q}^{u}_{\eta}(s^{l}_{p,i},a% ^{l}_{p,i},\lambda)}.$		(22)

Since ad requests from all channels arrive in sequence, and each request only belongs to one channel by definition, we cannot estimate the future of other channels, and then we save their most recent estimations as the input. Together with $CPC(a_{i}^{l})$ , the low-level policy becomes:

\pi^{l}(s^{l}_{i},\lambda)=\mathop{\arg\max}_{a\in\{{CPC_{m}^{pred}(a_{i}^{l})% \leq CPC_{m}^{set}|a_{i}^{l}\in\mathcal{A}^{l}\}}}Q^{l}_{\theta}(s^{l}_{i},a,% \lambda).

(23)

Note that there exists a slight bias between $CPC_{m}^{real}$ and $CPC_{m}^{pred}(a_{i}^{l})$ when $\gamma^{l}<1$ . In Section VI-G, we experimentally prove that the bias becomes smaller when $\gamma^{l}$ is close to $1$ , which can be ignored in practice.

Input: Log data

\mathcal{D}

, high-level CVAE

G^{h}_{\omega}

, Q-network

Q^{h}_{\theta}

and target network

Q^{h}_{\theta^{\prime}}

, Low-level CVAE

G^{l}_{\omega}

, Q-network

Q^{l}_{\theta}

and target network

Q^{l}_{\theta^{\prime}}

, evaluation networks

\hat{Q}^{u}_{\eta}

and

\hat{Q}^{c}_{\phi}

1 Initialize all parameterized networks;

2 Process log data into high-level dataset

\mathcal{D}^{h}

and low-level dataset

\mathcal{D}^{l}

;

3 for High-level update iteration $=1,2,3,\dots$ do

4 Sample a batch contains

J^{h}

tuples

\{(s_{j}^{h},a_{j}^{h},r_{j}^{h},c_{j}^{h},s_{j+1}^{h})\}_{j=1}^{J^{h}}

from

\mathcal{D}^{h}

;

5 Update high-level CVAE by minimizing Eqn. (12);

6 Calculate the

\mathcal{L}_{Q}(\theta)

\mathcal{L}_{OOD}(\theta)

and the batch loss

\mathcal{L}_{Batch}(\theta)

by Eqn. (11), (13) and (14), .

7 Update Q-network

Q_{\theta}^{h}

by minimizing the weighted loss

w_{1}\mathcal{L}_{Q}(\theta)+w_{2}\mathcal{L}_{OOD}(\theta)+w_{b}\mathcal{L}_{% Batch}(\theta)

;

8 Every

N_{target}

iterations synchronize

Q^{h}_{\theta^{\prime}}\xleftarrow{}Q^{h}_{\theta}

;

9for Low-level update iteration $=1,2,3,\dots$ do

10 Sample a batch of experiences contains

J^{l}

tuples

\{(s_{j}^{l},a_{j}^{l},r_{j}^{l},c_{j}^{l},s_{j+1}^{l})\}_{j=1}^{J^{l}}

from

\mathcal{D}^{l}

;

11 Update low-level CVAE by minimizing Eqn. (12);

12 for $j=1,\dots,J^{l}$ do

13 Get the allocated budget

a^{h}_{p}

using

Q^{h}_{\theta}

;

14 Uniform sample

\{\lambda^{n}\}_{n=1}^{N}

in range

[0,\lambda_{\max}]

;

15 Calculate the reward

r_{j}^{l,\lambda_{j}^{n}}

by Eqn. (17);

16 Including

\lambda^{n}

and

a^{h}_{p}

into the

s_{j}^{l}

, then enlarging each tuple into

\{(s_{j}^{l},a_{j}^{l},r_{j}^{l,\lambda^{n}},c_{j}^{l},s_{j+1}^{l},\lambda_{j}% ^{n})\}_{n=1}^{N}

;

17 Calculate the

\mathcal{L}_{Q}(\theta)

and

\mathcal{L}_{OOD}(\theta)

by Eqn. (11) and (13).

18 Update Q-network

Q_{\theta}^{l}

by minimizing the weighted loss

w_{1}\mathcal{L}_{Q}(\theta)+w_{2}\mathcal{L}_{OOD}(\theta)

;

19 Update evaluation networks

\hat{Q}^{u}_{\eta}

and

\hat{Q}^{c}_{\phi}

by minimizing the

\mathcal{L}_{\hat{Q}}(\eta)

and

\mathcal{L}_{\hat{Q}}(\phi)

in Eqn. (19);

20 Every

N_{target}

iterations synchronize

Q^{l}_{\theta^{\prime}}\xleftarrow{}Q^{l}_{\theta}

;

Algorithm 1 Offline Training

Input: Trained high-level Q-network

Q_{\theta}^{h}

, low-level Q-network

Q_{\theta}^{l}

, and evaluation networks

\hat{Q}^{u}_{\eta}

and

\hat{Q}^{c}_{\phi}

1 while Incoming ad request $i$ do

2 Get advertiser-level feature

s_{p}^{h}

and request-level feature

s_{i}^{l}

from the platform.

3 Allocate budget

a^{h}_{p}

\mathop{\arg\max}_{a_{p}^{h}\in{\mathcal{A}^{h}}}Q_{\theta}^{h}(s_{p}^{h},a_{p% }^{h}).

4 Find the optimal

\lambda_{i}^{*}

according to

a_{p}^{h}

by Eqn. (20);

5 Calculate

\{CPC_{m}^{pred}(a_{i}^{l})|a_{i}^{l}\in\mathcal{A}^{l}\}

by Eqn. (22)

6 Calculate bidding action

a_{i}^{l}

by Eqn. (23)

Algorithm 2 Online Prediction

V-D Algorithm Description

V-D1 Offline Training

We first show the pseudo-code for offline training in Algorithm $1$ . At the beginning, we initialize all parameterized networks (Line $1$ ). Then we process the data for the high-level and low-level training individually (Line $2$ ), to use the available log data efficiently. For the high-level planner, after sampling a batch of experiences from $\mathcal{D}^{h}$ (Line $4$ ), we update CVAE first by Eqn. (12) (Line $5$ ). Using sampled experiences, we calculate the $\mathcal{L}_{Q}(\theta)$ and $\mathcal{L}_{OOD}(\theta)$ as well as $L_{Batch}(\theta)$ by Eqn. (11), (13) and (14), then update high-level Q-network with weighted loss (Line $6$ - $7$ ). Finally, the target network $\theta^{\prime}$ is synchronized periodically (Line $8$ ). For the low-level executor, we sample a batch of experiences from $\mathcal{D}^{l}$ and then update the low-level CVAE by Eqn. (12) (Line $10$ - $11$ ). With each sampled experience, we leverage the trained high-level Q-network to determine the allocated budget (Line $13$ ). Then, we augment the origin tuple by sampling multiple $\{\lambda^{n}\}_{n=1}^{N}$ and modifying the reward (Line $14$ - $15$ ). The sampled $\lambda^{n}$ and allocated budget $a^{h}_{p}$ are incorporated into the state $s_{j}^{l}$ (Line $16$ ). Using the augmented tuple, we update the low-level Q-network with weighted loss by Eqn. (11) and (13) (Line $17$ - $18$ ). Two evaluation networks are updated by minimizing $\mathcal{L}_{\hat{Q}}(\eta)$ and $\mathcal{L}_{\hat{Q}}(\phi)$ in Eqn. (19) (Line $19$ ).

V-D2 Online Prediction

The pseudo-code for online prediction is given in Algorithms 2. For each ad request $i$ , HiBid gets advertiser-level feature $s_{p}^{h}$ and request-level feature $s_{i}^{l}$ as in Section $3$ from the advertising platform (Line $2$ ). Then, the high-level planner allocates budget $a_{p}^{h}$ by taking advertiser-level information as the input of $Q_{\theta}^{h}$ . (Line $3$ ) Together with two evaluation networks $\hat{Q}^{u}_{\eta}$ and $\hat{Q}^{c}_{\phi}$ , the low-level executor finds the optimal $\lambda_{i}^{*}$ according to Eqn. (20) (Line $4$ ). To satisfy the CPC constraint, $CPC_{m}^{pred}(a_{i}^{l})$ for each action is calculated by Eqn. (22) (Line $5$ ). Considering both the output of Q-network $Q_{\theta}^{l}(s_{i}^{l},\lambda_{i}^{*})$ and $\{CPC_{m}^{pred}(a_{i}^{l})|a_{i}^{l}\in\mathcal{A}^{l}\}$ , the low-level executor gives the final bidding action based on Eqn. (23) (Line $6$ ).

VI Experiment

VI-A Setup

VI-A1 Dataset

We use large-scale log data collected from Meituan advertising system (which is physically deployed online and running for real-time services) for offline training and performance evaluation. The data contains $28$ days of bidding logs and is divided into two parts, i.e., $21$ days for training and $7$ days for evaluation. On average, we sampled $64,272$ advertisers for $70$ million ad requests on $4$ channels in a day. We follow the common setting in [15] for most hyper-parameters (leaving two key ones for hyper-parameters tunning in Section VI-B) and list them in Table II.

VI-A2 Offline Evaluation System

Deploying a model to an online system without evaluating its potential effects is risky, since it may lead to significant loss of revenue. To avoid this, we design an offline evaluation system for cross-channel bidding scenarios, which includes two modules: advertising system simulator and user feedback predictor. The former simulates Meituan’s real online advertising platform, including the process of retrieval, bidding, ranking and pricing. The latter is used to predict the user’s feedback on certain advertisements and provide the advertising results for evaluation.

VI-A3 Evaluation Metrics

We introduce five metrics to mathematically evaluate the performance of HiBid in the considered c³-bidding problem, including (a) total impression counts ( $Impr$ ), (b) total number of clicks ( $Click$ ), (c) average CPC ( $CPC$ ), (d) average CPC satisfactory ratio ( $CSR$ ), and (e) average ROI ( $ROI$ ) of all advertisers. In particular, $CSR$ is average of $\mathds{1}_{CPC_{m}^{real}\leq CPC_{m}^{set}}$ of all advertisers and $\mathds{1}$ is the indicator function. To accurately evaluate the effectiveness of various methods and eliminate the influence of specific application scenarios, we use the normalized score with respect to the statistical result obtained from the log data (offline performance evaluation), and online solution R-BCQ [17] (online A/B testing) for each metric.

VI-A4 System Setup

We implement Hibid with Tensorflow 1.15 and utilize $4$ NVIDIA A100 GPUs for offline training. The training time cost is directly proportional to the choice of $N$ , and it lasts about $29$ hours when $N=30$ . For online prediction, we deployed Hibid on $233$ servers, each of which is equipped with an Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz and an NVIDIA A30 GPU. The maximum number of concurrent requests is about $18,899$ (during the business peak period) and the inference time is shown in Table V.

TABLE II: Key hyper-parameters in HiBid

Hyper-parameter	Value
high-level and low-level batch sizes	4096, 1024
high-level and low-level learning rates	1e-5, 1e-5
discounted factors $\gamma^{h},\gamma^{l}$	$1,0.999$
loss weights $w_{1},w_{2},w_{b}$	$1,0.05,0.1$
high-level decision interval	1 day
$\lambda$ sampling range $\lambda_{\max}$	1.45
data repetition times $N$	$30$
number of sampled $\lambda$ in online prediction $N_{p}$	$50$
low level action range $[a^{l}_{\min},a^{l}_{\max}]$	$[0.5,1.5]$

VI-B Hyper-parameter Tuning

We first show the results of hyper-parameters tunning in HiBid, as loss weight $w_{b}$ (in batch loss; see Section V-A) and data repetition times $N$ (in $\lambda$ -generalization; Section V-B). We tune $w_{b}\in\{0,0.02,0.04,\dots,0.2\}$ to investigate the effect of weighted batch loss on budget allocation and $N\in\{0,1,5,10,20,30,40,50,60\}$ to study the impact of sample efficiency on bidding strategy.

During practical deployment, the platform requirement is that the fluctuation of revenue does not exceed $1\%$ . Thus, $\epsilon$ is set to $0.01\kappa_{p}$ for each channel $p$ . From Figure 4(a), we see that the capacity satisfactory ratio increases when we increase the $w_{b}$ . However, the Click reaches the peak when $w_{b}=0.1$ and then gradually decreases. This is because updating the budget allocation strategy with a large $w_{b}$ can prevent the over-allocation issue on high-quality channels. When $w_{b}$ is too large, the batch loss will have a negative impact on the policy updating process, resulting in a poor budget allocation strategy and advertising results. As shown in Figure 4(b), we observe that as the data repetition times $N$ increases, the lower-level executor is able to accurately satisfy the allocated budget since more experiences help the bidding strategy generalize to unseen budget cases. However, overuse of the training experiences may result in poor training efficiency. Therefore, batch loss weight $w_{b}=0.1$ and data repetition times $N=30$ are the two best hyper-parameters chosen for performance comparison hereafter.

TABLE III: Ablation Study

Method		Impr	Click	ROI	CPC	CSR
HiBid		0.03%	10.93%	4.53%	-6.14%	8.94%
High level	.. w/o batch loss	-4.50%	-1.28%	3.44%	-0.03%	2.76%
High level	.. w/o budget allocation	-5.93%	-1.97%	4.32%	-0.88%	2.02%
Low level	.. w/o $\lambda$ -generalization	-3.35%	-1.14%	5.21%	-1.71%	4.67%
	.. w/o CPC-AS	-2.70%	3.99%	-2.88%	6.47%	-6.54%
	.. w/o $\lambda$ -generalization & CPC-AS	-6.44%	8.43%	-7.90%	12.26%	-9.87%

VI-C Ablation Study

We gradually remove four key components from the high-level planner and the low-level executor in HiBid, to verify the benefits brought by each component. The results are shown in Table III.

We first show the benefits of batch loss and budget allocation in the high-level planner. When the batch loss module is removed, $CPC$ increases $6.11\%$ , and $CSR$ significantly drops from $8.94\%$ to $2.76\%$ . We also present the specific changes in $Impr$ and $CPC$ across four channels in Table IV. We observe that when we removed the batch loss module, due to limited supply from high-quality channels (i.e., Channel 1), advertisers exhibite a stronger investment intention (as an increased value in $Impr$ ), which results in increased advertising cost (higher CPC) on high-quality channels. However, the summed $Impr$ on four channels is decreased, and an overall increase in average $CPC$ is observed, which is unfavorable for either advertiser’s investment or platform itself. $Click$ decreases from $10.93\%$ to $-1.97\%$ when we removed the entire budget allocation module. In this way, the low-level executor bids for each incoming request without a budget, leading unsuitable advertisers to take away and waste the click opportunities on that channel.

TABLE IV: Impact of batch loss

Metric	Method	Channel 1	Channel 2	Channel 3	Channel 4
$Impr$	HiBid	0.01%	0.03%	0.03%	0.04%
$Impr$	.. w/o batch loss	7.64%	-7.28%	-8.85%	8.56%
$CPC$	HiBid	-3.65%	-5.88%	-7.29%	-10.86%
$CPC$	.. w/o batch loss	6.76%	-1.67%	-2.24%	-1.65%

We further observe the impact of $\lambda$ -generalization and CPC-AS on the low-level bidding strategy. Compared to HiBid without $\lambda$ -generalization, $Click$ and $CSR$ of HiBid achieves $12.07\%$ and $4.27\%$ improvements, respectively. This is because the low-level executor accurately adapts its bidding strategy according to the allocated budget, and avoids taking away the budget originally from other channels, bringing an improvement in advertising performance. When we removed CPC-AS, $CSR$ drops significantly from $8.94\%$ to $-6.54\%$ since the higher bids result in more clicks but exhibit a negative impact on the advertisers’ expectations. When both $\lambda$ -generalization and CPC-AS are removed, $Impr$ and $ROI$ drastically drop, which confirms the benefits of introducing $\lambda$ -generalization and CPC-AS as the contribution of this paper.

VI-D Offline Performance Comparison

We compare HiBid with six baselines:

1.

CBRL [8]: It is a curriculum-guided Bayesian reinforcement learning (CBRL) framework with an indicator-augmented reward function to adaptively control the constraint-objective trade-off for ROI-constrained single-channel bidding, which is considered as the state-of-the-art approach. For a fair comparison, we adapt CBRL to c³-bidding problem by replacing ROI constraint with CPC constraint while maintaining the origin training process.
2.

OPAL [13]: Its high-level agent is trained by unsupervised learning, which provides a temporal abstraction for the low-level agent to improve offline policy optimization. It is considered as the state-of-the-art hierarchical offline DRL method. We adopt it by training a CVAE for budget allocation with the same network structure as HiBid.
3.

MCQ [15]: It alleviates the value overestimation effect that occurred in OOD actions by actively training and correcting their Q values. We consider it as the state-of-the-art approach in offline DRL.
4.

PRDC [39]: It is another offline DRL method with a policy regularization mechanism. It employs dataset constraints to allow the policy to choose better actions that do not appear in the dataset with the nearest neighbor retrieval, while still maintaining sufficient conservatism for OOD actions.
5.

CEM [40]: Cross-Entropy Method (CEM) is a popular evolutionary algorithm, where we consider the c³-bidding problem as a black-box optimization problem, aiming to maximize the number of clicks under the total budget and CPC constraints. Note that we always choose the policy set that is close to $CPC_{m}^{set}$ for policy updating in a CEM iteration.
6.

PID [23]: Proportional-Integral-Derivative (PID) controller is a classical feedback controller that is widely adopted, and performs well in unknown environments. We keep the advertiser’s current CPC close to the target $CPC_{m}^{set}$ to satisfy the advertiser’s cross-channel CPC constraint.

Since OPAL, MCQ and PRDC do not meet the CPC and budget constraints, we explicitly consider these two by designing a reward function, that is able to maximize $Click$ from Meituan’s past online service experiences, as:

\displaystyle\big{(}1+w_{x}*\overline{CPC}*(1+CPC_{m}^{set})\big{)}*r_{i}^{l}-% w_{y}*c^{i}_{l},

(24)

where $\overline{CPC}$ is the calculated average CPC, $w_{x}=1.35$ and $w_{y}=1.31$ are manually set weights. Results are shown in Figure 5 and Figure 6. By combining budget allocation and constrained bidding, HiBid enhances the matching efficacy between ads and users, thereby improving the $CPC$ and $CSR$ without taking extra impressions away from others. $Impr$ is influenced by the average bidding price decided by the proposed low-level executor since there are other advertising products competing for these ad requests. The slight fluctuation in HiBid’s $Impr$ indicates that the average bidding price from HiBid is closer to the baseline. This is attributed to the proposed batch loss, which constrains the bidding expenditure of advertisers on different channels, as demonstrated in Table IV. The second-best method is CBRL with $3.62\%$ and $3.24\%$ improvements over online solution R-BCQ in terms of $Click$ and $CSR$ , respectively. CBRL gradually learns the bidding strategy that satisfies constraints through curriculum learning, but its performance is still worse than HiBid due to the lack of accurate budget allocation. MCQ actively updates OOD state-action pairs using the experiences from the dataset, and the learned conservative strategy might miss out on some high-revenue actions. Therefore, as the number of model training steps increases, $CSR$ slightly improves, but $ROI$ remains essentially unchanged. PRDC performs marginally below MCQ, as the nearest-neighbor-based regularization cannot effectively guide policy iterations in the highly non-stationary markets presented in the log. OPAL achieves $0.66\%$ improvement in $Impr$ but its $Click$ is still declining. The reason is that its budget allocation strategy is learned through supervised learning and thus lacks proper adjustments for those undesirable budget allocation experiences. Due to limitations in policy representation capability, CEM cannot accommodate advertisers’ requirements, resulting in the decrease of $CSR$ to $-5.32\%$ . Although PID adjusts bidding prices based on the current CPC dynamically, it does not take into account the inconsistent average CPC across channels and forces all channels to achieve the same level. Therefore, PID obtains a $10.15\%$ improvement in $CSR$ but poor performance in all other metrics.

TABLE V: Online A/B Testing

Method	$Impr$	$Click$	$ROI$	$CPC$	$CSR$	TP999 (ms)
HiBid	0.15%	11.20%	5.15%	-6.63%	9.02%	33.8
CBRL [8]	0.21%	3.15%	2.16%	-1.89%	3.79%	29.1
MCQ [15]	0.92%	2.67%	-0.30%	0.52%	1.89%	28.4
PRDC [39]	0.75%	2.39%	-0.56%	0.50%	1.91%	27.9
OPAL [13]	0.66%	-0.58%	-5.98%	-4.70%	-2.01%	31.1
CEM [40]	-0.03%	2.49%	-6.61%	-1.76%	-4.39%	-
PID [41]	-0.82%	-14.45%	-2.07%	0.63%	9.38%	-

VI-E Online A/B Testing on Meituan Advertising Platform

HiBid and all other six baselines are validated on Meituan advertising platform, whose system architecture is shown in Figure 7. Multiple ad requests are generated when users use Meituan apps, and each request goes through four modules in the advertising platform:

Retrieval module. Each ad request represents a browsing from a user, thus the advertising platform retrievals a set of advertisers for this ad based on the current user behavior and historical statistics. The selected advertisers have the chance to participate in the ad request auction.

Bid module. The advertising platform sends remote procedure call (RPC) requests to RTB systems to obtain the bidding price of each selected advertiser for this ad. Then, the RTB system constructs request-level features and advertisers-level features for online prediction and returns the bidding price to the advertising platform.

Rank module. When all the advertisers’ bidding prices are obtained, the advertising platform ranks the bidding advertisers in descending order based on their bidding price. The advertiser with the highest bidding wins this ad and has the chance to display the ad.

Price module. The simulator deployed a GSP auction with CPC pricing schemes identical to the online advertising platform, which means that only if the user clicks the ad, the displayed advertiser will be charged the price of the second-highest bid in this ad auction.

Subsequently, the advertising platform will display ads to users based on the auction results, and record user feedback (e.g., whether they click on the ad or make an order) in Meituan’s Data Center. The daily bidding logs are automatically processed and stored in the Feature Center for high-level and low-level training. After the offline training, the trained model is synchronized to the RTB system for bidding services. This allows the high-level planner and low-level executor of our proposed HiBid to rapidly iterate and adapt to the changing environment, leading to performance improvement in large-scale advertising systems in practice.

Online experiments are conducted using A/B testing for two weeks, and results are shown in Table V. HiBid consistently outperforms all other baselines on most metrics by $11.20\%$ in $Click$ and $5.15\%$ in $ROI$ at least. Meanwhile, PID achieved a slightly higher $CSR$ , but it performs poorly in other metrics. In addition, We show the TP999 (completion time for $99.9\%$ requests) for each model in Table V. Due to the introduced optimal $\lambda-$ selection and CPC-AS mechanism, the inference time of HiBid is only slightly increased, but it is still quite acceptable.

VI-F Synthetic Dataset Validation

Currently, RTB-related research efforts have not offered a publicly available dataset for performance comparison. Thus, to ease fair algorithm comparisons and code reproduction for the research community, we develop a cross-channel constrained bidding simulator to generate the synthetic dataset and make it available online. It emulates the ad display process of the advertising platform depicted in Fig. 7. We implement the core modules as a simplified version of the advertising system, including retrieval, bid, rank, and price. Furthermore, we simulate ad requests and user feedback based on the distribution statistics obtained from Meituan’s online production system, as:

TABLE VI: Synthetic Dataset Validation

Method	$Impr$	$Click$	$ROI$	$CPC$	$CSR$
HiBid	0%	15.25%	6.38%	-8.13%	14.53%
CBRL	0%	5.65%	1.80%	-6.16%	8.9%
MCQ	0%	3.95%	0.86%	-5.67%	7.84%
PRDC	0%	3.65%	0.78%	-5.35%	7.41%
OPAL	0%	1.13%	-3.75%	-0.74%	3.61%

Request and advertisers simulation. It first randomly generates a few advertisers, who are categorized into several types representing varying business conditions of advertisers from the online statistics. Each advertiser possesses a total budget, expected CPC, historical CTR, CVR, GMV, etc, which are sampled from a Gaussian distribution based on their own category. Then, the simulator initializes the total ad requests, each of which belongs to one of $P$ channels. The number of ad requests and arrival distribution of each channel are kept relatively consistent with the online platform. The simulator replays all the ad requests hand uses the base bidding strategies (i.e., CTR-based strategy that bids with predicted CTR multiply $CPC_{m}^{set}$ ) to conduct auctions.

User feedback simulation. After the ad auction is finished, the simulator samples the user feedback towards the displayed ads from multiple Gaussian distributions, including whether the user clicks, makes an order, and the making order amount. Then we update the advertiser’s daily statistics (i.e., real-time CTR, CVR, budget consumption, the number of clicks, etc.), to simulate the real-time feature on the platform. Both user feedback and the auction process are recorded in the dataset for model offline training.

Our synthetic dataset is made publicly available online¹¹1https://drive.google.com/drive/folders/11TmSXZFtwiXhy1kyQdvEvzc5Mu-cHI7S, including 5,000 advertisers bidding for 3,500,000 ad requests in 7 days. During the model evaluation, we first restore and replay all ad requests. Then we use the RL-based bidding strategy instead of the base strategy to participate auction and summarize the bidding results to evaluate the model performance. Results are shown in Table VI. Since there is no external competition in the simulator, the total impressions remain unchanged. We see that Hibid consistently outperforms other baselines, improving $CSR$ while bringing more clicks for advertisers.

VI-G HiBid Performance Verification Upon Optimality

VI-G1 Bias Incurred during CPC-AS Derivation

To verify the bias between our proposed $CPC_{m}^{pred}(a_{i}^{l})$ in Eqn. (22) and $CPC_{m}^{real}$ , we sample $13,633,825$ trajectories from the low-level dataset $\mathcal{D}^{l}$ and calculate the bias with different $\gamma^{l}\in[0,1]$ . The mean absolute percentage error $\mathrm{MAPE}(\gamma^{l})$ is then calculated by:

\small\mathrm{MAPE}(\gamma^{l})=\mathrm{average}\bigg{(}\frac{|CPC_{m}^{pred}(% \gamma^{l},a_{i}^{l})-CPC_{m}^{real}|}{CPC_{m}^{real}}\bigg{|}_{\tau^{l}}\bigg% {)},

(25)

and the obtained curve is shown in Figure 9. We observe that $\mathrm{MAPE}(\gamma)$ decreases rapidly as $\gamma^{l}$ increases, and the difference is less than $0.001$ and thus can be ignored when $\gamma^{l}=0.999$ .

VI-G2 Optimality Performance of HiBid in Small Scale Dataset

Millions of advertisers bid for billions of ad requests every day, the solution space explodes exponentially and it is impossible to obtain the whole day’s information about the incoming request, making it difficult to get an optimal solution in the considered c³-bidding problem. However, the high-level budget allocation problem can be solved by optimization methods if the number of advertisers is not too large and the information of all participating advertisers is as prior. In order to verify the performance gap between the budget allocation strategies obtained by HiBid and the optimum, we constructed a small-scale dataset containing $2,000$ advertisers and get the optimal budget allocation through a solver [42]. The distributions of the allocated budget are shown in Figure 8, and the MAPE and RMSE between HiBid and optimal are $4.76\%$ and $1.09$ , respectively. Therefore, HiBid performs well which enables it to cope with real-world complex and dynamic bidding environments.

VII Concluding Remark

In this paper, we propose a hierarchical offline DRL framework HiBid for cross-channel bidding with budget allocation. HiBid introduces three contributions based on the state-of-the-art offline DRL approach MCQ: (a) auxiliary batch loss to alleviate the advertiser competition in high-quality channels, (b) $\lambda$ -generalization for adaptive constrained bidding strategy in response to changing budget, and (c) CPC-guided action selection scheme for improving cross-channel CPC satisfactory ratio. Both offline experiments and online A/B testing on Meituan advertising platform show HiBid outperforms six baselines. The HiBid has been deployed online to already service tens of thousands of advertisers every day. Furthermore, some works [43, 44] have introduced an auction mechanism for resource allocation (e.g., VMs) in cloud computing. In such auction-based allocations, a hierarchical architecture can also be employed to enhance resource distribution efficiency, benefiting both the platform and its users. While the primary application of our work is ad display, the proposed hierarchical architecture offers its potential to work with other applications that require resource allocation through auctions.

References

[1] Q. Feng, D. He, M. Luo, X. Huang, and K.-K. R. Choo, “Eprice: An efficient and privacy-preserving real-time incentive system for crowdsensing in industrial internet of things,” IEEE Transactions on Computers, pp. 1–15, 2023.
[2] H. Zhang, H. Jiang, B. Li, F. Liu, A. V. Vasilakos, and J. Liu, “A framework for truthful online auctions in cloud computing with heterogeneous user demands,” IEEE Transactions on Computers, vol. 65, no. 3, pp. 805–818, 2016.
[3] Z. Zheng, Y. Gui, F. Wu, and G. Chen, “Star: Strategy-proof double auctions for multi-cloud, multi-tenant bandwidth reservation,” IEEE Transactions on Computers, vol. 64, no. 7, pp. 2071–2083, 2015.
[4] Y. Chen, P. Berkhin, B. Anderson, and N. R. Devanur, “Real-time bidding algorithms for performance-based display ad allocation,” in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011, p. 1307–1315.
[5] H. R. Varian, “Online ad auctions,” American Economic Review, vol. 99, no. 2, pp. 430–34, May 2009.
[6] H. Zhu, J. Jin, C. Tan, F. Pan, Y. Zeng, H. Li, and K. Gai, “Optimized cost per click in taobao display advertising,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, p. 2191–2200.
[7] H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo, “Real-time bidding by reinforcement learning in display advertising,” in Proceedings of the tenth ACM international conference on web search and data mining, 2017, p. 661–670.
[8] H. Wang, C. Du, P. Fang, S. Yuan, X. He, L. Wang, and B. Zheng, “Roi-constrained bidding via curriculum-guided bayesian reinforcement learning,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, p. 4021–4031.
[9] S. Xiao, L. Guo, Z. Jiang, L. Lv, Y. Chen, J. Zhu, and S. Yang, “Model-based constrained mdp for budget allocation in sequential incentive marketing,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, p. 971–980.
[10] N. Alon, I. Gamzu, and M. Tennenholtz, “Optimizing budget allocation among channels and influencers,” in Proceedings of the 21st international conference on World Wide Web, 2012, p. 381–388.
[11] H. A. Taha and H. A. Taha, Operations research: an introduction. Prentice hall Upper Saddle River, NJ, 2003, vol. 7.
[12] E. Altman, Constrained Markov decision processes: stochastic modeling. Routledge, 1999.
[13] A. Ajay, A. Kumar, P. Agrawal, S. Levine, and O. Nachum, “Opal: Offline primitive discovery for accelerating offline reinforcement learning,” in International Conference on Learning Representations, 2021.
[14] J. Gehring, G. Synnaeve, A. Krause, and N. Usunier, “Hierarchical skills for efficient exploration,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 553–11 564, 2021.
[15] J. Lyu, X. Ma, X. Li, and Z. Lu, “Mildly conservative q-learning for offline reinforcement learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 1711–1724, 2022.
[16] G. Liao, Z. Wang, X. Wu, X. Shi, C. Zhang, Y. Wang, X. Wang, and D. Wang, “Cross dqn: Cross deep q network for ads allocation in feed,” in Proceedings of the ACM Web Conference, 2022, p. 401–409.
[17] Y. Zhang, B. Tang, Q. Yang, D. An, H. Tang, C. Xi, X. Li, and F. Xiong, “Bcorle ( $\lambda$ ): An offline reinforcement learning and evaluation framework for coupons allocation in e-commerce market,” Advances in Neural Information Processing Systems, vol. 34, pp. 20 410–20 422, 2021.
[18] W. Zhang, S. Yuan, and J. Wang, “Optimal real-time bidding for display advertising,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 1077–1086.
[19] T. Zhou, H. He, S. Pan, N. Karlsson, B. Shetty, B. Kitts, D. Gligorijevic, S. Gultekin, T. Mao, J. Pan, J. Zhang, and A. Flores, “An efficient deep distribution network for bid shading in first-price auctions,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021, p. 3996–4004.
[20] W. Zhang, B. Kitts, Y. Han, Z. Zhou, T. Mao, H. He, S. Pan, A. Flores, S. Gultekin, and T. Weissman, “Meow: A space-efficient nonparametric bid shading algorithm,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021, p. 3928–3936.
[21] K. Ren, W. Zhang, K. Chang, Y. Rong, Y. Yu, and J. Wang, “Bidding machine: Learning to bid for directly optimizing profits in display advertising,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 4, pp. 645–659, 2018.
[22] D. Wu, X. Chen, X. Yang, H. Wang, Q. Tan, X. Zhang, J. Xu, and K. Gai, “Budget constrained bidding by model-free reinforcement learning in display advertising,” in Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018, p. 1443–1451.
[23] X. Yang, Y. Li, H. Wang, D. Wu, Q. Tan, J. Xu, and K. Gai, “Bid optimization by multivariable control in display advertising,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining, 2019, p. 1966–1974.
[24] R. Gao, H. Xia, J. Li, D. Liu, S. Chen, and G. Chun, “Drcgr: Deep reinforcement learning framework incorporating cnn and gan-based for interactive recommendation,” in IEEE International Conference on Data Mining, 2019, pp. 1048–1053.
[25] R. Xie, S. Zhang, R. Wang et al., “Hierarchical reinforcement learning for integrated recommendation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 5, 2021, pp. 4521–4528.
[26] C.-C. Lin, K.-T. Chuang, W. C.-H. Wu, and M.-S. Chen, “Combining powers of two predictors in optimizing real-time bidding strategy under constrained budget,” in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2016, p. 2143–2148.
[27] Y. He, X. Chen, D. Wu, J. Pan, Q. Tan, C. Yu, J. Xu, and X. Zhu, “A unified solution to constrained bidding in online display advertising,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021, p. 2993–3001.
[28] J. Zhao, G. Qiu, Z. Guan, W. Zhao, and X. He, “Deep reinforcement learning for sponsored search real-time bidding,” in Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining, 2018, pp. 1021–1030.
[29] A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine, “Stabilizing off-policy q-learning via bootstrapping error reduction,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[30] N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Picard, “Way off-policy batch deep reinforcement learning of implicit human preferences in dialog,” arXiv preprint arXiv:1907.00456, 2019.
[31] S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” in International conference on machine learning, 2019, pp. 2052–2062.
[32] J. García, Fern, and o Fernández, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 42, pp. 1437–1480, 2015.
[33] D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural policy gradient primal-dual method for constrained markov decision processes,” Advances in Neural Information Processing Systems, vol. 33, pp. 8378–8390, 2020.
[34] H. Le, C. Voloshin, and Y. Yue, “Batch policy learning under constraints,” in International Conference on Machine Learning, 2019, pp. 3703–3712.
[35] C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,” in International Conference on Learning Representations, 2019.
[36] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[37] K. Sohn, H. Lee, and X. Yan, “Learning structured output representation using deep conditional generative models,” Advances in neural information processing systems, vol. 28, 2015.
[38] S. Paternain, L. Chamon, M. Calvo-Fullana, and A. Ribeiro, “Constrained reinforcement learning has zero duality gap,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[39] Y. Ran, Y.-C. Li, F. Zhang, Z. Zhang, and Y. Yu, “Policy regularization with dataset constraint for offline reinforcement learning,” arXiv preprint arXiv:2306.06569, 2023.
[40] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A tutorial on the cross-entropy method,” Annals of operations research, vol. 134, no. 1, pp. 19–67, 2005.
[41] K. H. Ang, G. Chong, and Y. Li, “Pid control system analysis, design, and technology,” IEEE Transactions on Control Systems Technology, vol. 13, no. 4, pp. 559–576, 2005.
[42] MindOpt, “Mindopt studio,” 2022. [Online]. Available: https://opt.aliyun.com
[43] L. Mashayekhy, M. M. Nejad, D. Grosu, and A. V. Vasilakos, “An online mechanism for resource allocation and pricing in clouds,” IEEE Transactions on Computers, vol. 65, no. 4, pp. 1172–1184, 2016.
[44] X. Zhu, C. Chen, L. T. Yang, and Y. Xiang, “Angel: Agent-based scheduling for real-time tasks in virtualized clouds,” IEEE Transactions on Computers, vol. 64, no. 12, pp. 3389–3403, 2015.