Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HiBid: A Cross-Channel Constrained Bidding System with Budget Allocation by Hierarchical Offline Deep Reinforcement Learning

Hao Wang*, Bo Tang*, Chi Harold Liu, , Shangqin Mao, Jiahong Zhou, Zipeng Dai, Yaqi Sun, Qianlong Xie, Xingxing Wang, Dong Wang H. Wang and B. Tang contributed equally to this work. H. Wang, C. H. Liu (Corresponding Author), and Z. Dai are with School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China. Email: liuchi02@gmail.comB. Tang is with Meituan and Institute for Advanced Algorithms Research, Shanghai, China.S. Mao, J. Zhou, Y. Sun, Q. Xie, X. Wang and D. Wang are with Meituan, Beijing, China.
Abstract

Online display advertising platforms service numerous advertisers by providing real-time bidding (RTB) for the scale of billions of ad requests every day. The bidding strategy handles ad requests cross multiple channels to maximize the number of clicks under the set financial constraints, i.e., total budget and cost-per-click (CPC), etc. Different from existing works mainly focusing on single channel bidding, we explicitly consider cross-channel constrained bidding with budget allocation. Specifically, we propose a hierarchical offline deep reinforcement learning (DRL) framework called “HiBid”, consisted of a high-level planner equipped with auxiliary loss for non-competitive budget allocation, and a data augmentation enhanced low-level executor for adaptive bidding strategy in response to allocated budgets. Additionally, a CPC-guided action selection mechanism is introduced to satisfy the cross-channel CPC constraint. Through extensive experiments on both the large-scale log data and online A/B testing, we confirm that HiBid outperforms six baselines in terms of the number of clicks, CPC satisfactory ratio, and return-on-investment (ROI). We also deploy HiBid on Meituan advertising platform to already service tens of thousands of advertisers every day.

Index Terms:
Real-time Bidding Systems; Cross-Channel Bidding; Deep Reinforcement Learning;

I Introduction

Real-time systems are witnessing an ever-growing significance in our society, including Industrial Internet of Things (IIoT) systems employed for crowdsensing [1], online cloud auction systems that facilitate seamless streaming services [2, 3], and real-time bidding (RTB) systems that have emerged as an indispensable component in modern online E-commerce, by offering real-time bidding services [4] for millions of advertisers participating in online display advertising. RTB creates an advertisement (ad) auction that varies depending on the platform, to allow interested advertisers to bid simultaneously for ad impressions. Different auction mechanisms (e.g., generalized first price in Google AdSense and Vickrey-Clarke-Groves in Facebook) and pricing mechanisms (e.g., cost-per-mille and cost-per-time) are chosen based on advertising and service format on the platform. One common type is the Generalized Second Price (GSP [5]) auction with cost-per-click (CPC [6]) pricing, which charges advertisers the second-highest bidding price if a user clicks on their ad. It provides a fair and flexible advertising format with an efficient evaluation of advertising performance for advertisers, which plays an important role in the online advertising industry.

Refer to caption
Figure 1: Overview of the considered cross-channel constrained bidding (c3-bidding) scenario.

As shown in Fig. 1, for better use of limited budget within a financial constraint like CPC, more advertising platforms have begun to offer automated bidding services across various ad channels, such as recommendation ads (i.e., ads that recommend products to a potential user), search ads (i.e., ads shown in response to user search content), etc. Different ad channels have varying numbers of ad impressions and users on the basis of their behavioral habits. Some high-quality channels, which are reflected by higher conversion rate (CVR) and click-through rate (CTR), could bring better advertising effectiveness and thus revenue as a return.

It is essential to allocate budget to different channels because: (a) from advertisers’ perspective, due to the inconsistent peak time and volume of ad requests in different channels, investing more budget in those channels (which are more suitable for themselves) can potentially avoid excessive consumption in other channels, leading to higher ROI; and (b) from the platform’s perspective, due to the limited available ad space, appropriately allocating advertisers’ budgets can help advertisers more precisely pinpoint their target users across different channels, thereby enhancing ad relevance and effectiveness. Advertisements with higher efficacy draw advertisers to increase their investments in ad campaigns, leading to a growth in the platform’s revenue, which is a mutually beneficial outcome; and (c) from a market perspective, owing to the higher cost-effectiveness of high-quality channels for all advertisers, distributing the total budget on each channel in a reasonable way may prevent advertisers from competing for those high-quality channels simultaneously.

Extensive works have been conducted on constrained bidding. Most of them focused on improving bidding strategy in a single channel under the set budget [7, 8] but did not adjust their bidding or budget allocation strategy across all channels. Hence, they cannot scale well to cross-channel bidding problems. Previous works on budget allocation [9, 10] derived the optimal strategy using a prediction model to forecast the expected return of the bidding strategy. However, the bidding and allocation strategies are “stand-alone”, that there exist many dynamic factors to affect the performance of the RTB system, causing fluctuations of budget allocation and bidding price. Therefore, an insight is that integrating these two may receive mutual feedback to bring better advertising results to advertisers.

In this paper, we explicitly consider the problem of cross-channel constrained bidding with budget allocation, “c3-bidding”, where the key challenges are: (a) advertisers may allocate most of their budgets to high-quality channels, leading to possible contentions and a decrease in overall performance; (b) the advertisers’ daily budget and the platform allocated budget on each channel highly likely change over time, and therefore the bidding strategy should dynamically adapt to the changing budget constraint; and (c) since each channel is bid independently, it is challenging to ensure the budget constraint for individual channels while guaranteeing cross-channel CPC constraints.

In practice, billions of ad requests arrive sequentially for tens of thousands of advertisers, and thus the solution space of the considered c3-bidding problem is huge which is not solvable by using optimization methods [11]. Meanwhile, the rapid advancement of offline deep reinforcement learning (DRL) demonstrates its potential for learning an optimal policy as well as satisfying the given constraint from the large-scale data. Furthermore, we observe that there is a hierarchy of budget allocation and constrained bidding in RTB. That is, the former assigns a percentage of the budget to each channel from market perspective, while the latter cares more about the suitable bidding price to win the ad impression opportunities under the allocated budget. Thus, we model the considered c3-bidding problem as a hierarchical Constrained Markov Decision Process (CMDP [12]). Existing hierarchical approaches [13, 14] neither address the cross-channel CPC constraints nor consider the decline in performance due to inappropriate allocation. To this end, we propose a novel hierarchical offline DRL framework called “HiBid” based on the state-of-the-art offline DRL approach MCQ [15], as the start point of design. Our contribution is three-fold:

  1. 1.

    We propose “HiBid”, a hierarchical DRL framework for c3-bidding problem which maintains a high-level planner for budget allocation and a low-level executor for cross-channel constrained bidding.

  2. 2.

    We introduce batch loss [16] for budget allocation to prevent over-allocation on specific channels, λ𝜆\lambdaitalic_λ-generalization [17] for constrained bidding to adaptively respond to changing budget. Then, we propose a CPC-guided action selection mechanism to significantly improve the cross-channel CPC satisfactory ratio, which also has wider applicability to other metrics as well.

  3. 3.

    We conduct extensive experiments on a large-scale real dataset in Meituan advertising platform. Results show that HiBid outperforms six baselines in terms of number of clicks, CPC satisfactory ratio and ROI. We also deployed HiBid, and performed online A/B testing to validate its effectiveness.

The rest of the paper is organized as follows. We review the related work in Section II and present the system model in Section III. We introduce preliminaries in Section IV. We propose HiBid in Section V, followed by the experimental results in Section VI. Finally, we conclude the paper in Section VII.

II Related Work

II-A Real-Time Bidding (RTB) systems

RTB attracts much attention and has been widely studied for various applications [18]. Some efforts have been devoted to designing bidding mechanisms to enhance the effectiveness and fairness of advertising auctions from the platform perspective. For example, Zhou et al. in [19] introduced a novel deep distribution network for optimal bidding in both open and closed online first-price auctions. Zhang et al. in [20] proposed a succinct and effective bid shading algorithm without parametric assumptions for the win distribution. Ren et al. in [21] proposed a comprehensive framework to jointly optimize user response prediction and bid landscape forecasting. Furthermore, there have been studies that approach the optimization of bidding strategies from the advertiser perspective, to improve the effectiveness of their ads during auctions. For example, Wu et al. in [22] developed a model-free DRL framework “DRLB” for constrained bidding to cope with the volatility of the auction environment. Yang et al. in [23] abstracted the essential demand of advertisers in RTB and proposed an effective linear programming solution. Those works focused on optimizing bidding prices under the given constraint for a single ad channel, to adapt to the unpredictability of the auction environment and satisfy advertisers’ requirements. However, there exist multiple advertising channels with significant quality differences in practical deployment. Also, some studies developed ways to allocate budget across multiple ad channels given a total budget constraint, e.g., some works [9, 10] formalized it as well-stated optimization problems. These methods required accurate estimation of outcome distributions (e.g., the expected number of clicks from choosing a particular budget), which is impracticable in a dynamic auction environment.

Different from the above works, this paper explicitly focuses on simultaneous budget allocation and constrained bidding for multiple ad channels, to maximize the advertising effectiveness for advertisers while ensuring platform revenue.

II-B Deep Reinforcement Learning (DRL)

DRL has been widely applied in real-time systems, including user-item recommendation [24], ad-slots allocation [25], and real-time bidding for ad impression auctions [26, 7, 8, 27]. Cai et al. in [26] and Zhao et al. in [28] utilized DRL to learn the optimal bid for a single ad in display advertising and sponsored search, respectively. He et al. in [27] formulated the budget and financial constraints simultaneously and leveraged DRL to find a unified optimal bidding function on behalf of an advertiser. Unfortunately, the optimal bidding function may not yield optimal results in uncertain auction markets. Wang et al. in [8] proposed a curriculum-guided Bayesian DRL method to generalize to highly dynamic ad markets with ROI constraints. However, the mentioned DRL-based works only focus on single ad channel bidding and have not considered joint modeling across multiple channels. Due to the joint constraint settings of the advertiser’s financial requirements across channels, bidding individually cannot yield optimal results. It is insightful to model the relationship between channels jointly for bidding through a unified approach. Therefore, we leverage hierarchical reinforcement learning to jointly model cross-channel bidding and adjust bidding strategies through high-level budget allocation, which is one motivation of our work.

DRL provides a promising approach to address the c3-bidding problem by interacting with the environment and updating policy iteratively. However, it is not suitable for training the agent in an online setting due to the potential financial risk involved. In offline DRL, the agent learns from a fixed dataset of past interactions, rather than learning online in real-time. The main challenge of offline DRL is the distribution shift [29] of state-action visitation between the learned policy and behavior policy. Recent work [30] utilized distributional penalties to regularize the learned policy to stay close to the behavior policy. Other methods [31] used generative models to approximate the behavior distribution to stay within the support of offline data during the value back up. Ajay et al. in [13] proposed a hierarchical offline reinforcement learning method with unsupervised primitive extraction. However, directly applying existing offline DRL algorithms may not effectively solve our considered c3-bidding problem, as there is no effective method to solve the cross-channel CPC constraint and the changing allocated budget.

Constrained DRL focuses on designing efficient algorithms to find optimal policies for CMDP problems under the given constraints [32]. Some works converted the CMDP problem into a Lagrangian dual problem [33], and then found an optimal Lagrangian multiplier λ𝜆\lambdaitalic_λ as well as the corresponding policy which satisfied the constraint. Here λ𝜆\lambdaitalic_λ can be manually adjusted as a hyperparameter, which is policy-sensitive and hard to fine-tune. In recent works, gradient descent [34] or bisection search [35] were developed to get the optimal value of λ𝜆\lambdaitalic_λ. Unfortunately, the policy needs to be retrained every time when the value of λ𝜆\lambdaitalic_λ changes until the constraint is satisfied. The iterative training process is unacceptable in c3-bidding problem due to the frequently changed budget that brings huge computational overhead. Thus, in this paper we adopt a λ𝜆\lambdaitalic_λ-generalization [17] method to learn diversified bidding strategies that can dynamically respond to the changed budget constraint. Nevertheless, for the cross-channel CPC constraint, we need a more appropriate way to solve it, which is the key contribution of this paper as CPC-guided action selection in Section V-C.

TABLE I: Important notations used in this paper.
Notation Explanation
P,p𝑃𝑝P,pitalic_P , italic_p The total number, index of channels
M,m𝑀𝑚M,mitalic_M , italic_m The total number, index of advertisers
Ip,isubscript𝐼𝑝𝑖I_{p},iitalic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_i The total number, index of ad requests on channel p𝑝pitalic_p
Bm,CPCmsetsubscript𝐵𝑚𝐶𝑃superscriptsubscript𝐶𝑚𝑠𝑒𝑡B_{m},CPC_{m}^{set}italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e italic_t end_POSTSUPERSCRIPT Total budget and CPC constraint set by advertisers m𝑚mitalic_m
Click(),Cost()𝐶𝑙𝑖𝑐𝑘𝐶𝑜𝑠𝑡Click(\cdot),Cost(\cdot)italic_C italic_l italic_i italic_c italic_k ( ⋅ ) , italic_C italic_o italic_s italic_t ( ⋅ ) The number of clicks and actual cost
am,p,isubscript𝑎𝑚𝑝𝑖a_{m,p,i}italic_a start_POSTSUBSCRIPT italic_m , italic_p , italic_i end_POSTSUBSCRIPT Bidding price offered by advertiser m𝑚mitalic_m for request i𝑖iitalic_i on channel p𝑝pitalic_p
sph,aph,rph,cphsuperscriptsubscript𝑠𝑝superscriptsubscript𝑎𝑝superscriptsubscript𝑟𝑝superscriptsubscript𝑐𝑝s_{p}^{h},a_{p}^{h},r_{p}^{h},c_{p}^{h}italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT State, action, reward and cost for channel p𝑝pitalic_p in high-level MDP
sil,ail,ril,cilsuperscriptsubscript𝑠𝑖𝑙superscriptsubscript𝑎𝑖𝑙superscriptsubscript𝑟𝑖𝑙superscriptsubscript𝑐𝑖𝑙s_{i}^{l},a_{i}^{l},r_{i}^{l},c_{i}^{l}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT State, action, reward and cost of ad request i𝑖iitalic_i in low-level MDP
Qθ(),Qθ(),Q^ηusubscriptsuperscript𝑄𝜃subscriptsuperscript𝑄superscript𝜃subscriptsuperscript^𝑄𝑢𝜂Q^{(\cdot)}_{\theta},Q^{(\cdot)}_{\theta^{\prime}},\hat{Q}^{u}_{\eta}italic_Q start_POSTSUPERSCRIPT ( ⋅ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_Q start_POSTSUPERSCRIPT ( ⋅ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT, Q^ϕcsubscriptsuperscript^𝑄𝑐italic-ϕ\hat{Q}^{c}_{\phi}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT Q-network, target network and evaluation networks
λ,λi𝜆subscriptsuperscript𝜆𝑖\lambda,\lambda^{*}_{i}italic_λ , italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Lagrangian multiplier in bidding strategy, optimal λ𝜆\lambdaitalic_λ for ad request i𝑖iitalic_i
N,Np𝑁subscript𝑁𝑝N,N_{p}italic_N , italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT Data repetition times in offline training and sample number of λ𝜆\lambdaitalic_λ in online prediction
(),w1,w2,wbsubscriptsubscript𝑤1subscript𝑤2subscript𝑤𝑏\mathcal{L}_{(\cdot)},w_{1},w_{2},w_{b}caligraphic_L start_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT Loss functions, weight of Q-function loss, OOD action loss, and batch loss
Impr,Click𝐼𝑚𝑝𝑟𝐶𝑙𝑖𝑐𝑘Impr,Clickitalic_I italic_m italic_p italic_r , italic_C italic_l italic_i italic_c italic_k The total number of impressions and clicks
ROI,CPC,CSR𝑅𝑂𝐼𝐶𝑃𝐶𝐶𝑆𝑅ROI,CPC,CSRitalic_R italic_O italic_I , italic_C italic_P italic_C , italic_C italic_S italic_R Return-on-investment, cost-per-click, and CPC satisfactory ratio

III System Model

In this paper, our overall objective is to maximize the total ad clicks while satisfying all the advertisers’ set budget and CPC constraints, while ensuring that the platform’s revenue remains within an acceptable range. Without loss of generality, we consider an advertising platform is servicing M𝑀Mitalic_M advertisers across P𝑃Pitalic_P ad channels with Ipsubscript𝐼𝑝I_{p}italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ad requests on channel p𝑝pitalic_p in a day. Each advertiser m𝑚mitalic_m sets a daily budget Bmsubscript𝐵𝑚B_{m}italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (i.e., the maximum amount of money they are willing to spend for their advertising campaign) and expected cost-per-click CPCmset𝐶𝑃superscriptsubscript𝐶𝑚𝑠𝑒𝑡CPC_{m}^{set}italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e italic_t end_POSTSUPERSCRIPT, then the overall objective is formulated as:

maximizeam,p,isubscript𝑎𝑚𝑝𝑖maximize\displaystyle\underset{{a_{m,p,i}}}{\mathrm{maximize}}start_UNDERACCENT italic_a start_POSTSUBSCRIPT italic_m , italic_p , italic_i end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_maximize end_ARG m=1Mp=1Pi=1IpClick(am,p,i)superscriptsubscript𝑚1𝑀superscriptsubscript𝑝1𝑃superscriptsubscript𝑖1subscript𝐼𝑝𝐶𝑙𝑖𝑐𝑘subscript𝑎𝑚𝑝𝑖\displaystyle\sum\limits_{m=1}^{M}\sum\limits_{p=1}^{P}\sum\limits_{i=1}^{I_{p% }}Click(a_{m,p,i})∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_C italic_l italic_i italic_c italic_k ( italic_a start_POSTSUBSCRIPT italic_m , italic_p , italic_i end_POSTSUBSCRIPT ) (1)
subjectto::subjecttoabsent\displaystyle\mathrm{subject\,to:\,}roman_subject roman_to : |m=1MCost(am,p,i)κp|ϵ,p{1,,P},formulae-sequencesuperscriptsubscript𝑚1𝑀𝐶𝑜𝑠𝑡subscript𝑎𝑚𝑝𝑖subscript𝜅𝑝italic-ϵfor-all𝑝1𝑃\displaystyle\left|\sum_{m=1}^{M}Cost(a_{m,p,i})-\kappa_{p}\right|\leq\epsilon% ,\forall p\in\{1,\dots,P\},| ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_C italic_o italic_s italic_t ( italic_a start_POSTSUBSCRIPT italic_m , italic_p , italic_i end_POSTSUBSCRIPT ) - italic_κ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | ≤ italic_ϵ , ∀ italic_p ∈ { 1 , … , italic_P } , (2)
p=1Pi=1IpCost(am,p,i)p=1Pi=1IpClick(am,p,i)CPCmset,superscriptsubscript𝑝1𝑃superscriptsubscript𝑖1subscript𝐼𝑝𝐶𝑜𝑠𝑡subscript𝑎𝑚𝑝𝑖superscriptsubscript𝑝1𝑃superscriptsubscript𝑖1subscript𝐼𝑝𝐶𝑙𝑖𝑐𝑘subscript𝑎𝑚𝑝𝑖𝐶𝑃superscriptsubscript𝐶𝑚𝑠𝑒𝑡\displaystyle\frac{\sum_{p=1}^{P}\sum_{i=1}^{I_{p}}Cost(a_{m,p,i})}{\sum_{p=1}% ^{P}\sum_{i=1}^{I_{p}}Click(a_{m,p,i})}\leq CPC_{m}^{set},divide start_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_C italic_o italic_s italic_t ( italic_a start_POSTSUBSCRIPT italic_m , italic_p , italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_C italic_l italic_i italic_c italic_k ( italic_a start_POSTSUBSCRIPT italic_m , italic_p , italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e italic_t end_POSTSUPERSCRIPT , (3)
p=1Pi=1IpCost(am,p,i)Bm,m{1,,M},formulae-sequencesuperscriptsubscript𝑝1𝑃superscriptsubscript𝑖1subscript𝐼𝑝𝐶𝑜𝑠𝑡subscript𝑎𝑚𝑝𝑖subscript𝐵𝑚for-all𝑚1𝑀\displaystyle\sum_{p=1}^{P}\sum_{i=1}^{I_{p}}Cost(a_{m,p,i})\leq{B_{m}},% \forall m\in\{1,\dots,M\},∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_C italic_o italic_s italic_t ( italic_a start_POSTSUBSCRIPT italic_m , italic_p , italic_i end_POSTSUBSCRIPT ) ≤ italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ∀ italic_m ∈ { 1 , … , italic_M } , (4)

where am,p,isubscript𝑎𝑚𝑝𝑖a_{m,p,i}italic_a start_POSTSUBSCRIPT italic_m , italic_p , italic_i end_POSTSUBSCRIPT represents the bidding price offered by an advertiser m𝑚mitalic_m for a request i𝑖iitalic_i on a channel p𝑝pitalic_p. This bidding price indicates the amount an advertiser m𝑚mitalic_m is willing to pay to display their ad in response to the request i𝑖iitalic_i on that particular channel. Cost(am,p,i)𝐶𝑜𝑠𝑡subscript𝑎𝑚𝑝𝑖{Cost}(a_{m,p,i})italic_C italic_o italic_s italic_t ( italic_a start_POSTSUBSCRIPT italic_m , italic_p , italic_i end_POSTSUBSCRIPT ) corresponds to the actual expense spent by advertiser m𝑚mitalic_m when their specific bid for request i𝑖iitalic_i is successful. Click(am,p,i)𝐶𝑙𝑖𝑐𝑘subscript𝑎𝑚𝑝𝑖Click(a_{m,p,i})italic_C italic_l italic_i italic_c italic_k ( italic_a start_POSTSUBSCRIPT italic_m , italic_p , italic_i end_POSTSUBSCRIPT ) indicates whether or not a user clicks on the ad after it is displayed. If the offered bidding price does not win the auction, both Cost(am,p,i)𝐶𝑜𝑠𝑡subscript𝑎𝑚𝑝𝑖{Cost}(a_{m,p,i})italic_C italic_o italic_s italic_t ( italic_a start_POSTSUBSCRIPT italic_m , italic_p , italic_i end_POSTSUBSCRIPT ) and Click(am,p,i)𝐶𝑙𝑖𝑐𝑘subscript𝑎𝑚𝑝𝑖{Click}(a_{m,p,i})italic_C italic_l italic_i italic_c italic_k ( italic_a start_POSTSUBSCRIPT italic_m , italic_p , italic_i end_POSTSUBSCRIPT ) are set to 00. The constraint in Eqn. (2) is added to prevent the channels’ revenue from too much fluctuation. ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 is a constant, representing the acceptable fluctuation range of the platform. κpsubscript𝜅𝑝\kappa_{p}italic_κ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is calculated by multiplying historical CTR, CPC and impression count, that represents the expected consuming capacity in an ideal situation.

In practice, the incoming ad requests of all advertisements are not known as a priori. This makes it hard to employ traditional combinatorial optimization methods to solve the considered c3superscript𝑐3c^{3}italic_c start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT-bidding problem. Considering its inherent hierarchical nature, we can allocate budgets to all the advertisers from the platform’s perspective. For each advertiser, bids can be made based on the allocated budget and set financial constraints. The budget allocation for advertisers and the decision-making of bidding prices for sequentially incoming ad requests both exhibit Markovian properties. Therefore, we model the c3-bidding problem as a hierarchical CMDP where high-level and low-level MDPs are executed on different timescales. As shown in Figure 2, the high-level MDP is responsible for allocating the budget at intervals, while the low-level MDP bids for each ad request according to the allocated budget.

III-A High-level MDP for Budget Allocation

The high-level planner needs to allocate the budget to maximize the number of user clicks while ensuring the revenue on each channel stays within an acceptable range. Thus the objective of the high-level planner is:

maximizeam,phsuperscriptsubscript𝑎𝑚𝑝maximize\displaystyle\underset{{a_{m,p}^{h}}}{\mathrm{maximize}}start_UNDERACCENT italic_a start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_maximize end_ARG m=1Mp=1PClickh(am,ph)superscriptsubscript𝑚1𝑀superscriptsubscript𝑝1𝑃𝐶𝑙𝑖𝑐superscript𝑘subscriptsuperscript𝑎𝑚𝑝\displaystyle\sum\limits_{m=1}^{M}\sum\limits_{p=1}^{P}Click^{h}(a^{h}_{m,p})∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_C italic_l italic_i italic_c italic_k start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT ) (5)
subjectto::subjecttoabsent\displaystyle\mathrm{subject\,to:\,}roman_subject roman_to : p=1Pam,phBm,m{1,,M},formulae-sequencesuperscriptsubscript𝑝1𝑃subscriptsuperscript𝑎𝑚𝑝subscript𝐵𝑚for-all𝑚1𝑀\displaystyle\sum_{p=1}^{P}a^{h}_{m,p}\leq{B_{m}},\quad\forall m\in\{1,\dots,M\},∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT ≤ italic_B start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , ∀ italic_m ∈ { 1 , … , italic_M } , (6)
|m=1MCosth(am,ph)κp|ϵ,p{1,,P},formulae-sequencesuperscriptsubscript𝑚1𝑀𝐶𝑜𝑠superscript𝑡subscriptsuperscript𝑎𝑚𝑝subscript𝜅𝑝italic-ϵfor-all𝑝1𝑃\displaystyle\left|\sum_{m=1}^{M}Cost^{h}(a^{h}_{m,p})-\kappa_{p}\right|\leq% \epsilon,\quad\forall p\in\{1,\dots,P\},| ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_C italic_o italic_s italic_t start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT ) - italic_κ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | ≤ italic_ϵ , ∀ italic_p ∈ { 1 , … , italic_P } , (7)

where am,phsuperscriptsubscript𝑎𝑚𝑝a_{m,p}^{h}italic_a start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT denotes the allocated budget on channel p𝑝pitalic_p for advertiser m𝑚mitalic_m, Clickh(am,ph)𝐶𝑙𝑖𝑐superscript𝑘superscriptsubscript𝑎𝑚𝑝Click^{h}(a_{m,p}^{h})italic_C italic_l italic_i italic_c italic_k start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) and Costh(am,ph)𝐶𝑜𝑠superscript𝑡superscriptsubscript𝑎𝑚𝑝Cost^{h}(a_{m,p}^{h})italic_C italic_o italic_s italic_t start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) represent the number of clicks and cost given the budget am,phsuperscriptsubscript𝑎𝑚𝑝a_{m,p}^{h}italic_a start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT within the interval, respectively. Meanwhile, the revenue constraint in Eqn. (7) is also regarded as a channel-capacity constraint that prevents advertisers from engaging in severe competition for high-quality channels.

Refer to caption
Figure 2: Hierarchical CMDP modeling for c3-bidding.

We formulate the high-level MDP for each advertiser m𝑚mitalic_m as a tuple (𝒮h,𝒜h,𝒯h,γh,h,𝒞h)superscript𝒮superscript𝒜superscript𝒯superscript𝛾superscriptsuperscript𝒞(\mathcal{S}^{h},\mathcal{A}^{h},\mathcal{T}^{h},\gamma^{h},\mathcal{R}^{h},% \mathcal{C}^{h})( caligraphic_S start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , caligraphic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , caligraphic_T start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , caligraphic_R start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , caligraphic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ). That is, the high-level planner allocates the channel-level budget at each interval according to the advertiser’s budget requirements, while each channel allocation is regarded as a decision step. The details are defined as follows:

State 𝒮hsuperscript𝒮\mathcal{S}^{h}caligraphic_S start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT: Let 𝒮hsuperscript𝒮\mathcal{S}^{h}caligraphic_S start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT denote the higher-level state space, consisting of the allocated budget and the historical statistics on the advertiser-level (e.g., the average CTR and CVR).

Action 𝒜hsuperscript𝒜\mathcal{A}^{h}caligraphic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT: The action aph𝒜hsuperscriptsubscript𝑎𝑝superscript𝒜a_{p}^{h}\in\mathcal{A}^{h}italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT is the budget assigned to channel p𝑝pitalic_p. We discretize the action by using the percentage of the budget and mask invalid actions that go beyond the total budget.

Reward hsuperscript\mathcal{R}^{h}caligraphic_R start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT: Reward rphhsubscriptsuperscript𝑟𝑝superscriptr^{h}_{p}\in\mathcal{R}^{h}italic_r start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∈ caligraphic_R start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT is defined as the sum of number of user clicks in channel p𝑝pitalic_p within the interval.

Constraint 𝒞hsuperscript𝒞\mathcal{C}^{h}caligraphic_C start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT: The channel-capacity constraint and total budget constraint are given in Eqn. (6) and Eqn. (7), respectively.

III-B Low-level MDP for Cross-Channel Constrained Bidding

After receiving the allocated budget aphsubscriptsuperscript𝑎𝑝a^{h}_{p}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT on each channel, the low-level executor aims to maximize the number of clicks while satisfying that budget and CPC constraint simultaneously. For each advertiser m𝑚mitalic_m, the objective function of the low-level executor can be formulated as:

maximizeap,ilsubscriptsuperscript𝑎𝑙𝑝𝑖maximize\displaystyle\underset{a^{l}_{p,i}}{\mathrm{maximize}}start_UNDERACCENT italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_maximize end_ARG p=1Pi=1IpClick(ap,il)superscriptsubscript𝑝1𝑃superscriptsubscript𝑖1subscript𝐼𝑝𝐶𝑙𝑖𝑐𝑘subscriptsuperscript𝑎𝑙𝑝𝑖\displaystyle\sum\limits_{p=1}^{P}\sum\limits_{i=1}^{I_{p}}Click(a^{l}_{p,i})∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_C italic_l italic_i italic_c italic_k ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT ) (8)
subjectto::subjecttoabsent\displaystyle\mathrm{subject\,to:\,}roman_subject roman_to : i=1IpCost(ap,il)aph,p{1,,P},formulae-sequencesuperscriptsubscript𝑖1subscript𝐼𝑝𝐶𝑜𝑠𝑡subscriptsuperscript𝑎𝑙𝑝𝑖subscriptsuperscript𝑎𝑝for-all𝑝1𝑃\displaystyle\sum_{i=1}^{I_{p}}Cost(a^{l}_{p,i})\leq a^{h}_{p},\quad\forall p% \in\{1,\dots,P\},∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_C italic_o italic_s italic_t ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT ) ≤ italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , ∀ italic_p ∈ { 1 , … , italic_P } , (9)
p=1Pi=1IpCost(ap,il)p=1Pi=1IpClick(ap,il)CPCmset,superscriptsubscript𝑝1𝑃superscriptsubscript𝑖1subscript𝐼𝑝𝐶𝑜𝑠𝑡subscriptsuperscript𝑎𝑙𝑝𝑖superscriptsubscript𝑝1𝑃superscriptsubscript𝑖1subscript𝐼𝑝𝐶𝑙𝑖𝑐𝑘subscriptsuperscript𝑎𝑙𝑝𝑖𝐶𝑃superscriptsubscript𝐶𝑚𝑠𝑒𝑡\displaystyle\frac{\sum_{p=1}^{P}\sum_{i=1}^{I_{p}}Cost(a^{l}_{p,i})}{\sum_{p=% 1}^{P}\sum_{i=1}^{I_{p}}Click(a^{l}_{p,i})}\leq CPC_{m}^{set},divide start_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_C italic_o italic_s italic_t ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_C italic_l italic_i italic_c italic_k ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT ) end_ARG ≤ italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e italic_t end_POSTSUPERSCRIPT , (10)

where Cost(ap,il)𝐶𝑜𝑠𝑡subscriptsuperscript𝑎𝑙𝑝𝑖Cost(a^{l}_{p,i})italic_C italic_o italic_s italic_t ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT ) and Click(ap,il)𝐶𝑙𝑖𝑐𝑘subscriptsuperscript𝑎𝑙𝑝𝑖Click(a^{l}_{p,i})italic_C italic_l italic_i italic_c italic_k ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT ) denotes the real cost and whether the user clicks the ad after giving a bidding price ap,ilsubscriptsuperscript𝑎𝑙𝑝𝑖a^{l}_{p,i}italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT, respectively. Each request i𝑖iitalic_i comes from a specific channel and only affects the cost of that channel, and thus we model each channel individually. Then, the CMDP for channel p𝑝pitalic_p can be formulated as a tuple (𝒮l,𝒜l,𝒯l,γl,l,𝒞l)superscript𝒮𝑙superscript𝒜𝑙superscript𝒯𝑙superscript𝛾𝑙superscript𝑙superscript𝒞𝑙(\mathcal{S}^{l},\mathcal{A}^{l},\mathcal{T}^{l},\gamma^{l},\mathcal{R}^{l},% \mathcal{C}^{l})( caligraphic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , caligraphic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , caligraphic_T start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , caligraphic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , caligraphic_C start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ), which is defined as follows:

State 𝒮lsuperscript𝒮𝑙\mathcal{S}^{l}caligraphic_S start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT: The low-level state space is a collection of allocated budgets aphsubscriptsuperscript𝑎𝑝a^{h}_{p}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, request and advertiser level information. The request-level information includes time, and current advertising status (e.g., budget consumption rate and financial constraints satisfactory ratio). The advertiser-level information is identical to the high-level planner state.

Action 𝒜lsuperscript𝒜𝑙\mathcal{A}^{l}caligraphic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT: Following [27], an action ail[aminl,amaxl]superscriptsubscript𝑎𝑖𝑙subscriptsuperscript𝑎𝑙subscriptsuperscript𝑎𝑙a_{i}^{l}\in[a^{l}_{\min},a^{l}_{\max}]italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ [ italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] represents the bidding ratio and the final bidding price is calculated by ailCPCmsetsuperscriptsubscript𝑎𝑖𝑙𝐶𝑃superscriptsubscript𝐶𝑚𝑠𝑒𝑡a_{i}^{l}*CPC_{m}^{set}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∗ italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e italic_t end_POSTSUPERSCRIPT.

Reward lsuperscript𝑙\mathcal{R}^{l}caligraphic_R start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT: For each request i𝑖iitalic_i, reward ril{0,1}superscriptsubscript𝑟𝑖𝑙01r_{i}^{l}\in\{0,1\}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ { 0 , 1 } is set to 1111 if the bidding is successful and the user eventually clicks the ad.

Constraint 𝒞lsuperscript𝒞𝑙\mathcal{C}^{l}caligraphic_C start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT: Single-channel budget constraints and cross-channel CPC constraints are given in Eqn. (9) and Eqn. (10), respectively.

IV Preliminary

In order to apply offline DRL methods to the considered c3-bidding problem, the key is to maintain a conservative value estimation (i.e., to eliminate the possible over-estimation). Recall that Q-network Qθ(s,a)subscript𝑄𝜃𝑠𝑎Q_{\theta}(s,a)italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , italic_a ) measures the accumulative discounted reward starting from state-action pair (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) parameterized by θ𝜃\thetaitalic_θ. Qθ(s,a)subscript𝑄𝜃𝑠𝑎Q_{\theta}(s,a)italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , italic_a ) can be improved via minimizing the temporal difference [36] as:

Q(θ)=𝔼(s,a,r,s)𝒟[(r+γQθ(s,a)Qθ(s,a))2],subscript𝑄𝜃subscript𝔼similar-to𝑠𝑎𝑟superscript𝑠𝒟delimited-[]superscript𝑟𝛾subscript𝑄superscript𝜃superscript𝑠superscript𝑎subscript𝑄𝜃𝑠𝑎2\displaystyle\mathcal{L}_{Q}(\theta)=\mathbb{E}_{(s,a,r,s^{\prime})\sim% \mathcal{D}}\left[(r+\gamma Q_{\theta^{\prime}}(s^{\prime},a^{\prime})-Q_{% \theta}(s,a))^{2}\right],caligraphic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ ( italic_r + italic_γ italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , italic_a ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (11)

where a=argmaxaQθ(s,a)superscript𝑎subscriptsuperscript𝑎subscript𝑄superscript𝜃superscript𝑠superscript𝑎a^{\prime}=\mathop{\arg\max}_{a^{\prime}}Q_{\theta^{\prime}}(s^{\prime},a^{% \prime})italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) and Qθsubscript𝑄superscript𝜃Q_{\theta^{\prime}}italic_Q start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is a target network for learning stability. The out-of-distribution (OOD) state-action pairs bring extrapolation error [31] during the offline training, resulting in a severely overestimated value function and an aggressive bidding strategy. This strategy may result in a catastrophic financial loss in c3-bidding because it is inclined to give higher bidding prices, leading to vicious competition and significant risks. Thus it is important to keep conservatism in value estimation which can help prevent the learned policy from taking risky actions.

In this paper, we adopt Mildly Conservative Q-learning (MCQ [15]) as the start of the design for both the high-level planner and the low-level executor, where OOD state-action pairs are actively trained by assigning proper pseudo Q values. In the considered c3-bidding problem, the policy distribution within the dataset exhibits a multi-modality pattern due to the highly non-stationary external market. A simple parameterized approach (e.g., using MLP with cross-entropy) cannot work well as it focuses on mapping input to output and neglect the full distribution. Specifically, we utilize a conditional variational autoencoder (CVAE [37]) to extensively model the distribution of the behavior policy μ𝜇\muitalic_μ. Given log data, the objective of CVAE is to reconstruct actions conditioned on the states, such that the generated actions come from the same distribution as the actions in the log. The utilized CVAE is denoted as Gω(s)subscript𝐺𝜔𝑠G_{\omega}(s)italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_s ) parameterized by ω𝜔\omegaitalic_ω, which is consisted of an encoder Gω1E(s,a)subscriptsuperscript𝐺𝐸subscript𝜔1𝑠𝑎G^{E}_{\omega_{1}}(s,a)italic_G start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) and an decoder Gω2D(s,z)subscriptsuperscript𝐺𝐷subscript𝜔2𝑠𝑧G^{D}_{\omega_{2}}(s,z)italic_G start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_z ). We optimize its variational lower bound by:

CVAE(ω)=𝔼[(aGω2D(s,z))2+KL(Gω1E(s,a),𝒩(0,𝐈))],subscript𝐶𝑉𝐴𝐸𝜔𝔼delimited-[]superscript𝑎subscriptsuperscript𝐺𝐷subscript𝜔2𝑠𝑧2𝐾𝐿subscriptsuperscript𝐺𝐸subscript𝜔1𝑠𝑎𝒩0𝐈\small\mathcal{L}_{CVAE}(\omega)=\mathbb{E}\left[\big{(}a-G^{D}_{\omega_{2}}(s% ,z)\big{)}^{2}+KL\big{(}G^{E}_{\omega_{1}}(s,a),\mathcal{N}(0,\mathbf{I})\big{% )}\right],caligraphic_L start_POSTSUBSCRIPT italic_C italic_V italic_A italic_E end_POSTSUBSCRIPT ( italic_ω ) = blackboard_E [ ( italic_a - italic_G start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_K italic_L ( italic_G start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) , caligraphic_N ( 0 , bold_I ) ) ] , (12)

where hidden state z=Gω1E(s,a)𝑧subscriptsuperscript𝐺𝐸subscript𝜔1𝑠𝑎z=G^{E}_{\omega_{1}}(s,a)italic_z = italic_G start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ), KL()𝐾𝐿KL(\cdot)italic_K italic_L ( ⋅ ) denotes the KL-divergence, 𝒩𝒩\mathcal{N}caligraphic_N is multivariate normal distribution, and 𝐈𝐈\mathbf{I}bold_I is the identity matrix.

Then, given a state s𝑠sitalic_s, we generate several in-distribution actions aμsubscript𝑎𝜇a_{\mu}italic_a start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT by CVAE Gω(s)subscript𝐺𝜔𝑠G_{\omega}(s)italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_s ), and the auxiliary loss for OOD actions is calculated by:

OOD(θ)=𝔼s𝒟[(maxaμGω(s)Qθ(s,aμ)Qθ(s,aπ))2].subscript𝑂𝑂𝐷𝜃subscript𝔼similar-to𝑠𝒟delimited-[]superscriptsubscriptsimilar-tosubscript𝑎𝜇subscript𝐺𝜔𝑠subscript𝑄𝜃𝑠subscript𝑎𝜇subscript𝑄𝜃𝑠subscript𝑎𝜋2\displaystyle\mathcal{L}_{OOD}(\theta)=\mathbb{E}_{s\sim\mathcal{D}}[(\max_{a_% {\mu}\sim G_{\omega}(s)}Q_{\theta}(s,a_{\mu})-Q_{\theta}(s,a_{\pi}))^{2}].caligraphic_L start_POSTSUBSCRIPT italic_O italic_O italic_D end_POSTSUBSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_s ∼ caligraphic_D end_POSTSUBSCRIPT [ ( roman_max start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∼ italic_G start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ) - italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (13)

If aπsubscript𝑎𝜋a_{\pi}italic_a start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT (generated by current policy π𝜋\piitalic_π) is an OOD action, the auxiliary loss will limit the corresponding value estimation below the maximum value of in-distribution action aμsubscript𝑎𝜇a_{\mu}italic_a start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT. In this way, we help the value estimator stay conservative such that OOD actions will not be severely overestimated. However, MCQ cannot guarantee the budget allocation and bidding strategy to satisfy user requirements, hence we explicitly propose three key modules.

Refer to caption
Figure 3: The proposed framework: HiBid

V Our solution: HiBid

Due to the large solution space and multiple constraints of the considered c3-bidding problem, we propose a hierarchical offline DRL framework called HiBid, as shown in Figure 3. We first introduce an auxiliary batch loss [16] to prevent over-allocating budget on specific channels (see Section V-A), λlimit-from𝜆\lambda-italic_λ -generalization [17] for constrained bidding to adaptively respond to changing budget (see Section V-B), and a CPC-guided action selection (CPC-AS) scheme for the cross-channel CPC constraint satisfaction (see Section V-C).

V-A Auxiliary Batch Loss for Non-competitive Budget Allocation

Recall that the high-level planner’s objective is to maximize the number of user clicks under the advertiser’s budget while ensuring the revenue on each channel stays within an acceptable range. However, each channel’s request is limited and cannot accommodate all advertisers. If all of them compete for the request on high-quality channels, it will certainly lift the bidding price and reduce advertisers’ CPC satisfactory ratio. A well-designed planner may allocate different budgets for each channel based on the advertiser preference, as well as prevent budget over-allocation on specific channels.

A common solution is to leverage an advertiser-level constraint to limit the amount of budget allocation on the channel. It may result in reducing overall revenue since the bidding abilities of advertisers are quite different and thus one cannot allocate equal budget to all of them. Meanwhile, our design also faces a key challenge as how to restrict the updated policy to satisfy the channel-capacity constraint because it is hard to design a suitable reward function. Inspired by [16], we design a batch loss for the high-level planner to ensure that the budget allocated to each channel fluctuates within an acceptable range. Consider that we sample a batch of experiences containing multiple tuples (sph,aph,rph,cph,sp+1h)superscriptsubscript𝑠𝑝superscriptsubscript𝑎𝑝superscriptsubscript𝑟𝑝superscriptsubscript𝑐𝑝superscriptsubscript𝑠𝑝1(s_{p}^{h},a_{p}^{h},r_{p}^{h},c_{p}^{h},s_{p+1}^{h})( italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) from the high-level dataset 𝒟hsuperscript𝒟\mathcal{D}^{h}caligraphic_D start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT, then the batch loss is calculated by:

Batch(θ)=p=1P(sphCost(argmaxaph𝒜hQθh(sph,aph))κp)2,subscript𝐵𝑎𝑡𝑐𝜃superscriptsubscript𝑝1𝑃superscriptsubscriptsuperscriptsubscript𝑠𝑝𝐶𝑜𝑠𝑡subscriptsuperscriptsubscript𝑎𝑝superscript𝒜superscriptsubscript𝑄𝜃superscriptsubscript𝑠𝑝superscriptsubscript𝑎𝑝subscript𝜅𝑝2\small\mathcal{L}_{Batch}(\theta)=\sum_{p=1}^{P}\bigg{(}\sum_{s_{p}^{h}}Cost% \big{(}\mathop{\arg\max}_{a_{p}^{h}\in\mathcal{A}^{h}}Q_{\theta}^{h}(s_{p}^{h}% ,a_{p}^{h})\big{)}-\kappa_{p}\bigg{)}^{2},caligraphic_L start_POSTSUBSCRIPT italic_B italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT ( italic_θ ) = ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_C italic_o italic_s italic_t ( start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ) - italic_κ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (14)

where κpsubscript𝜅𝑝\kappa_{p}italic_κ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is calculated by multiplying historical CTR, CPC and impression count based on all advertisers in the batch, and a weight wpsubscript𝑤𝑝w_{p}italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is introduced on the batch loss for learning stability. Since the argmax function is not differentiable, a soft version is applied:

Cost(argmaxaph𝒜hQθh(sph,aph))n=1|𝒜h|1Zexp[βQθh(sph,ap,nh)Cost(ap,nh)],𝐶𝑜𝑠𝑡subscriptsuperscriptsubscript𝑎𝑝superscript𝒜superscriptsubscript𝑄𝜃superscriptsubscript𝑠𝑝superscriptsubscript𝑎𝑝superscriptsubscript𝑛1superscript𝒜1𝑍𝛽superscriptsubscript𝑄𝜃superscriptsubscript𝑠𝑝superscriptsubscript𝑎𝑝𝑛𝐶𝑜𝑠𝑡superscriptsubscript𝑎𝑝𝑛\footnotesize Cost\big{(}\mathop{\arg\max}_{a_{p}^{h}\in\mathcal{A}^{h}}Q_{% \theta}^{h}(s_{p}^{h},a_{p}^{h})\big{)}\approx\sum_{n=1}^{|\mathcal{A}^{h}|}% \frac{1}{Z}\exp\left[\beta Q_{\theta}^{h}(s_{p}^{h},a_{p,n}^{h})Cost(a_{p,n}^{% h})\right],italic_C italic_o italic_s italic_t ( start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ) ≈ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_Z end_ARG roman_exp [ italic_β italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_p , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) italic_C italic_o italic_s italic_t ( italic_a start_POSTSUBSCRIPT italic_p , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ] , (15)

where β𝛽\betaitalic_β is the temperature coefficient and Z=n=1|𝒜h|exp[βQθh(sph,ap,nh)]𝑍superscriptsubscript𝑛1superscript𝒜𝛽subscriptsuperscript𝑄𝜃superscriptsubscript𝑠𝑝superscriptsubscript𝑎𝑝𝑛Z=\sum_{n=1}^{|\mathcal{A}^{h}|}\exp[\beta Q^{h}_{\theta}(s_{p}^{h},a_{p,n}^{h% })]italic_Z = ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT roman_exp [ italic_β italic_Q start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_p , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) ] is the normalization factor. While the high-level planner updates its strategy using randomly sampled batch experiences to maximize the number of clicks, the batch loss encourages to reallocate budgets across different channels based on historical statistics within the batch experiences. For example, it prompts some high-impact advertisers (i.e., those with higher conversion rates) to reduce their budget allocation on high-quality channels, thereby allowing low-impact advertisers to benefit from the superior advertising effects that these high-quality channels provide. This approach ensures that revenue on the channels does not fluctuate dramatically. Additionally, it prevents the high-level policy from consistently favoring higher investments in high-quality channels, which could lead to detrimental competition.

V-B Offline Data Augmentation by λ𝜆\lambdaitalic_λ-Generation and Optimal λ𝜆\lambdaitalic_λ-Selection for Online Adaptive Bidding

As the advertiser’s total budget and the platform’s allocated budget may change over time, we need an adaptive bidding strategy that can dynamically respond to the budget changes. However, under the offline DRL training paradigm, the low-level executor learns the bidding strategy from the low-level dataset 𝒟lsuperscript𝒟𝑙\mathcal{D}^{l}caligraphic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT which cannot be generalized to unseen budget cases, reducing the effectiveness of the high-level budget allocation. To this end, we adopt “λ𝜆\lambdaitalic_λ-generalization” [17] to learn an adaptive bidding strategy for dynamically changing budgets.

In order to learn a bidding strategy that satisfies the budget constraint on each channel individually, the problem of c3-bidding without CPC constraint can be converted into its Lagrangian dual problem [38] as:

minλmaxaili=1IClick(ail)λ(i=1ICost(ail)aph)subscript𝜆subscriptsubscriptsuperscript𝑎𝑙𝑖superscriptsubscript𝑖1𝐼𝐶𝑙𝑖𝑐𝑘subscriptsuperscript𝑎𝑙𝑖𝜆superscriptsubscript𝑖1𝐼𝐶𝑜𝑠𝑡subscriptsuperscript𝑎𝑙𝑖superscriptsubscript𝑎𝑝\displaystyle\min_{\lambda}\max_{a^{l}_{i}}\sum_{i=1}^{I}Click(a^{l}_{i})-% \lambda\bigg{(}\sum_{i=1}^{I}Cost(a^{l}_{i})-a_{p}^{h}\bigg{)}roman_min start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_C italic_l italic_i italic_c italic_k ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_λ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_C italic_o italic_s italic_t ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) (16)
\displaystyle\Rightarrow minλmaxaili=1I(Click(ail)λCost(ail))+λaphs.t.λ0,formulae-sequencesubscript𝜆subscriptsubscriptsuperscript𝑎𝑙𝑖superscriptsubscript𝑖1𝐼𝐶𝑙𝑖𝑐𝑘superscriptsubscript𝑎𝑖𝑙𝜆𝐶𝑜𝑠𝑡superscriptsubscript𝑎𝑖𝑙𝜆superscriptsubscript𝑎𝑝𝑠𝑡𝜆0\displaystyle\min_{\lambda}\max_{a^{l}_{i}}\sum_{i=1}^{I}\bigg{(}Click(a_{i}^{% l})-\lambda Cost(a_{i}^{l})\bigg{)}+\lambda a_{p}^{h}\quad s.t.\quad\lambda% \geq 0,roman_min start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( italic_C italic_l italic_i italic_c italic_k ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) - italic_λ italic_C italic_o italic_s italic_t ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ) + italic_λ italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT italic_s . italic_t . italic_λ ≥ 0 ,

where λ𝜆\lambdaitalic_λ is the Lagrangian multiplier that controls how much the bidding strategy spends. Thus, we take it as part of the input to the Q-network and modify the low-level reward function by:

ril,λ=rilλcil.subscriptsuperscript𝑟𝑙𝜆𝑖superscriptsubscript𝑟𝑖𝑙𝜆superscriptsubscript𝑐𝑖𝑙r^{l,\lambda}_{i}=r_{i}^{l}-\lambda c_{i}^{l}.italic_r start_POSTSUPERSCRIPT italic_l , italic_λ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT - italic_λ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT . (17)

Then the low-level policy can be formulated as:

πl(sil,λ)=argmaxail𝒜lQθl(sil,ail,λ).superscript𝜋𝑙superscriptsubscript𝑠𝑖𝑙𝜆subscriptsuperscriptsubscript𝑎𝑖𝑙superscript𝒜𝑙superscriptsubscript𝑄𝜃𝑙superscriptsubscript𝑠𝑖𝑙superscriptsubscript𝑎𝑖𝑙𝜆\pi^{l}(s_{i}^{l},\lambda)=\mathop{\arg\max}_{a_{i}^{l}\in{\mathcal{A}^{l}}}Q_% {\theta}^{l}(s_{i}^{l},a_{i}^{l},\lambda).italic_π start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_λ ) = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_λ ) . (18)

Therefore, the low-level executor can adaptively respond to changing budgets by selecting the optimal λsuperscript𝜆\lambda^{*}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Next, we need to deal with the training of the corresponding policy under different λ𝜆\lambdaitalic_λ offline, and getting an optimal λsuperscript𝜆\lambda^{*}italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT under the budget constraint during online prediction.

During the offline training, we perform data augmentation by λ𝜆\lambdaitalic_λ-generation, allowing the policy to learn how to bid under different λ𝜆\lambdaitalic_λ. Given a fixed training dataset consisted of multiple tuples, we extend it by enlarging each tuple (sil,ail,ril,cil,si+1l)superscriptsubscript𝑠𝑖𝑙superscriptsubscript𝑎𝑖𝑙superscriptsubscript𝑟𝑖𝑙superscriptsubscript𝑐𝑖𝑙superscriptsubscript𝑠𝑖1𝑙(s_{i}^{l},a_{i}^{l},r_{i}^{l},c_{i}^{l},s_{i+1}^{l})( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) into {(sil,ail,ril,λn,cil,si+1l,λin)}n=1Nsuperscriptsubscriptsuperscriptsubscript𝑠𝑖𝑙superscriptsubscript𝑎𝑖𝑙superscriptsubscript𝑟𝑖𝑙subscript𝜆𝑛superscriptsubscript𝑐𝑖𝑙superscriptsubscript𝑠𝑖1𝑙superscriptsubscript𝜆𝑖𝑛𝑛1𝑁\{(s_{i}^{l},a_{i}^{l},r_{i}^{l,\lambda_{n}},c_{i}^{l},s_{i+1}^{l},\lambda_{i}% ^{n})\}_{n=1}^{N}{ ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is data repetition times and λnsubscript𝜆𝑛\lambda_{n}italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a uniformly sampled value from [0,λmax]0subscript𝜆[0,\lambda_{\max}][ 0 , italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ]. The range [0,λmax]0subscript𝜆[0,\lambda_{\max}][ 0 , italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] can guarantee that the cost of the learned policy falls within a controllable range [17]. We also construct two additional evaluation networks Q^ϕcsubscriptsuperscript^𝑄𝑐italic-ϕ\hat{Q}^{c}_{\phi}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT and Q^ηusubscriptsuperscript^𝑄𝑢𝜂\hat{Q}^{u}_{\eta}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT to accurately evaluate the expected cost and number of clicks under the low-level policy πlsuperscript𝜋𝑙\pi^{l}italic_π start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. We use the action from πlsuperscript𝜋𝑙\pi^{l}italic_π start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT instead of max\maxroman_max operator when computing the target value in two evaluation network updates by:

Q^(ϕ)=𝔼τl𝒟l[cil+γlQ^ϕc(si+1l,πl(si+1l,λin),λin)Q^ϕc(sil,ail,λin)],subscript^𝑄italic-ϕsubscript𝔼similar-tosuperscript𝜏𝑙superscript𝒟𝑙delimited-[]superscriptsubscript𝑐𝑖𝑙superscript𝛾𝑙subscriptsuperscript^𝑄𝑐italic-ϕsuperscriptsubscript𝑠𝑖1𝑙superscript𝜋𝑙superscriptsubscript𝑠𝑖1𝑙superscriptsubscript𝜆𝑖𝑛superscriptsubscript𝜆𝑖𝑛subscriptsuperscript^𝑄𝑐italic-ϕsuperscriptsubscript𝑠𝑖𝑙superscriptsubscript𝑎𝑖𝑙superscriptsubscript𝜆𝑖𝑛\displaystyle\mathcal{L}_{\hat{Q}}(\phi)=\mathbb{E}_{\tau^{l}\sim\mathcal{D}^{% l}}\left[c_{i}^{l}+\gamma^{l}\hat{Q}^{c}_{\phi}\big{(}s_{i+1}^{l},\pi^{l}(s_{i% +1}^{l},\lambda_{i}^{n}),\lambda_{i}^{n}\big{)}-\hat{Q}^{c}_{\phi}(s_{i}^{l},a% _{i}^{l},\lambda_{i}^{n})\right],caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG end_POSTSUBSCRIPT ( italic_ϕ ) = blackboard_E start_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT + italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) , italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) - over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ] , (19)

where τlsuperscript𝜏𝑙\tau^{l}italic_τ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is sampled trajectories from 𝒟lsuperscript𝒟𝑙\mathcal{D}^{l}caligraphic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and cilsuperscriptsubscript𝑐𝑖𝑙c_{i}^{l}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT denotes the budget consumption under action ailsuperscriptsubscript𝑎𝑖𝑙a_{i}^{l}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. Note that we utilize the same loss function Q^(η)subscript^𝑄𝜂\mathcal{L}_{\hat{Q}}(\eta)caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG end_POSTSUBSCRIPT ( italic_η ) for Q^ηusubscriptsuperscript^𝑄𝑢𝜂\hat{Q}^{u}_{\eta}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT but compute number of clicks as cilsuperscriptsubscript𝑐𝑖𝑙c_{i}^{l}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT.

During the online prediction, two evaluation networks are used to perform λ𝜆\lambdaitalic_λ-selection to ensure that the low-level policy does not exceed the budget allocated by the high-level planner. We uniformly sample {λn}n=1Npsuperscriptsubscriptsuperscript𝜆𝑛𝑛1subscript𝑁𝑝\{\lambda^{n}\}_{n=1}^{N_{p}}{ italic_λ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for each request i𝑖iitalic_i within the range [0,λmax]0subscript𝜆[0,\lambda_{\max}][ 0 , italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ], and select the λisuperscriptsubscript𝜆𝑖\lambda_{i}^{*}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT which satisfies the allocated budget aphsuperscriptsubscript𝑎𝑝a_{p}^{h}italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT as well as maximizing the number of clicks:

λi=argmaxλ{Q^ϕc(sil,π(sil,λn),λn)aph|n=1Np}Q^ηu(sil,πl(sil,λ),λ),i.superscriptsubscript𝜆𝑖subscript𝜆subscriptsuperscript^𝑄𝑐italic-ϕsuperscriptsubscript𝑠𝑖𝑙𝜋superscriptsubscript𝑠𝑖𝑙superscript𝜆𝑛superscript𝜆𝑛evaluated-atsuperscriptsubscript𝑎𝑝𝑛1subscript𝑁𝑝subscriptsuperscript^𝑄𝑢𝜂superscriptsubscript𝑠𝑖𝑙superscript𝜋𝑙superscriptsubscript𝑠𝑖𝑙𝜆𝜆for-all𝑖\lambda_{i}^{*}=\mathop{\arg\max}_{\lambda\in\{\hat{Q}^{c}_{\phi}(s_{i}^{l},% \pi(s_{i}^{l},\lambda^{n}),\lambda^{n})\leq a_{p}^{h}|_{n=1}^{N_{p}}\}}\hat{Q}% ^{u}_{\eta}(s_{i}^{l},\pi^{l}(s_{i}^{l},\lambda),\lambda),\forall i.italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_λ ∈ { over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_π ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) , italic_λ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ≤ italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_λ ) , italic_λ ) , ∀ italic_i . (20)

In this way, the low-level executor can adaptively respond to the allocated budget aphsuperscriptsubscript𝑎𝑝a_{p}^{h}italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT and ensure the effectiveness of the budget allocation made by the higher-level planner.

V-C CPC-guided Action Selection for Cross-channel Constraint Satisfaction

Due to the varying target users and advertising quality, the competition situation differs among channels, resulting in a significant discrepancy in terms of CPC between channels (i.e., high-quality channels have higher CPC). Therefore, we cannot simplify the cross-channel CPC constraint by setting the same target CPC for each channel. When we use Lagrangian relaxation to deal with both budget and CPC constraints simultaneously, it is impossible to find an effective pair of Lagrangian multipliers to satisfy them due to the explosion of the solution space. We design a CPC-guided action selection (CPC-AS) scheme to help the low-level executor choose the action that satisfies the CPC constraint by considering both the past and the future.

When making a decision for a request i𝑖iitalic_i, the final CPC for an advertiser m𝑚mitalic_m is divided into two parts:

CPCmreal𝐶𝑃superscriptsubscript𝐶𝑚𝑟𝑒𝑎𝑙\displaystyle CPC_{m}^{real}italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_l end_POSTSUPERSCRIPT =p=1Pi=1ICost(ap,il)p=1Pi=1IClick(ap,il)absentsuperscriptsubscript𝑝1𝑃superscriptsubscript𝑖1𝐼𝐶𝑜𝑠𝑡subscriptsuperscript𝑎𝑙𝑝𝑖superscriptsubscript𝑝1𝑃superscriptsubscript𝑖1𝐼𝐶𝑙𝑖𝑐𝑘subscriptsuperscript𝑎𝑙𝑝𝑖\displaystyle=\frac{\sum_{p=1}^{P}\sum_{i=1}^{I}Cost(a^{l}_{p,i})}{\sum_{p=1}^% {P}\sum_{i=1}^{I}Click(a^{l}_{p,i})}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_C italic_o italic_s italic_t ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_C italic_l italic_i italic_c italic_k ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT ) end_ARG (21)
=Costm,t+p=1Pi=tICost(ap,il)Clickm,t+p=1Pi=tIClick(ap,il),absent𝐶𝑜𝑠subscript𝑡𝑚𝑡superscriptsubscript𝑝1𝑃superscriptsubscript𝑖𝑡𝐼𝐶𝑜𝑠𝑡subscriptsuperscript𝑎𝑙𝑝𝑖𝐶𝑙𝑖𝑐subscript𝑘𝑚𝑡superscriptsubscript𝑝1𝑃superscriptsubscript𝑖𝑡𝐼𝐶𝑙𝑖𝑐𝑘subscriptsuperscript𝑎𝑙𝑝𝑖\displaystyle=\frac{Cost_{m,t}+\sum_{p=1}^{P}\sum_{i=t}^{I}Cost(a^{l}_{p,i})}{% Click_{m,t}+\sum_{p=1}^{P}\sum_{i=t}^{I}Click(a^{l}_{p,i})},= divide start_ARG italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_C italic_o italic_s italic_t ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_C italic_l italic_i italic_c italic_k start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_C italic_l italic_i italic_c italic_k ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT ) end_ARG ,

where Costm,t𝐶𝑜𝑠subscript𝑡𝑚𝑡Cost_{m,t}italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT and Clickm,t𝐶𝑙𝑖𝑐subscript𝑘𝑚𝑡Click_{m,t}italic_C italic_l italic_i italic_c italic_k start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT denote the costs and number of clicks already happened up to now, respectively. As we describe in Section V-B, two evaluation networks Q^θ^csubscriptsuperscript^𝑄𝑐^𝜃\hat{Q}^{c}_{\hat{\theta}}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT and Q^θ^usubscriptsuperscript^𝑄𝑢^𝜃\hat{Q}^{u}_{\hat{\theta}}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over^ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT are developed to estimate the expected discounted costs and the number of clicks, respectively. Given the current state-action pair (sil,ail)subscriptsuperscript𝑠𝑙𝑖subscriptsuperscript𝑎𝑙𝑖(s^{l}_{i},a^{l}_{i})( italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), we approximate the expected costs and number of clicks through Q^ϕc(sil,ail,λ)subscriptsuperscript^𝑄𝑐italic-ϕsubscriptsuperscript𝑠𝑙𝑖subscriptsuperscript𝑎𝑙𝑖𝜆\hat{Q}^{c}_{\phi}(s^{l}_{i},a^{l}_{i},\lambda)over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_λ ) and Q^ηu(sil,ail,λ)subscriptsuperscript^𝑄𝑢𝜂subscriptsuperscript𝑠𝑙𝑖subscriptsuperscript𝑎𝑙𝑖𝜆\hat{Q}^{u}_{\eta}(s^{l}_{i},a^{l}_{i},\lambda)over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_λ ). Therefore, we define CPCmpred(ail)𝐶𝑃superscriptsubscript𝐶𝑚𝑝𝑟𝑒𝑑superscriptsubscript𝑎𝑖𝑙CPC_{m}^{pred}(a_{i}^{l})italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) by combining the past and expected future by:

CPCmpred(ail)=Costm,t+p=1Pi=tI(γl)i1Cost(ap,il)Clickm,t+p=1Pi=tI(γl)i1Click(ap,il)𝐶𝑃superscriptsubscript𝐶𝑚𝑝𝑟𝑒𝑑superscriptsubscript𝑎𝑖𝑙𝐶𝑜𝑠subscript𝑡𝑚𝑡superscriptsubscript𝑝1𝑃superscriptsubscript𝑖𝑡𝐼superscriptsuperscript𝛾𝑙𝑖1𝐶𝑜𝑠𝑡subscriptsuperscript𝑎𝑙𝑝𝑖𝐶𝑙𝑖𝑐subscript𝑘𝑚𝑡superscriptsubscript𝑝1𝑃superscriptsubscript𝑖𝑡𝐼superscriptsuperscript𝛾𝑙𝑖1𝐶𝑙𝑖𝑐𝑘subscriptsuperscript𝑎𝑙𝑝𝑖\displaystyle CPC_{m}^{pred}(a_{i}^{l})=\frac{Cost_{m,t}+\sum_{p=1}^{P}\sum_{i% =t}^{I}(\gamma^{l})^{i-1}Cost(a^{l}_{p,i})}{Click_{m,t}+\sum_{p=1}^{P}\sum_{i=% t}^{I}(\gamma^{l})^{i-1}Click(a^{l}_{p,i})}italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) = divide start_ARG italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_C italic_o italic_s italic_t ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_C italic_l italic_i italic_c italic_k start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT ( italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT italic_C italic_l italic_i italic_c italic_k ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT ) end_ARG (22)
=Costm,t+p=1PQ^ϕc(sp,il,ap,il,λ)Clickm,t+p=1PQ^ηu(sp,il,ap,il,λ).absent𝐶𝑜𝑠subscript𝑡𝑚𝑡superscriptsubscript𝑝1𝑃subscriptsuperscript^𝑄𝑐italic-ϕsubscriptsuperscript𝑠𝑙𝑝𝑖subscriptsuperscript𝑎𝑙𝑝𝑖𝜆𝐶𝑙𝑖𝑐subscript𝑘𝑚𝑡superscriptsubscript𝑝1𝑃subscriptsuperscript^𝑄𝑢𝜂subscriptsuperscript𝑠𝑙𝑝𝑖subscriptsuperscript𝑎𝑙𝑝𝑖𝜆\displaystyle=\frac{Cost_{m,t}+\sum_{p=1}^{P}\hat{Q}^{c}_{\phi}(s^{l}_{p,i},a^% {l}_{p,i},\lambda)}{Click_{m,t}+\sum_{p=1}^{P}\hat{Q}^{u}_{\eta}(s^{l}_{p,i},a% ^{l}_{p,i},\lambda)}.= divide start_ARG italic_C italic_o italic_s italic_t start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT , italic_λ ) end_ARG start_ARG italic_C italic_l italic_i italic_c italic_k start_POSTSUBSCRIPT italic_m , italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT , italic_λ ) end_ARG .

Since ad requests from all channels arrive in sequence, and each request only belongs to one channel by definition, we cannot estimate the future of other channels, and then we save their most recent estimations as the input. Together with CPC(ail)𝐶𝑃𝐶superscriptsubscript𝑎𝑖𝑙CPC(a_{i}^{l})italic_C italic_P italic_C ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ), the low-level policy becomes:

πl(sil,λ)=argmaxa{CPCmpred(ail)CPCmset|ail𝒜l}Qθl(sil,a,λ).superscript𝜋𝑙subscriptsuperscript𝑠𝑙𝑖𝜆subscript𝑎conditional-set𝐶𝑃superscriptsubscript𝐶𝑚𝑝𝑟𝑒𝑑superscriptsubscript𝑎𝑖𝑙𝐶𝑃superscriptsubscript𝐶𝑚𝑠𝑒𝑡superscriptsubscript𝑎𝑖𝑙superscript𝒜𝑙subscriptsuperscript𝑄𝑙𝜃subscriptsuperscript𝑠𝑙𝑖𝑎𝜆\pi^{l}(s^{l}_{i},\lambda)=\mathop{\arg\max}_{a\in\{{CPC_{m}^{pred}(a_{i}^{l})% \leq CPC_{m}^{set}|a_{i}^{l}\in\mathcal{A}^{l}\}}}Q^{l}_{\theta}(s^{l}_{i},a,% \lambda).italic_π start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_λ ) = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_a ∈ { italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) ≤ italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e italic_t end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a , italic_λ ) . (23)

Note that there exists a slight bias between CPCmreal𝐶𝑃superscriptsubscript𝐶𝑚𝑟𝑒𝑎𝑙CPC_{m}^{real}italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_l end_POSTSUPERSCRIPT and CPCmpred(ail)𝐶𝑃superscriptsubscript𝐶𝑚𝑝𝑟𝑒𝑑superscriptsubscript𝑎𝑖𝑙CPC_{m}^{pred}(a_{i}^{l})italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) when γl<1superscript𝛾𝑙1\gamma^{l}<1italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT < 1. In Section VI-G, we experimentally prove that the bias becomes smaller when γlsuperscript𝛾𝑙\gamma^{l}italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT is close to 1111, which can be ignored in practice.

Input: Log data 𝒟𝒟\mathcal{D}caligraphic_D, high-level CVAE Gωhsubscriptsuperscript𝐺𝜔G^{h}_{\omega}italic_G start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, Q-network Qθhsubscriptsuperscript𝑄𝜃Q^{h}_{\theta}italic_Q start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and target network Qθhsubscriptsuperscript𝑄superscript𝜃Q^{h}_{\theta^{\prime}}italic_Q start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, Low-level CVAE Gωlsubscriptsuperscript𝐺𝑙𝜔G^{l}_{\omega}italic_G start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, Q-network Qθlsubscriptsuperscript𝑄𝑙𝜃Q^{l}_{\theta}italic_Q start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and target network Qθlsubscriptsuperscript𝑄𝑙superscript𝜃Q^{l}_{\theta^{\prime}}italic_Q start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, evaluation networks Q^ηusubscriptsuperscript^𝑄𝑢𝜂\hat{Q}^{u}_{\eta}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT and Q^ϕcsubscriptsuperscript^𝑄𝑐italic-ϕ\hat{Q}^{c}_{\phi}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT.
1 Initialize all parameterized networks;
2 Process log data into high-level dataset 𝒟hsuperscript𝒟\mathcal{D}^{h}caligraphic_D start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT and low-level dataset 𝒟lsuperscript𝒟𝑙\mathcal{D}^{l}caligraphic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT;
3 for High-level update iteration=1,2,3,absent123italic-…=1,2,3,\dots= 1 , 2 , 3 , italic_… do
4       Sample a batch contains Jhsuperscript𝐽J^{h}italic_J start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT tuples {(sjh,ajh,rjh,cjh,sj+1h)}j=1Jhsuperscriptsubscriptsuperscriptsubscript𝑠𝑗superscriptsubscript𝑎𝑗superscriptsubscript𝑟𝑗superscriptsubscript𝑐𝑗superscriptsubscript𝑠𝑗1𝑗1superscript𝐽\{(s_{j}^{h},a_{j}^{h},r_{j}^{h},c_{j}^{h},s_{j+1}^{h})\}_{j=1}^{J^{h}}{ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT from 𝒟hsuperscript𝒟\mathcal{D}^{h}caligraphic_D start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT;
5       Update high-level CVAE by minimizing Eqn. (12);
6       Calculate the Q(θ)subscript𝑄𝜃\mathcal{L}_{Q}(\theta)caligraphic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_θ ), OOD(θ)subscript𝑂𝑂𝐷𝜃\mathcal{L}_{OOD}(\theta)caligraphic_L start_POSTSUBSCRIPT italic_O italic_O italic_D end_POSTSUBSCRIPT ( italic_θ ) and the batch loss Batch(θ)subscript𝐵𝑎𝑡𝑐𝜃\mathcal{L}_{Batch}(\theta)caligraphic_L start_POSTSUBSCRIPT italic_B italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT ( italic_θ ) by Eqn. (11), (13) and (14), .
7       Update Q-network Qθhsuperscriptsubscript𝑄𝜃Q_{\theta}^{h}italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT by minimizing the weighted loss w1Q(θ)+w2OOD(θ)+wbBatch(θ)subscript𝑤1subscript𝑄𝜃subscript𝑤2subscript𝑂𝑂𝐷𝜃subscript𝑤𝑏subscript𝐵𝑎𝑡𝑐𝜃w_{1}\mathcal{L}_{Q}(\theta)+w_{2}\mathcal{L}_{OOD}(\theta)+w_{b}\mathcal{L}_{% Batch}(\theta)italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_θ ) + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_O italic_O italic_D end_POSTSUBSCRIPT ( italic_θ ) + italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_B italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT ( italic_θ );
8       Every Ntargetsubscript𝑁𝑡𝑎𝑟𝑔𝑒𝑡N_{target}italic_N start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT iterations synchronize QθhQθhabsentsubscriptsuperscript𝑄superscript𝜃subscriptsuperscript𝑄𝜃Q^{h}_{\theta^{\prime}}\xleftarrow{}Q^{h}_{\theta}italic_Q start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_ARROW start_OVERACCENT end_OVERACCENT ← end_ARROW italic_Q start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT;
9for Low-level update iteration=1,2,3,absent123italic-…=1,2,3,\dots= 1 , 2 , 3 , italic_… do
10       Sample a batch of experiences contains Jlsuperscript𝐽𝑙J^{l}italic_J start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT tuples {(sjl,ajl,rjl,cjl,sj+1l)}j=1Jlsuperscriptsubscriptsuperscriptsubscript𝑠𝑗𝑙superscriptsubscript𝑎𝑗𝑙superscriptsubscript𝑟𝑗𝑙superscriptsubscript𝑐𝑗𝑙superscriptsubscript𝑠𝑗1𝑙𝑗1superscript𝐽𝑙\{(s_{j}^{l},a_{j}^{l},r_{j}^{l},c_{j}^{l},s_{j+1}^{l})\}_{j=1}^{J^{l}}{ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT from 𝒟lsuperscript𝒟𝑙\mathcal{D}^{l}caligraphic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT;
11       Update low-level CVAE by minimizing Eqn. (12);
12       for j=1,,Jl𝑗1superscript𝐽𝑙j=1,\dots,J^{l}italic_j = 1 , … , italic_J start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT do
13             Get the allocated budget aphsubscriptsuperscript𝑎𝑝a^{h}_{p}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT using Qθhsubscriptsuperscript𝑄𝜃Q^{h}_{\theta}italic_Q start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT;
14             Uniform sample {λn}n=1Nsuperscriptsubscriptsuperscript𝜆𝑛𝑛1𝑁\{\lambda^{n}\}_{n=1}^{N}{ italic_λ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT in range [0,λmax]0subscript𝜆[0,\lambda_{\max}][ 0 , italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ];
15             Calculate the reward rjl,λjnsuperscriptsubscript𝑟𝑗𝑙superscriptsubscript𝜆𝑗𝑛r_{j}^{l,\lambda_{j}^{n}}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT by Eqn. (17);
16             Including λnsuperscript𝜆𝑛\lambda^{n}italic_λ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and aphsubscriptsuperscript𝑎𝑝a^{h}_{p}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT into the sjlsuperscriptsubscript𝑠𝑗𝑙s_{j}^{l}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, then enlarging each tuple into {(sjl,ajl,rjl,λn,cjl,sj+1l,λjn)}n=1Nsuperscriptsubscriptsuperscriptsubscript𝑠𝑗𝑙superscriptsubscript𝑎𝑗𝑙superscriptsubscript𝑟𝑗𝑙superscript𝜆𝑛superscriptsubscript𝑐𝑗𝑙superscriptsubscript𝑠𝑗1𝑙superscriptsubscript𝜆𝑗𝑛𝑛1𝑁\{(s_{j}^{l},a_{j}^{l},r_{j}^{l,\lambda^{n}},c_{j}^{l},s_{j+1}^{l},\lambda_{j}% ^{n})\}_{n=1}^{N}{ ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l , italic_λ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT;
17      Calculate the Q(θ)subscript𝑄𝜃\mathcal{L}_{Q}(\theta)caligraphic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_θ ) and OOD(θ)subscript𝑂𝑂𝐷𝜃\mathcal{L}_{OOD}(\theta)caligraphic_L start_POSTSUBSCRIPT italic_O italic_O italic_D end_POSTSUBSCRIPT ( italic_θ ) by Eqn. (11) and (13).
18       Update Q-network Qθlsuperscriptsubscript𝑄𝜃𝑙Q_{\theta}^{l}italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT by minimizing the weighted loss w1Q(θ)+w2OOD(θ)subscript𝑤1subscript𝑄𝜃subscript𝑤2subscript𝑂𝑂𝐷𝜃w_{1}\mathcal{L}_{Q}(\theta)+w_{2}\mathcal{L}_{OOD}(\theta)italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_θ ) + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_O italic_O italic_D end_POSTSUBSCRIPT ( italic_θ );
19       Update evaluation networks Q^ηusubscriptsuperscript^𝑄𝑢𝜂\hat{Q}^{u}_{\eta}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT and Q^ϕcsubscriptsuperscript^𝑄𝑐italic-ϕ\hat{Q}^{c}_{\phi}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT by minimizing the Q^(η)subscript^𝑄𝜂\mathcal{L}_{\hat{Q}}(\eta)caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG end_POSTSUBSCRIPT ( italic_η ) and Q^(ϕ)subscript^𝑄italic-ϕ\mathcal{L}_{\hat{Q}}(\phi)caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG end_POSTSUBSCRIPT ( italic_ϕ ) in Eqn. (19);
20       Every Ntargetsubscript𝑁𝑡𝑎𝑟𝑔𝑒𝑡N_{target}italic_N start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT iterations synchronize QθlQθlabsentsubscriptsuperscript𝑄𝑙superscript𝜃subscriptsuperscript𝑄𝑙𝜃Q^{l}_{\theta^{\prime}}\xleftarrow{}Q^{l}_{\theta}italic_Q start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_ARROW start_OVERACCENT end_OVERACCENT ← end_ARROW italic_Q start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT;
Algorithm 1 Offline Training
Input: Trained high-level Q-network Qθhsuperscriptsubscript𝑄𝜃Q_{\theta}^{h}italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT, low-level Q-network Qθlsuperscriptsubscript𝑄𝜃𝑙Q_{\theta}^{l}italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, and evaluation networks Q^ηusubscriptsuperscript^𝑄𝑢𝜂\hat{Q}^{u}_{\eta}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT and Q^ϕcsubscriptsuperscript^𝑄𝑐italic-ϕ\hat{Q}^{c}_{\phi}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT.
1 while Incoming ad request i𝑖iitalic_i do
2       Get advertiser-level feature sphsuperscriptsubscript𝑠𝑝s_{p}^{h}italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT and request-level feature silsuperscriptsubscript𝑠𝑖𝑙s_{i}^{l}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT from the platform.
3       Allocate budget aphsubscriptsuperscript𝑎𝑝a^{h}_{p}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT by argmaxaph𝒜hQθh(sph,aph).subscriptsuperscriptsubscript𝑎𝑝superscript𝒜superscriptsubscript𝑄𝜃superscriptsubscript𝑠𝑝superscriptsubscript𝑎𝑝\mathop{\arg\max}_{a_{p}^{h}\in{\mathcal{A}^{h}}}Q_{\theta}^{h}(s_{p}^{h},a_{p% }^{h}).start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ) .
4       Find the optimal λisuperscriptsubscript𝜆𝑖\lambda_{i}^{*}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT according to aphsuperscriptsubscript𝑎𝑝a_{p}^{h}italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT by Eqn. (20);
5       Calculate {CPCmpred(ail)|ail𝒜l}conditional-set𝐶𝑃superscriptsubscript𝐶𝑚𝑝𝑟𝑒𝑑superscriptsubscript𝑎𝑖𝑙superscriptsubscript𝑎𝑖𝑙superscript𝒜𝑙\{CPC_{m}^{pred}(a_{i}^{l})|a_{i}^{l}\in\mathcal{A}^{l}\}{ italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } by Eqn. (22)
6       Calculate bidding action ailsuperscriptsubscript𝑎𝑖𝑙a_{i}^{l}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT by Eqn. (23)
7
Algorithm 2 Online Prediction

V-D Algorithm Description

V-D1 Offline Training

We first show the pseudo-code for offline training in Algorithm 1111. At the beginning, we initialize all parameterized networks (Line 1111). Then we process the data for the high-level and low-level training individually (Line 2222), to use the available log data efficiently. For the high-level planner, after sampling a batch of experiences from 𝒟hsuperscript𝒟\mathcal{D}^{h}caligraphic_D start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT (Line 4444), we update CVAE first by Eqn. (12) (Line 5555). Using sampled experiences, we calculate the Q(θ)subscript𝑄𝜃\mathcal{L}_{Q}(\theta)caligraphic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_θ ) and OOD(θ)subscript𝑂𝑂𝐷𝜃\mathcal{L}_{OOD}(\theta)caligraphic_L start_POSTSUBSCRIPT italic_O italic_O italic_D end_POSTSUBSCRIPT ( italic_θ ) as well as LBatch(θ)subscript𝐿𝐵𝑎𝑡𝑐𝜃L_{Batch}(\theta)italic_L start_POSTSUBSCRIPT italic_B italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT ( italic_θ ) by Eqn. (11), (13) and (14), then update high-level Q-network with weighted loss (Line 6666-7777). Finally, the target network θsuperscript𝜃\theta^{\prime}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is synchronized periodically (Line 8888). For the low-level executor, we sample a batch of experiences from 𝒟lsuperscript𝒟𝑙\mathcal{D}^{l}caligraphic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and then update the low-level CVAE by Eqn. (12) (Line 10101010-11111111). With each sampled experience, we leverage the trained high-level Q-network to determine the allocated budget (Line 13131313). Then, we augment the origin tuple by sampling multiple {λn}n=1Nsuperscriptsubscriptsuperscript𝜆𝑛𝑛1𝑁\{\lambda^{n}\}_{n=1}^{N}{ italic_λ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and modifying the reward (Line 14141414-15151515). The sampled λnsuperscript𝜆𝑛\lambda^{n}italic_λ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and allocated budget aphsubscriptsuperscript𝑎𝑝a^{h}_{p}italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are incorporated into the state sjlsuperscriptsubscript𝑠𝑗𝑙s_{j}^{l}italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT (Line 16161616). Using the augmented tuple, we update the low-level Q-network with weighted loss by Eqn. (11) and (13) (Line 17171717-18181818). Two evaluation networks are updated by minimizing Q^(η)subscript^𝑄𝜂\mathcal{L}_{\hat{Q}}(\eta)caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG end_POSTSUBSCRIPT ( italic_η ) and Q^(ϕ)subscript^𝑄italic-ϕ\mathcal{L}_{\hat{Q}}(\phi)caligraphic_L start_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG end_POSTSUBSCRIPT ( italic_ϕ ) in Eqn. (19) (Line 19191919).

V-D2 Online Prediction

The pseudo-code for online prediction is given in Algorithms 2. For each ad request i𝑖iitalic_i, HiBid gets advertiser-level feature sphsuperscriptsubscript𝑠𝑝s_{p}^{h}italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT and request-level feature silsuperscriptsubscript𝑠𝑖𝑙s_{i}^{l}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT as in Section 3333 from the advertising platform (Line 2222). Then, the high-level planner allocates budget aphsuperscriptsubscript𝑎𝑝a_{p}^{h}italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT by taking advertiser-level information as the input of Qθhsuperscriptsubscript𝑄𝜃Q_{\theta}^{h}italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT. (Line 3333) Together with two evaluation networks Q^ηusubscriptsuperscript^𝑄𝑢𝜂\hat{Q}^{u}_{\eta}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_u end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT and Q^ϕcsubscriptsuperscript^𝑄𝑐italic-ϕ\hat{Q}^{c}_{\phi}over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, the low-level executor finds the optimal λisuperscriptsubscript𝜆𝑖\lambda_{i}^{*}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT according to Eqn. (20) (Line 4444). To satisfy the CPC constraint, CPCmpred(ail)𝐶𝑃superscriptsubscript𝐶𝑚𝑝𝑟𝑒𝑑superscriptsubscript𝑎𝑖𝑙CPC_{m}^{pred}(a_{i}^{l})italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) for each action is calculated by Eqn. (22) (Line 5555). Considering both the output of Q-network Qθl(sil,λi)superscriptsubscript𝑄𝜃𝑙superscriptsubscript𝑠𝑖𝑙superscriptsubscript𝜆𝑖Q_{\theta}^{l}(s_{i}^{l},\lambda_{i}^{*})italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and {CPCmpred(ail)|ail𝒜l}conditional-set𝐶𝑃superscriptsubscript𝐶𝑚𝑝𝑟𝑒𝑑superscriptsubscript𝑎𝑖𝑙superscriptsubscript𝑎𝑖𝑙superscript𝒜𝑙\{CPC_{m}^{pred}(a_{i}^{l})|a_{i}^{l}\in\mathcal{A}^{l}\}{ italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) | italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ caligraphic_A start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT }, the low-level executor gives the final bidding action based on Eqn. (23) (Line 6666).

VI Experiment

VI-A Setup

VI-A1 Dataset

We use large-scale log data collected from Meituan advertising system (which is physically deployed online and running for real-time services) for offline training and performance evaluation. The data contains 28282828 days of bidding logs and is divided into two parts, i.e., 21212121 days for training and 7777 days for evaluation. On average, we sampled 64,2726427264,27264 , 272 advertisers for 70707070 million ad requests on 4444 channels in a day. We follow the common setting in [15] for most hyper-parameters (leaving two key ones for hyper-parameters tunning in Section VI-B) and list them in Table II.

VI-A2 Offline Evaluation System

Deploying a model to an online system without evaluating its potential effects is risky, since it may lead to significant loss of revenue. To avoid this, we design an offline evaluation system for cross-channel bidding scenarios, which includes two modules: advertising system simulator and user feedback predictor. The former simulates Meituan’s real online advertising platform, including the process of retrieval, bidding, ranking and pricing. The latter is used to predict the user’s feedback on certain advertisements and provide the advertising results for evaluation.

VI-A3 Evaluation Metrics

We introduce five metrics to mathematically evaluate the performance of HiBid in the considered c3-bidding problem, including (a) total impression counts (Impr𝐼𝑚𝑝𝑟Impritalic_I italic_m italic_p italic_r), (b) total number of clicks (Click𝐶𝑙𝑖𝑐𝑘Clickitalic_C italic_l italic_i italic_c italic_k), (c) average CPC (CPC𝐶𝑃𝐶CPCitalic_C italic_P italic_C), (d) average CPC satisfactory ratio (CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R), and (e) average ROI (ROI𝑅𝑂𝐼ROIitalic_R italic_O italic_I) of all advertisers. In particular, CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R is average of 𝟙CPCmrealCPCmsetsubscript1𝐶𝑃superscriptsubscript𝐶𝑚𝑟𝑒𝑎𝑙𝐶𝑃superscriptsubscript𝐶𝑚𝑠𝑒𝑡\mathds{1}_{CPC_{m}^{real}\leq CPC_{m}^{set}}blackboard_1 start_POSTSUBSCRIPT italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_l end_POSTSUPERSCRIPT ≤ italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT of all advertisers and 𝟙1\mathds{1}blackboard_1 is the indicator function. To accurately evaluate the effectiveness of various methods and eliminate the influence of specific application scenarios, we use the normalized score with respect to the statistical result obtained from the log data (offline performance evaluation), and online solution R-BCQ [17] (online A/B testing) for each metric.

VI-A4 System Setup

We implement Hibid with Tensorflow 1.15 and utilize 4444 NVIDIA A100 GPUs for offline training. The training time cost is directly proportional to the choice of N𝑁Nitalic_N, and it lasts about 29292929 hours when N=30𝑁30N=30italic_N = 30. For online prediction, we deployed Hibid on 233233233233 servers, each of which is equipped with an Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz and an NVIDIA A30 GPU. The maximum number of concurrent requests is about 18,8991889918,89918 , 899 (during the business peak period) and the inference time is shown in Table V.

TABLE II: Key hyper-parameters in HiBid
Hyper-parameter Value
high-level and low-level batch sizes 4096, 1024
high-level and low-level learning rates 1e-5, 1e-5
discounted factors γh,γlsuperscript𝛾superscript𝛾𝑙\gamma^{h},\gamma^{l}italic_γ start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT 1,0.99910.9991,0.9991 , 0.999
loss weights w1,w2,wbsubscript𝑤1subscript𝑤2subscript𝑤𝑏w_{1},w_{2},w_{b}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT 1,0.05,0.110.050.11,0.05,0.11 , 0.05 , 0.1
high-level decision interval 1 day
λ𝜆\lambdaitalic_λ sampling range λmaxsubscript𝜆\lambda_{\max}italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT 1.45
data repetition times N𝑁Nitalic_N 30303030
number of sampled λ𝜆\lambdaitalic_λ in online prediction Npsubscript𝑁𝑝N_{p}italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 50505050
low level action range [aminl,amaxl]subscriptsuperscript𝑎𝑙subscriptsuperscript𝑎𝑙[a^{l}_{\min},a^{l}_{\max}][ italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] [0.5,1.5]0.51.5[0.5,1.5][ 0.5 , 1.5 ]

VI-B Hyper-parameter Tuning

We first show the results of hyper-parameters tunning in HiBid, as loss weight wbsubscript𝑤𝑏w_{b}italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT (in batch loss; see Section V-A) and data repetition times N𝑁Nitalic_N (in λ𝜆\lambdaitalic_λ-generalization; Section V-B). We tune wb{0,0.02,0.04,,0.2}subscript𝑤𝑏00.020.040.2w_{b}\in\{0,0.02,0.04,\dots,0.2\}italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ { 0 , 0.02 , 0.04 , … , 0.2 } to investigate the effect of weighted batch loss on budget allocation and N{0,1,5,10,20,30,40,50,60}𝑁015102030405060N\in\{0,1,5,10,20,30,40,50,60\}italic_N ∈ { 0 , 1 , 5 , 10 , 20 , 30 , 40 , 50 , 60 } to study the impact of sample efficiency on bidding strategy.

During practical deployment, the platform requirement is that the fluctuation of revenue does not exceed 1%percent11\%1 %. Thus, ϵitalic-ϵ\epsilonitalic_ϵ is set to 0.01κp0.01subscript𝜅𝑝0.01\kappa_{p}0.01 italic_κ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT for each channel p𝑝pitalic_p. From Figure 4(a), we see that the capacity satisfactory ratio increases when we increase the wbsubscript𝑤𝑏w_{b}italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. However, the Click reaches the peak when wb=0.1subscript𝑤𝑏0.1w_{b}=0.1italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 0.1 and then gradually decreases. This is because updating the budget allocation strategy with a large wbsubscript𝑤𝑏w_{b}italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT can prevent the over-allocation issue on high-quality channels. When wbsubscript𝑤𝑏w_{b}italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT is too large, the batch loss will have a negative impact on the policy updating process, resulting in a poor budget allocation strategy and advertising results. As shown in Figure 4(b), we observe that as the data repetition times N𝑁Nitalic_N increases, the lower-level executor is able to accurately satisfy the allocated budget since more experiences help the bidding strategy generalize to unseen budget cases. However, overuse of the training experiences may result in poor training efficiency. Therefore, batch loss weight wb=0.1subscript𝑤𝑏0.1w_{b}=0.1italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 0.1 and data repetition times N=30𝑁30N=30italic_N = 30 are the two best hyper-parameters chosen for performance comparison hereafter.

Refer to caption
(a) Impact of wbsubscript𝑤𝑏w_{b}italic_w start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT.
Refer to caption
(b) Impact of N𝑁Nitalic_N.
Figure 4: Hyper-Parameters Tunning. The Capacity satisfactory ratio is calculated as the average of 𝟙|m=1MCost(am,ph)κp|ϵsubscript1superscriptsubscript𝑚1𝑀𝐶𝑜𝑠𝑡subscriptsuperscript𝑎𝑚𝑝subscript𝜅𝑝italic-ϵ\mathds{1}_{\left|\sum_{m=1}^{M}Cost(a^{h}_{m,p})-\kappa_{p}\right|\leq\epsilon}blackboard_1 start_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_C italic_o italic_s italic_t ( italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_p end_POSTSUBSCRIPT ) - italic_κ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT | ≤ italic_ϵ end_POSTSUBSCRIPT of all channels and Budget satisfactory ratio is calculated by the average of 𝟙i=1ICost(ap,il)aphsubscript1superscriptsubscript𝑖1𝐼𝐶𝑜𝑠𝑡subscriptsuperscript𝑎𝑙𝑝𝑖subscriptsuperscript𝑎𝑝\mathds{1}_{\sum_{i=1}^{I}Cost(a^{l}_{p,i})\leq a^{h}_{p}}blackboard_1 start_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT italic_C italic_o italic_s italic_t ( italic_a start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p , italic_i end_POSTSUBSCRIPT ) ≤ italic_a start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT of all advertisers, respectively.
TABLE III: Ablation Study
Method Impr Click ROI CPC CSR
HiBid 0.03% 10.93% 4.53% -6.14% 8.94%
High level .. w/o batch loss -4.50% -1.28% 3.44% -0.03% 2.76%
.. w/o budget allocation -5.93% -1.97% 4.32% -0.88% 2.02%
Low level .. w/o λ𝜆\lambdaitalic_λ-generalization -3.35% -1.14% 5.21% -1.71% 4.67%
.. w/o CPC-AS -2.70% 3.99% -2.88% 6.47% -6.54%
.. w/o λ𝜆\lambdaitalic_λ-generalization & CPC-AS -6.44% 8.43% -7.90% 12.26% -9.87%
Refer to caption
(a) Click𝐶𝑙𝑖𝑐𝑘Clickitalic_C italic_l italic_i italic_c italic_k
Refer to caption
(b) CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R
Figure 5: Offline performance comparison with 5 baselines in terms of Click𝐶𝑙𝑖𝑐𝑘Clickitalic_C italic_l italic_i italic_c italic_k and CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R improvements.
Refer to caption
(a) ROI𝑅𝑂𝐼ROIitalic_R italic_O italic_I
Refer to caption
(b) Impr𝐼𝑚𝑝𝑟Impritalic_I italic_m italic_p italic_r
Refer to caption
(c) CPC𝐶𝑃𝐶CPCitalic_C italic_P italic_C
Figure 6: Offline performance comparison with 5 baselines in terms of ROI𝑅𝑂𝐼ROIitalic_R italic_O italic_I, Impr𝐼𝑚𝑝𝑟Impritalic_I italic_m italic_p italic_r and CPC𝐶𝑃𝐶CPCitalic_C italic_P italic_C improvements.

VI-C Ablation Study

We gradually remove four key components from the high-level planner and the low-level executor in HiBid, to verify the benefits brought by each component. The results are shown in Table III.

We first show the benefits of batch loss and budget allocation in the high-level planner. When the batch loss module is removed, CPC𝐶𝑃𝐶CPCitalic_C italic_P italic_C increases 6.11%percent6.116.11\%6.11 %, and CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R significantly drops from 8.94%percent8.948.94\%8.94 % to 2.76%percent2.762.76\%2.76 %. We also present the specific changes in Impr𝐼𝑚𝑝𝑟Impritalic_I italic_m italic_p italic_r and CPC𝐶𝑃𝐶CPCitalic_C italic_P italic_C across four channels in Table IV. We observe that when we removed the batch loss module, due to limited supply from high-quality channels (i.e., Channel 1), advertisers exhibite a stronger investment intention (as an increased value in Impr𝐼𝑚𝑝𝑟Impritalic_I italic_m italic_p italic_r), which results in increased advertising cost (higher CPC) on high-quality channels. However, the summed Impr𝐼𝑚𝑝𝑟Impritalic_I italic_m italic_p italic_r on four channels is decreased, and an overall increase in average CPC𝐶𝑃𝐶CPCitalic_C italic_P italic_C is observed, which is unfavorable for either advertiser’s investment or platform itself. Click𝐶𝑙𝑖𝑐𝑘Clickitalic_C italic_l italic_i italic_c italic_k decreases from 10.93%percent10.9310.93\%10.93 % to 1.97%percent1.97-1.97\%- 1.97 % when we removed the entire budget allocation module. In this way, the low-level executor bids for each incoming request without a budget, leading unsuitable advertisers to take away and waste the click opportunities on that channel.

TABLE IV: Impact of batch loss
Metric Method Channel 1 Channel 2 Channel 3 Channel 4
Impr𝐼𝑚𝑝𝑟Impritalic_I italic_m italic_p italic_r HiBid 0.01% 0.03% 0.03% 0.04%
.. w/o batch loss 7.64% -7.28% -8.85% 8.56%
CPC𝐶𝑃𝐶CPCitalic_C italic_P italic_C HiBid -3.65% -5.88% -7.29% -10.86%
.. w/o batch loss 6.76% -1.67% -2.24% -1.65%

We further observe the impact of λ𝜆\lambdaitalic_λ-generalization and CPC-AS on the low-level bidding strategy. Compared to HiBid without λ𝜆\lambdaitalic_λ-generalization, Click𝐶𝑙𝑖𝑐𝑘Clickitalic_C italic_l italic_i italic_c italic_k and CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R of HiBid achieves 12.07%percent12.0712.07\%12.07 % and 4.27%percent4.274.27\%4.27 % improvements, respectively. This is because the low-level executor accurately adapts its bidding strategy according to the allocated budget, and avoids taking away the budget originally from other channels, bringing an improvement in advertising performance. When we removed CPC-AS, CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R drops significantly from 8.94%percent8.948.94\%8.94 % to 6.54%percent6.54-6.54\%- 6.54 % since the higher bids result in more clicks but exhibit a negative impact on the advertisers’ expectations. When both λ𝜆\lambdaitalic_λ-generalization and CPC-AS are removed, Impr𝐼𝑚𝑝𝑟Impritalic_I italic_m italic_p italic_r and ROI𝑅𝑂𝐼ROIitalic_R italic_O italic_I drastically drop, which confirms the benefits of introducing λ𝜆\lambdaitalic_λ-generalization and CPC-AS as the contribution of this paper.

Refer to caption
Figure 7: HiBid deployment in Meituan advertising platform.

VI-D Offline Performance Comparison

We compare HiBid with six baselines:

  1. 1.

    CBRL [8]: It is a curriculum-guided Bayesian reinforcement learning (CBRL) framework with an indicator-augmented reward function to adaptively control the constraint-objective trade-off for ROI-constrained single-channel bidding, which is considered as the state-of-the-art approach. For a fair comparison, we adapt CBRL to c3-bidding problem by replacing ROI constraint with CPC constraint while maintaining the origin training process.

  2. 2.

    OPAL [13]: Its high-level agent is trained by unsupervised learning, which provides a temporal abstraction for the low-level agent to improve offline policy optimization. It is considered as the state-of-the-art hierarchical offline DRL method. We adopt it by training a CVAE for budget allocation with the same network structure as HiBid.

  3. 3.

    MCQ [15]: It alleviates the value overestimation effect that occurred in OOD actions by actively training and correcting their Q values. We consider it as the state-of-the-art approach in offline DRL.

  4. 4.

    PRDC [39]: It is another offline DRL method with a policy regularization mechanism. It employs dataset constraints to allow the policy to choose better actions that do not appear in the dataset with the nearest neighbor retrieval, while still maintaining sufficient conservatism for OOD actions.

  5. 5.

    CEM [40]: Cross-Entropy Method (CEM) is a popular evolutionary algorithm, where we consider the c3-bidding problem as a black-box optimization problem, aiming to maximize the number of clicks under the total budget and CPC constraints. Note that we always choose the policy set that is close to CPCmset𝐶𝑃superscriptsubscript𝐶𝑚𝑠𝑒𝑡CPC_{m}^{set}italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e italic_t end_POSTSUPERSCRIPT for policy updating in a CEM iteration.

  6. 6.

    PID [23]: Proportional-Integral-Derivative (PID) controller is a classical feedback controller that is widely adopted, and performs well in unknown environments. We keep the advertiser’s current CPC close to the target CPCmset𝐶𝑃superscriptsubscript𝐶𝑚𝑠𝑒𝑡CPC_{m}^{set}italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e italic_t end_POSTSUPERSCRIPT to satisfy the advertiser’s cross-channel CPC constraint.

Since OPAL, MCQ and PRDC do not meet the CPC and budget constraints, we explicitly consider these two by designing a reward function, that is able to maximize Click𝐶𝑙𝑖𝑐𝑘Clickitalic_C italic_l italic_i italic_c italic_k from Meituan’s past online service experiences, as:

(1+wxCPC¯(1+CPCmset))rilwycli,1subscript𝑤𝑥¯𝐶𝑃𝐶1𝐶𝑃superscriptsubscript𝐶𝑚𝑠𝑒𝑡superscriptsubscript𝑟𝑖𝑙subscript𝑤𝑦subscriptsuperscript𝑐𝑖𝑙\displaystyle\big{(}1+w_{x}*\overline{CPC}*(1+CPC_{m}^{set})\big{)}*r_{i}^{l}-% w_{y}*c^{i}_{l},( 1 + italic_w start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∗ over¯ start_ARG italic_C italic_P italic_C end_ARG ∗ ( 1 + italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e italic_t end_POSTSUPERSCRIPT ) ) ∗ italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT - italic_w start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∗ italic_c start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , (24)

where CPC¯¯𝐶𝑃𝐶\overline{CPC}over¯ start_ARG italic_C italic_P italic_C end_ARG is the calculated average CPC, wx=1.35subscript𝑤𝑥1.35w_{x}=1.35italic_w start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 1.35 and wy=1.31subscript𝑤𝑦1.31w_{y}=1.31italic_w start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = 1.31 are manually set weights. Results are shown in Figure 5 and Figure 6. By combining budget allocation and constrained bidding, HiBid enhances the matching efficacy between ads and users, thereby improving the CPC𝐶𝑃𝐶CPCitalic_C italic_P italic_C and CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R without taking extra impressions away from others. Impr𝐼𝑚𝑝𝑟Impritalic_I italic_m italic_p italic_r is influenced by the average bidding price decided by the proposed low-level executor since there are other advertising products competing for these ad requests. The slight fluctuation in HiBid’s Impr𝐼𝑚𝑝𝑟Impritalic_I italic_m italic_p italic_r indicates that the average bidding price from HiBid is closer to the baseline. This is attributed to the proposed batch loss, which constrains the bidding expenditure of advertisers on different channels, as demonstrated in Table IV. The second-best method is CBRL with 3.62%percent3.623.62\%3.62 % and 3.24%percent3.243.24\%3.24 % improvements over online solution R-BCQ in terms of Click𝐶𝑙𝑖𝑐𝑘Clickitalic_C italic_l italic_i italic_c italic_k and CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R, respectively. CBRL gradually learns the bidding strategy that satisfies constraints through curriculum learning, but its performance is still worse than HiBid due to the lack of accurate budget allocation. MCQ actively updates OOD state-action pairs using the experiences from the dataset, and the learned conservative strategy might miss out on some high-revenue actions. Therefore, as the number of model training steps increases, CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R slightly improves, but ROI𝑅𝑂𝐼ROIitalic_R italic_O italic_I remains essentially unchanged. PRDC performs marginally below MCQ, as the nearest-neighbor-based regularization cannot effectively guide policy iterations in the highly non-stationary markets presented in the log. OPAL achieves 0.66%percent0.660.66\%0.66 % improvement in Impr𝐼𝑚𝑝𝑟Impritalic_I italic_m italic_p italic_r but its Click𝐶𝑙𝑖𝑐𝑘Clickitalic_C italic_l italic_i italic_c italic_k is still declining. The reason is that its budget allocation strategy is learned through supervised learning and thus lacks proper adjustments for those undesirable budget allocation experiences. Due to limitations in policy representation capability, CEM cannot accommodate advertisers’ requirements, resulting in the decrease of CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R to 5.32%percent5.32-5.32\%- 5.32 %. Although PID adjusts bidding prices based on the current CPC dynamically, it does not take into account the inconsistent average CPC across channels and forces all channels to achieve the same level. Therefore, PID obtains a 10.15%percent10.1510.15\%10.15 % improvement in CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R but poor performance in all other metrics.

Refer to caption
Figure 8: Distribution of allocated budget between HiBid and optimum in four channels.
TABLE V: Online A/B Testing
Method Impr𝐼𝑚𝑝𝑟Impritalic_I italic_m italic_p italic_r Click𝐶𝑙𝑖𝑐𝑘Clickitalic_C italic_l italic_i italic_c italic_k ROI𝑅𝑂𝐼ROIitalic_R italic_O italic_I CPC𝐶𝑃𝐶CPCitalic_C italic_P italic_C CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R TP999 (ms)
HiBid 0.15% 11.20% 5.15% -6.63% 9.02% 33.8
CBRL [8] 0.21% 3.15% 2.16% -1.89% 3.79% 29.1
MCQ [15] 0.92% 2.67% -0.30% 0.52% 1.89% 28.4
PRDC [39] 0.75% 2.39% -0.56% 0.50% 1.91% 27.9
OPAL [13] 0.66% -0.58% -5.98% -4.70% -2.01% 31.1
CEM [40] -0.03% 2.49% -6.61% -1.76% -4.39% -
PID [41] -0.82% -14.45% -2.07% 0.63% 9.38% -

VI-E Online A/B Testing on Meituan Advertising Platform

HiBid and all other six baselines are validated on Meituan advertising platform, whose system architecture is shown in Figure 7. Multiple ad requests are generated when users use Meituan apps, and each request goes through four modules in the advertising platform:

Retrieval module. Each ad request represents a browsing from a user, thus the advertising platform retrievals a set of advertisers for this ad based on the current user behavior and historical statistics. The selected advertisers have the chance to participate in the ad request auction.

Bid module. The advertising platform sends remote procedure call (RPC) requests to RTB systems to obtain the bidding price of each selected advertiser for this ad. Then, the RTB system constructs request-level features and advertisers-level features for online prediction and returns the bidding price to the advertising platform.

Rank module. When all the advertisers’ bidding prices are obtained, the advertising platform ranks the bidding advertisers in descending order based on their bidding price. The advertiser with the highest bidding wins this ad and has the chance to display the ad.

Price module. The simulator deployed a GSP auction with CPC pricing schemes identical to the online advertising platform, which means that only if the user clicks the ad, the displayed advertiser will be charged the price of the second-highest bid in this ad auction.

Subsequently, the advertising platform will display ads to users based on the auction results, and record user feedback (e.g., whether they click on the ad or make an order) in Meituan’s Data Center. The daily bidding logs are automatically processed and stored in the Feature Center for high-level and low-level training. After the offline training, the trained model is synchronized to the RTB system for bidding services. This allows the high-level planner and low-level executor of our proposed HiBid to rapidly iterate and adapt to the changing environment, leading to performance improvement in large-scale advertising systems in practice.

Online experiments are conducted using A/B testing for two weeks, and results are shown in Table V. HiBid consistently outperforms all other baselines on most metrics by 11.20%percent11.2011.20\%11.20 % in Click𝐶𝑙𝑖𝑐𝑘Clickitalic_C italic_l italic_i italic_c italic_k and 5.15%percent5.155.15\%5.15 % in ROI𝑅𝑂𝐼ROIitalic_R italic_O italic_I at least. Meanwhile, PID achieved a slightly higher CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R, but it performs poorly in other metrics. In addition, We show the TP999 (completion time for 99.9%percent99.999.9\%99.9 % requests) for each model in Table V. Due to the introduced optimal λlimit-from𝜆\lambda-italic_λ -selection and CPC-AS mechanism, the inference time of HiBid is only slightly increased, but it is still quite acceptable.

VI-F Synthetic Dataset Validation

Currently, RTB-related research efforts have not offered a publicly available dataset for performance comparison. Thus, to ease fair algorithm comparisons and code reproduction for the research community, we develop a cross-channel constrained bidding simulator to generate the synthetic dataset and make it available online. It emulates the ad display process of the advertising platform depicted in Fig. 7. We implement the core modules as a simplified version of the advertising system, including retrieval, bid, rank, and price. Furthermore, we simulate ad requests and user feedback based on the distribution statistics obtained from Meituan’s online production system, as:

TABLE VI: Synthetic Dataset Validation
Method Impr𝐼𝑚𝑝𝑟Impritalic_I italic_m italic_p italic_r Click𝐶𝑙𝑖𝑐𝑘Clickitalic_C italic_l italic_i italic_c italic_k ROI𝑅𝑂𝐼ROIitalic_R italic_O italic_I CPC𝐶𝑃𝐶CPCitalic_C italic_P italic_C CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R
HiBid 0% 15.25% 6.38% -8.13% 14.53%
CBRL 0% 5.65% 1.80% -6.16% 8.9%
MCQ 0% 3.95% 0.86% -5.67% 7.84%
PRDC 0% 3.65% 0.78% -5.35% 7.41%
OPAL 0% 1.13% -3.75% -0.74% 3.61%

Request and advertisers simulation. It first randomly generates a few advertisers, who are categorized into several types representing varying business conditions of advertisers from the online statistics. Each advertiser possesses a total budget, expected CPC, historical CTR, CVR, GMV, etc, which are sampled from a Gaussian distribution based on their own category. Then, the simulator initializes the total ad requests, each of which belongs to one of P𝑃Pitalic_P channels. The number of ad requests and arrival distribution of each channel are kept relatively consistent with the online platform. The simulator replays all the ad requests hand uses the base bidding strategies (i.e., CTR-based strategy that bids with predicted CTR multiply CPCmset𝐶𝑃superscriptsubscript𝐶𝑚𝑠𝑒𝑡CPC_{m}^{set}italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_e italic_t end_POSTSUPERSCRIPT) to conduct auctions.

User feedback simulation. After the ad auction is finished, the simulator samples the user feedback towards the displayed ads from multiple Gaussian distributions, including whether the user clicks, makes an order, and the making order amount. Then we update the advertiser’s daily statistics (i.e., real-time CTR, CVR, budget consumption, the number of clicks, etc.), to simulate the real-time feature on the platform. Both user feedback and the auction process are recorded in the dataset for model offline training.

Our synthetic dataset is made publicly available online111https://drive.google.com/drive/folders/11TmSXZFtwiXhy1kyQdvEvzc5Mu-cHI7S, including 5,000 advertisers bidding for 3,500,000 ad requests in 7 days. During the model evaluation, we first restore and replay all ad requests. Then we use the RL-based bidding strategy instead of the base strategy to participate auction and summarize the bidding results to evaluate the model performance. Results are shown in Table VI. Since there is no external competition in the simulator, the total impressions remain unchanged. We see that Hibid consistently outperforms other baselines, improving CSR𝐶𝑆𝑅CSRitalic_C italic_S italic_R while bringing more clicks for advertisers.

VI-G HiBid Performance Verification Upon Optimality

VI-G1 Bias Incurred during CPC-AS Derivation

To verify the bias between our proposed CPCmpred(ail)𝐶𝑃superscriptsubscript𝐶𝑚𝑝𝑟𝑒𝑑superscriptsubscript𝑎𝑖𝑙CPC_{m}^{pred}(a_{i}^{l})italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) in Eqn. (22) and CPCmreal𝐶𝑃superscriptsubscript𝐶𝑚𝑟𝑒𝑎𝑙CPC_{m}^{real}italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_l end_POSTSUPERSCRIPT, we sample 13,633,8251363382513,633,82513 , 633 , 825 trajectories from the low-level dataset 𝒟lsuperscript𝒟𝑙\mathcal{D}^{l}caligraphic_D start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT and calculate the bias with different γl[0,1]superscript𝛾𝑙01\gamma^{l}\in[0,1]italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ [ 0 , 1 ]. The mean absolute percentage error MAPE(γl)MAPEsuperscript𝛾𝑙\mathrm{MAPE}(\gamma^{l})roman_MAPE ( italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) is then calculated by:

MAPE(γl)=average(|CPCmpred(γl,ail)CPCmreal|CPCmreal|τl),MAPEsuperscript𝛾𝑙averageevaluated-at𝐶𝑃superscriptsubscript𝐶𝑚𝑝𝑟𝑒𝑑superscript𝛾𝑙superscriptsubscript𝑎𝑖𝑙𝐶𝑃superscriptsubscript𝐶𝑚𝑟𝑒𝑎𝑙𝐶𝑃superscriptsubscript𝐶𝑚𝑟𝑒𝑎𝑙superscript𝜏𝑙\small\mathrm{MAPE}(\gamma^{l})=\mathrm{average}\bigg{(}\frac{|CPC_{m}^{pred}(% \gamma^{l},a_{i}^{l})-CPC_{m}^{real}|}{CPC_{m}^{real}}\bigg{|}_{\tau^{l}}\bigg% {)},roman_MAPE ( italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) = roman_average ( divide start_ARG | italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT ( italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) - italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_l end_POSTSUPERSCRIPT | end_ARG start_ARG italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_l end_POSTSUPERSCRIPT end_ARG | start_POSTSUBSCRIPT italic_τ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) , (25)

and the obtained curve is shown in Figure 9. We observe that MAPE(γ)MAPE𝛾\mathrm{MAPE}(\gamma)roman_MAPE ( italic_γ ) decreases rapidly as γlsuperscript𝛾𝑙\gamma^{l}italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT increases, and the difference is less than 0.0010.0010.0010.001 and thus can be ignored when γl=0.999superscript𝛾𝑙0.999\gamma^{l}=0.999italic_γ start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = 0.999.

Refer to caption
Figure 9: MAPE between CPCmpred(ail)𝐶𝑃superscriptsubscript𝐶𝑚𝑝𝑟𝑒𝑑superscriptsubscript𝑎𝑖𝑙CPC_{m}^{pred}(a_{i}^{l})italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p italic_r italic_e italic_d end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ) and CPCmreal𝐶𝑃superscriptsubscript𝐶𝑚𝑟𝑒𝑎𝑙CPC_{m}^{real}italic_C italic_P italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r italic_e italic_a italic_l end_POSTSUPERSCRIPT in Eqn. (25).

VI-G2 Optimality Performance of HiBid in Small Scale Dataset

Millions of advertisers bid for billions of ad requests every day, the solution space explodes exponentially and it is impossible to obtain the whole day’s information about the incoming request, making it difficult to get an optimal solution in the considered c3-bidding problem. However, the high-level budget allocation problem can be solved by optimization methods if the number of advertisers is not too large and the information of all participating advertisers is as prior. In order to verify the performance gap between the budget allocation strategies obtained by HiBid and the optimum, we constructed a small-scale dataset containing 2,00020002,0002 , 000 advertisers and get the optimal budget allocation through a solver [42]. The distributions of the allocated budget are shown in Figure 8, and the MAPE and RMSE between HiBid and optimal are 4.76%percent4.764.76\%4.76 % and 1.091.091.091.09, respectively. Therefore, HiBid performs well which enables it to cope with real-world complex and dynamic bidding environments.

VII Concluding Remark

In this paper, we propose a hierarchical offline DRL framework HiBid for cross-channel bidding with budget allocation. HiBid introduces three contributions based on the state-of-the-art offline DRL approach MCQ: (a) auxiliary batch loss to alleviate the advertiser competition in high-quality channels, (b) λ𝜆\lambdaitalic_λ-generalization for adaptive constrained bidding strategy in response to changing budget, and (c) CPC-guided action selection scheme for improving cross-channel CPC satisfactory ratio. Both offline experiments and online A/B testing on Meituan advertising platform show HiBid outperforms six baselines. The HiBid has been deployed online to already service tens of thousands of advertisers every day. Furthermore, some works [43, 44] have introduced an auction mechanism for resource allocation (e.g., VMs) in cloud computing. In such auction-based allocations, a hierarchical architecture can also be employed to enhance resource distribution efficiency, benefiting both the platform and its users. While the primary application of our work is ad display, the proposed hierarchical architecture offers its potential to work with other applications that require resource allocation through auctions.

References

  • [1] Q. Feng, D. He, M. Luo, X. Huang, and K.-K. R. Choo, “Eprice: An efficient and privacy-preserving real-time incentive system for crowdsensing in industrial internet of things,” IEEE Transactions on Computers, pp. 1–15, 2023.
  • [2] H. Zhang, H. Jiang, B. Li, F. Liu, A. V. Vasilakos, and J. Liu, “A framework for truthful online auctions in cloud computing with heterogeneous user demands,” IEEE Transactions on Computers, vol. 65, no. 3, pp. 805–818, 2016.
  • [3] Z. Zheng, Y. Gui, F. Wu, and G. Chen, “Star: Strategy-proof double auctions for multi-cloud, multi-tenant bandwidth reservation,” IEEE Transactions on Computers, vol. 64, no. 7, pp. 2071–2083, 2015.
  • [4] Y. Chen, P. Berkhin, B. Anderson, and N. R. Devanur, “Real-time bidding algorithms for performance-based display ad allocation,” in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011, p. 1307–1315.
  • [5] H. R. Varian, “Online ad auctions,” American Economic Review, vol. 99, no. 2, pp. 430–34, May 2009.
  • [6] H. Zhu, J. Jin, C. Tan, F. Pan, Y. Zeng, H. Li, and K. Gai, “Optimized cost per click in taobao display advertising,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, p. 2191–2200.
  • [7] H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, Y. Yu, and D. Guo, “Real-time bidding by reinforcement learning in display advertising,” in Proceedings of the tenth ACM international conference on web search and data mining, 2017, p. 661–670.
  • [8] H. Wang, C. Du, P. Fang, S. Yuan, X. He, L. Wang, and B. Zheng, “Roi-constrained bidding via curriculum-guided bayesian reinforcement learning,” in Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, p. 4021–4031.
  • [9] S. Xiao, L. Guo, Z. Jiang, L. Lv, Y. Chen, J. Zhu, and S. Yang, “Model-based constrained mdp for budget allocation in sequential incentive marketing,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, p. 971–980.
  • [10] N. Alon, I. Gamzu, and M. Tennenholtz, “Optimizing budget allocation among channels and influencers,” in Proceedings of the 21st international conference on World Wide Web, 2012, p. 381–388.
  • [11] H. A. Taha and H. A. Taha, Operations research: an introduction.   Prentice hall Upper Saddle River, NJ, 2003, vol. 7.
  • [12] E. Altman, Constrained Markov decision processes: stochastic modeling.   Routledge, 1999.
  • [13] A. Ajay, A. Kumar, P. Agrawal, S. Levine, and O. Nachum, “Opal: Offline primitive discovery for accelerating offline reinforcement learning,” in International Conference on Learning Representations, 2021.
  • [14] J. Gehring, G. Synnaeve, A. Krause, and N. Usunier, “Hierarchical skills for efficient exploration,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 553–11 564, 2021.
  • [15] J. Lyu, X. Ma, X. Li, and Z. Lu, “Mildly conservative q-learning for offline reinforcement learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 1711–1724, 2022.
  • [16] G. Liao, Z. Wang, X. Wu, X. Shi, C. Zhang, Y. Wang, X. Wang, and D. Wang, “Cross dqn: Cross deep q network for ads allocation in feed,” in Proceedings of the ACM Web Conference, 2022, p. 401–409.
  • [17] Y. Zhang, B. Tang, Q. Yang, D. An, H. Tang, C. Xi, X. Li, and F. Xiong, “Bcorle (λ𝜆\lambdaitalic_λ): An offline reinforcement learning and evaluation framework for coupons allocation in e-commerce market,” Advances in Neural Information Processing Systems, vol. 34, pp. 20 410–20 422, 2021.
  • [18] W. Zhang, S. Yuan, and J. Wang, “Optimal real-time bidding for display advertising,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 1077–1086.
  • [19] T. Zhou, H. He, S. Pan, N. Karlsson, B. Shetty, B. Kitts, D. Gligorijevic, S. Gultekin, T. Mao, J. Pan, J. Zhang, and A. Flores, “An efficient deep distribution network for bid shading in first-price auctions,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021, p. 3996–4004.
  • [20] W. Zhang, B. Kitts, Y. Han, Z. Zhou, T. Mao, H. He, S. Pan, A. Flores, S. Gultekin, and T. Weissman, “Meow: A space-efficient nonparametric bid shading algorithm,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021, p. 3928–3936.
  • [21] K. Ren, W. Zhang, K. Chang, Y. Rong, Y. Yu, and J. Wang, “Bidding machine: Learning to bid for directly optimizing profits in display advertising,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 4, pp. 645–659, 2018.
  • [22] D. Wu, X. Chen, X. Yang, H. Wang, Q. Tan, X. Zhang, J. Xu, and K. Gai, “Budget constrained bidding by model-free reinforcement learning in display advertising,” in Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018, p. 1443–1451.
  • [23] X. Yang, Y. Li, H. Wang, D. Wu, Q. Tan, J. Xu, and K. Gai, “Bid optimization by multivariable control in display advertising,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery and data mining, 2019, p. 1966–1974.
  • [24] R. Gao, H. Xia, J. Li, D. Liu, S. Chen, and G. Chun, “Drcgr: Deep reinforcement learning framework incorporating cnn and gan-based for interactive recommendation,” in IEEE International Conference on Data Mining, 2019, pp. 1048–1053.
  • [25] R. Xie, S. Zhang, R. Wang et al., “Hierarchical reinforcement learning for integrated recommendation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 5, 2021, pp. 4521–4528.
  • [26] C.-C. Lin, K.-T. Chuang, W. C.-H. Wu, and M.-S. Chen, “Combining powers of two predictors in optimizing real-time bidding strategy under constrained budget,” in Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2016, p. 2143–2148.
  • [27] Y. He, X. Chen, D. Wu, J. Pan, Q. Tan, C. Yu, J. Xu, and X. Zhu, “A unified solution to constrained bidding in online display advertising,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2021, p. 2993–3001.
  • [28] J. Zhao, G. Qiu, Z. Guan, W. Zhao, and X. He, “Deep reinforcement learning for sponsored search real-time bidding,” in Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining, 2018, pp. 1021–1030.
  • [29] A. Kumar, J. Fu, M. Soh, G. Tucker, and S. Levine, “Stabilizing off-policy q-learning via bootstrapping error reduction,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [30] N. Jaques, A. Ghandeharioun, J. H. Shen, C. Ferguson, A. Lapedriza, N. Jones, S. Gu, and R. Picard, “Way off-policy batch deep reinforcement learning of implicit human preferences in dialog,” arXiv preprint arXiv:1907.00456, 2019.
  • [31] S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement learning without exploration,” in International conference on machine learning, 2019, pp. 2052–2062.
  • [32] J. García, Fern, and o Fernández, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 42, pp. 1437–1480, 2015.
  • [33] D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural policy gradient primal-dual method for constrained markov decision processes,” Advances in Neural Information Processing Systems, vol. 33, pp. 8378–8390, 2020.
  • [34] H. Le, C. Voloshin, and Y. Yue, “Batch policy learning under constraints,” in International Conference on Machine Learning, 2019, pp. 3703–3712.
  • [35] C. Tessler, D. J. Mankowitz, and S. Mannor, “Reward constrained policy optimization,” in International Conference on Learning Representations, 2019.
  • [36] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [37] K. Sohn, H. Lee, and X. Yan, “Learning structured output representation using deep conditional generative models,” Advances in neural information processing systems, vol. 28, 2015.
  • [38] S. Paternain, L. Chamon, M. Calvo-Fullana, and A. Ribeiro, “Constrained reinforcement learning has zero duality gap,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [39] Y. Ran, Y.-C. Li, F. Zhang, Z. Zhang, and Y. Yu, “Policy regularization with dataset constraint for offline reinforcement learning,” arXiv preprint arXiv:2306.06569, 2023.
  • [40] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A tutorial on the cross-entropy method,” Annals of operations research, vol. 134, no. 1, pp. 19–67, 2005.
  • [41] K. H. Ang, G. Chong, and Y. Li, “Pid control system analysis, design, and technology,” IEEE Transactions on Control Systems Technology, vol. 13, no. 4, pp. 559–576, 2005.
  • [42] MindOpt, “Mindopt studio,” 2022. [Online]. Available: https://opt.aliyun.com
  • [43] L. Mashayekhy, M. M. Nejad, D. Grosu, and A. V. Vasilakos, “An online mechanism for resource allocation and pricing in clouds,” IEEE Transactions on Computers, vol. 65, no. 4, pp. 1172–1184, 2016.
  • [44] X. Zhu, C. Chen, L. T. Yang, and Y. Xiang, “Angel: Agent-based scheduling for real-time tasks in virtualized clouds,” IEEE Transactions on Computers, vol. 64, no. 12, pp. 3389–3403, 2015.