1 Introduction
Web search engines usually rank results according to their relevance scores in descending order. Assuming that users browse search results sequentially from top to bottom on search engine result pages (SERPs), ranking the relevant results at the top positions reduces users’ efforts in locating useful information. For example, learning-to-rank (LTR) methods can either adopt the relevance annotations or user click feedback to train a ranking model. Although this established approach has achieved much success in improving search ranking performance, it faces two important challenges with the development of recent search techniques.
First, the sequential browsing hypothesis no longer holds in today’s heterogeneous search scenarios where images, videos, news, and even applications are aggregated with 10 blue hyperlinks in a unified result list. Previous studies have revealed that users’ attention may be drawn toward non-textual information items [
66] and vertical results [
15,
71], which alters users’ behavior and leads to a non-sequential examination sequence. Second, the interactions among search results are largely ignored by existing methods. The selection and ranking of search results depends not only on their own relevance but also on the context—for example, the complementarity and redundancy of search results [
81].
Regarding the preceding two issues, some existing research efforts formulate the ranking problem as a
Markov decision process (MDP) and solve it with the
reinforcement learning (RL) paradigm [
60,
67,
75,
76,
81]. The RL framework is suitable for modeling the interaction effects of search results; it optimizes the ranking performance globally while considering the context. The RL approach is more complex than linear ranking methods based on relevance.
However, training RL agents can be a great challenge. The main problem is how to evaluate the re-ranked result list and design reward functions for the RL agents. Both offline and online evaluation methods have been adopted in measuring the performance of information systems. Offline evaluation is based on relevance annotations and usually suffers from expensive annotation costs and search intent mismatches between assessors and practical search engine users. Online evaluation regards user feedback such as clicks to indicate satisfaction. However, poor-quality ranking lists will harm the user experience. It also takes time to collect enough trials and rewards from users to evaluate the ranking policy.
To not harm the online system performance and to train the RL agent more efficiently, a number of existing works have adopted a simulation way for online evaluation [
67]. To model user interactions with practical search systems, click models such as UBM [
25], DCM [
32], and DBN [
13] represent user behaviors as sequences of observable and hidden events. These works are mainly based on the
probabilistic graphical model (PGM) framework and handcrafted behavior assumptions. However, most click models have been designed for homogeneous result lists. User interactions with search systems are much more complex in heterogeneous search scenarios. In this work, we propose two novel user simulators to mimic fine-grained user behavior:
Context-aware Click Simulator (CCS) and
Fine-grained User Behavior Simulator with GAN (UserGAN).
The first simulator, CCS, is based on a hierarchical structure including the session level and the result level. It predicts click probability not only according to the multimodal contents of the search result but also the previous click and SERP context. The interactions among search results can be implicitly learned through the sequential predicting process. CCS is simpler in structure and efficient in training, but it can only predict the click/skip signal of the user browsing process. Nevertheless, user behavior on SERPs is closely correlated with time and space. The click event triggered by a user is associated with both temporal signals (e.g., the click time since query submission) and spatio signals (e.g., the position and area of the search result clicked). Existing studies have shown that the time between user browsing actions is an important signal to indicate users’ satisfaction in Web search [
4].
To take temporal information into account, we propose the second simulator, UserGAN, to simulate fine-grained user behavior. We use the multidimensional spatio-temporal signal (described in Section
1) associated with the click event to provide a comprehensive understanding of user behavior patterns. We assume that user behaviors obey an implicit distribution. The user simulator aims to learn and generate an indistinguishable user behavior distribution from the genuine one. As GANs have shown great success in a wide range of areas, including computer vision [
46,
63,
83], natural language processing [
26,
78], and
information retrieval (IR) tasks [
35,
49,
72], we choose GAN to generate a simulated user behavior distribution through adversarial learning. To represent the multidimensional spatio-temporal signals of user behavior, we also design a modulation-demodulation process. The modulator encodes the spatial and temporal signals with a Gaussian distribution while the demodulator recovers the fine-grained user interaction signals with existing constrained optimization algorithms. UserGAN can simulate click times along with clicks, which helps perform accurate online evaluation for heterogenous Web search systems.
Based on the user simulators, we propose a
User Behavior Simulation for Reinforcement Learning (UBS4RL) framework for result re-ranking in search engines. As shown in Figure
1, UBS4RL consists of three modules. First, a feature extractor (JRE [
80] or TreeNN [
48]) fuses the visual, textual, and structural information of vertical results to obtain the dense representation. The extracted result feature is delivered to the user simulator and the ranking agent for subsequent processing. In this work, we adopt TreeNN as the feature extractor. It embeds the visual and textual information into a HTML parse tree, which can better leverage the structural information. Second, a user simulator (CCS [
81] or UserGAN) exploits user behavior patterns from search logs and mimics fine-grained user feedback including click and time information. The user simulator interacts with the ranking agent and provides rewards to guide the updating of the ranking policy. Third, a ranking agent (RLRanker [
81]) formulates search result ranking as an MDP. At each step
\(t\), the ranking agent chooses a result from the candidates (
actions) to place at position
\(t\) in the ranking list. By regarding the already placed results as the current
state and setting a proper immediate
reward, the MDP ranking model can take the interactions among search results into consideration. In this work, the MDP problem is solved with the policy gradient algorithm called
REINFORCE [
68].
During training, the ranking agent generates re-ranked result lists and obtains feedback from the simulation environment (user simulator). Through a large number of trials, the ranking agent can ultimately learn an optimal ranking policy. For testing, the optimal ranking agent is used to iteratively fill results into an empty list from top to bottom for re-ranking task. The user simulator is not used in this stage, as we do not need the rewards to update the ranking agent.
Specifically, the contributions of the proposed UBS4RL framework can be summarized as follows:
•
Two different user simulators are constructed with historical search logs to mimic fine-grained user behaviors. The user simulators provide reliable online evaluation for ranking optimization, which enable low-cost and efficient online training of RL agents for ranking tasks.
•
A unified UBS4RL re-ranking framework is proposed to take the interactions among search results into consideration for re-ranking. It involves three subtasks: feature extraction of heterogeneous search results, ranking evaluation by simulated users, and ranking policy optimization with RL agents. The framework has the potential to be further extended to a wide range of optimization tasks for different information systems.
•
By conducting experiments on both synthetic and practical datasets, we show significant improvement on search re-ranking performance, which verifies the effectiveness of the proposed re-ranking framework.
The rest of the paper is organized as follows. We describe related work in Section
2. Section
3 formally introduces the two user simulators: CCS and UserGAN, while Section
4 introduces the ranking agent. The experiment settings and results are presented in Sections
5 and
6, respectively. Finally, we conclude this paper and discuss future work in Section
7. The notations used in this paper are summarized in Table
1.
4 UBS4RL: Ranking Agent
To model the interactions among search results, we recast the ranking problem as a sequential filling game (as shown in Figure
1). At the initial timestep, the result list is empty, whereas all of the results are in the candidate set. Based on the query information, one search result is chosen for the first position. Regarding the results already placed in the ranking list, we compare the candidate results and choose the most proper one for the next position iteratively.
The ranking problem can be formulated as an MDP, which is described by a tuple \(\lbrace \mathcal {S}, \mathcal {A}, \mathcal {T}, \mathcal {R}, \gamma \rbrace\). \(\mathcal {S}\) denotes the state space, and \(\mathcal {A}\) denotes the action space. \(\mathcal {T}: \mathcal {S} \times \mathcal {A} \rightarrow \mathcal {S}\) is the transition function \(\mathcal {T}(s_{t+1}|s_t, a_t)\) to generate the next state \(s_{t+1}\) from the current state \(s_t\) and action \(a_t\). \(\mathcal {R}: \mathcal {S} \times \mathcal {A} \rightarrow \mathbb {R}\) is the reward function, whereas the reward at the \(t\)th timestep \(r_{t}=\mathcal {R}(s_t, a_t)\). \(\gamma \in [0,1]\) is the discounting factor for future rewards. Formally, the MDP components are specified with the following definitions:
State \(s\) is the global information of the search results already ranked in the list. At step \(t\), \(s_t = \lbrace q, o_{i_j}|j=1,2,\ldots ,t-1\rbrace\), where \(o_{i_j}\) is the search result ranked at position \(j\) by the policy. At the initial time, \(s_0=\lbrace q\rbrace\) is the query information.
Action \(a\) is a search result that the ranking agent chooses for the next position in the ranking list. At step \(t\), \(a_t=o_{i_t}\), where \(o_{i_t}\) is the search result ranked at position \(t\) by the policy.
Transition \(\mathcal {T}\) changes the state of the ranking list, adding one search result to the end of the list at each step.
Reward \(\mathcal {R}\) is the reward function that can be the online users’ feedback. In this work, the reward is given by the simulation environment based on simulated user feedback. The reward function is introduced in Section
4.2 in detail.
In this article, the MDP problem is solved with the policy gradient algorithm of REINFORCE [
68]. At each step
\(t\), the policy
\(\pi (a_t|s_t)\) defines the probability of the sampling action
\(a_t \in \mathcal {A}\) in state
\(s_t \in \mathcal {S}\). The sampling strategy can be either choosing an action randomly or the one with the highest probability. These two strategies are denoted as “RandomSample” and “MaxSample,” respectively. The aim of RL is to learn an optimal policy
\(\pi ^*\) by maximizing the expected cumulative reward
\(R_t = \mathbb {E}[\sum _{k=0}^{\infty }{\gamma ^{k}r_{t+k}}]\).
Formally, let \(S=(q, o_1, o_2, \ldots , o_N)\) denote a query session, where \(q\) is the query and \(o_j\) is the \(j\)th search result in the original ranking list of the search engine, and \(N\) is the number of results in the session. The ranking agent re-ranks the result list to \(\hat{S}=(q, o_{i_1}, o_{i_2}, \ldots , o_{i_N})\), where \((i_1,i_2,\ldots , i_N)\) is a permutation of \((1,2,\ldots , N)\). The user simulator samples interaction feedback \(U = (u_{i_1}, u_{i_2}, \ldots , u_{i_N})\) on the re-ranked list \(\hat{S}\). For CCS, \(u_i=\lbrace c_i\rbrace\), whereas for UserGAN, \(u_i=\lbrace c_i, \tau _i\rbrace\), where \(c_i\) is the click behavior of the \(i\)th search result (1 for click and 0 for skip) and \(\tau _i\) is the click time of the \(i\)th search result. The time starts from the moment the user enters the session. The reward at each step is given by reward = \(\mathcal {R}(U)\).
4.1 Policy Network
The structure of the policy network is shown in Figure
5. The deep neural network architecture can learn policies from high-dimensional raw input data in a complex RL environment. Specifically, the state feature is extracted by a single-direction GRU. The search results that have been placed at proper positions are input to the GRU sequentially according to their rankings. The last hidden state of the GRU is adopted as the state feature for RL. The state feature is then concatenated with the candidate result features and input to a
multi-layer perceptron (MLP) to assess the probability of each candidate action that be chosen at this step. Then, the action is sampled according to different strategies (RandomSample or MaxSample) as the next result that is added to the ranking list. At step
\(t\), there are
\(t\) and
\(N-t\) results in the ranking list and candidate set, respectively, where
\(N\) is the number of results in the query session. The policy network can be formulated as follows:
where
\(\lbrace i_1,i_2,\ldots , i_N\rbrace\) is a permutation of
\(\lbrace 1,2,\ldots , N\rbrace\), MLP denotes the multi-layer perceptron, the sampling strategy can be RandomSample or MaxSample as described earlier,
\(a_t\) is the chosen action in timestep
\(t\), and
\(h_0\) is the query feature
\(f_q\).
4.2 Reward Design
The user simulator serves as the simulation environment for the ranking agent. Given the re-ranked list of search results, the user simulator predicts user interaction feedback, such as clicks and click times. Numerous online evaluation metrics can be designed as the rewards based on the user feedback. In this work, for user clicks, we adopt a typical online metric (CTR) and two kinds of typical offline evaluation metrics (MRR and DCG (discounted cumulative gain)) [
64] for reward design. All of these metrics are modified to utilize the sequential clicks as reward signals and are capable to give rewards at each RL step. The rewards used in the experiments include CTR@3, CTR@5, CTR@10, DCG@3, DCG@5, DCG@10, and MRR. The rewards based on user clicks at timestep
\(t\) are defined as follows:
For user click times, we adopt FCT (time to first click), LCT (time to last click), and ACT (average click time) [
18] as the rewards. The rewards based on user click times at timestep
\(t\) are defined as follows:
Here,
\(N\) is the number of results in a query session,
\(I(c_i=1)=1 \text{ if } c_i=1 \text{ else } 0\).
4.3 RL Training
A trajectory
\(\tau = {s_0, a_0, s_1, a_1,\ldots , s_N, a_N}\) is sampled according to the policy
\(\pi\) in each episode. The episode terminates when the ranking list is filled with results. The objective of the training process is to maximize
\(J(\theta)\).
Equation (
52) can be approximated by a Monte Carlo estimator [
41]:
where
\(\theta\) represents the parameters of the policy network,
\(M\) is the number of samples, and
\(N\) is the number of results in a session.
Pretraining of the policy network. It is known that training an RL model from scratch is not effective. As the original ranking list has already achieved a generally satisfactory performance, we pretrain the policy network to output the original rankings. At step \(t\), \(s_t=\lbrace o_i | i=1,2,\ldots ,t-1 \rbrace , a_t = o_t\). The training objective is to maximum \(\pi _\theta (s_t, a_t)\). The policy network is trained in the form of supervised learning.
When training the ranking agent, the parameters of the feature extractor and the user simulator are all fixed to provide a stable environment. In this manner, we can make sure that the training of the ranking agent does not change the simulation environment or the result representations. This is important because if we do not fix the simulation environment, the performance improvement may be caused by the user simulator rather than the ranking agent.
6 Experiments
The proposed UBS4RL framework aims to improve the ranking performance by adopting a user simulator as the simulation environment for the ranking agent. To demonstrate the effectiveness of this framework, we conduct a series of experiments to address the following research questions:
•
RQ1: Can user simulators capture the important properties of practical search engine users?
•
RQ2: Can the ranking framework optimize online evaluation metrics effectively in the simulation environment?
•
RQ3: How does the ranking framework perform on a practical data environment?
6.1 Evaluating User Simulators (RQ1)
6.1.1 Click Simulation Performance.
Click probability prediction. To evaluate how different models can predict user click probabilities, we compare them in terms of perplexity and log-likelihood as in most click model work [
13,
21,
25,
32,
50]. A lower perplexity and a higher log-likelihood indicate that the user simulator can predict users’ click behavior more accurately. The click models and the two user simulators (CCS and UserGAN) are trained using the training set of Real 2017 and tested using the testing set of Real 2017 and Real 2018, respectively. From the results in Table
3, we can note the following.
First, conventional click models perform well in click probability prediction. However, for newly emerging search requests in Real 2018, there is sharp decline in performance, because they can only deal with repeated query-document pairs. This limits the practical application of click models for commercial search engines, because there are millions of emerging search requests conducted every day. It is not practical to train the click models on large-scale search logs frequently.
Second, CCS can achieve comparable performance with click models. CCS predicts user click probabilities based on the context information and previous click as well as result contents. For emerging query-document pairs in Real 2018, CCS can predict the click probability based on the multimodal content information of the search results, which makes its generalization ability stronger.
Third, UserGAN learns the implicit click pattern distribution of users from search logs and exhibits strong generalization ability. Because the click pattern distribution of users is consistent on different queries and documents, UserGAN can perform well on emerging query-document pairs in Real 2018 with just slight performance decline. We can also see that UserGAN cannot outperform the click models for predicting click probabilities. This may be because UserGAN learns both the temporal and spatial signals of click events, which is much more difficult than just learning users’ click/skip behavior. It needs to not only predict how likely a user is going to click a search result but also when they will click it. As the temporal, spatial, and click probability information are modulated in a single carrier wave using Equation (
25), mistakes on the click time prediction also affect the click probability prediction and vice versa.
Click sequence prediction. As revealed in previous studies [
18], online evaluation metrics align well with user satisfaction in today’s heterogeneous search scenarios. To measure how closely different click models and the two user simulators can simulate users’ actual clicks, we evaluate them in terms of online evaluation metrics such as CTR, DCG, and MRR. We train the models on Real 2017 and evaluate them on Real 2018. For each session in the search logs, we sample a click sequence with click models or the user simulator, then compute the online metrics. The
root mean squared error (RMSE) of the predicted click-related online metrics compared with the search logs are shown in Table
4. Lower RMSE values indicate better performance. From the results, we can note the following.
First, performance of click models varies a lot. Some models, such as DCM and UBM, perform well on click sequence prediction. Others, like MCM and NCM, have large differences with search logs. Second, CCS performs the best on MRR and achieves comparable performance with the click models on the other online metrics, like CTR or DCG. Third, UserGAN performs the best in terms of CTR@3,5,10, which are among the most widely used online metrics. On DCG@3,5,10, although UserGAN dose not make the best prediction, it achieves very close performance with the best click model DCM and performs secondarily among all of the baselines. On MRR, UserGAN also achieves close performance with the best models, CCS and DCM.
Overall, the online metrics predicted by UserGAN are close with search logs, which indicates its capability in simulating user click behavior. As we use different online metrics as the rewards to train the ranking agent, we will adopt UserGAN as the user simulator in the following experiments.
6.1.2 Time Simulation Performance.
Click time is an important aspect to characterize users’ fine-grained behaviors as described in the work of Chen et al. [
18]. Among all of the click models and the two user simulators, UserGAN is the only one that can simulate user click time corresponding to each click event. We compare UserGAN with CATM (a context-aware time model) [
5] (see brief introduction in Section
5.3). As Weibull probability density function performs the best to describe time distributions according to Borisov et al. [
5], we adopt Weibull distribution in CATM.
The performance of click time simulation is evaluated in terms of three click time related online metrics: FCT, LCT, and ACT [
18]. The RMSE of the predicted click time related online metrics compared with the search logs are shown in Table
5. Lower RMSE values indicate better performance. From the results, we can note that the click time simulated by UserGAN is close to that of search logs. On both Real 2017 and Real 2018 datasets, UserGAN outperforms CATM in terms of FCT, LCT, and ACT. The results show that besides click prediction, UserGAN can also predict the time corresponding to each click event accurately. We will show in the following experiments that the predicted time information will help a lot in both ranking evaluation and optimization.
6.1.3 Online Evaluation with Click Time.
For the online evaluation, the user simulator should be able to judge which ranking of search results is better to guide the ranking algorithm optimization. In this section, we analyze whether the click time can help to provide a more accurate online evaluation.
Five click models and UserGAN are trained on Real 2017 and tested on Real 2018. There are a total of 1,239 queries in Real 2018. For each query and the corresponding result list in Real 2018, we try to assess three different ranking strategies: the original ranking of search engine (“Original”), the reversed ranking (“Reverse”), and the randomly shuffled ranking (“Random”). For each query, we compare three pairs of ranking lists: Original with Reverse, Original with Random, and Random with Reverse. Preferences with regard to ranking list pairs are assessed through a commercial crowdsourcing company. Given two different ranking lists (with the same search results) of a query side by side, the assessors need to grade the ranking pair with a five-level preference criterion (from –2 to +2, a higher absolute value indicates stronger preference). For the sake of fairness, result lists are randomly positioned on the two sides. Each ranking list pair has three assessments. The average score is regarded as the final preference (a positive score indicates the right ranking list is better, whereas a negative score indicates the left one is preferred, and zero means the same). The Cohen’s
\(\kappa\) [
23] of preference assessments is 0.6281.
We sample 100 click sequences with UserGAN or click models for each ranking list and compute different online evaluation metrics, including CTR@3,5,10, DCG@3,5,10, and MRR based on the predicted clicks. For UserGAN, each click has a corresponding click time signal. Clicks that occur early may be more valuable to indicate user preference. Thus, click time is adopted as a threshold to filter out less important clicks for UserGAN. The ranking list with a higher predicted online evaluation metric is regarded as the better one. The accuracy of the predicted preference relative to the annotated one is shown in Figure
6. From the results, we can note the following.
First, UserGAN outperforms the click models on Original with Reverse, Original with Random, and Random with Reverse ranking pairs by a large margin in most cases. UserGAN is sensitive to different ranking strategies and can well predict which one users prefer. The click behavior simulated by UserGAN is more differentiated compared to the other click models and thus gives stronger evidence for preference judgment in online evaluation.
Second, the difficulty of judging different ranking strategy pairs is different. The highest accuracy achieved by UserGAN on Original with Reverse is nearly 0.6, whereas it is between 0.56 and 0.58 for Original with Random. The accuracy on Random with Reverse is the lowest, which is approximately 0.45 to 0.48. This shows that it is easier to judge which ranking is better for Original with Reverse pairs because there exists a greater difference. However, the preference between Random with Reverse pairs is not very easy to predict, which may be because both of the rankings have lower quality.
Third, UserGAN is more stable on different evaluation metrics for preference judgments. Click models that perform well based on one evaluation metric may perform poorly on others. For example, UBM-Layout performs the best on CTR@10 among click models. However, it does not achieve the best performance on the other metrics. In contrast, UserGAN is reliable on all metrics in most cases.
Fourth, by jointly considering click/skip behavior and click time, the performance of UserGAN can be improved substantially. For CTR@3,5,10, the accuracy achieved by using an appropriate time threshold compared to using all clicks (click time threshold = 600 seconds) can be improved by \(2.88\%\) to \(15.50\%\). For MRR, the best performance is achieved when all the clicks are considered. Different online evaluation metrics have different sensitivity to the click time threshold. This verifies that user clicks are not equally important for online metric evaluation. With click time information, fine-grained user behavior can better indicate user preferences, which has great potential for more accurate online evaluation.
6.1.4 Qualitative Analysis of UserGAN.
We randomly sample 10,000 carrier waves from the original search logs and the simulated carrier waves generated by UserGAN of the same search sessions. The distributions of carrier waves are shown in Figure
7. The genuine and simulated carrier waves show similar characteristics. First, the Gaussian distributions are steeper in the top positions of the SERP and gentler toward the bottom. In Equation (
25), a smaller click time
\({\tau }\) makes the wave steeper, whereas a bigger click time makes it gentler. The distribution indicates that users prefer to click the top-ranked search results at early times; they prefer to click the bottom ones later. This is consistent with previous findings that there exists position bias [
29,
30] in the user browsing process. Second, the height of the wave crest decreases by position, which indicates descending click probabilities.
The consistency in click times and click probabilities between the genuine and simulated carrier waves shows that UserGAN has effectively captured the hidden click pattern in the user behavior and can generate similar user behavior based on specific SERP information.
6.1.5 Discussion about Click Models.
The traditional click models, such as UBM, DCM, and DBN, have some major limitations. First, traditional click models are mainly based on the PGM framework and search behavior hypotheses. They use a binary random variable to indicate whether the user clicks or skips a search result and can only predict user click behavior. It is a non-trivial task to incorporate additional random variables to indicate other user behaviors, such as the click time predicted by UserGAN, the viewport time, and user scroll action. Second, traditional click models can only deal with repeated query-result pairs. For an unseen query-result pair, they will use default values for prediction. For example, the relevance is set to 0.5 (ranging from 0 to 1) for search results that never appear in historical logs. This makes the traditional click models not applicable in practical search systems, as there are emerging search results every moment on the Web. For a large amount of new search results, the traditional click models will lose efficacy. Third, the contents of search results have changed from pure texts to multimodal information sources in modern search engines, such as texts, images, videos, and applications. The multimodal contents will largely affect user behavior on SERPs. To predict user clicks or click times, it is important to take result contents into consideration. However, the PGM-based click models are incapable of adopting result contents for prediction. In contrast, neural models can deal with multimodal contents of search results well.
6.1.6 Summary for Research Question 1.
From the preceding experiments, we can see that the proposed user simulators UserGAN and CCS can capture the important properties of practical search engine users in the following ways:
•
For click simulation, UserGAN and CCS can achieve comparable performance in click probability prediction compared with click models. UserGAN achieves superior performance to simulate click sequences, which has close values with search logs on different click-related online evaluation metrics.
•
For time simulation, UserGAN can simulate user click times along with clicks at the same time. The simulated click time can help to more accurately distinguish between valuable and noisy click events. By considering time information, we can better judge different ranking strategies, which substantially helps ranking evaluation and optimization.
•
By exploiting the consistent distribution of user behaviors from search logs, UserGAN shows strong generalization ability on emerging search requests. This is valuable for search engines dealing with ever-changing information on the Web.
6.2 Evaluation with a Synthetic Dataset (RQ2)
To verify that the proposed UBS4RL framework can optimize a range of online evaluation metrics, we conduct a simulation study with synthetic users, which is similar to those conducted in other words [
2,
75]. A simulation rule is designed according to some prior knowledge and assumptions to generate simulated user feedback based on relevance annotations, vertical types, and statistics of historical search logs. In this way, we create a synthetic user that can (1) generate simulation search logs for original ranking lists and (2) produce
ground truth user feedback on re-ranked lists to evaluate the effectiveness of the ranking policy.
6.2.1 Simulation Rule.
In this work, we extend the simulation rule in the work of Ai et al. [
2] to simulate user clicks as well as click times at the same time for heterogeneous search scenarios. The extended rule consists of two parts: click simulation and click time simulation.
Click simulation. As the vertical bias substantially affects the users’ browsing process, we incorporate it into the click simulation rule presented in the work of Ai et al. [
2]. The extended assumption is that users click a search result
\(o\) belonging to query
\(q\) (
\(c_q^o = 1\)) only when it is both estimated (
\(e_q^o=1\)) and perceived as relevant (
\(r_q^o=1\)). At the same time, the vertical type is preferred by users to click (
\(v_q^o=1\)).
\(e_q^o, r_q^o, v_q^o\), and
\(c_q^o\) are all Bernoulli random variables, and we sample these variables by the following formulations.
\(e_q^o\). The position bias
\({\eta }\), estimated through eye-tracking experiments presented in the work of Joachims et al. [
36], is adopted to sample
\(e_q^o\).
\(\alpha \in [0, +\infty)\) controls the severity of the position biases; it is set to 2.0 in our experiment.
\(i\) denotes the ranking position of the search result.
\(r_q^o\). The relevance annotations are utilized to sample
\(r_q^o\).
\(y \in [0, 3]\) is the four-level relevance label for result
\(o,\) and
\(y_\mathsf {max}\) is 3 in our dataset. The parameter
\(\epsilon\) introduces click noise into the click decision process. A search result not very relevant to the query also has a small but positive probability to be clicked.
\(\epsilon\) is set to 0.2 in our experiment.
\(v_q^o\). The vertical bias
\({\omega }\) is estimated on a large number of practical search logs. There are 19 different result types in
SRR. The average CTR for each result type is computed as
\(\omega =\mathsf {AverageCTR}(v)\), where
\(v\) denotes the vertical type of the search result. The statistical results show that the variances of CTR values belonging to the same type across different queries are small (most of them are less than 0.01). This indicates that the vertical bias holds for a wide variety of searches.
\(\mu \in [0, +\infty)\) controls the severity of the vertical bias, which is set to 0.5 in our experiment.
\(c_q^o\). The final clicks are sampled according to the multiplication of the probabilities of
\(e_q^o, r_q^o, v_q^o\). To control the severity of position bias and vertical bias, we need to adjust the hyperparameters
\(\alpha\) and
\(\mu\), which can cause small values for
\(P(e_q^o=1)\) and
\(P(v_q^o=1)\). Thus,
\(P(e_q^o=1)\) and
\(P(v_q^o=1)\) are linearly mapped to
\([0.3,1]\) in our experiment. The click probability of result
\(o\) is given by
Time simulation. To simulate the click time for each click event, we first analyze click time distributions in historical search logs. For each ranking position, click times
\(\mathbf {\tau }\) in search logs are first mapped to
\(\mathbf {\tau ^{\prime }} \in [0,1]\) by
\(\mathbf {\tau ^{\prime }} = \frac{log(\mathbf {\tau })-log(\tau _\mathsf {min})}{log(\tau _\mathsf {max})-log(\tau _\mathsf {min})}\). Then, they are fitted with a Beta distribution [
3]. The Beta distribution for ranking position
\(i\) is denoted as
\(\mathcal {B}_i\), which is shown in Figure
8. For a search result
\(o\) ranked at position
\(i\), if
\(c_q^{o}=1\), we then randomly sample a click time from
\(\mathcal {B}_i\) as
\(\tau _q^o\).
With Equations (
59) and (
60), we can sample clicks and corresponding click times at the same time for any given result list. The simulation rule is regarded as a synthetic user, who can give feedback on result lists and generate synthetic search logs.
6.2.2 Online Click Metric Optimization.
As the search engine needs to handle both repeated queries and emerging queries that have never appeared in search logs, we train the ranking framework UBS4RL on the Seen Set and subsequently evaluate it on both the previously
seen queries (Seen Set) and
unseen queries (Unseen Set) as described in Section
5.1. The rule-based synthetic user introduced in Section
6.2.1 is regarded as the practical search engine user. The synthetic user interacts with the re-ranked result list and evaluates the ranking performance in terms of a collection of online metrics to simulate actual online testing.
We also adopt the synthetic user as the simulation environment to simulate actual online training. The UBS4RL ranking framework using the synthetic user for training is denoted as \(\mbox{UBS4RL}_\mathsf {Oracle}\). The experiment results of \(\mbox{UBS4RL}_\mathsf {Oracle}\) indicate the upper bound of ranking performance using all possible environments (actual online environment and simulation environment) under the adopted RL algorithm. In addition, the number of different rankings of a result list containing \(N\) search results is \(N!\) By traversing all possible rankings for a result list and comparing the feedback (online evaluation metrics) provided by the synthetic user, we can find the optimal ranking of the search results. The optimal ranking performance is denoted as Oracle. The Oracle result indicates the upper bound of ranking performance achieved by all possible methods.
Table
6 shows the performance of the ranking framework that optimize online metrics on the Seen and Unseen datasets. The first row (denoted as Log) shows the performance of the original ranking lists while the clicks are sampled by the synthetic user (the click labels of the Simulation datasets).
\(\mbox{UBS4RL}_\mathsf {CCS}\) denotes the ranking performance using CCS as the simulation environment, whereas
\(\mbox{UBS4RL}_\mathsf {UserGAN}\) denotes the same for UserGAN. From the results, we can note the following.
First, \(\mbox{UBS4RL}_\mathsf {CCS}\) and \(\mbox{UBS4RL}_\mathsf {UserGAN}\) can all improve the online evaluation metrics over the original rankings by using the objective online metric as the reward, including CTR@3,5,10, DCG3,5,10, and MRR. This shows the potential capability of the UBS4RL ranking framework to optimize a wide range of objectives for commercial search engines as long as a proper reward is designed. \(\mbox{UBS4RL}_\mathsf {UserGAN}\) is slightly better than \(\mbox{UBS4RL}_\mathsf {CCS}\) on Seen Set and much superior on Unseen Set. The results shows that UserGAN has a stronger generalization ability than CCS. This is because UserGAN directly exploits the click pattern distribution of user behaviors, which is quite consistent in different search scenarios.
Second, \(\mbox{UBS4RL}_\mathsf {CCS}\) and \(\mbox{UBS4RL}_\mathsf {UserGAN}\) can achieve a close performance compared to \(\mbox{UBS4RL}_\mathsf {Oracle}\). The results show that training the ranking agent with the user simulator is comparable with practical online users. This verifies the effectiveness of the proposed user simulators serving as the simulation environment for RL agent training. We also note that the performance of \(\mbox{UBS4RL}_\mathsf {UserGAN}\) and \(\mbox{UBS4RL}_\mathsf {CCS}\) sometimes outperforms \(\mbox{UBS4RL}_\mathsf {Oracle}\). Although \(\mbox{UBS4RL}_\mathsf {Oracle}\) is better than \(\mbox{UBS4RL}_\mathsf {UserGAN}\) and \(\mbox{UBS4RL}_\mathsf {CCS}\) in theory, the variability in the training and prediction processes introduces some uncertainty.
Third, the Oracle value is much better than \(\mbox{UBS4RL}_\mathsf {Oracle}\) as well as \(\mbox{UBS4RL}_\mathsf {CCS}\) and \(\mbox{UBS4RL}_\mathsf {UserGAN}\). As the user simulator can already compare favorably with practical users for RL training, the results show that there is still large room for improvement of the RL algorithm. This is left for future work to explore more effective RL algorithms into the ranking framework.
6.2.3 Online Time Metric Optimization.
UserGAN can simulate user clicks as well as click times. In this section, we optimize some click time related online evaluation metrics, including FCT, LCT, and ACT. The performance is also evaluated by the synthetic user. Figure
9 shows that the click time metrics can be decreased steadily during training on both Seen Set and Unseen Set. Table
7 shows the performance of optimizing FCT, LCT, and ACT. We can see that on Seen Set and Unseen Set, the metrics can all be decreased by approximately 1 second. Smaller FCT values indicate that users tend to make the first click earlier, as they can find the most desired search result more easily. Smaller LCT values indicate that users spend less time on the SERP, as they can obtain the useful information in a relatively short time. Smaller ACT values also show that the browsing process is accelerated, which makes information seeking more efficient. The results show that by optimizing click time related online metrics, the ranking performance is indeed improved.
6.2.4 Summary for Research Question 2.
The preceding experiments show that the proposed UBS4RL framework can optimize different online evaluation metrics effectively in the simulation environment:
•
UBS4RL can significantly improve click related online evaluation metrics over the original rankings by using the objective evaluation metric as the reward. This shows the potential capability of the UBS4RL ranking framework to optimize a wide range of objectives for commercial search engines as long as a proper reward is designed.
•
By adopting UserGAN as the simulation environment, we can directly optimize click time related online evaluation metrics, such as FCT, LCT, and ACT. The click time metrics can be decreased steadily during training, which makes the user information seeking process more efficient and effective.
•
Optimizing UBS4RL in the simulation environment can achieve close performance to that of training with the synthetic user that can be regarded as practical online learning. This shows the effectiveness of the simulation approach for offline training. However, there still exists a gap between UBS4RL and the oracle performance, which is upper bound of all possible ranking methods. This indicates that RL algorithms that are more effective and efficient than REINFORCE can be explored to further improve the performance of UBS4RL.
6.3 Evaluation with a Practical Dataset (RQ3)
As we have demonstrated that the proposed UBS4RL framework works well to optimize online evaluation metrics in the simulation study, we further investigate how it performs in practical data environments, and whether the ranking framework trained in the simulation environment actually helps to improve the offline evaluation metrics. We evaluate the ranking framework on datasets constructed at different times to show its generalizability, which is an important concern for online search systems. The experimental results on the Real 2017 and Real 2018 datasets are shown in Table
8. From the results, we can note the following.
First, click models that directly utilize the debiased query-result relevance for ranking perform the best among the baselines. The counterfactual LTR methods REM, DLA, and IPW, which learn a propensity weight to remove the bias in click logs, can hardly surpass the click models’ performance. JRE achieves a performance that is close to UBM-Layout from which the weak relevance labels are obtained.
Second, although UBS4RL aims to optimize the online evaluation metrics, the learned policy also performs well on offline metrics. Under all of the reward functions (including CTR@3, CTR@5, CTR@10, DCG@3, DCG@5, DCG@10, MRR), \(\mbox{UBS4RL}_\mathsf {CCS}\) and \(\mbox{UBS4RL}_\mathsf {UserGAN}\) achieve superior performance than all baselines in terms of NDCG@3,5,10. This shows the robustness of the ranking framework with respect to optimization objectives. The versatility of the ranking framework is potentially important in supporting the heterogeneous search scenarios. \(\mbox{UBS4RL}_\mathsf {CCS}\) and \(\mbox{UBS4RL}_\mathsf {UserGAN}\) achieve close performance when optimizing click-related online metrics, whereas \(\mbox{UBS4RL}_\mathsf {UserGAN}\) is slightly better on Real 2018. This also verifies that learning the implicit user behavior distribution from search logs makes the user simulator more generalizable.
Third, UserGAN is the only user simulator that can mimic user click times. From the results, we can see that optimizing click time related online metrics (FCT, LCT, ACT) can achieve superior performance than optimizing click-related online metrics (CTR, DCG, MRR). Optimizing the FCT metric can achieve the best performance on Real 2017; optimizing the ACT metric can achieve the best performance for Real 2018. This shows that the user click time is an important signal for ranking evaluation, which is usually ignored by previous works. UserGAN simulates fine-grained user behavior including clicks and click times; it performs the best as the simulation environment.
Fourth, UBS4RL has stronger generalization ability than other methods based on search logs. As shown in Table
8, UBS4RL performs much better than DLA, JRE, and click models on Real 2018. The ranking framework trained on historical data performs well on emerging queries and documents, whereas a sharp decline in performance is exhibited for the baselines. Generalization ability is an important concern for search engines as a large amount of Web documents or multimedia contents become available online every moment. It is infeasible to train the model constantly for new contents. The proposed ranking framework UBS4RL is more adaptable for practical search engines.
The overall experiment results show that the simulation environment captures important properties of practical users. It serves as a reliable reward provider and performance judger for the ranking agent. The UBS4RL framework can effectively leverage the implicit relevance signals in search logs, even without making any explicit assumptions on user behavior. UBS4RL can substantially improve the ranking performance by optimizing a wide range of online metrics, including click-related and click time related metrics. The click time information is more effective for ranking evaluation and optimization.
6.3.1 Summary for Research Question 3.
By training the user simulators on practical search logs, the proposed UBS4RL framework can effectively improve ranking performance in terms of offline evaluation metrics:
•
Although the UBS4RL framework is trained to optimize online evaluation metrics, it also performs well on offline evaluation. UBS4RL achieves superior performance than all of the baselines, including click models, the neural ranking model, and the counterfactual LTR methods.
•
When adopting UserGAN as the simulation environment, UBS4RL can achieve stronger generalization ability than the baselines. This verifies that the learned implicit user behavior distribution is consistent over time; thus, they can be transferred to emerging search requests.
•
By optimizing click time related online metrics, the UBS4RL framework can achieve even better ranking performance than optimizing click-related online metrics. This shows that the click time information within users’ searching process is vital for ranking evaluation and optimization. This time information is usually ignored by previous works.
7 Conclusion
Ranking heterogeneous search results with complex interactions is an emerging and critical problem for modern search engines. In this work, we propose a ranking framework UBS4RL based on user simulation to optimize online evaluation metrics. The framework consists of three modules: feature extractor, user simulator, and ranking agent. Different user simulators are constructed to mimic fine-grained user feedback as the simulation environment for the offline training of the RL agent. Offline training avoids the time and resource consumption issues inherent in online training methods for practical search engines. The online training can also be built upon the policy learned offline with search logs and the simulation environment, which is more efficient and brings little harm to the online system. By extensive experiments on both synthetic and practical data, we demonstrate the effectiveness of the framework in improving ranking performance in terms of online as well as offline evaluation metrics. We also find that the click time information during users’ search process is vital for the ranking evaluation and optimization, which is usually ignored by previous works. The ranking framework has a strong generalizability over time, which is valuable for search engines dealing with ever-changing information on the Web.
The framework has the potential to be extended to a wide range of optimization tasks for different information systems, such as mobile search, product search, image search, and recommendation systems. It can be further improved from three aspects. First, the feature extractor can incorporate the information of landing pages, which contain more detailed information of search results, into the multimedia contents of SERPs. Meanwhile, the components of the ranking framework can utilize different feature extractors in practical applications. For example, the user simulator can be trained offline with highly computational deep neural features to achieve better performance, whereas the ranking agent, which is used to rank the results online, can utilize more cost-efficient handcrafted features. Second, the user simulator can adopt other generative models such as flow-based generative models and VAEs for user feedback simulation. Additional fine-grained spatial and temporal signals of user behavior can also be explored, such as fixations and mouse movements during the users’ information seeking process. Third, the ranking agent can leverage a wide range of RL algorithms to improve the ranking performance, which have shown effectiveness and efficiency in other research fields like games and robotics. In addition, different online metrics can be jointly optimized. User satisfaction can also be directly improved through tailored reward designs. The list-wise ranking framework also has the potential to be extended to the two-dimensional whole-page optimization problem, which takes the results on the right side of SERPs into consideration, such as knowledge cards and related searches.