5.1 Context Extraction
We extracted context from digital activity data for the following sources: (1) full context, (2) non-Web context, (3) Web context, (4) prior page context, (5) search session context, and (6) random context.
Figure
2 illustrates various information sources for modeling. Depending on the source type, activity sequences were extracted and segmented into slices
\((d_n,d_{n+1})\) for training, validating, and testing. Consider a following sequence of user activities:
–
Full context consisted of the whole activity sequence. We consider 4-document sequence slices: (\(d_1,d_2,d_3,SERP_1\)), (\(d_2,d_3,SERP_1,d_4\)), (\(d_3,SERP_1,d_4,d_5\)), (\(SERP_1,d_4,d_5,SERP_2\)), ...
–
Non-Web only consisted of non-Web activity sequence; we applied a constraint that sequence slices only accepted non-Web documents. We also considered 4-document sequence slices but excluded Web documents.
–
Web only consisted of Web activity sequence, we applied a constraint that sequence slices only accepted Web documents. We also considered 4-document sequence slices but excluded non-Web documents.
–
Prior Page only consisted of sequence slices comprising of 2 documents, each comprising a SERP and a prior document: (\(d_3,SERP_1\)), (\(d_5,SERP_2\)), (\(d_6,SERP_3\)), (\(d_9,SERP_4\)).
–
Search Session consisted of a sequence of SERPs: (\(SERP_1,SERP_2,SERP_3\)), ... A single-query session, e.g., \(SERP_4\) was excluded (where time between \(SERP_3\) and \(SERP_4\) was longer than 30 minutes).
–
Random context consisted of a sequence of documents that were randomly selected from user activity history. We considered 4-document sequence slices but randomly selected from activity history for first 3 documents in the sequence: (\(d_2,d_3,d_1,SERP_1\)), (\(d_1,SERP_1,d_2,d_4\)), (\(SERP_1,d_2,d_4,d_5\)), (\(d_3,d_1,d_5,SERP_2\)), ...
5.2 Model Training
For each participant, data was split into training, validation, and test sets. The model was trained on the sequence of documents. The training set consisted of the data of the first 8 days. The validation set consisted of the data of 2 subsequent days. The test set consisted of data of the remaining 4 days. This approach ensured that the queries and the sessions in the training and validation sets were not to be seen in the test set. We described more details about the data used for each context model as follows:
–
Full context model: There were 26,225 documents in the training set, 6,565 documents in the validation set, and 14,055 documents in the test set.
–
Non-Web context model: There were 13,265 documents in the training set, 3,325 documents in the validation set, and 7,115 documents in the test set.
–
Web context model: There were 15,705 documents in the training set, 3,943 documents in the validation set, and 8,412 documents in the test set.
–
Prior page context model: There were 4,301 documents in the training set, 1,081 documents in the validation set, and 2,260 documents in the test set.
–
Search session context model: There were 540 sessions in the training set, 140 sessions in the validation set, and 260 sessions in the test set.
For the evaluation, we set the sequence length to 4 documents for Full context, Non-Web context, and Web context models. We used a 4-document sequence slice because, from the pilot study, we found that having more than 4 documents in the sequence did not significantly improve the prediction performance. Using this setting would also have the advantage of being faster and easier for fine-tuning the model (Analysis of sequence length is reported in Section
8.2). We fine-tuned BART on the context data for 20 epochs and used gradual unfreezing for the text generation.
5.3 Baselines
We reproduced an encoder-decoder model based on LSTM-RNN architecture proposed by Dehghani et al. [
16] and used it as a baseline. Originally, this model was designed for translation tasks [
4] and was adapted for query prediction by Dehghani et al. [
16]. The model architecture consisted of an encoder that learned the representation for the source sequence and a decoder that generated the target sequence. In this research, the source sequence was the text concatenated from 3 previous documents, and the target sequence was the text of the target document. This modeling setup was the same as for our transformer model using full context, in which the model leveraged a 4-document window, with the text of the 3 prior documents considered as the source sequence.
Query prediction & QAC. Prior page context, search session context, random context, and LSTM-RNN were considered comparison baselines. This way, we could examine the effectiveness of using full context source that goes beyond the context considered on a prior page context and search session context on prediction performance.
Selected search result prediction. Prior page context, random context, and LSTM-RNN were considered comparison baselines.
Web search re-ranking. Prior page context, random context, and LSTM-RNN were considered comparison baselines. In addition, we consider non-contextual ranking as a baseline. To produce non-contextualized ranking, the actual query issued by a participant was used. BM25 was applied to re-rank the top 10 search result documents using the content of the actual query.
5.4 Evaluation Measures
We used the actual queries and actual clicked documents in the test sessions as the ground truth for the evaluation.
Query prediction and QAC. We first used
Mean Reciprocal Rank (MRR) and
Partial-matching MRR (PMRR), which are often-used metrics to evaluate query prediction and QAC [
46]. These measures are considered to be useful in information retrieval research [
53] despite some discussions regarding their effectiveness [
19].
where
\(|Q|\) is the number of all queries,
\(r_q\) is the rank of the original query among the candidates, and
\(pr_q\) is the rank of the first candidate that partially matches the original query [
46].
Because MRR was too harsh (this metric considers only the exact match of the original query), we also used the classical metric BLEU, which corresponded to the rate of generated n-grams that were present in the target query. We referred to BLEU1, BLEU2, and BLEU3 for 1-gram, 2-grams, 3-grams.
Sim Extrema, which computed the cosine similarity between the representation of the candidate query with the target one, was also used. The representation of a query is a component-wise maximum of the representations of the words making up the query (we use GoogleNews embeddings). The extrema vector method has the advantage of accounting for words carrying information instead of other common words of the queries.
We also computed Sim Pairwise as the mean value of the maximum cosine similarity between each term of the target query and all the terms of the generated one.
Last, for each metric, we averaged over all prefixes for the performance of QAC and query prediction. In this article, we considered 0–8-character prefixes.
Selected search result prediction. We considered Sim Pairwise, Sim Extrema, BLEU1, BLEU2, and BLEU3. We evaluated the generated documents compared to the original clicked search result document. For BLEU1, BLEU2, and BLEU3, we compared the generated content to the title of the original clicked search result documents.
For each model, we first generated (through a beam search with K = 20) 10 documents to suggest to the user, given the context sequence. Then, the reported value for each metric is the maximum score over the top 10 generated queries or selected search result documents. This approach has been used in the early work for assessing the performance of a probabilistic model [
36], which corresponded to a fair evaluation of models that tried to find a good balance between quality and diversity.
Web search re-ranking. We considered MRR and \(Hitrate@k\) by considering ranks of selected search result documents in the re-ranked list. \(Hitrate@k\) denotes the average percentage of the selected search result document that can be found in the top-k ranked documents. Here, we considered \(Hitrate@1\), \(@2\), and \(@3\).