4.1 RQ1: Can We Provide Effective Cached Answers?
The results of the experiments conducted on the three
CAsT datasets with the
no-caching baseline, static-
CACHE, and dynamic-
CACHE are reported in Table
1. For each dataset, the static, and dynamic versions of
CACHE, we vary the value of the cache cutoff
\(k_c\) as discussed in Section
3.2, and highlight with symbol
\(\blacktriangledown\) the statistical significant differences (two-sample t-test with
\(p\lt 0.01\) ) w.r.t. the
no-caching baseline. The best results for each dataset and effectiveness metric are shown in bold.
By looking at the figures in the table, we see that static-
CACHE returns worse results than
no-caching for all the datasets, most of the metrics, and cache cutoffs
\(k_c\) considered. However, in a few cases, the differences are not statistically significant. For example, we observe that static-
CACHE on
CAsT 2019 with
\(k_c=10\text{K}\) does not statistically differ from
no-caching for all metrics but MAP@200. The reuse of the embeddings retrieved for the first queries of
CAsT 2019 conversations is thus so high that even the simple heuristic of statically caching the top
\(10\text{K}\) embeddings of the first query allows to answer effectively the following queries without further interactions with the back-end. As expected, we see that by increasing the number
\(k_c\) of statically cached embeddings from
\(1\text{K}\) to
\(10\text{K}\) , we improve the quality for all datasets and metrics. Interestingly, we observe that static-
CACHE performs relatively better at small query cutoffs, since in column P@1 we have, for 5 times out of 12, results not statistically different from those of
no-caching. We explain such behavior by observing again Figure
3: when an incoming query
\(q_b\) is close to a previously cached one, i.e.,
\(\hat{r}_b \ge 0\) , it is likely that the relevant documents for
\(q_b\) present in the cache are those most similar to
\(q_b\) among all those in
\(\mathcal {D}\) . The larger is query cutoff
k, the lower is the probability of the least similar documents among the ones in
NN \((q_b, k)\) residing in the cache.
When considering dynamic-
CACHE, based on the heuristic update policy discussed earlier, effectiveness improves remarkably. Independently of the dataset and the value of
\(k_c\) , we achieve performance figures that are not statistically different from those measured with
no-caching for all metrics but MAP@200. Indeed, the metrics measured at small query cutoffs result in some cases to be even slightly better than those of the baseline even if the improvements are not statistically significant: Since the embeddings relevant for a conversation are tightly clustered, retrieving them from the cache rather than from the whole index in some case reduces noise and provides higher accuracy. MAP@200 is the only metrics for which some configurations of dynamic-
CACHE result to perform worse than
no-caching. This is motivated by the tuning of threshold
\(\epsilon\) performed by focusing on small query cutoffs, i.e., the ones commonly considered important for conversational search tasks [
4].
RQ1.A: Effectiveness of the quality assessment heuristic. The performance exhibited by dynamic-
CACHE demonstrates that the quality assessment heuristic used to determine cache updates is highly effective. To further corroborate this claim, the
\(cov_{10}\) column of Table
1 reports for static-
CACHE and dynamic-
CACHE the mean coverage for
\(k=10\) measured by averaging Equation (
5) over all the conversational queries in the datasets. We recall that this measure counts the cardinality of the intersection between the top 10 elements retrieved from the cache and the exact top 10 elements retrieved from the whole index, divided by 10. While the
\(cov_{10}\) values for static-
CACHE range between 0.35 and 0.62, justifying the quality degradation captured by the metrics reported in the table, with dynamic-
CACHE we measure values between 0.89 and 0.96, showing that, consistently across different datasets and cache configurations, the update heuristics proposed successfully trigger when the content of the cache needs refreshing to answer a new topic introduced in the conversation.
To gain further insights about RQ1.A, we conducted other experiments aimed at understanding if the hyperparameter
\(\epsilon\) driving the dynamic-
CACHE updates can be fine-tuned for a specific query cutoff. Our investigation is motivated by the MAP@200 results reported in Table
1 that are slightly lower than the baseline for 5 of 12 dynamic-
CACHE configurations. We ask ourselves if it is possible to tune the value of
\(\epsilon\) to achieve MAP@200 results statistically equivalent to those of
no-caching without losing all the efficiency advantages of our client-side cache.
Similar to Figure
4, the plot in Figure
5 shows the correlation between the value of
\(\hat{r}_b\) versus
cov \(_{200}(q)\) for the
CAsT 2019 train queries with static-
CACHE,
\(k=200\) and
\(k_c=1\text{K}\) . Even at query cutoff 200, we observe a strong correlation between
\(\hat{r}_b\) and the coverage metrics of Equation (
5): most of the train queries with coverage
\(\textsf {cov}_{200} \le 0.3\) have a value of
\(\hat{r}_b\) smaller than 0.07, with a single query for which this rule of thumb does not strictly hold. Hence, we set
\(\epsilon = 0.07\) , and we run again our experiments with dynamic-
CACHE by varying the cache cutoff
\(k_c\) in
\(\lbrace 1\text{K}, 2\text{K}, 5\text{K}, 10\text{K}\rbrace\) . The results of these experiments, conducted with the
CAsT 2019 dataset, are reported in Table
2. As we can see from the figures reported in the table, increasing from 0.04 to 0.07 the value of
\(\epsilon\) improves the quality of the results returned by the cache at large cutoffs. Now dynamic-
CACHE returns results that are always, even for MAP@200, statistically equivalent to the ones retrieved from the whole index by the
no-caching baseline (according to a two-sample t-test for
\(p\lt 0.01\) ). The improved quality at cutoff 200 is of course paid with a decrease in efficiency. While for
\(\epsilon = 0.04\) (see Table
1), we measured on
CAsT 2019 hit rates ranging from 67.82 to 75.29, by setting
\(\epsilon = 0.07\) , we strengthen the constraint on cache content quality and correspondingly increase the number of cache updates performed. Consequently, the hit rate now ranges from 46.55 to 58.05, witnessing a likewise strong efficiency boost with respect to the
no-caching baseline.
RQ1.B: Impact of CACHE on client-server interactions. The last column of Table
1 reports the cache hit rate, i.e., the percentage of conversational queries over the total answered with the cached embeddings without interacting with the conversational search back-end. Of course, static-
CACHE results in a trivial 100% hit rate, since all the queries in a conversation are answered with the embeddings initially retrieved for answering the first query. The lowest possible workload on the back-end is however paid with a significant performance drop with respect to the
no-caching baseline. With dynamic-
CACHE, instead, we achieve high hit rates with the optimal answer quality discussed earlier. As expected, the greater the value of
\(k_c\) , the larger the number of cached embeddings and the higher the hit rate. With
\(k_c=1\text{K}\) , hit rates range between
\(56.02\%\) to
\(67.82\%\) , meaning that even with the lowest cache cutoff experimented more than half of the conversation queries in the three datasets are answered directly by the cache, without forwarding the query to the back-end. For
\(k_c=10\text{K}\) , the hit rate value is in the interval
\([63.87\%\text{--}75.29\%]\) , with more than
\(3/4\) of the queries in the
CAsT 2019 dataset answered directly by the cache. If we consider the hit rate as a measure correlated to the amount of temporal locality present in the
CAsT conversations, then we highlight the highest locality present in the 2019 dataset: on this dataset dynamic-
CACHE with
\(k_c=1\text{K}\) achieves a hit rate higher that the ones measured for
\(k_c=10\text{K}\) configurations on
CAsT 2020 and 2021.
RQ1.C: Worst-case CACHE memory requirements. The memory occupancy of static-CACHE is limited, fixed and known in advance. The worst-case amount of memory required by dynamic-CACHE depends instead on the value of \(k_c\) and on the number of cache updates performed during a conversation. The parameter \(k_c\) establishes the number of embeddings added to the cache after every cache miss. Limiting the value of \(k_c\) can be necessary to respect memory constraints on the client hosting the cache. Anyway, the larger \(k_c\) is, the greater the performance of dynamic-CACHE thanks to the increased likelihood that upcoming queries in the conversation will be answered directly, without querying the back-end index. In our experiments, we varied \(k_c\) in \(\lbrace 1\text{K}, 2\text{K}, 5\text{K}, 10\text{K}\rbrace\) always obtaining optimal retrieval performances thanks to the effectiveness and robustness of the cache-update heuristic.
Regarding the number of cache updates performed, we consider as exemplary cases the most difficult conversations for our caching strategy in the three CAsT datasets, namely, topic 77, topic 104, and topic 117 for CAsT 2019, 2020, and 2021, respectively. These conversations require the highest number of cache updates: 6, 7, 6 for \(k_c=1\text{K}\) and 5, 6, 5 for \(k_c=10\text{K}\) , respectively. Consider topic 104 of CAsT 2020, the toughest conversation for the memory requirements of dynamic-CACHE. At its maximum occupancy, after the last cache update, dynamic-CACHE system stores at most \(8 \cdot 1\text{K} + 8 \approx 8\text{K}\) embeddings for \(k_c=1\text{K}\) and \(7 \cdot 1\text{K} + 7 \approx 70\text{K}\) embeddings for \(k_c=10\text{K}\) . In fact, at a given time, dynamic-CACHE stores the \(k_c\) embedding retrieved for the first query in the conversation plus \(k_c\) new embeddings for every cache update performed. Indeed, the total number is lower due to the presence of embeddings retrieved multiple times from the index on the back-end. The actual number of cache embeddings for the case considered is \(7.5\text{K}\) and \(64\text{K}\) for \(k_c=1\text{K}\) and \(k_c=10\text{K}\) , respectively. Since each embedding is represented with 769 floating point values, the maximum memory occupation for our largest cache is \(64\text{K} \times 769 \times 4\) bytes \(\approx 188\) MB. Note that if we consider the case dynamic-CACHE, \(k_c=1\text{K}\) , achieving the same optimal performance of dynamic-CACHE, \(k_c=10\text{K}\) on CAsT 2020 topic 104, the maximum occupancy of the cache decreases dramatically to about 28 MB.
4.2 RQ2: How Much Does CACHE Expedite the Conversational Search Process?
We now answer RQ2 by assessing the efficiency of the conversational search process in presence of cache misses (RQ2.A) or cache hits (RQ2.B).
RQ2.A: What is the impact of the cache cutoff \(k_c\) on the efficiency of the system in case of cache misses?. We first conduct experiments to understand the impact of
\(k_c\) on the latency of nearest-neighbor queries performed on the remote back-end. To this end, we do not consider the costs of client-server communications, but only the retrieval time measured for answering a query on the remote index. Our aim is understanding if the value of
\(k_c\) impacts significantly or not the retrieval cost. In fact, when we answer the first query in the conversation or dynamic-
CACHE performs an update of the cache in case of a miss (lines 1–3 of Algorithm
1), we retrieve from the remote index a large set of
\(k_c\) embeddings to increase the likelihood of storing in the cache documents relevant for successive queries. However, the query cutoff
k commonly used for answering conversational queries is very small, e.g.,
\(1, 3, 5\) , and
\(k \ll k_c\) . Our caching approach can improve efficiency only if the cost of retrieving from the remote index
\(k_c\) embeddings is comparable to that of retrieving a much smaller set of
k elements. Otherwise, even if we reduce remarkably the number of accesses to the back-end, every retrieval of a large number of results for filling or updating the cache would jeopardize its efficiency benefits.
We conduct the experiment on the
CAsT 2020 dataset by reporting the average latency (in milliseconds (ms)) of performing
NN \((q, k_c)\) queries on the remote index. Due to the peculiarities of the FAISS library implementation previously discussed, the response time is measured by retrieving the top-
\(k_c\) results for a batch of 216 queries, i.e., the
CAsT 2020 test utterances, and by averaging the total response time (Table
3). Experimental results show that the back-end query response time is approximately 1 second and is almost not affected by the value of
\(k_c\) . This is expected as exhaustive nearest-neighbor search requires the computation of the distances from the query of all indexed documents, plus the negligible cost of maintaining the top-
\(k_c\) closest documents in a min-heap. The result thus confirms that large
\(k_c\) values do not jeopardize the efficiency of the whole system when cache misses occur.
RQ2.B: How much faster is answering a query from the local cache rather than from the remote index?. The second experiment conducted aims at measuring the average retrieval time for querying the client-side cache (line 4 of Algorithm
1) in case of hit. We run the experiment for the two caches proposed, i.e., static-
CACHE and dynamic-
CACHE. While the first one stores a fixed number of documents, the latter employs cache updates that add document embeddings to the cache during the conversation. We report, in the last two rows of Table
3, the average response time of top-3 nearest-neighbor queries resulting in cache hits for different configurations of static-
CACHE and dynamic-
CACHE. As before, latencies are measured on batches of 216 queries, i.e., the
CAsT 2020 test utterances, by averaging the total response time. The results of the experiment show that, in case of a hit, querying the cache requires on average less than 4 ms, more than 250 times less than querying the back-end. We observe that, as expected, hit time increases linearly with the size of the static-
CACHE. We also note that dynamic-
CACHE shows slightly higher latency than static-
CACHE. This is due to the updates of the cache performed during the conversation that add embeddings to the cache. This result shows that the use of a cache in conversational search allows to achieve a speedup of up to four orders of magnitude, i.e., from seconds to few tenths of milliseconds, between querying a remote index and a local cache.
We can now finally answer RQ2, how much does
CACHE expedite the conversational search process, by computing the average overall speedup achieved by our caching techniques on an entire conversation. Assuming that the average conversation is composed of 10 utterances, the
no-caching baseline that always queries the back-end leads to a total response time of about
\(10 \times 1.06 = 10.6\) s. Instead, with static-
CACHE, we perform only one retrieval from the remote index for the first utterance while the remaining queries are resolved by the cache. Assuming the use of static-
CACHE with 10K embeddings, i.e., the one with higher latency, the total response time for the whole conversation is
\(1.06 + (9 \cdot 0.00159) = 1.074\) s, with an overall speedup of about
\(9.87\times\) over
no-caching. Finally, the use of dynamic-
CACHE implies possible cache updates that may increase the number of queries answered using the remote index. In detail, dynamic-
CACHE with 10K embeddings obtains a hit rate of about 64% on
CAsT 2020 (see Table
1). This means that, on average, we forward
\(1 + (9 \cdot 0.36) = 4.24\) queries to the back-end that cost in total
\(4.24 \cdot 1.06 = 4.49\) s. The remaining cost comes from cache hits. Hits are on average 5.76 and require
\(5.76 \cdot 0.00348 = 0.002\) s accounting for a total response time for the whole conversation of 4.242 s. This leads to a speedup of
\(2.5\times\) with respect to the
no-caching solution.
The above figures confirm the feasibility and the computational performance advantages of our client-server solution for caching historical embeddings for conversational search.