1 Introduction

Task-oriented conversational agents or chatbots are designed to provide users with specific predetermined services, such as booking a flight or reserving an event ticket. Such a chatbot must identify the intent (e.g. find restaurant) from the user utterances while also extracting necessary pieces of information, referred to as slots (e.g. restaurant cuisine). The end goal is to fill all the required slots, and complete the user-intended task, such as finding a restaurant specializing on a specific cuisine. These systems are often backed by a database, which is a collection of items with their attributes (slots) (Chen et al. 2020).

Existing work on task-oriented chatbots assumes a closed schema, that is, the interaction of users with the chatbot is limited to the information stored in the underlying database (Chen et al. 2020). However, in the real world, users often ask questions that the system cannot answer, because this information is out-of-schema, that

Fig. 1
figure 1

An example conversation for reserving a table at a restaurant where the user drops out when the system fails to answer an out-of-schema user query

is, this information is not available for the items in the database. For example, in Fig. 1 the system’s goal is to reserve a table at a restaurant for the user. The user proceeds with the conversation by providing a series of slot values to corresponding questions or suggestions from the chatbot. The user may also ask questions, such as “Are there any diabetes-friendly desserts?”. The system may be unable to answer this question as the information is absent from the service schema. We refer to such user queries as out-of-schema questions.

Not handling out-of-schema questions is costly, as the user may decide to drop out of the conversation, as shown in Fig. 1. User dropouts have a significant negative impact on the applications, such as decreased revenue in e-commerce chatbots.

Previous work has shown that users drop out of chatbot conversations more often when they ask a question to the bot, than when they respond to the bot by inputting information (Li et al. 2020). This study found that about 45% of users abandon the chatbot after just one “non-progress” event, that is, when the bot does not understand their intent or cannot handle it. Previous studies have shown that consumers are much more likely to repeat the same questions when researching a product, specifically related to a product’s determinant attributes (Fernando and Aw 2023). Periodically analyzing the chat log and answering these common questions offline to improve the effectiveness of a conversational AI system is common practice. For example, Ponnusamy et al. analyze recent Alexa chat logs periodically to discover unclear or error-prone user queries and retrain the system (Ponnusamy et al. 2020). A key challenge is that users may ask a large number of out-of-schema questions and it may be infeasible or too expensive for a domain expert to answer all of them.

Fig. 2
figure 2

A high-level overview of the two-stage pipeline for ranking 6 input dropped-out conversations. First, we detect which ones have out-of-schema questions and then we compute the benefit of answering these questions

This paper formulates the problem of detecting and selecting out-of-schema questions in a conversational system, and studies how to identify the most critical out-of-schema questions to answer, given a chat log, in order to maximize the expected success rate of the chatbot. We define a conversation as successful if the corresponding task is completed, for example, the user finds a suitable restaurant or books a flight.

The overview of our two-stage pipeline is shown in Fig. 2. In the first stage, we identify the out-of-schema questions from the dropped-out conversations using in-context learning, where a large language model makes predictions based on the task instruction and examples in the prompt. This approach does not require any fine-tuning i.e. no parameter updates and no annotated training data. This approach can handle out-of-schema questions in a new domain given a few examples without re-training meaning it is flexible and scalable. In the second stage, we calculate the potential benefits of answering each out-of-schema question and pick the ones with the highest benefit. Given a collection of conversations, identifying the best out-of-schema questions to answer is non-trivial, as different questions may have different impact on the success rate of the system. The impact is not only dependent on the frequency of a question but also on how close a question is to the success state of the conversation.

We propose two approaches to select the most critical out-of-schema questions: a naive approach of counting the frequency of each question, and a probabilistic Markov Chain-based approach. We refer to these approaches as ‘OQS algorithms’. The Frequency-based selection method picks the user questions that are asked in the highest number of failed conversations. The Markov Chain approach intuitively selects questions closer to the success state. By selecting such questions, the Markov Chain approach reduces the number of potential dropout paths, thereby increasing the likelihood of reaching the success state.

To evaluate the proposed OQS algorithms, we created two new datasets. Existing datasets such as MultiWoz Budzianowski et al. (2018), SNIPS (Coucke et al. 2018), ATIS (Liu et al. 2019), SGD (Rastogi et al. 2020), DST9 Track1 (Kim et al. 2020) are all grounded in the closed schema or external knowledge sources. None of these datasets contains user drop-outs due to the lack of available information. The only exception is Maqbool et al. dataset (Maqbool et al. 2022) which includes “off-script” (similar to our “out-of-schema”) user queries at the end of each dialog. However, all out-of-schema questions appear when all required slots have been filled, which is not realistic. Our new datasets overcome these limitations; they contain task-oriented conversations with out-of-schema user questions appearing at different positions.

The contributions of this paper are:

(1) We formulate the problem of detecting and selecting out-of-schema questions in a conversational system, and propose a two-stage pipeline to solve these problems.

(2) We propose an in-context learning (ICL) approach for detecting out-of-schema questions in a conversation.

(3) We propose two out-of-schema question selection (OQS) algorithms: Frequency-based and Markov Chain-based.

(4) We create and publish two datasets for the problem.Footnote 1

(5) We evaluate our algorithms to solve the out-of-schema question detection and selection problems through both quantitative methods, including SOTA supervised and unsupervised models, and a realistic simulation approach. We achieve up to 48% improvement over the state-of-the-art for out-of-schema question detection, and 78% improvement in the OQS problem over simpler methods.

2 Background and problem formulation

2.1 Previous work

2.1.1 User drop-out from chatbot

Most slot-filling models (Kim et al. 2014; Abro et al. 2022; Larson and Leach 2022) ignore the possibility of a user dropping out of a conversation. Little work has studied this behavior. Even the metrics for task-oriented dialog systems evaluation, such as user satisfaction modeling, do not take user dropouts’ impact on the system’s success into consideration (Siro et al. 2022; Pan et al. 2022; Deng et al. 2022). Li et al. Li et al. (2020) studied different types of “Non-Progress” events in a banking chatbot, such as when a user has to reformulate their question. A key difference from our setting is that we have a clearly defined success state, whereas they are trying to learn when the user is dissatisfied with the chatbot, based on the Non-Progress events. Their work is descriptive, that is, they do not propose a method to decide what new information to ingest to the chatbot to improve its future performance. External knowledge-grounded beyond domain APIs. Kim et al. Kim et al. (2020) published a dataset containing ‘out of scope’ questions grounded in external knowledge by exploiting the MultiWoz (Budzianowski et al. 2018) dataset. Their work relies on a large amount of training data to fine-tune their BERT model (Devlin et al. 2018) for classifying the out-of-scope questions from their proposed dataset. However, their approach is less flexible, since their model needs to be retrained when new domains are introduced or for domains where enough training data is not available. To address the above challenges, we propose our in-context learning approach to detect out-of-schema questions which is flexible, easy-to-use, and scalable. We compare our out-of-schema question detection module’s performance to their proposed approach in Sect. 4.2. While their focus lies in searching for answers within external knowledge sources, our approach is significantly different than theirs. In this paper, we assume that only a human expert can answer these questions, so our focus is on ranking these questions to better utilize the expert’s time in selecting and answering questions. We recognize the limitations of human bandwidth and prioritize efficiency by strategically identifying and addressing the most critical questions. This is important for systems where frequent system updates are essential, emphasizing the significance of human involvement in addressing high-priority inquiries.

2.1.2 In-context learning in task-oriented dialog systems

Dialogue state tracking (DST) is an important module in many task-oriented dialogue systems. Hu et. alHu et al. (2022) proposed an in-context learning framework for DST, unlike previous studies which explored zero/few-shot DST, with finetuning pre-trained language models. They reformulated DST into a text-to-SQL problem and directly predicted the dialogue states without any parameter updates. Inspired by their work, we formulate our out-of-schema question detection problem into a question-answering task, utilizing few-shot examples.

2.2 Problem formulation

Fig. 3
figure 3

In-context learning context contains slot descriptions in the instructions, one of the two prompt choices, and a test dialog. <Answer3> for Prompt1 and <Answer> for Prompt2 determine if the last question is out-of-schema or not. For D-MultiWoz, the slot names are: pricerange, area, food, name, bookday, bookpeople and booktime (Complete prompt is shown in Appendix A)

A task-oriented chat log D comprises of conversations \(c_1\), \(c_2\), \(\ldots\), \(c_m\). A conversation, \(c_i\) consists of a sequence of turns, denoted as \(t_1,t_2,\ldots ,t_z\) and a boolean outcome value of SUCCESS or DROP, indicating if the conversation completed the desired user task or the user dropped out, respectively. Each turn \(t_j\) consists of an utterance, from the user or the chatbot, and a subset of the slots whose values are either requested or provided, as shown in Fig. 1.

There are two types of slots: the required slots (\(S=\{s_1,s_2,\ldots ,s_n\}\)) and the supporting slots. The system’s goal is to fill the n required slots with values extracted from the user utterances and then perform the relevant task for the user (Jurafsky and Martin 2009) i.e. reach the SUCCESS state. The supporting slots, which are the target of user questions, can provide additional information about an item and guide both the user and the system to reach SUCCESS. For example, the required slots may be ‘food’ and ‘area’, and the supporting slots may be ‘parking’ and ‘phone’. Supporting slots are further split to in-schema and out-of-schema. Supporting in-schema slots is not important in this paper, as the system can always provide this information if the user asks about these.

Consider the example shown in Fig. 1. We see a conversation, \(c_i \in D\) which consists of 6 turns (\(z = 6\)). The user provided three required slot values by the end of turn \(t_4\). The user then asks an out-of-schema question, Q at turn \(t_5\), requesting the value of an out-of-schema slot (‘diabetes-friendly’). The system cannot answer Q and the user drops out.

2.2.1 Problem definition

The input to the problem is a chat log D of dropped-out conversations, and an integer k. The expected output is the k out-of-schema questions that, if answered will maximally increase the expected number of successful conversations in D.

We assume that the domain expert answers out-of-schema questions offline, that is, we extract out-of-schema questions from a past chat log, and select some of them to answer. We also assume that many of these questions will be repeated in future chats, so their answers will increase the future success rate.

3 Proposed pipeline

In this section, we discuss our two-staged pipeline for detecting and selecting out-of-schema questions. The first stage of our pipeline detects out-of-schema conversations from a set of dropped-out conversations, D, and the second stage selects the most critical out-of-schema questions from these conversations using our OQS algorithms.

3.1 Out-of-schema question detection

Large language models (LLMs) have demonstrated impressive performance across a wide spectrum of question-answering (QA) domains (Kojima et al. 2022; Wang et al. 2023; Zhao et al. 2023). Leveraging an in-context learning few-shot setting (Brown et al. 2020), we reformulate the task as a question-answering task. The prompt templates are shown in Fig. 3. We consider two different choices of question formats in order to ensure the robustness of our method. The general instruction contains the task and slot descriptions associated with the task ontology, directly posing a question to identify out-of-schema or in-schema (Prompt2) or posing 3 consecutive questions to elicit chain-of-thought reasoning (Prompt1) (Wei et al. 2022), a set of examples, and finally a test dialog ending with a user turn. Each example consists of a dropped-out dialog concluding with a user turn. The task of the LLM is to answer the question(s) given the dialog history. The chain-of-thought prompt (Prompt1) identifies an out-of-schema question step by step, first by checking if the last user turn contains any slot values, then checking if the last turn is a question about a slot, and finally checking if the last turn is a question about undefined information. The response to the third question determines whether the last user turn constitutes an out-of-schema question. The problem can be formulated as:

$$\begin{aligned} f(c_i) = {\left\{ \begin{array}{ll} (c_i, Q),&{} \text {if }~<Answer3> = Yes~or~<Answer> = Yes\\ NONE, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

where, \(f(c_i)\) returns the conversation \(c_i\) and the out-of-schema question, Q associated with it (the last user turn), if the last user turn is an out-of-schema question. The final output of this stage is the set of conversations having out-of-schema questions, \(D_o\subseteq D\) and the set of out-of-schema questions, \(\mathbb {Q}\). For example, in Fig. 2, the output from the first stage is \(D_o= {c_1, c_2, c_3, c_4, c_6}\) and \(\mathbb {Q} = {Q1, Q2, Q3}\).

Note that posing the task as a question-answering task and providing good examples gives the large language model a better understanding of the task and examples. We also experimented with different prompting techniques such as posing the task as a classification task, or providing different sets of instructions. The setting shown in Fig. 3 performed the best in different contexts for all the datasets. A detailed example prompt is shown in Appendix A.

3.2 OQS algorithms

In this stage, our two OQS algorithms select the k most critical out-of-schema questions from the collection of out-of-schema questions, \(\mathbb {Q}\) (present in the chat log \(D_o\)) returned by the out-of-schema question detection stage. This stage involves modeling each conversation first and then calculating the benefit of each out-of-schema question within the modeled conversations.

The proposed OQS algorithms compute the total benefit B(Q) for each out-of-schema question \(Q \in \mathbb {Q}\), and select the set of k questions I with the highest benefit, that is, \(I=\arg \max _{I\subseteq \mathbb {Q}:|I|=k}\sum _{Q\in I}B(Q)\), where \(B(Q) = \sum _{i=1}^{|D_o|} benefit(Q,c_i)\).

Here \(benefit(Q,c_i)\) is the benefit of answering Q for a single conversation \(c_i\). As we will see, the proposed algorithms use different ways to model the conversations and compute \(benefit(Q,c_i)\).

3.2.1 Frequency-based selection

Fig. 4
figure 4

a Markov chain modeling of a conversation, where each state represents a user action: \(s_i\) means the user has provided \(i-1\) slot values so far, \(q_i\) means that the user asks a question and \(SUCCESS~ \& ~ DROP\) are the two absorbing states. b Shows the state transitions in the dotted box of a in more detail. \(r_i\) and \(w_i\) mean that the system knows or does not know the answer to \(q_i\), respectively

Table 1 Transition probabilities for Markov Chain in Fig. 4

The Frequency-based Selection method models a conversation as a set of out-of-schema questions, and picks the most frequent ones across all conversations. The benefit of a question Q with respect to a conversation \(c_i\) in out-of-schema conversations \(D_o\) is defined as: \(benefit(Q,c_i) = 1, ~if ~Q \in c_i; ~0, ~otherwise\).

3.2.2 Markov chain-based selection

Fig. 5
figure 5

a Markov chain representation of conversation, \(c_i\) from Fig. 1 where slots \(`food', `area' ~and~ `price\_range'\) have been filled, b & c are replays of conversation, \(c_i\) shown in a when q is answered and not answered respectively. \(S_Q^a(c_i)\) & \(S_Q^b(c_i)\) (the success probabilities of replaying conversation \(c_i\) after and before Q is answered, respectively) are calculated using Algorithm 1. It should be noted that once the user decides to go back to state \(s_4\) irrespective of Q being answered or not, they may or may not ask a new out-of-schema question, leading to transitions to either \(q_4\) or \(s_5\). The \(\ldots\) refer to the states between \(s_4\) and \(s_n\)

To address the limitations of the Frequency-based Selection, mainly that the position of a question is not considered (meaning how far it is asked from the success state), we model the user behavior when interacting with a chatbot system. Previous studies have shown the effectiveness of Markov Chains in modeling and explaining the user query behavior (Jansen et al. 2009), reformulating queries during interactions with systems via absorbing random walk for both web and virtual assistants (Wang et al. 2015; Ponnusamy et al. 2020), and modeling query utility (Zhu et al. 2012). These approaches have been proven to be highly scalable and interpretable due to intuitive definitions of transition probabilities. In our approach, we first model the interactions between a user and the system in a task-oriented dialog as a Markov chain, where the states and transitions are determined by the user actions. Then, we use this model to measure the benefit of answering a question, which accurately estimates the impact of answering that question on the overall system success rate.

3.2.3 Conversation modeling

The overall modeling flow is illustrated in Fig. 4. In each turn of the conversation, the user either provides a slot value \(s_i\) or asks a question Q (we assume one at a time for simplicity), leading to state transitions \(s_i \rightarrow s_{i+1}\) or \(s_i \rightarrow q_i\) (corresponding substates (\(r_i\) or \(w_i\)) shown in Fig. 4b). The states \(s_i\) indicate that the system is waiting for the value of slot \(s_i\), while states \(q_i\) represent the user asking a question. If an in-schema question is asked, the system responds correctly (substate \(r_i\)) and transitions back to slot-filling (state \(s_i\)). In the case of an out-of-schema question, the system has no answer (substate \(w_i\)), and the user may either drop out (state DROP) or continue (returning to state \(s_i\)). The diagram includes absorbing states DROP and SUCCESS, indicating the end of the conversation. There is also an “initial” state \(s_0\) for the start of the Markov chain. Each conversation can be represented as a path in Fig. 4a, terminating in either SUCCESS or DROP. The transition probabilities are specified in Table 1. Figure 5a demonstrates the modeling of a conversation, where \(n=7\) required slots named: restaurant_food, restaurant_area, restaurant_price_range, restaurant_name, restaurant_bookday, restaurant_bookpeople and restaurant_booktime need to be filled to reserve a restaurant table.

3.2.4 Benefit score calculation

Algorithm 1
figure a

Calculating benefit of an out-of-schema question

In this section, we discuss how to compute the benefit of answering an out-of-schema question, Q concerning a chat log of dropped-out conversations \(D_o\), given the conversation modeling in Sect. 3.2.3.

The benefit of a question Q with respect to a single conversation, \(c_i\), is defined as the increase in the success rate in chat log \(D_o\): \(benefit(Q,c_i) = (S_Q^a(c_i) -S_Q^b(c_i)),~ ~if ~ Q \in c_i; ~0~ ~otherwise\),

where \(S_Q^a(c_i)\), \(S_Q^b(c_i)\) are the success probabilities of replaying conversation \(c_i\) after and before Q is answered, respectively.

We define a replay of a conversation \(c_i\) as a hypothetical future conversation that is identical to \(c_i\) up to the point when out-of-schema question Q is asked. Figure 5b and c represent the replays of the conversation model shown in Fig. 5a, after and before the question Q is answered respectively. The probability of a random user reaching the SUCCESS state is calculated for both replays. If Q is not answered by the system, the user will drop out with probability \(Pr^{qd}\) which equals \((1-Pr^{qr}) \cdot Pr^{wd}\); and with probability \(Pr^{qs}\) which equals \(1 \cdot Pr_{qr} + (1-Pr^{qr}) \cdot (1-Pr^{wd})\), will go back to the last slot state \(s_i\) and continue the conversation (following Fig. 4b).

The benefit calculation algorithm (Algorithm 1) takes an out-of-schema question Q, a chat log of dropped-out conversations \(D_o\), a set of required slots \(S = {s_1,\cdots ,s_n}\), and the transition probabilities \(\tau\) of Table 1 as inputs. It outputs the benefit B(Q) for each out-of-schema question Q in \(D_o\). The algorithm checks for each conversation \(c_i\) in \(D_o\) whether question Q has been asked. If not, the algorithm proceeds to the next conversation \(c_{i+1}\). If question \(Q \in c_i\), the algorithm determines the remaining required slots (\(R\subseteq S\)) that need to be filled to reach SUCCESS. If a random user repeats conversation \(c_i\) up to the point when question Q is asked and by this time \(i-1\) slots have been filled, these remaining slots comprise the states \(s_i,\ldots , s_{n}\) in the Markov chain representation of the replay. The algorithm creates two adjacency matrices representing two Markov chains, \(M_a\) and \(M_b\), for replaying conversation \(c_i\) when question Q is and is not answered, respectively. The only difference between these two Markov representations is that \(M_a\) starts with \(start =\) substate \(r_i\), assuming that the system answers the question correctly, and \(M_b\) starts with \(start =\) substate \(w_i\), assuming that the system failed to answer question Q. The PageRank formula (Brin and Page 1998) is then used to calculate the probability of reaching the SUCCESS state in both replays. We use a damping factor of 0.99 (\(\approx 1\)), which intuitively means the random user always follows a transition edge, while at the same time guaranteeing to lead to one of the absorbing states. We calculate the PageRanks of all the states of the Markov chain until the results converge. The benefit of answering Q for this conversation \(c_i\) is calculated as the increase in the PageRank of the SUCCESS state when Q is answered (\(S_{Q}^a\)) compared to when Q is not answered (\(S_{Q}^b\)). The total benefit score B(Q) is obtained by summing \(benefit(Q,c_i)\) over all conversations.

4 Empirical evaluation

In this section, we first present the datasets we built for the OQS task. Then we present our experimental setup and results.

4.1 OQS datasets

Fig. 6
figure 6

Transformation of an original conversation from MultiWoz dataset to one conversation of our D-MultiWoz dataset; the dotted red box is part of the original conversation which is removed in our D-MultiWoz conversation and turn \(t_5\) now becomes the out-of-schema question

Table 2 Datasets summary

The existing datasets mentioned in Sect. 1 are widely used for dialog state tracking tasks. However, none of these datasets contain out-of-schema user questions leading to users dropping out without achieving the goal they had in mind. Maqbool et al. built a dataset (Maqbool et al. 2022) including off-script user questions seeking additional information but these questions always appeared after all the required slots have been filled.

For our OQS task, we need a dataset having multi-turn conversations and a number of out-of-schema user questions that can be asked at any point of the conversation leading users to drop out. We extended two existing datasets (Budzianowski et al. 2018; Maqbool et al. 2022). The summary of our proposed datasets is given in Table 2. Our first dataset, D-MultiWoz (D stands for “Drop-out”), extends the latest version of the MultiWoz (2.2) Budzianowski et al. (2018). We chose the “restaurant” domain and inserted five categories of unique out-of-schema questions in the existing conversations. Our second dataset, D-SGD, is a modified version of the dataset provided by Maqbool et al. (2022). They built their dataset by including Amazon Mechanical Turk queries to the SGD (Rastogi et al. 2020) dataset. However, the queries always appear after all the required slots have been filled and the system has successfully booked an event ticket. D-SGD contains the same conversations with out-of-schema questions appearing at different points of the conversations after different numbers of filled slots.

4.1.1 D-multiwoz dataset

4.1.1.1 Limitations of multiWoz dataset

Multiwoz (Budzianowski et al. 2018) is a large-scale Wizard-of-Woz multi-turn conversational corpus spanning over eight domains. Each conversation consists of a goal, a set of user and system turns. In each turn, a user either requests a slot value or provides one. An out-of-schema question is never asked, and users never drop out. We overcame the aforementioned limitations by injecting out-of-schema questions and user dropouts into the conversations of MultiWoz.

4.1.1.2 Modifications in D-MultiWoz

We refined the MultiWoz dataset to include out-of-schema questions leading to user dropouts. We define five categories of questions (initial, area, food, price_range and restaurant), which may be asked based on the slots that have already been filled, as shown in Table 3. We create at least one new conversation (see below for the case where we add more than one) by inserting an out-of-schema question for each conversation in MultiWoz. The modification steps are shown for an example conversation in Fig. 6. The modification involves selecting a random user turn (t), inserting a chosen question following Table 3, responding with “I don’t know" from the system, and truncating the dialogue thereafter. If the initial turn is selected (\(t=1\)), the process is repeated to avoid using the very first turn again.

Table 3 Conditions for an out-of-schema question category to qualify and number of questions belonging to a particular category

Note that the MultiWoz dataset contains data from eight domains. We chose the ‘Restaurant’ domain due to its broader scope for inserting various categories of out-of-schema questions after different slots. Our new dataset, D-Multiwoz contains 523 conversations each having one of the 240 unique out-of-schema questions and leading to dropouts of the user. Some questions are repeated across multiple conversations in the dataset.

4.1.2 D-SGD dataset

4.1.2.1 Limitations of the augmented SGD dataset in Maqbool et al. (2022)

Maqbool et. al curated the SGD Dataset by adding self-contained off-script questions, focusing on anaphora resolution. Their questions were mostly asked after filling all the required slots and completing key tasks, like booking tickets. This limits insights into how question timing affects their value in task completion. This focus leaves a gap in understanding how the questions’ position impacts the success state for evaluating OQS algorithms.

4.1.2.2 Modifications in D-SGD

The construction of the D-SGD dataset aims to explore the handling of out-of-schema questions by including them at different conversation stages. We move a self-contained question from the end of each conversation to an earlier user turn, replacing the original utterance (e.g., moving \(t_7\) to \(t_5\) in Fig. 7), and eliminating the subsequent conversation. Our dataset specifically focuses on the intent of booking event tickets, chosen for its increased complexity compared to finding events, and offers a greater challenge for addressing the OQS problem.

Fig. 7
figure 7

Transformation of an original conversation from Maqbool et al.’s dataset (Maqbool et al. 2022) to one conversation of our D-SGD dataset; the dotted red box is part of the original conversation which is removed in our D-SGD conversation and turn \(t_7\) now becomes \(t_5\)

4.2 Experimental setup

4.2.1 Out-of-schema question detection experimental details

We implement our ICL approach on instruction-tuned open-source LLMs, including Flan-t5-xl (Chung et al. 2022), Llama-2-7B (Touvron et al. 2023) and closed-source bigger LLMs: GPT-4 (OpenAI 2023) and GPT\(-\)3.5 (Brown et al. 2020) following the prompts shown in Fig. 3 with 5 examples. We evaluate our Out-of-schema Question detection method on our two datasets: D-MultiWoz and D-SGD, along with the ‘restaurant’ domain (to keep it consistent with D-MultiWoz evaluation) of ‘DST9 Track1’ dataset of external knowledge-grounded ‘out of scope’ questions provided by Kim et al. (2020). We use Kim et. al. BERT-based Devlin et al. (2018) Neural Classifier fine-tuned on out-of-domain question detection as one of our baselines to compare with our ICL approach. As we formulate our task as a yes/no question-answering, we also show comparisons with supervised QA models T5-base, Roberta-base and BERT-large fine-tuned on boolean QA [32-34](Clark et al. 2019) as baselines (More details in Appendix B.4). We demonstrate the results in Sect. 4.3.

4.2.1.1 OQS algorithms experimental details

We evaluate our OQS algorithms in three different ways. First, we measure how much applying each algorithm brings the conversations closer to the success state, that is, how much we reduce the number of hops (slots) distance from the current state to the success state (Sect. 4.3). We perform a simulation of conversations, based on realistic parameters, to measure the increase in the success rate after applying the proposed algorithms (Sect. 4.3). The simulation uses parameters computed based on the datasets (following Table 1) where possible. We also analyze the performance of our OQS algorithms using few-shot prompting in GPT-4, where GPT-4 estimates the probability of completion of each conversation based on the information collected so far (Sect. 4.3). Hackl et al. (2023) demonstrated GPT-4’s consistent human-level accuracy in automated evaluation tasks across multiple iterations, time spans, and stylistic variations.

Table 4 Comparison of out-of-schema question detection methods

We evaluate the two OQS algorithms on the D-MultiWoz and D-SGD datasets. The questions in the DST9 Track1 dataset (Kim et al. 2020) are inherently grounded in existing knowledge resources. This characteristic contradicts the primary objective of our work, which is to identify and prioritize the most critical questions, whose answers are not known, for the system admins. And so we exclude this dataset from the evaluations of the OQS algorithms.

Table 5 Transition probabilities for Markov-based experimental evaluation

To define the SUCCESS state, we establish a set of required slots S for each dataset. The D-SGD dataset has four predefined slots, while the D-MultiWoz dataset lacks a specific set of required slots. For D-MultiWoz, we individually determine required slots for each conversation based on the number of slots filled in the original MultiWoz conversation before reaching success. As such, different conversations have varying numbers and sets of required slots, which are all subsets of this set of slots: restaurant-food, restaurant-area, restaurant-pricerange, restaurant-name, restaurant-bookday, restaurant-bookpeople, and restaurant-booktime. Transition probabilities from Table 5 are used for Markov-based evaluation, with \(Pr^{sq}\) computed from the datasets Table 1. Authors selected values for \(Pr^{qr}\) (probability of system responding correctly) and \(Pr^{wd}\) (probability of dropout) to ensure a diverse set of experiments, as these values are constant for our datasets (all out-of-schema questions lead to dropout, \(Pr^{qr}=0\) and \(Pr^{wd}=1\)). Four values for \(Pr^{wd}\) were used to model users with varying degrees of patience when their questions are not answered. Additionally, alongside our proposed methods, we include a Random method, selecting k random out-of-schema questions, as there are no established baselines for the problem.

Fig. 8
figure 8

a number of conversations where the top-ranked questions appear up to 1, 2 and 3 hops away from SUCCESS for D-SGD (top) and D-MultiWoz (bottom), \(Pr^{qr} = 0.2\); b same for D-MultiWoz only, \(Pr^{qr} = 0.5\)

4.3 Experimental results

4.3.1 Evaluation of Out-of-schema Question detection framework

Table 4 shows that GPT-4 achieves the overall best performance on all three datasets for all three metrics for both prompt choices. We believe that the superior performance of GPT-4 is due to its stronger reasoning capabilities. Bang et al. also showed evidence of GPT-4’s superior ability against other LLMs and GPT\(-\)3.5 for varied reasoning tasks, including classification and question-answering (Bang et al. 2023). Our results highlight the QA reasoning limitations of the supervised models and smaller open-source LLMs in comparison with GPT-4. However, all the LLMs achieve comparable or better performance than the supervised baselines on unseen datasets at least for one of the prompt settings. The authors in the paper (Kim et al. 2020) mentioned that the BERT classifier works well when restricted only to their training dataset or similar. Both of our datasets have some common slots such as ‘location’, ‘date’, ‘number of people’, which are similar to their training data. Our analysis suggests that this resemblance contributes to the high precision of the baseline on D-MultiWoz and D-SGD. However, the baseline can only identify around 35% of the total out-of-schema questions whereas GPT-4 achieves up to 48% improvement on D-MultiWoz and D-SGD, and upto 10% improvement on their test set over the baseline in detecting out-of-schema questions.

Our ICL approaches achieve comparable or better performance on all three datasets having different slots and goals without any training cost or data. The supervised bert-large model tends to have a very high recall, however, the precision is low implicating its tendency to mark all utterances as ‘out-of-schema’ without properly learning the differences. Flan-t5-xl performs very poorly on Prompt1 and comparably to Llama2-7B for Prompt2 indicating its limitations in understanding longer/step-by-step task specifications. On the contrary, step-by-step reasoning help improve the performance of Llama2 and GPT\(-\)3.5. However, GPT-4 outperforms all baselines and LLMs with a substantial performance gap for all contexts and scenarios. (More experiments in Appendix B.1).

4.3.2 Evaluation of OQS algorithms using distance from SUCCESS state

In this experiment, we assume that the questions selected by the OQS algorithms are answered for each conversation in a dataset, denoted as \(\mathbb {D}\). Let \(D' \subseteq \mathbb {D}\), be the set of out-of-schema conversations ending in question Q. For a conversation \(c_i\) in \(D'\) with \(x_i\) filled slots out of a total of n required slots, the conversation is considered \(n - x_i\) hops away from the SUCCESS state. We count the number of conversations in \(D'\) that are 1, 2, or 3 hops away, considering different values of \(Pr^{wd}\), \(Pr^{qr}\), and k for both the D-SGD and D-MultiWoz datasets.

The results in Fig. 8 show that the Markov-based Top k question selection significantly brings a higher number of conversations closer to the success state across different values of \(Pr^{wd}\), assuming \(Pr^{qr}\) is set to 0.2. The Markov-based algorithm consistently outperforms the Frequency-based and Random approaches, with a more evident superiority for larger values of k. Note that, the Frequency-based method performs porrly with larger k due to ties in frequencies, resulting in random selection from questions with the same frequency. For brevity, we show only one graph for \(Pr^{qr} = 0.5\), as similar trends are observed in other cases (Fig. 8b).

In the above discussion, we view each question as unique. However, in practice, different questions may have similar or identical meanings. For that, we also study OQS-Cluster, where we first cluster the questions based on their semantics, and then pick the Top k clusters to answer. The details of this setting is shown in Appendix B.2.

4.3.3 Evaluation of OQS algorithms using Success Rate for Simulated Users

We evaluated our algorithms by running simulated conversations, where the system has been taught the answers to the top-k questions selected by the corresponding OQS algorithm. The evaluation measures the increase in success rate achieved by addressing the Top k questions, to scenarios where the system couldn’t answer these questions.

We use conversation replays from \(D_o\), as defined in Sect. 3.2.4, up to the point when an out-of-schema question Q was asked. If the Q is one of the Top k questions selected by the OQS algorithm, then we set \(Pr^{qr}\) to 1 (as in Fig. 5b), otherwise we set it to 0 (as in Fig. 5c) for that particular path. Each simulated conversation continued until reaching a SUCCESS or DROP state following the transition probabilities shown in Table 1 and chooses a path following Fig. 4. We performed 100 such simulations for each conversation in the chat log, considering four distinct user profiles with varying degrees of patience (different values of \(Pr^{wd}\) as shown in Table 5).

Our metric for evaluation was the percentage increase in success rate, calculated as the difference between successful runs with and without training on the top-k questions. Specifically, let \(S_k\) (and \(S_0\)) be the total number of successful simulated runs when the system has been taught (and not been taught) the answers to the top k questions. Then, our evaluation metric is defined as: \(\%~increase~in~success~rate = \frac{S_k - S_0}{S_0} \times 100\%\).

We present our simulation results in Table 6. The values of \(S_0\) are shown in the last two columns of Table 6. We show the comparisons among the three approaches for four different user profiles and for four different values of k, when \(Pr^{qr}=0.2\).

Table 6 Percentage increase in success rate of simulated runs when the system has been taught the answers to the Top k out-of-schema questions for 3 different approaches

The results demonstrate that the Markov-based approach is superior than the other two, showcasing up to a 78% improvement compared to Frequency-based and up to 313% improvement compared to the Random approach across different user profiles and values of k. This improvement is particularly higher when the likelihood of user dropouts (\(Pr^{wd}\)) is higher, emphasizing the effectiveness of considering question position, especially in scenarios where dropouts are more frequent.

4.3.4 Evaluation of OQS algorithms using GPT-4 Chat completion estimation

Table 7 GPT-4 Estimation of Chat Completion when Top k questions are answered

In addition to our standard evaluations, we extended the performance analysis of the OQS algorithms in a few-shot setting using GPT-4. Specifically, we prompted GPT-4 to estimate the likelihood of the system successfully gathering all required information and making a reservation, given a dropped-out conversation due to an out-of-schema question being asked. We provided two few-shot examples, one with a very low chance of completion and another with a very high chance of chat completion. The detailed prompt for this setting is available in Appendix B.1. We calculate the average score of conversations containing the Top-k questions. We compare GPT-4 scores on the conversations with the Top-k out-of-schema questions ranked by the two OQS algorithms.

Table 7 shows that all 4 variants of the Markov-based ranking are scored higher than the Frequency-based ranking on both datasets. This signifies that Markov-based methods are better at predicting the success of task-oriented dialog systems. As an additional verification measure, the authors conducted a random selection of 50 evaluations by GPT-4, confirming that the obtained results are reasonable.

5 Conclusion

We address the significant challenge of handling out-of-schema questions in task-oriented dialog systems, which, if unaddressed, could lead to user dropouts. We highlight the bandwidth limitation of humans to manually respond to the potentially vast number of these questions, and introduce a novel two-stage pipeline for the detection and selection of the most significant out-of-schema questions aimed at enhancing system success rates. We publish two datasets, specifically designed to include out-of-schema questions and user dropouts. Our experimental evaluation, utilizing both quantitative and simulation-based analysis, shows that our few-shot ICL approach achieves a new state-of-the-art in detecting out-of-schema/scope questions, and our OQS algorithms, especially the Markov-based method, significantly increase the success rate of task-oriented dialog systems. We release our datasets publicly to elicit future research in this area.

5.1 Limitations

In our work, we assume uniform dropout probability for all unanswered questions due to the unavailability of any dataset to classify the importance of a user question based on its effect on conversation dropout. In future, we will explore the semantics of out-of-schema questions to predict the dropout likelihood. Our analysis also does not consider the varying costs of answering different questions. For example, getting the address of restaurants from public sources might be easier than verifying its vegan options, which could be more labor-intensive. Integrating the cost of answers with their benefits in our OQS algorithms is an area for future exploration. Further, we focus on situations where the system lacks the information to answer a question, rather than failures in understanding the question’s intent, which has been covered by existing research.