Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

The 2nd FutureDial Challenge: Dialog Systems with Retrieval Augmented Generation (FutureDial-RAG)

Yucheng Cai
Tsinghua University &Shi Chen
China Mobile Research Institute &Yi Huang
China Mobile Research Institute &Junlan Feng
China Mobile Research Institute
fengjunlan@chinamobile.com &Zhijian Ou∗†
Tsinghua University
ozj@tsinghua.edu.cn
Supported by National Science and Technology Major Project (2023ZD0121401)Corresponding author

1 Overview

Developing intelligent dialog systems has been one of the longest running goals in AI. In recent years, significant progress has been made in building dialog systems with the breakthrough of deep learning methods and the large amount of conversational data being made available for system development (Budzianowski et al., 2018; Ou et al., 2022a; Ouyang et al., 2022; Achiam et al., 2023).

There are still full of challenges toward building future dialog systems. The first FutureDial challenge focused on building semi-supervised and reinforced task-oriented dialog systems (FutureDial-SereTOD) (Ou et al., 2022a; b), which was successfully held at EMNLP 2022 SereTOD workshop111http://seretod.org/. ChatGPT (Ouyang et al., 2022), a newly emerged generative dialog system in the end of 2022, has marked another amazing progress in engaging users in open-domain dialogs. However, problems like hallucination and fabrication (Alkaissi & McFarlane, 2023) still hinder the usage of such systems in real-life applications like customer service systems, which requires pin-point accuracy. Retrieval augmented generation (RAG) (Lewis et al., 2020; Guu et al., 2020) has been introduced to enhance dialog systems with retrieved information from external knowledge bases and has attracted increasing interests. RAG has been shown to be able to help the dialog systems to reply with higher accuracy and factuality, providing more informative and grounded responses (Humeau et al., 2020; Izacard & Grave, 2021; Shuster et al., 2022b; Glass et al., 2022; Izacard et al., 2022a; b; Shuster et al., 2022a; Cai et al., 2023). However, there remain challenges for RAG-based dialog systems such as designing retrievers that can retrieve knowledge from multiple knowledge sources, building RAG-based dialog systems that can effectively utilize available tools and API-calls for retrieval (Schick et al., 2023; Yao et al., 2023), and etc.

To further promote the study of how to empower dialog systems with RAG, we release a new dataset, called MobileCS2 (Mobile Customer Service) that aims to benchmark and stimulate related research in this area. The dataset originates from the real-life customer-service logs from China Mobile. Relevant knowledge bases and ground truth retrieved results are annotated so that dialog systems with RAG can be trained over such data. Note that successful fulfilling of customer-service usually needs to call specialized domain knowledge and/or APIs to retrieve relevant information. This dataset is very suited to benchmark dialog systems with RAG. In MobileCS2, there have multiple types of knowledge bases, like user profile, product information and FAQ manual, which bring challenge to the retrieval task in RAG. Moreover, the dataset contains around 3,000 sessions of unlabeled dialogs along with the same amount of sessions of labeled dialogs, which facilitates the study for semi-supervised RAG-based dialog systems (Zhang et al., 2020; Liu et al., 2022a; Cai et al., 2022; Liu et al., 2023; Cai et al., 2023).

Following the success of the 1st FutureDial challenge, the 2nd FutureDial challenge222http://futuredial.org/, co-located with SLT 2024, aims to benchmark and stimulate research in building dialog systems with RAG, with the newly released dialog dataset, MobileCS2, as overviewed in Figure 1. We aim to create a forum to discuss key challenges in the field and share findings from real-world applications.

2 Topics of interest for the challenge

Topics of interest that are relevant to the challenge include, but are not limited to, the following:

  • Retrieval augmented dialog systems

  • Information retrieval

  • Grounded dialog systems with unstructured knowledge sources

  • Large language model based dialog systems

  • Holistic AI technologies and applications

  • Semi-supervised dialog systems

  • Reinforced dialog systems

  • Evaluation of dialog systems

  • Dialog-related datasets and language resources

  • General topics for dialog systems

3 Shared task

Refer to caption
Figure 1: Overview of the FutureDial-RAG Challenge: Dialog Systems with Retrieval Augmented Generation.

The 1st FutureDial challenge at EMNLP 2022 (Ou et al., 2022a; b; Liu et al., 2022b) focused on building semi-supervised and reinforced task-oriented dialog systems (FutureDial-SereTOD), and released a large-scale human-human dialog dataset MobileCS1 (Mobile Customer Service).

The 2nd FutureDial challenge focuses on building dialog systems with RAG, with the following features:

  • We release a new dataset from the China Mobile customer-service logs (MobileCS2) that contains both labeled and unlabeled data, which encourages the study of semi-supervised RAG-based dialog systems.

  • The dataset enables the study of building dialog systems with knowledge base queries and API calls.

  • The dataset is available in both Chinese and English versions to the public, so that researchers around the world can experiment with this dataset.

To enable a RAG-based dialog system to provide appropriate answers and services to users, it is essential for the system to utilize knowledge relevant to the conversation context. Therefore the 2nd challenge examines how dialog systems can retrieve the most appropriate knowledge pieces from the knowledge base and generate grounded and faithful response to user requests, with the newly released knowledge-grounded dialog dataset, MobileCS2. The information needed should be retrieved from a given database or API call, which returns specific feedback closely related to real customer service scenarios, such as bill inquiry and package change. Accordingly, the following two tracks are proposed, which are related to the information retrieval of dialog data and the construction of RAG-based dialog systems in the customer service scenario respectively:

  • Track 1: Information retrieval based on knowledge bases and dialog context

  • Track 2: Dialog systems with retrieval augmented generation

Given the context in a dialog, the most relevant knowledge snippet in the multi-source databases should be retrieved by a retrieval model. So Track 1 aims to build the retrieval model for the dialog system. Based on retrieved knowledge, Track 2 aims to build a retrieval-augmented dialog system in the customer service scenario. The system should generate informative responses leveraging the retrieved results. Offline corpus-based evaluation will be conducted to test the of performance of the submitted system.

4 The MobileCS2 Dataset

The MobileCS2 dataset is derived from the China Mobile real-world conversational scenarios and comprises around 6,000 processed dialog logs (nearly 3,000 carefully annotated) between customers and customer service staffs. It can serve for research aims such as the development of conversational models, colloquial human-to-human dialog systems, and data-driven systematic dialog analysis.

Table 1: Detailed description for Api_query annotation. The Chinese version can be seen in Appendix.
Main_class Api_query Description
QA [QA] Consult the FAQ manual, which includes a collection of commonly asked questions such as recent promotional packages and general business regulations.
NULL - Based on the contextual information, customer service personnel can successfully complete the conversation without the need for additional inquiries.
API-Inquiry Search for products information Inquire about the current business information of the China Mobile, such as specific packages, data plans, etc.
Search for user information Inquire about the services that the user currently possesses, including the current package, current monthly fee, and current data usage.
Search for other information Inquire about other key information used to complete the dialog. For example, inquiring about text messages regarding excessive data usage alerts sent by the China Mobile 10086 in the historical trajectory, querying the address of the business hall, etc.
API-Cancel Cancel business Revoke a certain service currently possessed by the user.
API-Handle Handle business Process a new service for the user.
API-Verification Verify identity Send verification codes, passwords, or other related customer service verification operations to the user.

4.1 Annotation Details

In the customer service scenario, there are some knowledge or information that the customer service agent needs to get from knowledge bases (KBs) in order to correctly respond to the user. Therefore, to annotate the necessary knowledge or information, the annotators should imagine themselves as customer service agents. When presented with a dialog, annotators are required to identify the agent’s intent at each turn. If the intent is to query the KBs to seek external information and the response contains specific details, the annotator should perform a retrospective analysis based on the information provided in the response and annotate the corresponding query result. Specifically, Table 1 contains the set of intents (annotated as Api_query) and the explanations of each intent, which are provided to annotators for their reference during the annotation process. For example, given the dialog “Help me check my package”, the annotator needs to identify the intent “Search for user information” and then annotate the package that appears in the customer service’s response into the query result.

We recruited 6 China Mobile customer service staffs for the annotation, which are divided into 2 teams. The annotation is conducted dialog by dialog, and the labeling task for one dialog is assigned to an arbitrary annotator, and the annotation process takes about a week. To ensure the quality of the dataset, cross-validation is conducted between the 2 teams, and 100 annotated dialogs are checked by the other team everyday. The cross-validation agreement rate is 97 percent, which shows the dataset is of high quality. After annotation, the dataset is desensitized to remove sensitvie personal information like individual names, ids, and phone numbers.

In the final dataset, each sample in the dataset represents a dialog. At each turn of the dialog, there are two types of information to be annotated: customer service intent and customer service query results. An example of the annotated data dialog data is shown in Figure 2.

Refer to caption
Figure 2: An example of annotated dialogs. The Chinese version can be seen in Appendix.

4.2 Post-processing

Based on the annotation data, it is possible to aggregate the information in the dataset and simulate the information that the agents can access in real-world services. For turns annotated with the inquiry [QA], the information can be aggregated into an FAQ (Frequently Asked Questions) handbook across the entire dataset. Turns labeled as “Search for user information” can be consolidated into a user database (local_kb) within a single dialog. Meanwhile, turns labeled as “search for products information” can be aggregated into a product database (global_kb) across the entire dataset. These three databases largely emulate the channels through which the agents acquire knowledge in real-world settings.

5 The Baseline System and metrics

5.1 The Baseline System

We use RAG-based (Lewis et al., 2020; Cai et al., 2023) methods to build our baseline system. RAG-based dialog systems aim to retrieve relevant knowledge pieces given the dialog context and generate system response using the retrieved knowledge. For MobileCS2, we take into consideration various important settings, such as adding the unique user profile for each user to the knowledge base and considering multiple relevant knowledge pieces useful given context. RAG over MobilleCS2 is for real-life scenarios, which is different from prior work in knowledge grounded dialog systems (Lewis et al., 2020; Cai et al., 2023).

To introduce the RAG-based baseline system on MobileCS2, we make the following definitions. Assume we have a dialog X𝑋Xitalic_X with T𝑇Titalic_T turns of user utterances and system responses, denoted by u1,r1,,uT,rTsubscript𝑢1subscript𝑟1subscript𝑢𝑇subscript𝑟𝑇u_{1},r_{1},\cdots,u_{T},r_{T}italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_u start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT respectively. For each dialog, we assume that there is a knowledge base that is necessary for the system to respond correctly. In MobileCS2, the knowledge base is made up of the user information, which is unique for each dialog, the product information list, and the FAQ list for commonly asked questions. Therefore, for the dialog X𝑋Xitalic_X, the knowledge base KBX𝐾subscript𝐵𝑋KB_{X}italic_K italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT can be denoted as: KBXKBuserKBFAQKBproduct𝐾subscript𝐵𝑋𝐾subscript𝐵𝑢𝑠𝑒𝑟𝐾subscript𝐵𝐹𝐴𝑄𝐾subscript𝐵𝑝𝑟𝑜𝑑𝑢𝑐𝑡KB_{X}\triangleq KB_{user}\cup KB_{FAQ}\cup KB_{product}italic_K italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ≜ italic_K italic_B start_POSTSUBSCRIPT italic_u italic_s italic_e italic_r end_POSTSUBSCRIPT ∪ italic_K italic_B start_POSTSUBSCRIPT italic_F italic_A italic_Q end_POSTSUBSCRIPT ∪ italic_K italic_B start_POSTSUBSCRIPT italic_p italic_r italic_o italic_d italic_u italic_c italic_t end_POSTSUBSCRIPT.

At turn t𝑡titalic_t of a dialog X𝑋Xitalic_X, based on dialog context ctu1r1ut1rt1utsubscript𝑐𝑡direct-sumsubscript𝑢1subscript𝑟1subscript𝑢𝑡1subscript𝑟𝑡1subscript𝑢𝑡c_{t}\triangleq u_{1}\oplus r_{1}\oplus\cdots\oplus u_{t-1}\oplus r_{t-1}% \oplus u_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊕ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊕ ⋯ ⊕ italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ⊕ italic_r start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ⊕ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (direct-sum\oplus means sequence concatenation) and the knowledge base KBX𝐾subscript𝐵𝑋KB_{X}italic_K italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, the system uses a retriever to get the relevant knowledge htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the knowledge base and generates appropriate responses with the generator pθ(rtct,ht)subscript𝑝𝜃conditionalsubscript𝑟𝑡subscript𝑐𝑡subscript𝑡p_{\theta}(r_{t}\mid c_{t},h_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

To train the retrieval model, we consider each knowledge piece zi(i=1,2,,K)subscript𝑧𝑖𝑖12𝐾z_{i}~{}(i=1,2,\cdots,K)italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_i = 1 , 2 , ⋯ , italic_K ) in KBX𝐾subscript𝐵𝑋KB_{X}italic_K italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT and model the retrieval distribution of pη(zict)subscript𝑝𝜂conditionalsubscript𝑧𝑖subscript𝑐𝑡p_{\eta}(z_{i}\mid c_{t})italic_p start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as in (Lewis et al., 2020):

pη(zict)exp(Encoderp(zi)Encoderc(ct))\displaystyle p_{\eta}(z_{i}\mid c_{t})\propto\exp\left(\operatorname{Encoder}% _{p}(z_{i})^{\top}\operatorname{Encoder}_{c}(c_{t})\right)italic_p start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∝ roman_exp ( roman_Encoder start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Encoder start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (1)

EncoderpsubscriptEncoder𝑝\operatorname{Encoder}_{p}roman_Encoder start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and EncodercsubscriptEncoder𝑐\operatorname{Encoder}_{c}roman_Encoder start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are both initialized with a BERT-based pretrained model (Devlin et al., 2019). The probability is optimized with the standard cross entropy loss, with the positive pieces zZ+𝑧subscript𝑍z\in Z_{+}italic_z ∈ italic_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT labeled in the dataset:

ret=1Z+zZ+logpη(zct)pη(zct)+i=1,zizKpη(zict)subscript𝑟𝑒𝑡1delimited-∣∣subscript𝑍subscript𝑧subscript𝑍subscript𝑝𝜂conditional𝑧subscript𝑐𝑡subscript𝑝𝜂conditional𝑧subscript𝑐𝑡superscriptsubscriptformulae-sequence𝑖1subscript𝑧𝑖𝑧𝐾subscript𝑝𝜂conditionalsubscript𝑧𝑖subscript𝑐𝑡\mathcal{L}_{ret}=-\frac{1}{\mid Z_{+}\mid}\sum_{z\in Z_{+}}\log\frac{p_{\eta}% (z\mid c_{t})}{p_{\eta}(z\mid c_{t})+\sum_{i=1,z_{i}\neq z}^{K}p_{\eta}(z_{i}% \mid c_{t})}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_t end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG ∣ italic_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ∣ end_ARG ∑ start_POSTSUBSCRIPT italic_z ∈ italic_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log divide start_ARG italic_p start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_z ∣ italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_z ∣ italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG (2)

The knowledge piece encoder EncoderpsubscriptEncoder𝑝\operatorname{Encoder}_{p}roman_Encoder start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is fixed during the training, and the context encoder EncodercsubscriptEncoder𝑐\operatorname{Encoder}_{c}roman_Encoder start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is trained with the loss in Eq. 2, following the setting in Karpukhin et al. (2020).

To train the dialog system pθ(rtct,ht)subscript𝑝𝜃conditionalsubscript𝑟𝑡subscript𝑐𝑡subscript𝑡p_{\theta}(r_{t}\mid c_{t},h_{t})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), we use the standard auto-regressive loss to optimize the generation probability initialized :

pθ(rtct,ht)=l=1|rt|pθ(ylct,ht,y1,,yl1)subscript𝑝𝜃conditionalsubscript𝑟𝑡subscript𝑐𝑡subscript𝑡superscriptsubscriptproduct𝑙1subscript𝑟𝑡subscript𝑝𝜃conditionalsuperscript𝑦𝑙subscript𝑐𝑡subscript𝑡superscript𝑦1superscript𝑦𝑙1\displaystyle{p}_{\theta}(r_{t}\mid c_{t},h_{t})=\prod_{l=1}^{|r_{t}|}p_{% \theta}(y^{l}\mid c_{t},h_{t},y^{1},\ldots,y^{l-1})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∣ italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_y start_POSTSUPERSCRIPT italic_l - 1 end_POSTSUPERSCRIPT ) (3)

where |||\cdot|| ⋅ | denotes the length in tokens, and ylsuperscript𝑦𝑙y^{l}italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT the l𝑙litalic_l-th token of rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is initialized with a GPT-based pretrained lanugae model (Radford et al., 2019).

5.2 Metrics and evaluation

Given a dialog X𝑋Xitalic_X and its knowledge base KBX𝐾subscript𝐵𝑋KB_{X}italic_K italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT, the retrieval system needs to rank the relevance score for each knowledge piece in the KBX𝐾subscript𝐵𝑋KB_{X}italic_K italic_B start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT. We use the commonly used recall metrics to assess the retrieval system. To get the recall@k metrics, we calculate whether the ground-truth knowledge piece is in the top-k retrieved knowledge pieces. To comprehensively evaluate the retrieval quality of the system, we calculate the sum of the recall for k=1,5,20𝑘1520k=1,5,20italic_k = 1 , 5 , 20 as the final score: scoreretriever=recall@1+recall@5+recall@20𝑠𝑐𝑜𝑟subscript𝑒𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑟𝑟𝑒𝑐𝑎𝑙𝑙@1𝑟𝑒𝑐𝑎𝑙𝑙@5𝑟𝑒𝑐𝑎𝑙𝑙@20score_{retriever}=recall@1+recall@5+recall@20italic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_r italic_e italic_t italic_r italic_i italic_e italic_v italic_e italic_r end_POSTSUBSCRIPT = italic_r italic_e italic_c italic_a italic_l italic_l @ 1 + italic_r italic_e italic_c italic_a italic_l italic_l @ 5 + italic_r italic_e italic_c italic_a italic_l italic_l @ 20.

To generate the suitable system response, relevant knowledge pieces are first retrieved using the retrieval system. Given the retrieved knowledge pieces, the generator can generate response based on the retrieved knowledge. The generated response is evaluated by measuring the similarity score with the ground-truth response (BLEU and BERTScore) and whether the system correctly provides the requested information by the user (Inform Rate). BLEU is used to measure the fluency of the generated responses by analyzing the amount of n-gram overlap between the real responses and the generated responses. BERTScore (Zhang et al., 2019) is used to measure the semantic similarity of the generated responses with the oracle responses by using a pretrained BERT model. Inform Rate refers to how often the system response is able to cover the requested information by the user. The final score of the generator is computed as scoregenerator=0.5(BLEU+BERTScore)+Inform𝑠𝑐𝑜𝑟subscript𝑒𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑜𝑟0.5𝐵𝐿𝐸𝑈𝐵𝐸𝑅𝑇𝑆𝑐𝑜𝑟𝑒𝐼𝑛𝑓𝑜𝑟𝑚score_{generator}=0.5*(BLEU+BERTScore)+Informitalic_s italic_c italic_o italic_r italic_e start_POSTSUBSCRIPT italic_g italic_e italic_n italic_e italic_r italic_a italic_t italic_o italic_r end_POSTSUBSCRIPT = 0.5 ∗ ( italic_B italic_L italic_E italic_U + italic_B italic_E italic_R italic_T italic_S italic_c italic_o italic_r italic_e ) + italic_I italic_n italic_f italic_o italic_r italic_m.

6 Challenge Rules

  • The challenge website is http://futuredial.org/. Teams should submit the registration form to FutureDialRAG@gmail.com, which will be reviewed by the organizers.

  • Teams are required to sign an Agreement for Challenge Participation and Data Usage. Data will be provided to approved teams.

  • For teams that participate in Track 1, the scores will be ranked according to the performance for Track 1. The teams can choose to participate only in Track 1.

  • For teams that participate in Track 2, they can use the baseline system provided by the organizers or use the system developed by themselves for Track 1. The ranking is based on the performance for Track 2.

  • Participants need to strictly follow the Submission Guidelines as described below. Participants are allowed to use any external (publicly available) or internal (proprietary) datasets, resources and pre-trained models,

  • The evaluation data will not be released to the teams for their own evaluation. The organizers will run the submitted systems for evaluation. The evaluation data will be shared with the eligible teams after evaluation results are announced. Only teams who strictly follow the Submission Guidelines are viewed as eligible.

  • In publishing the results, all teams will be identified as team IDs (e.g. team1, team2, etc). The organizers will verbally indicate the identities of all teams at the Workshop for communicating results. Participants may identify their own team label (e.g. team5) and report their own result, in publications or presentations, if they desire.

7 Submission Guidelines

  • Each team needs to submit a package via email to FutureDialRAG@gmail.com before the Entry Submission Deadline. The package should contain a clear README documentation for running the system over the evaluation data. The submitted system should be in one of the following two forms. In either form, the system’s processing speed should be no less than 10 tokens per second.

    • The submission package contains the system executable with the model, for example, in a Docker image. All dependencies are contained in the submission package. The organizers run the system over a server with Nvidia A100*4 hardware, evaluate, and calculate the running time over the evaluation data.

    • The system is encapsulated as a callable web service. The organizers will run the script submitted by the team, call the web service to evaluate, and calculate the running time over the evaluation data.

  • The submission should provide a System Description Document (SDD), introducing the submitted system. Teams are also encouraged to submit papers to SLT 2024. See important dates and instructions at SLT 2024 website https://2024.ieeeslt.org/.

  • Before the Entry Submission Deadline, each team can submit for multiple times for each track. The last entry from each team will be used for the evaluation.

8 Important Dates

  • April 9, 2024: Registration opening for the challenge

  • April 29, 2024: Training data release

  • June 10, 2024: Entry submission deadline

  • June 20, 2024: Evaluation results announced

  • June 20, 2024: SLT paper submission deadline

  • June 27, 2024: SLT paper update deadline

  • August 30, 2024: Notification of paper acceptance

  • December 2-5, 2024: SLT 2024 Workshop Date (in-person)

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Alkaissi & McFarlane (2023) Hussam Alkaissi and Samy I McFarlane. Artificial hallucinations in ChatGPT: Implications in scientific writing, 2023.
  • Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278, 2018.
  • Cai et al. (2022) Yucheng Cai, Hong Liu, Zhijian Ou, Yi Huang, and Junlan Feng. Advancing semi-supervised task oriented dialog systems by JSA learning of discrete latent variable models. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp.  456–467, 2022.
  • Cai et al. (2023) Yucheng Cai, Hong Liu, Zhijian Ou, Yi Huang, and Junlan Feng. Knowledge-retrieval task-oriented dialog systems with semi-supervision. In INTERSPEECH, 2023.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.
  • Glass et al. (2022) Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Naik, Pengshan Cai, and Alfio Gliozzo. Re2g: Retrieve, rerank, generate. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2701–2715, 2022.
  • Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: retrieval-augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, pp.  3929–3938, 2020.
  • Humeau et al. (2020) Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. In International Conference on Learning Representations, 2020.
  • Izacard & Grave (2021) Gautier Izacard and Édouard Grave. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp.  874–880, 2021.
  • Izacard et al. (2022a) Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022a.
  • Izacard et al. (2022b) Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv e-prints, pp.  arXiv–2208, 2022b.
  • Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6769–6781, 2020.
  • Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  • Liu et al. (2022a) Hong Liu, Yucheng Cai, Zhijian Ou, Yi Huang, and Junlan Feng. Building Markovian generative architectures over pretrained LM backbones for efficient task-oriented dialog systems. In IEEE Spoken Language Technology Workshop, 2022a.
  • Liu et al. (2022b) Hong Liu, Hao Peng, Zhijian Ou, Juanzi Li, Yi Huang, and Junlan Feng. Information extraction and human-robot dialogue towards real-life tasks: A baseline study with the mobilecs dataset. In EMNLP 2022 SereTOD Workshop, 2022b.
  • Liu et al. (2023) Hong Liu, Yucheng Cai, Zhenru Lin, Zhijian Ou, Yi Huang, and Junlan Feng. Variational latent-state GPT for semi-supervised task-oriented dialog systems. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • Ou et al. (2022a) Zhijian Ou, Junlan Feng, and Juanzi Li. Proceedings of the towards semi-supervised and reinforced task-oriented dialog systems (seretod). In Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD), 2022a.
  • Ou et al. (2022b) Zhijian Ou, Junlan Feng, Juanzi Li, Yakun Li, Hong Liu, Hao Peng, Yi Huang, and Jiangjiang Zhao. A challenge on semi-supervised and reinforced task-oriented dialog systems. arXiv preprint arXiv:2207.02657, 2022b.
  • Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
  • Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Shuster et al. (2022a) Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur Szlam, and Jason Weston. Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion. arXiv preprint arXiv:2203.13224, 2022a.
  • Shuster et al. (2022b) Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, et al. Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv preprint arXiv:2208.03188, 2022b.
  • Yao et al. (2023) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023.
  • Zhang et al. (2019) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019.
  • Zhang et al. (2020) Yichi Zhang, Zhijian Ou, Min Hu, and Junlan Feng. A probabilistic end-to-end task-oriented dialog model with latent belief states towards semi-supervised learning. In Empirical Methods in Natural Language Processing (EMNLP), 2020.

Appendix A Appendix

{CJK}

UTF8gbsn

Table 2: Chinese description for Api_query annotation.
主类 Api_query 解释
QA类 [QA] 查询FAQ手册,包含一些常用问题,如最近优惠的套餐、普遍的业务规则等。
置空类 - 根据上下文信息,客服人员无需进行额外的查询便能顺利的完成对话。
API-查询类 查询特定业务信息 查询移动当前有的业务信息,如特定的套餐、流量包等。
查询用户已办理的业务 查询用户当前已经拥有的业务,包括当前套餐、当前月租、当前流量等。
查询其他信息(例如:查询流量短信) 查询其他用于完成对话的关键信息。比如:查询历史轨迹中移动10086给用户发送的超出流量提醒短信、查询营业厅地址等
API-取消类 取消 取消用户当前拥有的某个业务
API-办理类 办理 为用户办理某个新的业务
API-验证类 验证 向用户发送验证码、密码等相关的客服验证操作
Refer to caption
Figure 3: An example of annotated Chinese dialogs.