The 2nd FutureDial Challenge: Dialog Systems with Retrieval Augmented Generation (FutureDial-RAG)
1 Overview
Developing intelligent dialog systems has been one of the longest running goals in AI. In recent years, significant progress has been made in building dialog systems with the breakthrough of deep learning methods and the large amount of conversational data being made available for system development (Budzianowski et al., 2018; Ou et al., 2022a; Ouyang et al., 2022; Achiam et al., 2023).
There are still full of challenges toward building future dialog systems. The first FutureDial challenge focused on building semi-supervised and reinforced task-oriented dialog systems (FutureDial-SereTOD) (Ou et al., 2022a; b), which was successfully held at EMNLP 2022 SereTOD workshop111http://seretod.org/. ChatGPT (Ouyang et al., 2022), a newly emerged generative dialog system in the end of 2022, has marked another amazing progress in engaging users in open-domain dialogs. However, problems like hallucination and fabrication (Alkaissi & McFarlane, 2023) still hinder the usage of such systems in real-life applications like customer service systems, which requires pin-point accuracy. Retrieval augmented generation (RAG) (Lewis et al., 2020; Guu et al., 2020) has been introduced to enhance dialog systems with retrieved information from external knowledge bases and has attracted increasing interests. RAG has been shown to be able to help the dialog systems to reply with higher accuracy and factuality, providing more informative and grounded responses (Humeau et al., 2020; Izacard & Grave, 2021; Shuster et al., 2022b; Glass et al., 2022; Izacard et al., 2022a; b; Shuster et al., 2022a; Cai et al., 2023). However, there remain challenges for RAG-based dialog systems such as designing retrievers that can retrieve knowledge from multiple knowledge sources, building RAG-based dialog systems that can effectively utilize available tools and API-calls for retrieval (Schick et al., 2023; Yao et al., 2023), and etc.
To further promote the study of how to empower dialog systems with RAG, we release a new dataset, called MobileCS2 (Mobile Customer Service) that aims to benchmark and stimulate related research in this area. The dataset originates from the real-life customer-service logs from China Mobile. Relevant knowledge bases and ground truth retrieved results are annotated so that dialog systems with RAG can be trained over such data. Note that successful fulfilling of customer-service usually needs to call specialized domain knowledge and/or APIs to retrieve relevant information. This dataset is very suited to benchmark dialog systems with RAG. In MobileCS2, there have multiple types of knowledge bases, like user profile, product information and FAQ manual, which bring challenge to the retrieval task in RAG. Moreover, the dataset contains around 3,000 sessions of unlabeled dialogs along with the same amount of sessions of labeled dialogs, which facilitates the study for semi-supervised RAG-based dialog systems (Zhang et al., 2020; Liu et al., 2022a; Cai et al., 2022; Liu et al., 2023; Cai et al., 2023).
Following the success of the 1st FutureDial challenge, the 2nd FutureDial challenge222http://futuredial.org/, co-located with SLT 2024, aims to benchmark and stimulate research in building dialog systems with RAG, with the newly released dialog dataset, MobileCS2, as overviewed in Figure 1. We aim to create a forum to discuss key challenges in the field and share findings from real-world applications.
2 Topics of interest for the challenge
Topics of interest that are relevant to the challenge include, but are not limited to, the following:
-
•
Retrieval augmented dialog systems
-
•
Information retrieval
-
•
Grounded dialog systems with unstructured knowledge sources
-
•
Large language model based dialog systems
-
•
Holistic AI technologies and applications
-
•
Semi-supervised dialog systems
-
•
Reinforced dialog systems
-
•
Evaluation of dialog systems
-
•
Dialog-related datasets and language resources
-
•
General topics for dialog systems
3 Shared task
The 1st FutureDial challenge at EMNLP 2022 (Ou et al., 2022a; b; Liu et al., 2022b) focused on building semi-supervised and reinforced task-oriented dialog systems (FutureDial-SereTOD), and released a large-scale human-human dialog dataset MobileCS1 (Mobile Customer Service).
The 2nd FutureDial challenge focuses on building dialog systems with RAG, with the following features:
-
•
We release a new dataset from the China Mobile customer-service logs (MobileCS2) that contains both labeled and unlabeled data, which encourages the study of semi-supervised RAG-based dialog systems.
-
•
The dataset enables the study of building dialog systems with knowledge base queries and API calls.
-
•
The dataset is available in both Chinese and English versions to the public, so that researchers around the world can experiment with this dataset.
To enable a RAG-based dialog system to provide appropriate answers and services to users, it is essential for the system to utilize knowledge relevant to the conversation context. Therefore the 2nd challenge examines how dialog systems can retrieve the most appropriate knowledge pieces from the knowledge base and generate grounded and faithful response to user requests, with the newly released knowledge-grounded dialog dataset, MobileCS2. The information needed should be retrieved from a given database or API call, which returns specific feedback closely related to real customer service scenarios, such as bill inquiry and package change. Accordingly, the following two tracks are proposed, which are related to the information retrieval of dialog data and the construction of RAG-based dialog systems in the customer service scenario respectively:
-
•
Track 1: Information retrieval based on knowledge bases and dialog context
-
•
Track 2: Dialog systems with retrieval augmented generation
Given the context in a dialog, the most relevant knowledge snippet in the multi-source databases should be retrieved by a retrieval model. So Track 1 aims to build the retrieval model for the dialog system. Based on retrieved knowledge, Track 2 aims to build a retrieval-augmented dialog system in the customer service scenario. The system should generate informative responses leveraging the retrieved results. Offline corpus-based evaluation will be conducted to test the of performance of the submitted system.
4 The MobileCS2 Dataset
The MobileCS2 dataset is derived from the China Mobile real-world conversational scenarios and comprises around 6,000 processed dialog logs (nearly 3,000 carefully annotated) between customers and customer service staffs. It can serve for research aims such as the development of conversational models, colloquial human-to-human dialog systems, and data-driven systematic dialog analysis.
Main_class | Api_query | Description |
---|---|---|
QA | [QA] | Consult the FAQ manual, which includes a collection of commonly asked questions such as recent promotional packages and general business regulations. |
NULL | - | Based on the contextual information, customer service personnel can successfully complete the conversation without the need for additional inquiries. |
API-Inquiry | Search for products information | Inquire about the current business information of the China Mobile, such as specific packages, data plans, etc. |
Search for user information | Inquire about the services that the user currently possesses, including the current package, current monthly fee, and current data usage. | |
Search for other information | Inquire about other key information used to complete the dialog. For example, inquiring about text messages regarding excessive data usage alerts sent by the China Mobile 10086 in the historical trajectory, querying the address of the business hall, etc. | |
API-Cancel | Cancel business | Revoke a certain service currently possessed by the user. |
API-Handle | Handle business | Process a new service for the user. |
API-Verification | Verify identity | Send verification codes, passwords, or other related customer service verification operations to the user. |
4.1 Annotation Details
In the customer service scenario, there are some knowledge or information that the customer service agent needs to get from knowledge bases (KBs) in order to correctly respond to the user. Therefore, to annotate the necessary knowledge or information, the annotators should imagine themselves as customer service agents. When presented with a dialog, annotators are required to identify the agent’s intent at each turn. If the intent is to query the KBs to seek external information and the response contains specific details, the annotator should perform a retrospective analysis based on the information provided in the response and annotate the corresponding query result. Specifically, Table 1 contains the set of intents (annotated as Api_query) and the explanations of each intent, which are provided to annotators for their reference during the annotation process. For example, given the dialog “Help me check my package”, the annotator needs to identify the intent “Search for user information” and then annotate the package that appears in the customer service’s response into the query result.
We recruited 6 China Mobile customer service staffs for the annotation, which are divided into 2 teams. The annotation is conducted dialog by dialog, and the labeling task for one dialog is assigned to an arbitrary annotator, and the annotation process takes about a week. To ensure the quality of the dataset, cross-validation is conducted between the 2 teams, and 100 annotated dialogs are checked by the other team everyday. The cross-validation agreement rate is 97 percent, which shows the dataset is of high quality. After annotation, the dataset is desensitized to remove sensitvie personal information like individual names, ids, and phone numbers.
In the final dataset, each sample in the dataset represents a dialog. At each turn of the dialog, there are two types of information to be annotated: customer service intent and customer service query results. An example of the annotated data dialog data is shown in Figure 2.
4.2 Post-processing
Based on the annotation data, it is possible to aggregate the information in the dataset and simulate the information that the agents can access in real-world services. For turns annotated with the inquiry [QA], the information can be aggregated into an FAQ (Frequently Asked Questions) handbook across the entire dataset. Turns labeled as “Search for user information” can be consolidated into a user database (local_kb) within a single dialog. Meanwhile, turns labeled as “search for products information” can be aggregated into a product database (global_kb) across the entire dataset. These three databases largely emulate the channels through which the agents acquire knowledge in real-world settings.
5 The Baseline System and metrics
5.1 The Baseline System
We use RAG-based (Lewis et al., 2020; Cai et al., 2023) methods to build our baseline system. RAG-based dialog systems aim to retrieve relevant knowledge pieces given the dialog context and generate system response using the retrieved knowledge. For MobileCS2, we take into consideration various important settings, such as adding the unique user profile for each user to the knowledge base and considering multiple relevant knowledge pieces useful given context. RAG over MobilleCS2 is for real-life scenarios, which is different from prior work in knowledge grounded dialog systems (Lewis et al., 2020; Cai et al., 2023).
To introduce the RAG-based baseline system on MobileCS2, we make the following definitions. Assume we have a dialog with turns of user utterances and system responses, denoted by respectively. For each dialog, we assume that there is a knowledge base that is necessary for the system to respond correctly. In MobileCS2, the knowledge base is made up of the user information, which is unique for each dialog, the product information list, and the FAQ list for commonly asked questions. Therefore, for the dialog , the knowledge base can be denoted as: .
At turn of a dialog , based on dialog context ( means sequence concatenation) and the knowledge base , the system uses a retriever to get the relevant knowledge from the knowledge base and generates appropriate responses with the generator .
To train the retrieval model, we consider each knowledge piece in and model the retrieval distribution of as in (Lewis et al., 2020):
(1) |
and are both initialized with a BERT-based pretrained model (Devlin et al., 2019). The probability is optimized with the standard cross entropy loss, with the positive pieces labeled in the dataset:
(2) |
The knowledge piece encoder is fixed during the training, and the context encoder is trained with the loss in Eq. 2, following the setting in Karpukhin et al. (2020).
To train the dialog system , we use the standard auto-regressive loss to optimize the generation probability initialized :
(3) |
where denotes the length in tokens, and the -th token of and is initialized with a GPT-based pretrained lanugae model (Radford et al., 2019).
5.2 Metrics and evaluation
Given a dialog and its knowledge base , the retrieval system needs to rank the relevance score for each knowledge piece in the . We use the commonly used recall metrics to assess the retrieval system. To get the recall@k metrics, we calculate whether the ground-truth knowledge piece is in the top-k retrieved knowledge pieces. To comprehensively evaluate the retrieval quality of the system, we calculate the sum of the recall for as the final score: .
To generate the suitable system response, relevant knowledge pieces are first retrieved using the retrieval system. Given the retrieved knowledge pieces, the generator can generate response based on the retrieved knowledge. The generated response is evaluated by measuring the similarity score with the ground-truth response (BLEU and BERTScore) and whether the system correctly provides the requested information by the user (Inform Rate). BLEU is used to measure the fluency of the generated responses by analyzing the amount of n-gram overlap between the real responses and the generated responses. BERTScore (Zhang et al., 2019) is used to measure the semantic similarity of the generated responses with the oracle responses by using a pretrained BERT model. Inform Rate refers to how often the system response is able to cover the requested information by the user. The final score of the generator is computed as .
6 Challenge Rules
-
•
The challenge website is http://futuredial.org/. Teams should submit the registration form to FutureDialRAG@gmail.com, which will be reviewed by the organizers.
-
•
Teams are required to sign an Agreement for Challenge Participation and Data Usage. Data will be provided to approved teams.
-
•
For teams that participate in Track 1, the scores will be ranked according to the performance for Track 1. The teams can choose to participate only in Track 1.
-
•
For teams that participate in Track 2, they can use the baseline system provided by the organizers or use the system developed by themselves for Track 1. The ranking is based on the performance for Track 2.
-
•
Participants need to strictly follow the Submission Guidelines as described below. Participants are allowed to use any external (publicly available) or internal (proprietary) datasets, resources and pre-trained models,
-
•
The evaluation data will not be released to the teams for their own evaluation. The organizers will run the submitted systems for evaluation. The evaluation data will be shared with the eligible teams after evaluation results are announced. Only teams who strictly follow the Submission Guidelines are viewed as eligible.
-
•
In publishing the results, all teams will be identified as team IDs (e.g. team1, team2, etc). The organizers will verbally indicate the identities of all teams at the Workshop for communicating results. Participants may identify their own team label (e.g. team5) and report their own result, in publications or presentations, if they desire.
7 Submission Guidelines
-
•
Each team needs to submit a package via email to FutureDialRAG@gmail.com before the Entry Submission Deadline. The package should contain a clear README documentation for running the system over the evaluation data. The submitted system should be in one of the following two forms. In either form, the system’s processing speed should be no less than 10 tokens per second.
-
–
The submission package contains the system executable with the model, for example, in a Docker image. All dependencies are contained in the submission package. The organizers run the system over a server with Nvidia A100*4 hardware, evaluate, and calculate the running time over the evaluation data.
-
–
The system is encapsulated as a callable web service. The organizers will run the script submitted by the team, call the web service to evaluate, and calculate the running time over the evaluation data.
-
–
-
•
The submission should provide a System Description Document (SDD), introducing the submitted system. Teams are also encouraged to submit papers to SLT 2024. See important dates and instructions at SLT 2024 website https://2024.ieeeslt.org/.
-
•
Before the Entry Submission Deadline, each team can submit for multiple times for each track. The last entry from each team will be used for the evaluation.
8 Important Dates
-
•
April 9, 2024: Registration opening for the challenge
-
•
April 29, 2024: Training data release
-
•
June 10, 2024: Entry submission deadline
-
•
June 20, 2024: Evaluation results announced
-
•
June 20, 2024: SLT paper submission deadline
-
•
June 27, 2024: SLT paper update deadline
-
•
August 30, 2024: Notification of paper acceptance
-
•
December 2-5, 2024: SLT 2024 Workshop Date (in-person)
References
- Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Alkaissi & McFarlane (2023) Hussam Alkaissi and Samy I McFarlane. Artificial hallucinations in ChatGPT: Implications in scientific writing, 2023.
- Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278, 2018.
- Cai et al. (2022) Yucheng Cai, Hong Liu, Zhijian Ou, Yi Huang, and Junlan Feng. Advancing semi-supervised task oriented dialog systems by JSA learning of discrete latent variable models. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 456–467, 2022.
- Cai et al. (2023) Yucheng Cai, Hong Liu, Zhijian Ou, Yi Huang, and Junlan Feng. Knowledge-retrieval task-oriented dialog systems with semi-supervision. In INTERSPEECH, 2023.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.
- Glass et al. (2022) Michael Glass, Gaetano Rossiello, Md Faisal Mahbub Chowdhury, Ankita Naik, Pengshan Cai, and Alfio Gliozzo. Re2g: Retrieve, rerank, generate. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2701–2715, 2022.
- Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: retrieval-augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, pp. 3929–3938, 2020.
- Humeau et al. (2020) Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. In International Conference on Learning Representations, 2020.
- Izacard & Grave (2021) Gautier Izacard and Édouard Grave. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 874–880, 2021.
- Izacard et al. (2022a) Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022a.
- Izacard et al. (2022b) Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Few-shot learning with retrieval augmented language models. arXiv e-prints, pp. arXiv–2208, 2022b.
- Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781, 2020.
- Lewis et al. (2020) Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Liu et al. (2022a) Hong Liu, Yucheng Cai, Zhijian Ou, Yi Huang, and Junlan Feng. Building Markovian generative architectures over pretrained LM backbones for efficient task-oriented dialog systems. In IEEE Spoken Language Technology Workshop, 2022a.
- Liu et al. (2022b) Hong Liu, Hao Peng, Zhijian Ou, Juanzi Li, Yi Huang, and Junlan Feng. Information extraction and human-robot dialogue towards real-life tasks: A baseline study with the mobilecs dataset. In EMNLP 2022 SereTOD Workshop, 2022b.
- Liu et al. (2023) Hong Liu, Yucheng Cai, Zhenru Lin, Zhijian Ou, Yi Huang, and Junlan Feng. Variational latent-state GPT for semi-supervised task-oriented dialog systems. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- Ou et al. (2022a) Zhijian Ou, Junlan Feng, and Juanzi Li. Proceedings of the towards semi-supervised and reinforced task-oriented dialog systems (seretod). In Proceedings of the Towards Semi-Supervised and Reinforced Task-Oriented Dialog Systems (SereTOD), 2022a.
- Ou et al. (2022b) Zhijian Ou, Junlan Feng, Juanzi Li, Yakun Li, Hong Liu, Hao Peng, Yi Huang, and Jiangjiang Zhao. A challenge on semi-supervised and reinforced task-oriented dialog systems. arXiv preprint arXiv:2207.02657, 2022b.
- Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
- Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Shuster et al. (2022a) Kurt Shuster, Mojtaba Komeili, Leonard Adolphs, Stephen Roller, Arthur Szlam, and Jason Weston. Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion. arXiv preprint arXiv:2203.13224, 2022a.
- Shuster et al. (2022b) Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju, Eric Michael Smith, Stephen Roller, Megan Ung, Moya Chen, Kushal Arora, Joshua Lane, et al. Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv preprint arXiv:2208.03188, 2022b.
- Yao et al. (2023) Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023.
- Zhang et al. (2019) Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019.
- Zhang et al. (2020) Yichi Zhang, Zhijian Ou, Min Hu, and Junlan Feng. A probabilistic end-to-end task-oriented dialog model with latent belief states towards semi-supervised learning. In Empirical Methods in Natural Language Processing (EMNLP), 2020.
Appendix A Appendix
UTF8gbsn
主类 | Api_query | 解释 |
---|---|---|
QA类 | [QA] | 查询FAQ手册,包含一些常用问题,如最近优惠的套餐、普遍的业务规则等。 |
置空类 | - | 根据上下文信息,客服人员无需进行额外的查询便能顺利的完成对话。 |
API-查询类 | 查询特定业务信息 | 查询移动当前有的业务信息,如特定的套餐、流量包等。 |
查询用户已办理的业务 | 查询用户当前已经拥有的业务,包括当前套餐、当前月租、当前流量等。 | |
查询其他信息(例如:查询流量短信) | 查询其他用于完成对话的关键信息。比如:查询历史轨迹中移动10086给用户发送的超出流量提醒短信、查询营业厅地址等 | |
API-取消类 | 取消 | 取消用户当前拥有的某个业务 |
API-办理类 | 办理 | 为用户办理某个新的业务 |
API-验证类 | 验证 | 向用户发送验证码、密码等相关的客服验证操作 |