Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: bibentry

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2401.06541v1 [cs.CL] 12 Jan 2024

Medical Dialogue Generation via Intuitive-then-Analytical Differential Diagnosis

Kaishuai Xu1, Wenjun Hou1,2, Yi Cheng1, Jian Wang1, Wenjie Li1
Abstract

Medical dialogue systems have attracted growing research attention as they have the potential to provide rapid diagnoses, treatment plans, and health consultations. In medical dialogues, a proper diagnosis is crucial as it establishes the foundation for future consultations. Clinicians typically employ both intuitive and analytic reasoning to formulate a differential diagnosis. This reasoning process hypothesizes and verifies a variety of possible diseases and strives to generate a comprehensive and rigorous diagnosis. However, recent studies on medical dialogue generation have overlooked the significance of modeling a differential diagnosis, which hinders the practical application of these systems. To address the above issue, we propose a medical dialogue generation framework with the Intuitive-then-Analytic Differential Diagnosis (IADDx). Our method starts with a differential diagnosis via retrieval-based intuitive association and subsequently refines it through a graph-enhanced analytic procedure. The resulting differential diagnosis is then used to retrieve medical knowledge and guide response generation. Experimental results on two datasets validate the efficacy of our method. Besides, we demonstrate how our framework assists both clinicians and patients in understanding the diagnostic process, for instance, by producing intermediate results and graph-based diagnosis paths.

Introduction

Refer to caption
Figure 1: An example of differential diagnosis in a medical dialogue, which contains intuitive and analytic reasoning.

Medical dialogue systems (MDS) endeavor to provide diverse medical services such as diagnosis, treatment plans, and health consultations (Yan et al. 2022; Xu et al. 2023; Chen et al. 2022b). These systems have garnered increasing research attention in recent years due to their potential to assist clinicians in diagnosing and prescribing (Shi et al. 2023; Liu et al. 2022b; Zeng et al. 2020; Zhou et al. 2021; He et al. 2022; Xu et al. 2019).

In medical dialogues, diagnosis is a crucial process as the results of diagnosis establish an essential foundation for subsequent consultations (Maynard and Heritage 2005; Silverman, Kurtz, and Draper 2016). However, previous studies on medical dialogue generation using pre-trained language models neglected to explicitly model the diagnostic process (Xu et al. 2023; Zhao et al. 2022; Li et al. 2021; Liu et al. 2021; Lin et al. 2021; Liu et al. 2022b). One significant issue with these methods is that although responses benefiting from pre-trained models may appear coherent, they usually lack an interpretation grounded in meticulous medical diagnosis. It is challenging for clinicians or patients to accept responses from MDSs without a clear and interpretable diagnostic basis (Kurtz, Draper, and Silverman 2017).

In practice, clinicians typically employ both intuitive and analytic reasoning during the dialogue to formulate a differential diagnosis, i.e., a set of potential diseases guiding how the subsequent dialogue unfolds (Croskerry 2009; Tchango et al. 2022; Silverman, Kurtz, and Draper 2016). Intuitive reasoning forms a rough disease list through a quick review of extensive clinical experience, while analytic reasoning cautiously verifies some diseases via a systematic analysis of body systems, organs, and symptoms. As an example shown in Figure 1, if a patient is diagnosed with a high possibility of gastritis but may still have chronic cholecystitis, the physician will first ask if any tests have been done to rule out chronic cholecystitis and then inquire about more gastritis-related symptoms to prescribe medications. Prior studies often overlook the importance of differential diagnosis (Li et al. 2021; Zhao et al. 2022; Xu et al. 2019; Wei et al. 2018; Kao, Tang, and Chang 2018). In our work, we argue that modeling a differential diagnosis with intuitive and analytic reasoning is crucial, and generating responses conditioned on potential diseases can improves the reliability and accuracy of medical dialogue generation.

To address the above issues, we propose a medical dialogue generation framework with Intuitive-then-Analytic Differential Diagnosis (IADDx), which first produces a differential diagnosis and then utilizes potential diseases to guide response generation. For differential diagnosis, we draw inspiration from the diagnostic reasoning research (Croskerry 2009) and design a two-stage (i.e., intuitive-then-analytical) differential diagnosis method. In the intuitive stage, we extract patients’ conditions from the dialogue and use them to retrieve previous cases and disease documents that present similar situations. A preliminary list of potential diseases can be concluded from the cases and documents. In the analytical stage, we first create a diagnosis-oriented entity graph that contains body systems, organs, diseases, and symptoms. Then, we employ ConceptTransformer (Rigotti et al. 2022) to incorporate the constructed graph and build a multi-disease classifier to discriminate multiple diseases, thereby assisting in refining the preliminary list. Our analytical stage achieves a multi-disease classification and provides a faithful and plausible interpretation represented by entities on the graph. For response generation, we utilize refined potential diseases to retrieve medical knowledge and generate responses conditioned on the knowledge.

Our main contributions are summarized as follows:

  • We propose a medical dialogue generation framework, IADDx, which explicitly models a differential diagnosis with intuitive-then-analytic reasoning and incorporates diagnosis to guide response generation.

  • We build a diagnosis-oriented entity graph composed of systems, organs, diseases, and symptoms and apply the graph to enhance and interpret the diagnostic process in conversations.

  • Experimental results on two medical datasets show the effectiveness and interpretability of our IADDx.

Related Work

Medical dialogue systems (MDS) strive to offer healthcare services to patients. Initial research concentrated on automated diagnosis through task-oriented dialogue systems, emphasizing the swift identification of latent symptoms and providing a final diagnosis (Liao et al. 2020; Lin et al. 2019; Chen et al. 2022a; Liu et al. 2022a). The work of Wei et al. (2018) introduced a dataset marked with symptom annotations and developed a medical dialogue system using reinforcement learning. Xu et al. (2019) integrated a medical knowledge graph into MDS to manage the order of inquired symptoms. Tchango et al. (2022) further improves system reliability by outputting a differential diagnosis, using the exploration-confirmation method, and prioritizing serious diseases. However, these systems conclude with diagnostic results without providing treatment plans or consultations.

The advent of large-scale medical dialogue datasets like MedDialog (Zeng et al. 2020), MedDG (Liu et al. 2022b), and KaMed (Li et al. 2021) and pre-trained language models (Lewis et al. 2020; Radford et al. 2019) have amplified interest in medical dialogue generation with multiple services. The study by Liu et al. (2022b) approached medical dialogue generation by focusing on entity prediction coupled with entity-centric response generation. Moreover, Liu et al. (2021) enhanced dialogue understanding and entity reasoning using a unified heterogeneous graph. Similarly, Zhao et al. (2022) construct a dialogue graph to leverage medical relationships implied in the context. Li et al. (2021) treated medical entities within both patient and doctor utterances as states and actions, introducing a semi-supervised variation reasoning system complemented by a patient state tracker and a physician action network. Xu et al. (2023) proposed a dual flow (i.e., dialogue act and entity flows) modeling method to improve dialogue understanding and use acts and entities to guide response generation. Lin et al. (2021) explored to transfer the diagnostic experience from rich-resource diseases to low-resource ones.

Although previous studies on medical dialogue generation have attempted to enhance dialogue understanding and guide response generation by incorporating predicted dialogue acts and medical entities, they ignore modeling the diagnostic process, which provides interpretation for generated responses. Besides, few works focus on differential diagnosis and apply it to instruct response generation. Our framework aims to model the process of differential diagnosis and generate responses with diagnosis hints.

Preliminary

Refer to caption
Figure 2: Left: The architecture of Differential Diagnosis, which includes the intuitive association stage and the analytic refinement stage. The multi-disease classifier in Stage 2 generates a refined diagnosis to guide response generation. Right: The structure of Response Generation. The diagnosis combined with the dialogue acts are used to retrieve relevant knowledge.

Problem Formulation

A medical dialogue is denoted as U𝑈Uitalic_U==={(UkP,UkD)}k=1Tsuperscriptsubscriptsubscriptsuperscript𝑈𝑃𝑘subscriptsuperscript𝑈𝐷𝑘𝑘1𝑇\{(U^{P}_{k},U^{D}_{k})\}_{k=1}^{T}{ ( italic_U start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_U start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, where utterances from patients and doctors are represented by UPsuperscript𝑈𝑃U^{P}italic_U start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT and UDsuperscript𝑈𝐷U^{D}italic_U start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT respectively. Each dialogue is annotated with several possible diseases D𝐷Ditalic_D==={di}i=1ndsuperscriptsubscriptsubscript𝑑𝑖𝑖1subscript𝑛𝑑\{d_{i}\}_{i=1}^{n_{d}}{ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and each doctor utterance is annotated with multiple dialogue acts A𝐴Aitalic_A==={ai}i=1nasuperscriptsubscriptsubscript𝑎𝑖𝑖1subscript𝑛𝑎\{a_{i}\}_{i=1}^{n_{a}}{ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Given the historical dialogue sequence Utsubscript𝑈𝑡U_{t}italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT==={U1P,U1D,,UtP}subscriptsuperscript𝑈𝑃1subscriptsuperscript𝑈𝐷1subscriptsuperscript𝑈𝑃𝑡\{U^{P}_{1},U^{D}_{1},...,U^{P}_{t}\}{ italic_U start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_U start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_U start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }, the objective is to produce the t𝑡titalic_t-th doctor utterance UtDsubscriptsuperscript𝑈𝐷𝑡U^{D}_{t}italic_U start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

SOAP-based Content Structuring

In each round of a medical dialogue (UD,UP)superscript𝑈𝐷superscript𝑈𝑃(U^{D},U^{P})( italic_U start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , italic_U start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ), we first extract medical-related segments in accordance with the SOAP Notes (Krishna et al. 2021), i.e., Subjective personal reports {SiS}subscriptsuperscript𝑆𝑆𝑖\{S^{S}_{i}\}{ italic_S start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, Objective quantifiable data {SiO}subscriptsuperscript𝑆𝑂𝑖\{S^{O}_{i}\}{ italic_S start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, an Assessment (or diagnosis results) {SiA}subscriptsuperscript𝑆𝐴𝑖\{S^{A}_{i}\}{ italic_S start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, and subsequent Plans {SiP}subscriptsuperscript𝑆𝑃𝑖\{S^{P}_{i}\}{ italic_S start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } for patient care. SOAP notes serve as a documentation method used by medical specialists to structure patient information. For example, a subjective segment can be “vomited three times”. The extraction filters out irrelevant details, and segments from the subjective and objective sections are instrumental for diagnosis. We employ pre-trained large language models (LLMs), such as GPT-4 (OpenAI 2023), to extract segments in a few-shot manner. The input prompts for LLMs contain SOAP instructions and some examples of extracted segments. Details are described in Appendix.

Diagnosis-Oriented Graph Construction

We construct a Diagnosis-Oriented Graph (DOG) G𝐺Gitalic_G={ei}absentsubscript𝑒𝑖=\{e_{i}\}= { italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } inspired by the problem-specific framework used in differential diagnosis (Stern, Cifu, and Altkorn 2020). This framework aids in pinpointing medical issues and inferring a differential diagnosis. The graph we develop encompasses entities such as body systems (eSyssuperscript𝑒Syse^{\text{Sys}}italic_e start_POSTSUPERSCRIPT Sys end_POSTSUPERSCRIPT), organs (eOrgsuperscript𝑒Orge^{\text{Org}}italic_e start_POSTSUPERSCRIPT Org end_POSTSUPERSCRIPT), diseases (eDissuperscript𝑒Dise^{\text{Dis}}italic_e start_POSTSUPERSCRIPT Dis end_POSTSUPERSCRIPT), and symptoms (eSymsuperscript𝑒Syme^{\text{Sym}}italic_e start_POSTSUPERSCRIPT Sym end_POSTSUPERSCRIPT). Notably, “System absent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW Organ absent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW Disease absent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW Symptom” can be a diagnostic path. In detail, systems and organs serve to categorize disease types and help understand the potential impact, while symptoms provide the basis for confirming or ruling out a particular disease. Based on findings from anatomic studies (Tortora and Derrickson 2018), we incorporate major body systems, such as the digestive and endocrine systems, and associated organs like the stomach and thyroid. Besides, we select diseases and symptoms from an online medical encyclopedia website111https://www.baikemy.com/ that is edited and reviewed by medical specialists. The relations between body systems and their respective organs, as well as between diseases and symptoms, are inherent. To establish connections between diseases and the organs they affect, we utilize pre-trained LLMs to associate diseases with specific organs based on their pathological manifestations.

Method

Our proposed IADDx framework comprises two main components, as illustrated in Figure 2. The Intuitive-then-Analytic Differential Diagnosis component makes a differential diagnosis through intuitive and analytic reasoning. Subsequently, the Diagnosis-guided Response Generation component utilizes diagnosis results to guide knowledge retrieval and generate an appropriate response with the retrieved knowledge.

Ituitive-then-Analytic Differential Diagnosis

We model the differential diagnosis process in two stages: intuitive association and analytic refinement. The first stage draws upon clinical experience to make a rough diagnosis. We use patient case and disease document retrieval to generate a preliminary list of diseases. The disease document retriever assists in identifying diseases that have not been previously encountered. Then, the second stage further refines the diagnosis through a more detailed problem analysis, enhancing both diagnostic accuracy and interpretability. We construct a multi-disease classifier with the aid of our diagnosis-oriented graph for this stage.

Stage 1: Intuitive Association.

The objective of this stage is to make a rough diagnosis. Given the dialogue history Utsubscript𝑈𝑡U_{t}italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we extract SOAP segments {SiS}subscriptsuperscript𝑆𝑆𝑖\{S^{S}_{i}\}{ italic_S start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, {SiO}subscriptsuperscript𝑆𝑂𝑖\{S^{O}_{i}\}{ italic_S start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, {SiA}subscriptsuperscript𝑆𝐴𝑖\{S^{A}_{i}\}{ italic_S start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, and {SiP}subscriptsuperscript𝑆𝑃𝑖\{S^{P}_{i}\}{ italic_S start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } using a pre-trained LLM and use them to retrieve knowledge for disease list generation. These segments contain information (i.e., patient symptoms, signs, and medical history) needed for differential diagnosis. We concatenate the {SiS}subscriptsuperscript𝑆𝑆𝑖\{S^{S}_{i}\}{ italic_S start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, {SiO}subscriptsuperscript𝑆𝑂𝑖\{S^{O}_{i}\}{ italic_S start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, and {SiA}subscriptsuperscript𝑆𝐴𝑖\{S^{A}_{i}\}{ italic_S start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } segments as the query for subsequent retrieval. Here, the plan segments are not considered since their content is presented after diagnosis. We use the aforementioned query to retrieve two types of knowledge: (1) patient cases (in the format of SOAP) from the training corpus exhibiting conditions similar to those of the current patient and (2) disease documents (i.e., etiology and symptoms) from a disease corpus with descriptions that align with the current patient’s conditions. The query and patient case (or disease document) sequences with a “[CLS]” token inserted at the front are separately input to a BERT encoder (Devlin et al. 2019). We select hidden states of “[CLS]” tokens for each sequence as their representations 𝐒d𝐒superscript𝑑\textbf{S}\in\mathbb{R}^{d}S ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝐒Casesuperscript𝐒Case\textbf{S}^{\text{Case}}S start_POSTSUPERSCRIPT Case end_POSTSUPERSCRIPTdabsentsuperscript𝑑\in\mathbb{R}^{d}∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (or 𝐒Docsuperscript𝐒Doc\textbf{S}^{\text{Doc}}S start_POSTSUPERSCRIPT Doc end_POSTSUPERSCRIPTdabsentsuperscript𝑑\in\mathbb{R}^{d}∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT) and calculate relevance scores as:

sCase=𝐒,𝐒Case,sDoc=𝐒,𝐒Doc,formulae-sequencesuperscript𝑠Case𝐒superscript𝐒Casesuperscript𝑠Doc𝐒superscript𝐒Doc\displaystyle s^{\text{Case}}=\langle\textbf{S},\textbf{S}^{\text{Case}}% \rangle,s^{\text{Doc}}=\langle\textbf{S},\textbf{S}^{\text{Doc}}\rangle,italic_s start_POSTSUPERSCRIPT Case end_POSTSUPERSCRIPT = ⟨ S , S start_POSTSUPERSCRIPT Case end_POSTSUPERSCRIPT ⟩ , italic_s start_POSTSUPERSCRIPT Doc end_POSTSUPERSCRIPT = ⟨ S , S start_POSTSUPERSCRIPT Doc end_POSTSUPERSCRIPT ⟩ , (1)

where ,\langle,\rangle⟨ , ⟩ represents a similarity function. After retrieving patient cases, each case is assigned a relevance score, which is then applied to all diseases associated with that case. Thus, each disease can be assigned a group of scores {sjCase}j𝒞isubscriptsubscriptsuperscript𝑠Case𝑗𝑗subscript𝒞𝑖\{s^{\text{Case}}_{j}\}_{j\in{\cal{C}}_{i}}{ italic_s start_POSTSUPERSCRIPT Case end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j ∈ caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT from different patient cases, where 𝒞isubscript𝒞𝑖{\cal{C}}_{i}caligraphic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes cases diagnosed with disease i𝑖iitalic_i. We select the maximum as the relevance score for each disease siCase*subscriptsuperscript𝑠Case𝑖s^{\text{Case}*}_{i}italic_s start_POSTSUPERSCRIPT Case * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. After retrieving disease documents, each disease can also be directly assigned a relevance score siDocsubscriptsuperscript𝑠Doc𝑖s^{\text{Doc}}_{i}italic_s start_POSTSUPERSCRIPT Doc end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We average these two scores as the final disease relevance score, sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT===(siCase*+siDoc)/2subscriptsuperscript𝑠Case𝑖subscriptsuperscript𝑠Doc𝑖2(s^{\text{Case}*}_{i}+s^{\text{Doc}}_{i})/2( italic_s start_POSTSUPERSCRIPT Case * end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_s start_POSTSUPERSCRIPT Doc end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / 2, and select top-K𝐾Kitalic_K diseases to generate a preliminary disease list {di}i=1Ksuperscriptsubscriptsubscript𝑑𝑖𝑖1𝐾\{d_{i}\}_{i=1}^{K}{ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT.

We adopt Contrastive Learning (Chen et al. 2020) to optimize these two retrievers. The loss function for the patient case retriever is defined as:

CasesubscriptCase\displaystyle{\cal{L}}_{\text{Case}}caligraphic_L start_POSTSUBSCRIPT Case end_POSTSUBSCRIPT =logexp(𝐒,𝐒tCase+)𝐒tCaseexp(𝐒,𝐒tCase),absent𝐒subscriptsuperscript𝐒limit-fromCase𝑡subscriptsubscriptsuperscript𝐒limit-fromCase𝑡𝐒subscriptsuperscript𝐒limit-fromCase𝑡\displaystyle=-\log\frac{\exp(\langle\textbf{S},\textbf{S}^{\text{Case}+}_{t}% \rangle)}{\sum_{\textbf{S}^{\text{Case}-}_{t}\in\cal{B}}\exp(\langle\textbf{S}% ,\textbf{S}^{\text{Case}-}_{t}\rangle)},= - roman_log divide start_ARG roman_exp ( ⟨ S , S start_POSTSUPERSCRIPT Case + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT S start_POSTSUPERSCRIPT Case - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_B end_POSTSUBSCRIPT roman_exp ( ⟨ S , S start_POSTSUPERSCRIPT Case - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟩ ) end_ARG , (2)

where 𝐒tCase+subscriptsuperscript𝐒limit-fromCase𝑡\textbf{S}^{\text{Case}+}_{t}S start_POSTSUPERSCRIPT Case + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes representations from positive cases that share at least one diagnosed disease with the current dialogue, and 𝐒tCasesubscriptsuperscript𝐒limit-fromCase𝑡\textbf{S}^{\text{Case}-}_{t}S start_POSTSUPERSCRIPT Case - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes those from negative cases that do not coincide with diseases discussed in the current dialogue. We use negative cases from input batches \cal{B}caligraphic_B for training. The loss function DocsubscriptDoc{\cal{L}}_{\text{Doc}}caligraphic_L start_POSTSUBSCRIPT Doc end_POSTSUBSCRIPT for the disease document retriever is defined in the same way.

Stage 2: Analytic Refinement.

This stage further refines the diagnosis via specific problem analysis, improving diagnostic accuracy and interpretability. We construct a multi-disease classifier inspired by the ConceptTransformer (Rigotti et al. 2022) model for this stage, which leverages domain knowledge to improve classification accuracy and provide concept-based interpretations. The diagnosis-oriented graph is employed to augment and interpret the classification.

As depicted in the left of Figure 2, the inputs of the classifier are SOAP segments extracted in the first stage and the diagnosis-oriented graph. Specifically, SOAP segments are encoded into several representations {𝐒i}i=1nssuperscriptsubscriptsubscript𝐒𝑖𝑖1subscript𝑛𝑠\{\textbf{S}_{i}\}_{i=1}^{n_{s}}{ S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT through a BERT encoder, and each one is obtained by averaging hidden states of tokens corresponding to one segment. nssubscript𝑛𝑠n_{s}italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the number of no duplicate segments until the current turn. We incorporate entities that are involved in the diagnostic path of diseases identified in the preliminary diagnosis and obtain the sub-graph Gtsubscript𝐺𝑡G_{t}italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For all entities in the sub-graph, we use the same encoder as SOAP segments to get entity embeddings. The average token embedding of each entity is utilized as the raw embedding, denoted as 𝐞0dsuperscript𝐞0superscript𝑑\textbf{e}^{0}\in\mathbb{R}^{d}e start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. We employ Graph Attention Network (GAT) (Velickovic et al. 2018) to merge neighboring information for each entity:

αijksubscriptsuperscript𝛼𝑘𝑖𝑗\displaystyle\alpha^{k}_{ij}italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT =exp(σ1(𝐚𝖳[𝐖k𝐞i0𝐖k𝐞j0]))μ𝒩iexp(σ1(𝐚𝖳[𝐖k𝐞i0𝐖k𝐞μ0])),absentsubscript𝜎1superscript𝐚𝖳delimited-[]conditionalsuperscript𝐖𝑘superscriptsubscript𝐞𝑖0superscript𝐖𝑘superscriptsubscript𝐞𝑗0subscript𝜇subscript𝒩𝑖subscript𝜎1superscript𝐚𝖳delimited-[]conditionalsuperscript𝐖𝑘superscriptsubscript𝐞𝑖0superscript𝐖𝑘superscriptsubscript𝐞𝜇0\displaystyle=\frac{\exp\left(\sigma_{1}\left(\textbf{a}^{\mathsf{T}}[\textbf{% W}^{k}\textbf{e}_{i}^{0}\|\textbf{W}^{k}\textbf{e}_{j}^{0}]\right)\right)}{% \sum_{\mu\in\mathcal{N}_{i}}\exp\left(\sigma_{1}\left(\textbf{a}^{\mathsf{T}}[% \textbf{W}^{k}\textbf{e}_{i}^{0}\|\textbf{W}^{k}\textbf{e}_{\mu}^{0}]\right)% \right)},= divide start_ARG roman_exp ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( a start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT [ W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_μ ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( a start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT [ W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∥ W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT e start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ] ) ) end_ARG , (3)
𝐞isubscript𝐞𝑖\displaystyle\textbf{e}_{i}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =[σ2(j𝒩iαijk𝐖k𝐞j0)]k=1h,absentsuperscriptsubscriptdelimited-[]subscript𝜎2subscript𝑗subscript𝒩𝑖subscriptsuperscript𝛼𝑘𝑖𝑗superscript𝐖𝑘superscriptsubscript𝐞𝑗0𝑘1\displaystyle=\left[\sigma_{2}\left(\sum_{j\in\mathcal{N}_{i}}\alpha^{k}_{ij}% \textbf{W}^{k}\textbf{e}_{j}^{0}\right)\right]_{k=1}^{h},= [ italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT , (4)

where 𝐞idsubscript𝐞𝑖superscript𝑑\textbf{e}_{i}\in\mathbb{R}^{d}e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT represents the updated embedding, 𝐚2d𝐚superscript2𝑑\textbf{a}\in\mathbb{R}^{2d}a ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT and 𝐖kdh×dsuperscript𝐖𝑘superscriptsubscript𝑑𝑑\textbf{W}^{k}\in\mathbb{R}^{d_{h}\times d}W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT are learnable parameters, σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and σ2subscript𝜎2\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denote activation function, 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a set of neighboring entities that connect to entity i𝑖iitalic_i, and hhitalic_h is the number of heads.

The multi-disease classifier incorporates a cross-attention layer to facilitate interaction between segment representations and entity embeddings. Each segment representation integrates relevant entity information from the graph, and this enriched representation is then transformed via linear mapping to estimate probabilities for multiple diseases. The attention matrix A and the probability pjdsubscriptsuperscript𝑝𝑑𝑗p^{d}_{j}italic_p start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT for each disease are calculated as follows:

A =softmax(𝐐𝐊𝖳d),absentsoftmaxsuperscript𝐐𝐊𝖳𝑑\displaystyle=\text{softmax}(\frac{{\textbf{Q}}{\textbf{K}}^{\mathsf{T}}}{% \sqrt{d}}),= softmax ( divide start_ARG bold_Q bold_K start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) , (5)
pjdsubscriptsuperscript𝑝𝑑𝑗\displaystyle p^{d}_{j}italic_p start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =sigmoid(i=1ns[𝐀𝐕𝐎]ij),j=1,,n,formulae-sequenceabsentsigmoidsubscriptsuperscriptsubscript𝑛𝑠𝑖1subscriptdelimited-[]𝐀𝐕𝐎𝑖𝑗𝑗1𝑛\displaystyle=\text{sigmoid}(\sum^{n_{s}}_{i=1}[\textbf{A}{\textbf{V}}\textbf{% O}]_{ij}),j=1,\dots,n,= sigmoid ( ∑ start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT [ bold_A bold_V bold_O ] start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) , italic_j = 1 , … , italic_n , (6)

where 𝐊nGt×d𝐊superscriptsubscript𝑛subscript𝐺𝑡𝑑\textbf{K}\in\mathbb{R}^{n_{G_{t}}\times d}K ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT and 𝐕nGt×d𝐕superscriptsubscript𝑛subscript𝐺𝑡𝑑\textbf{V}\in\mathbb{R}^{n_{G_{t}}\times d}V ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT are the linear projected matrix based on the concatenation of entity embeddings {𝐞i}iGtsubscriptsubscript𝐞𝑖𝑖subscript𝐺𝑡\{\textbf{e}_{i}\}_{i\in G_{t}}{ e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ italic_G start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and 𝐐ns×d𝐐superscriptsubscript𝑛𝑠𝑑\textbf{Q}\in\mathbb{R}^{n_{s}\times d}Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT × italic_d end_POSTSUPERSCRIPT is based on the segment representations {𝐒i}i=1nssuperscriptsubscriptsubscript𝐒𝑖𝑖1subscript𝑛𝑠\{\textbf{S}_{i}\}_{i=1}^{n_{s}}{ S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. 𝐎d×n𝐎superscript𝑑𝑛\textbf{O}\in\mathbb{R}^{d\times n}O ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_n end_POSTSUPERSCRIPT denotes an output projection matrix, and n𝑛nitalic_n is the number of total diseases in our corpus. We select the probabilities of diseases within the preliminary list and employ a proper threshold to predict multiple diseases (See in Experiments) as the refined differential diagnosis {di}i=1ndsubscriptsuperscriptsubscriptsuperscript𝑑𝑖subscript𝑛𝑑𝑖1\{{d^{\prime}}_{i}\}^{n_{d}}_{i=1}{ italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT.

We optimize the multi-disease prediction by minimizing a binary cross-entropy loss. The loss function dsubscript𝑑{\cal{L}}_{d}caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is calculated as follows:

d=1K[yjdlogpjd+(1yjd)log(1pjd)],subscript𝑑1𝐾delimited-[]subscriptsuperscript𝑦𝑑𝑗superscriptsubscript𝑝𝑗𝑑1subscriptsuperscript𝑦𝑑𝑗1superscriptsubscript𝑝𝑗𝑑\displaystyle{\cal{L}}_{d}=-\frac{1}{K}\sum[y^{d}_{j}\cdot\log p_{j}^{d}+(1-y^% {d}_{j})\cdot\log(1-p_{j}^{d})],caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ [ italic_y start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ⋅ roman_log italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT + ( 1 - italic_y start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ⋅ roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) ] , (7)

where yjdsubscriptsuperscript𝑦𝑑𝑗y^{d}_{j}italic_y start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the label of j𝑗jitalic_j-th disease, and K𝐾Kitalic_K denotes the number of diseases in the preliminary list. We optimize the disease probabilities within each list, which vary across dialogues. Besides, we add an explanation loss (Rigotti et al. 2022) to supervise the attention weights. This loss can guide the attention heads to attend to entities that are beneficial for disease classification. The loss explsubscript𝑒𝑥𝑝𝑙{\cal{L}}_{expl}caligraphic_L start_POSTSUBSCRIPT italic_e italic_x italic_p italic_l end_POSTSUBSCRIPT is defined as:

expl=𝐀𝐀F2,subscript𝑒𝑥𝑝𝑙superscriptsubscriptnorm𝐀superscript𝐀𝐹2\displaystyle{\cal{L}}_{expl}=\|\textbf{A}-\textbf{A}^{\prime}\|_{F}^{2},caligraphic_L start_POSTSUBSCRIPT italic_e italic_x italic_p italic_l end_POSTSUBSCRIPT = ∥ A - A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (8)

where F\|\cdot\|_{F}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT denotes the Frobenius norm, 𝐀superscript𝐀\textbf{A}^{\prime}A start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the desired distribution of attention. The final loss used for training multi-disease classification is defined as follows:

=αd+βexpl,𝛼subscript𝑑𝛽subscript𝑒𝑥𝑝𝑙\displaystyle{\cal{L}}=\alpha{\cal{L}}_{d}+\beta{\cal{L}}_{expl},caligraphic_L = italic_α caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + italic_β caligraphic_L start_POSTSUBSCRIPT italic_e italic_x italic_p italic_l end_POSTSUBSCRIPT , (9)

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are weights for balancing these two losses.

Diagnosis-guided Response Generation

In this section, we utilize the refined diagnosis to guide medical knowledge retrieval, thereby enhancing response generation with the aid of retrieved disease-related knowledge. The diagnostic results help to select knowledge more accurately and guide the dialogue around related diseases. A doctor dialogue act predictor is introduced since acts help to select a specific aspect of disease knowledge (e.g., clinical manifestations or examinations) and guide the flow of the dialogue (Xu et al. 2023). The detailed architecture is shown on the right of Figure 2.

Dialogue Act Prediction.

We develop a multi-act predictor to assist knowledge retrieval and manage the dialogue. The input of the predictor is SOAP segments and dialogue history. We concatenate “[CLS]” token at the front of the segment sequence and encode this sequence using a BERT encoder, which is also applied to the dialogue history. The final hidden states of “[CLS]” token in these two encodings are selected as the representation of the segments 𝐇tsdsuperscriptsubscript𝐇𝑡𝑠superscript𝑑\textbf{H}_{t}^{s}\in\mathbb{R}^{d}H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and dialogue history 𝐇tdsubscript𝐇𝑡superscript𝑑\textbf{H}_{t}\in\mathbb{R}^{d}H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. We merge the structured patient information and dialogue content to predict dialogue acts. The act probability is calculated as follows:

piasubscriptsuperscript𝑝𝑎𝑖\displaystyle p^{a}_{i}italic_p start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =sigmoid(𝐖a[𝐇ts;𝐇t]),i=1,,m,formulae-sequenceabsentsigmoidsuperscript𝐖𝑎superscriptsubscript𝐇𝑡𝑠subscript𝐇𝑡𝑖1𝑚\displaystyle=\text{sigmoid}(\textbf{W}^{a}[\textbf{H}_{t}^{s};\textbf{H}_{t}]% ),i=1,\dots,m,= sigmoid ( W start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT [ H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ; H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ) , italic_i = 1 , … , italic_m , (10)

where 𝐖am×2dsuperscript𝐖𝑎superscript𝑚2𝑑\textbf{W}^{a}\in\mathbb{R}^{m\times 2d}W start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × 2 italic_d end_POSTSUPERSCRIPT is a trainable parameter matrix, [;][;][ ; ] denotes a concatenation operation, and m𝑚mitalic_m is the number of candidate acts. The predicted dialogue acts {ai}i=1nasuperscriptsubscriptsubscript𝑎𝑖𝑖1subscript𝑛𝑎\{a_{i}\}_{i=1}^{n_{a}}{ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are selected through an appropriate threshold (See in Experiments).

We apply a binary cross-entropy loss to optimize the multi-act prediction. The loss function is denoted as follows:

a=1m[yialogpia+(1yia)log(1pia)],subscript𝑎1𝑚delimited-[]subscriptsuperscript𝑦𝑎𝑖superscriptsubscript𝑝𝑖𝑎1subscriptsuperscript𝑦𝑎𝑖1superscriptsubscript𝑝𝑖𝑎\displaystyle{\cal{L}}_{a}=-\frac{1}{m}\sum[y^{a}_{i}\cdot\log p_{i}^{a}+(1-y^% {a}_{i})\cdot\log(1-p_{i}^{a})],caligraphic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ [ italic_y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT + ( 1 - italic_y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ) ] , (11)

where yiasubscriptsuperscript𝑦𝑎𝑖y^{a}_{i}italic_y start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the label of i𝑖iitalic_i-th dialogue act.

Disease-guided Retrieval.

To augment response generation, we retrieve disease-related passages based on the refined differential diagnosis and predicted dialogue acts. These passages are from the medical encyclopedia website, providing external knowledge for response generation. For dialogue acts directly related to a specific aspect of medicine, we choose corresponding passages without requiring retrieval. For example, responses with the act “Inquire about present illness” are closely related to the passage describing clinical manifestations of one disease. For non-medical dialogue acts, we use the current dialogue history to retrieve relevant passages from a disease corpus. We remove stop words and punctuation from the dialogue history sequence and retrieve top-k𝑘kitalic_k passages through the BM25 algorithm (Schütze, Manning, and Raghavan 2008).

Response Generation.

After retrieving medical knowledge and predicting dialogue acts, we incorporate these two pieces of information to guide response generation. We construct a generation model following the Fusion-in-Decoder (FiD) method (Izacard and Grave 2021; Shuster et al. 2021), which allows the decoder to attend to all encoding representations at the same time when generating a response. The input of the model contains a group of sequences, where the first sequence is the dialogue history sequence Utsubscript𝑈𝑡U_{t}italic_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT concatenated with unique tokens denoting dialogue acts {ai}subscript𝑎𝑖\{a_{i}\}{ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, and other sequences are retrieved knowledge sequences Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Each sequence is concatenated with a special token (“[U]” or “[K]”) at the front to represent the dialogue history or knowledge. Compared with concatenating the dialogue history with knowledge as an input sequence, the above input can leverage longer dialogue history.

We train the generation model by a negative log-likelihood loss. The loss function is defined as:

g=i=1Nlogp(Ut,iD),subscript𝑔superscriptsubscript𝑖1𝑁𝑝subscriptsuperscript𝑈𝐷𝑡𝑖\displaystyle{\cal{L}}_{g}=-\sum_{i=1}^{N}\log p(U^{D}_{t,i}),caligraphic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT roman_log italic_p ( italic_U start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ) , (12)

where Ut,iDsubscriptsuperscript𝑈𝐷𝑡𝑖U^{D}_{t,i}italic_U start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th token’s probability. During training, we use ground truth acts as part of the input and apply ground truth diseases to guide knowledge retrieval. Then in inference, we apply predicted acts and diseases as guidance.

Experiments

Methods B-1 B-2 B-4 R-1 R-2 E-P E-R E-F1
w/o Pre-training Seq2Seq 28.55 22.85 15.45 25.61 11.24 16.79 10.44 12.88
Seq2Seq-Entity 29.13 23.22 15.66 25.79 11.42 23.79 15.89 19.06
HRED 31.61 25.22 17.05 24.17 9.79 15.56 10.12 12.26
HRED-Entity 32.84 26.12 17.63 24.26 9.76 21.75 15.33 17.98
VRBot 29.69 23.90 16.34 24.69 11.23 18.67 9.72 12.78
w/ Pre-trained LM GPT-2 35.27 28.19 19.16 28.74 13.61 18.29 14.45 16.14
GPT-2-Entity 34.56 27.56 18.71 28.78 13.62 21.27 17.10 18.96
BART 34.94 27.99 19.06 29.03 14.40 19.97 14.29 16.66
BART-Entity 34.14 27.19 18.42 28.52 13.67 23.49 16.90 19.66
DFMed 42.56 33.34 22.53 29.31 14.21 22.48 22.84 22.66
IADDx (Ours) 43.17normal-†{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 34.09normal-†{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 23.33normal-†{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 29.60normal-†{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 14.37 21.81 22.90normal-†{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 22.34
Table 1: Automatic evaluation results on MedDG. † denotes statistically significant differences (p𝑝pitalic_p = 0.05).

Datasets

We perform our experiments using two medical dialogue datasets: MedDG (Liu et al. 2022b) and KaMed (Li et al. 2021). MedDG encompasses 17K dialogues, primarily centered on 12 diseases within the gastroenterology department. We partition the dataset into 14862, 1999, and 999 dialogues for training, validation, and testing, respectively. KaMed offers a comprehensive collection of over 63K dialogues spanning nearly 100 hospital departments. To address privacy concerns (See in Appendix), we exclude certain dialogues from KaMed, resulting in 29159, 1532, and 1539 dialogues for training, validation, and testing, respectively.

Baseline Models

We evaluate IADDx with six baseline models. Non-Pretrained models: (1) Seq2Seq (Sutskever, Vinyals, and Le 2014) employs an RNN for sequence-to-sequence generation enhanced by an attention layer. (2) HRED (Serban et al. 2016) leverages a multi-level RNN design to encode dialogues both at the token and utterance levels. (3) VRBot (Li et al. 2021) is designed for medical dialogue generation, emphasizing the tracking and predicting of patient and doctor entities. Pretrained models: (1) GPT-2 (Radford et al. 2019) is a transformer decoder-based language model. (2) BART (Lewis et al. 2020) is a transformer-based encoder-decoder model. (3) DFMed (Xu et al. 2023) is a medical dialogue generation model that learns entity and dialogue act flows. For our experiments on the MedDG dataset, we supplement models with entity hints as described by Liu et al. (2022b). This involves appending extracted medical entities to the end of the input sequence.

Evaluation Metrics

Automatic Evaluation.

BLEU (Papineni et al. 2002) and ROUGE (Lin 2004) scores across varied n-grams (specifically, B-1, B-2, B-4, R-1, and R-2) are utilized as metrics to evaluate the quality of generated responses. Additionally, in alignment with Liu et al. (2022b), we calculate the precision, recall, and F1 of entities mentioned in the responses, denoted as E-P, E-R, and E-F1 respectively. We evaluate the accuracy of the differential diagnosis using the disease F1 score, denoted as D-F1.

Human Evaluation.
Methods B-1 B-2 B-4 R-1 R-2
Seq2Seq 23.52 18.56 12.13 23.56 8.67
HRED 26.75 21.08 13.91 22.93 7.80
VRBot 30.04 23.76 16.36 18.71 7.28
GPT-2 33.76 26.58 17.82 26.80 10.56
BART 33.62 26.43 17.64 27.91 11.43
DFMed 40.20 30.97 20.76 28.28 11.54
IADDx (Ours) 40.98normal-†{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 31.69normal-†{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 21.35normal-†{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 28.31normal-†{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT 11.67normal-†{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT
Table 2: Automatic evaluation results on KaMed. † denotes statistically significant differences (p𝑝pitalic_p = 0.05).

We randomly selected 100 cases and engaged three physicians for manual evaluation. The performance of our IADDx is compared with various baseline models. Drawing upon previous studies (Liu et al. 2022b; Li et al. 2021), we assess the generated responses using three metrics: sentence fluency (FLU), knowledge accuracy (KC), and overall quality (EQ). Each metric is on a 5-point Likert scale, ranging from 1 (poorest) to 5 (excellent).

Implementation Details

We apply the MedBERT222https://github.com/trueto/medbert, a BERT-base model pre-trained on Chinese medical documents as the backbone for encoders in the differential diagnosis component. The disease corpus used for the intuitive association is extracted from an online medical encyclopedia website named baikemy333https://www.baikemy.com/, which contains medical knowledge certified by specialists. In the intuitive association stage, we selected the top 50 diseases as the preliminary diagnosis results. In the analytic refinement stage, we predict multiple diseases from the list. We set the weights for dsubscript𝑑{\cal{L}}_{d}caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and explsubscript𝑒𝑥𝑝𝑙{\cal{L}}_{expl}caligraphic_L start_POSTSUBSCRIPT italic_e italic_x italic_p italic_l end_POSTSUBSCRIPT to 1 and 0.5. The threshold for predicting multiple diseases is set to 0.8 based on the disease F1 scores on the validation dataset. In the response generation component, we retrieve knowledge from the same corpus and select the top 5 passages for generating responses. The dialogue act predictor is also based on the MedBERT. We choose from 10 acts and adopt different thresholds to get the highest act F1 scores for each act. The generator is a pre-trained BART-base model444https://huggingface.co/fnlp/bart-base-chinese with a six-layer encoder and a six-layer decoder. We adopt the AdamW optimizer (Loshchilov and Hutter 2019) to train the above models and implement all experiments on a single RTX 3090 GPU. Further training details are provided in Appendix.

Evaluation of Dialogue Generation

Methods FLU KC EQ
BART 3.77 1.87 3.12
BART-Entity 3.79 2.05 3.41
DFMed 3.91 2.26 3.59
IADDx (Ours) 4.08 2.41 3.83
Table 3: Human evaluation results on MedDG.

Automatic Evaluation.

The dialogue generation results for the MedDG dataset are presented in Table 1, and results for the KaMed dataset can be found in Table 2. We observe that our IADDx method outperforms baseline models on most evaluation metrics. In detail, on the MedDG dataset, IADDx surpasses the state-of-the-art method DFMed by 0.61, 0.75, and 0.8 in B-1, B-2, and B-4 metrics, as well as 0.29 and 0.16 in R-1 and R-2 metrics. Additionally, we achieve comparable entity accuracy even without employing entity flow learning like DFMed. The reason is that we use multiple differential diagnoses to retrieve pertinent knowledge and subsequently leverage them to enhance response generation. Similar advantages are evident in the experimental results on the KaMed dataset. IADDx outperforms DFMed by 0.78, 0.72, and 0.59 in B-1, B-2, and B-4 metrics. It indicates that modeling and incorporating differential diagnosis aid in generating coherent, informative, and accurate responses.

Human Evaluation.

Table 3 shows the results of human evaluation on the MedDG dataset. Our method outperforms baselines in all metrics. This suggests that by explicitly modeling the differential diagnosis and using it to guide response generation, we can produce more informative and accurate responses. The Fleiss’ kappa (Fleiss 1971) score is 0.49, indicating a moderate level of inter-annotator agreement.

Analysis of Intuitive-then-Analytic Differential Diagnosis

Datasets Methods D-F1 B-1 B-4 R-2
MedDG IADDx 43.50 43.17 23.33 14.37
    w/o DDx - 42.21 22.08 13.78
    w/o Analytic 37.01 43.02 22.98 14.09
    w/o DOG 42.61 43.13 23.22 14.23
KaMed IADDx 50.23 40.98 21.35 11.67
    w/o DDx - 39.03 19.54 10.22
    w/o Analytic 40.26 40.74 21.16 11.44
    w/o DOG 49.52 40.86 21.22 11.52
Table 4: Ablation results on two datasets.

To delve deeper into the efficacy of our approach, we examine multiple variants of our IADDx method as follows: (1) w/o DDx, which removes the entire differential diagnosis component and generates responses conditioned solely on the dialogue history and dialogue acts. (2) w/o Analytic, which removes the analytic refinement on the preliminary diagnosis and employs the top 5 ranked diseases to guide knowledge retrieval. (3) w/o DOG, which removes the diagnosis-oriented graph in the multi-disease classifier and solely adopts the mean representations of SOAP segments and a subsequent linear layer to classify diseases.

Table 4 displays the overall ablation results. We see a reduction in performance across all metrics with the ablation variants, emphasizing the critical role of each module in our proposed method. Among these variants, the response quality of w/o DDx notably decreases due to the absence of diagnosis-related knowledge. Such knowledge is crucial for providing essential disease information that enhances the informativeness and accuracy of the response. Besides, the differential diagnosis performance of w/o Analytic significantly drops compared with the complete DDx. It is because the retrieval-based intuitive association provides a fixed number of potential diseases and cannot ensure that all relevant diseases rank high, inadvertently introducing unrelated diseases. The results of w/o DOG demonstrate that the diagnosis-oriented graph plays a role in the differential diagnosis and subsequent diagnosis-guided response generation.

Case Study

Patient: Hello, doctor! Over the past week, I’ve been experiencing occasional subtle pain in the center of my upper abdomen, and sometimes this pain radiates to my back. Last night, I felt a tightness in my chest and a heavy sensation in my throat, almost like acid reflux. What could be the issue?

Doctor: Hello, are you currently experiencing acid reflux or heartburn? Do you burp often?

Patient: I haven’t had noticeable acid reflux, but there’s a slight burning sensation. I started burping since yesterday. After taking medicine today, the burping has become less frequent. Sometimes, there’s also a subtle burning feeling in my back.

Patient: I didn’t sleep well last night since I felt discomfort in my upper abdomen, and I woke up several times.

Gold Response: It’s likely that you have gastroesophageal reflux disease (GERD). I’d recommend trying Mosapride. For now, avoid greasy, fried, and spicy foods.

BART: I suspect it’s gastroesophageal reflux disease (GERD).

BART-Entity: Yes, it seems to be caused by an excessive secretion of stomach acid.

DFMed: I believe you might have weak gastrointestinal motility and poor digestive function. I recommend taking Omeprazole, Domperidone, and Mosapride for treatment. Stick to a bland diet and avoid spicy, irritating, and fried foods.

Differential Diagnosis: (1) reflux esophagitis, (2) gastro -esophageal reflux, (3) bile reflux gastritis, (4) duodeno -gastric reflux

Attended Entities in DOG: [digestive system] absent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW [stomach] absent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW [reflux esophagitis / gastroesophageal reflux] absent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW [upper abdominal pain / poststernal burning sensation / acid reflux]

IADDx (Ours): I suspect you have gastroesophageal reflux disease (GERD). I recommend taking Rabeprazole, Mosapr-ide, and Magnesium aluminum carbonate for treatment. Over the next few days, be mindful of your diet. Focus on easily digestible foods, and try to avoid acidic, spicy, or cold foods.

Table 5: Case study. The responses are generated by different models, where diseases are underlined.

We present a case study of medical dialogue generation by different models in Table 5. We observe that the response generated by our IADDx method offers more accurate diagnoses, along with treatment plans tailored to those diagnoses, compared to responses from the three baseline models. The prescribed medicine, Rabeprazole, Mosapride, and Magnesium aluminum carbonate, are suitable for treating the diagnosed disease GERD. This suggests that IADDx effectively performs a reasonable differential diagnosis and leverages it to enhance response generation. Besides, although four diseases are included in the differential diagnosis, IADDx can implicitly select the most likely disease to generate a response.

Interpretation of differential diagnosis.

Our IADDx offers a comprehensive list of differential diagnoses that can explain the generated response. In this dialogue, IADDx identifies four potential diseases that exhibit similar symptoms. The first and second diseases are both related to GERD and have been utilized to generate the response. Beyond these two, physicians still need to consider the potential of the other diseases and seek to confirm or rule them out in subsequent conversations. IADDx also provides a diagnosis path composed of systems, organs, diseases, and symptoms to interpret the differential diagnosis. We select entities with high attention weights to build the diagnosis path. We observe that the patient’s condition predominantly pertains to the digestive system and the stomach. Additionally, the attention to specific symptom entities lends further support to the potential diseases: reflux esophagitis and gastroesophageal reflux.

Conclusion

In this work, we propose a medical dialogue generation framework with a differential diagnosis, IADDx, which explicitly models the process of differential diagnosis through intuitive association followed by analytic refinement. Moreover, we devise a diagnosis-oriented graph to interpret the differential diagnosis. The diagnosis results are utilized to guide medical knowledge retrieval and response generation. Experiments on two datasets demonstrate the efficacy of the proposed framework. Additionally, we illustrate how our framework aids clinicians and patients in understanding the diagnostic procedure, such as by generating intermediate outcomes and graph-based diagnostic paths.

References

  • Chen et al. (2022a) Chen, J.; Li, D.; Chen, Q.; Zhou, W.; and Liu, X. 2022a. Diaformer: Automatic Diagnosis via Symptoms Sequence Generation. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, 4432–4440. AAAI Press.
  • Chen et al. (2020) Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. E. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, 1597–1607. PMLR.
  • Chen et al. (2022b) Chen, W.; Li, Z.; Fang, H.; Yao, Q.; Zhong, C.; Hao, J.; Zhang, Q.; Huang, X.; Peng, J.; and Wei, Z. 2022b. A benchmark for automatic medical consultation system: frameworks, tasks and datasets. Bioinformatics. Btac817.
  • Croskerry (2009) Croskerry, P. 2009. A universal model of diagnostic reasoning. Academic medicine, 84(8): 1022–1028.
  • Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics.
  • Fleiss (1971) Fleiss, J. L. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5): 378.
  • He et al. (2022) He, Z.; Han, Y.; Ouyang, Z.; Gao, W.; Chen, H.; Xu, G.; and Wu, J. 2022. DialMed: A Dataset for Dialogue-based Medication Recommendation. In Proceedings of the 29th International Conference on Computational Linguistics, 721–733. Gyeongju, Republic of Korea: International Committee on Computational Linguistics.
  • Izacard and Grave (2021) Izacard, G.; and Grave, E. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. In Merlo, P.; Tiedemann, J.; and Tsarfaty, R., eds., Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, 874–880. Association for Computational Linguistics.
  • Kao, Tang, and Chang (2018) Kao, H.; Tang, K.; and Chang, E. Y. 2018. Context-Aware Symptom Checking for Disease Diagnosis Using Hierarchical Reinforcement Learning. In McIlraith, S. A.; and Weinberger, K. Q., eds., Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, 2305–2313. AAAI Press.
  • Krishna et al. (2021) Krishna, K.; Khosla, S.; Bigham, J.; and Lipton, Z. C. 2021. Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 4958–4972. Online: Association for Computational Linguistics.
  • Kurtz, Draper, and Silverman (2017) Kurtz, S.; Draper, J.; and Silverman, J. 2017. Teaching and learning communication skills in medicine. CRC press.
  • Lewis et al. (2020) Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault, J. R., eds., Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, 7871–7880. Association for Computational Linguistics.
  • Li et al. (2021) Li, D.; Ren, Z.; Ren, P.; Chen, Z.; Fan, M.; Ma, J.; and de Rijke, M. 2021. Semi-Supervised Variational Reasoning for Medical Dialogue Generation. In Diaz, F.; Shah, C.; Suel, T.; Castells, P.; Jones, R.; and Sakai, T., eds., SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, 544–554. ACM.
  • Liao et al. (2020) Liao, K.; Liu, Q.; Wei, Z.; Peng, B.; Chen, Q.; Sun, W.; and Huang, X. 2020. Task-oriented Dialogue System for Automatic Disease Diagnosis via Hierarchical Reinforcement Learning. ArXiv, abs/2004.14254.
  • Lin (2004) Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74–81. Barcelona, Spain: Association for Computational Linguistics.
  • Lin et al. (2021) Lin, S.; Zhou, P.; Liang, X.; Tang, J.; Zhao, R.; Chen, Z.; and Lin, L. 2021. Graph-Evolving Meta-Learning for Low-Resource Medical Dialogue Generation. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, 13362–13370. AAAI Press.
  • Lin et al. (2019) Lin, X.; He, X.; Chen, Q.; Tou, H.; Wei, Z.; and Chen, T. 2019. Enhancing Dialogue Symptom Diagnosis with Global Attention and Symptom Graph. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 5033–5042. Hong Kong, China: Association for Computational Linguistics.
  • Liu et al. (2022a) Liu, W.; Cheng, Y.; Wang, H.; Tang, J.; Liu, Y.; Zhao, R.; Li, W.; Zheng, Y.; and Liang, X. 2022a. ”My nose is running.””Are you also coughing?”: Building A Medical Diagnosis Agent with Interpretable Inquiry Logics. ArXiv, abs/2204.13953.
  • Liu et al. (2022b) Liu, W.; Tang, J.; Cheng, Y.; Li, W.; Zheng, Y.; and Liang, X. 2022b. MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware Medical Dialogue Generation. In Lu, W.; Huang, S.; Hong, Y.; and Zhou, X., eds., Natural Language Processing and Chinese Computing - 11th CCF International Conference, NLPCC 2022, Guilin, China, September 24-25, 2022, Proceedings, Part I, volume 13551 of Lecture Notes in Computer Science, 447–459. Springer.
  • Liu et al. (2021) Liu, W.; Tang, J.; Liang, X.; and Cai, Q. 2021. Heterogeneous graph reasoning for knowledge-grounded medical dialogue system. Neurocomputing, 442: 260–268.
  • Loshchilov and Hutter (2019) Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  • Maynard and Heritage (2005) Maynard, D. W.; and Heritage, J. 2005. Conversation analysis, doctor–patient interaction and medical communication. Medical education, 39(4): 428–435.
  • OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. CoRR, abs/2303.08774.
  • Papineni et al. (2002) Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 311–318. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics.
  • Radford et al. (2019) Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I.; et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8): 9.
  • Rigotti et al. (2022) Rigotti, M.; Miksovic, C.; Giurgiu, I.; Gschwind, T.; and Scotton, P. 2022. Attention-based Interpretability with Concept Transformers. In International Conference on Learning Representations.
  • Schütze, Manning, and Raghavan (2008) Schütze, H.; Manning, C. D.; and Raghavan, P. 2008. Introduction to information retrieval, volume 39. Cambridge University Press Cambridge.
  • Serban et al. (2016) Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A. C.; and Pineau, J. 2016. Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models. In Schuurmans, D.; and Wellman, M. P., eds., Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, 3776–3784. AAAI Press.
  • Shi et al. (2023) Shi, X.; Liu, Z.; Wang, C.; Leng, H.; Xue, K.; Zhang, X.; and Zhang, S. 2023. MidMed: Towards Mixed-Type Dialogues for Medical Consultation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 8145–8157. Toronto, Canada: Association for Computational Linguistics.
  • Shuster et al. (2021) Shuster, K.; Poff, S.; Chen, M.; Kiela, D.; and Weston, J. 2021. Retrieval Augmentation Reduces Hallucination in Conversation. In Moens, M.; Huang, X.; Specia, L.; and Yih, S. W., eds., Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, 3784–3803. Association for Computational Linguistics.
  • Silverman, Kurtz, and Draper (2016) Silverman, J.; Kurtz, S.; and Draper, J. 2016. Skills for communicating with patients. crc press.
  • Stern, Cifu, and Altkorn (2020) Stern, S. D.; Cifu, A. S.; and Altkorn, D. 2020. Symptom to Diagnosis: An Evidence-Based Guide, 4e. McGraw-Hill Education.
  • Sutskever, Vinyals, and Le (2014) Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to Sequence Learning with Neural Networks. In Ghahramani, Z.; Welling, M.; Cortes, C.; Lawrence, N. D.; and Weinberger, K. Q., eds., Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 3104–3112.
  • Tchango et al. (2022) Tchango, A. F.; Goel, R.; Martel, J.; Wen, Z.; Marceau-Caron, G.; and Ghosn, J. 2022. Towards Trustworthy Automatic Diagnosis Systems by Emulating Doctors’ Reasoning with Deep Reinforcement Learning. In NeurIPS.
  • Tortora and Derrickson (2018) Tortora, G. J.; and Derrickson, B. H. 2018. Principles of anatomy and physiology. John Wiley & Sons.
  • Velickovic et al. (2018) Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; and Bengio, Y. 2018. Graph Attention Networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
  • Wei et al. (2018) Wei, Z.; Liu, Q.; Peng, B.; Tou, H.; Chen, T.; Huang, X.; Wong, K.-f.; and Dai, X. 2018. Task-oriented Dialogue System for Automatic Diagnosis. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 201–207. Melbourne, Australia: Association for Computational Linguistics.
  • Xu et al. (2023) Xu, K.; Hou, W.; Cheng, Y.; Wang, J.; and Li, W. 2023. Medical Dialogue Generation via Dual Flow Modeling. In Findings of the Association for Computational Linguistics: ACL 2023, 6771–6784. Toronto, Canada: Association for Computational Linguistics.
  • Xu et al. (2019) Xu, L.; Zhou, Q.; Gong, K.; Liang, X.; Tang, J.; and Lin, L. 2019. End-to-End Knowledge-Routed Relational Dialogue System for Automatic Diagnosis. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, 7346–7353. AAAI Press.
  • Yan et al. (2022) Yan, G.; Pei, J.; Ren, P.; Ren, Z.; Xin, X.; Liang, H.; de Rijke, M.; and Chen, Z. 2022. ReMeDi: Resources for Multi-domain, Multi-service, Medical Dialogues. In Amigó, E.; Castells, P.; Gonzalo, J.; Carterette, B.; Culpepper, J. S.; and Kazai, G., eds., SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, 3013–3024. ACM.
  • Zeng et al. (2020) Zeng, G.; Yang, W.; Ju, Z.; Yang, Y.; Wang, S.; Zhang, R.; Zhou, M.; Zeng, J.; Dong, X.; Zhang, R.; Fang, H.; Zhu, P.; Chen, S.; and Xie, P. 2020. MedDialog: Large-scale Medical Dialogue Datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 9241–9250. Online: Association for Computational Linguistics.
  • Zhao et al. (2022) Zhao, Y.; Li, Y.; Wu, Y.; Hu, B.; Chen, Q.; Wang, X.; Ding, Y.; and Zhang, M. 2022. Medical Dialogue Response Generation with Pivotal Information Recalling. In Zhang, A.; and Rangwala, H., eds., KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, 4763–4771. ACM.
  • Zhou et al. (2021) Zhou, M.; Li, Z.; Tan, B.; Zeng, G.; Yang, W.; He, X.; Ju, Z.; Chakravorty, S.; Chen, S.; Yang, X.; Zhang, Y.; Wu, Q.; Yu, Z.; Xu, K.; Xing, E.; and Xie, P. 2021. On the Generation of Medical Dialogs for COVID-19. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 886–896. Online: Association for Computational Linguistics.