Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

2201.08904

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Description-Driven Task-Oriented Dialog Modeling

Jeffrey Zhao, Raghav Gupta, Yuan Cao, Dian Yu, Mingqiu Wang,
Harrison Lee, Abhinav Rastogi, Izhak Shafran, Yonghui Wu
Google Research
{jeffreyzhao, raghavgupta, yuancao}@google.com

Abstract is especially true for decoder-only or sequence-to-


sequence (seq2seq) TOD models, which are often
Task-oriented dialogue (TOD) systems are re- trained with supervision to predict dialogue belief
quired to identify key information from con- states as sequences of these notations. For example,
arXiv:2201.08904v1 [cs.CL] 21 Jan 2022

versations for the completion of given tasks. a sequence such as train-leaveat=3:00pm,


Such information is conventionally specified
hotel-internet=no, or sequences of similar
in terms of intents and slots contained in task-
specific ontology or schemata. Since these structure, are the target output for TOD models
schemata are designed by system developers, described by Hosseini-Asl et al. 2020 and Zhao
the naming convention for slots and intents is et al. 2021.
not uniform across tasks, and may not con- This has several disadvantages. First, the el-
vey their semantics effectively. This can lead ement notations convey little semantic (and pos-
to models memorizing arbitrary patterns in
sibly ambiguous) meaning for the requirements
data, resulting in suboptimal performance and
generalization. In this paper, we propose
of the slot (Du et al., 2021), potentially harming
that schemata should be modified by replac- language understanding. Second, task-specific ab-
ing names or notations entirely with natural stract schema notations make it easy for a model
language descriptions. We show that a lan- to overfit on observed tasks and fail to transfer
guage description-driven system exhibits bet- to unseen ones, even if there is sufficient seman-
ter understanding of task specifications, higher tic similarity between the two. Finally, creating
performance on state tracking, improved data notations for each slot and intent complicates the
efficiency, and effective zero-shot transfer to
schema design process.
unseen tasks. Following this paradigm, we
present a simple yet effective Description- In this paper, we advocate for TOD schemata
Driven Dialog State Tracking (D3ST) model, with intuitive, human-readable, and semantically-
which relies purely on schema descriptions rich natural language descriptions, rather than the
and an “index-picking” mechanism. We abbreviated notations that have become custom-
demonstrate the superiority in quality, data effi- ary when designing TOD models. Instead of
ciency and robustness of our approach as mea-
“hotel-internet”, we assert that it is more
sured on the MultiWOZ (Budzianowski et al.,
2018), SGD (Rastogi et al., 2020), and the re-
natural to describe this slot as “whether the
cent SGD-X (Lee et al., 2021b) benchmarks. hotel has internet”. This would be easier
for both the designer of the TOD system when spec-
1 Introduction ifying the task ontology, and we also argue that it
plays an important role in improving model quality
The design of a task-oriented dialog (TOD) sys- and data efficiency.
tem conventionally starts with defining a schema To this end, we present a simple yet effective ap-
specifying the information required to complete proach: Description-Driven Dialog State Tracking
its tasks — usually, a list of relevant slots and in- (D3ST). Here, schema descriptions are indexed and
tents. These slots and intents often appear as abbre- concatenated as prefixes to a seq2seq model, which
viated notations, such as train-leaveat and then learns to predict active schema element in-
hotel-internet, to indicate the domain of a dices and corresponding values. In addition, an
task and the information it captures. index-picking mechanism reduces the chance of
Models are trained using these schemata will the model overfitting to specific schema descrip-
be heavily dependent on these abbreviations. This tions. We demonstrate its superior performance
measured on benchmarks including MultiWOZ generation with given actions, whereas our work
(Budzianowski et al., 2018; Zang et al., 2020; Han focuses mainly on state tracking.
et al., 2021; Ye et al., 2021) and Schema-Guided Describe task with questions: Another line of
Dialogue (SGD, (Rastogi et al., 2020)), as well as research casts state tracking as a question answer-
strong few- and zero-shot transfer capability to un- ing (QA) or machine reading (MR) problem (Gao
seen tasks. We also show evidence that, under this et al., 2020; Namazifar et al., 2020; Li et al., 2021;
very general setting, natural language descriptions Lin et al., 2021a), in which models are provided
lead to better quality over abbreviated notations. questions about each slot and their values are pre-
dicted as answers to these questions. The mod-
2 Related Work els are often finetuned on extractive QA or MR
datasets, and by converting slot prediction into QA
In recent years, there has been increasing interest pairs the models are able to perform zero-shot state
in leveraging language prompts for data efficiency tracking on dialogue datasets. Their question gener-
and quality improvement for dialogue modelling. ation procedure however, is more costly than using
Inclusion of task descriptions: One line of re- schema descriptions, which we adopt in our work.
search focuses on providing descriptions or instruc-
tions related to the dialogue tasks. Shah et al. 3 Methodology
(2019) utilized both slot descriptions and a small
D3ST uses a seq2seq model for dialogue state track-
number of examples of slot values for learning slot
ing, and relies purely on descriptions of schema
representations for spoken language understanding.
items to instruct the model.
Similar to our work, Lin et al. (2021b); Lee et al.
(2021a) provided slot descriptions as extra inputs to 3.1 Model
the model and have shown quality improvement as We choose to use seq2seq for modeling for the fol-
well as zero-shot transferability. Mi et al. (2021) ex- lowing reasons: first, seq2seq is a general and ver-
tended the descriptions to a more detailed format by satile architecture that can easily handle different
including task instructions, constraints and prompts formats of language instructions; second, seq2seq
altogether, demonstrating advantages of providing has been shown to be an effective approach for
more sophisticated instructions to the model. How- DST (Zhao et al., 2021); and third, seq2seq as a
ever, unlike our approach, they predict slot values generic model architecture can be easily initialized
one-by-one in turn, which becomes increasingly from a publicly available pretrained checkpoint.
inefficient as the number of slots increases, and is For D3ST, we used the T5 (Raffel et al., 2020)
also prone to oversampling slot values since most model and the associated pretrained checkpoints
slots are inactive at any stage during a dialogue. of different sizes: Base (220M parameters), Large
In contrast, our work predicts all states in a single (770M parameters), and XXL (11B parameters).
pass, and is hence more efficient.
Prompting language models: Powerful lan- 3.2 Description-Driven Modeling
guage models like GPT (Radford et al., 2019; D3ST relies solely on schema descriptions for dia-
Brown et al., 2020) demonstrated impressive few- logue state tracking. An example of D3ST is pro-
shot learning ability even without fine-tuning. It vided in Figure 1.
is therefore natural to consider leveraging these Given a set of descriptions corresponding to
models for few-shot dialogue modeling. Madotto slots and intents specified by a schema, let
et al. (2020) applied GPT-2 by priming the model dslot
i , i = 1 . . . N and dintent
j , j = 1 . . . M be the
with examples for language understanding, state descriptions for slots and intents respectively,
tracking, dialogue policy and language generation where N and M are the numbers of slots and intents.
tasks respectively, and in Madotto et al. (2021) this sys
Let uusr
t and ut be the user and system utterance
approach has been extended to systematically eval- at turn t respectively.
uate on a set of diversified tasks using GPT-3 as Input The input to the encoder consists of the
backbone. Unlike these works in which the lan- slot descriptions, intent descriptions, and conversa-
guage models are frozen, we finetune the models tion context concatenated into a single string. The
on downstream tasks. Budzianowski and Vulić slot descriptions have the following format:
(2019); Baolin Peng (2020) on the other hand, ap-
plied GPT-2 for few-shot and transferable response 0 : dslot
0 . . . I : dslot
S
Similarly, the intent descriptions have the following 3.3 Properties
format: From the formulation described in Section 3.2, we
i0 : dint int expect our proposed approach to have the follow-
0 . . . iJ : dJ
ing properties. First, the model relies fully on the
Note that 0 . . . I and i0 . . . iJ are the indices we understanding of schema descriptions for the iden-
assign to each of the slot and intent descriptions tification of active slots and intents. Second, the
respectively. Here, “i” is a literal character to dif- model learns to pick indices corresponding to the
ferentiate intent indices from those for slots. To active slots, intents or categorical values, instead
prevent the model from memorizing association of generating these schema elements. This “index-
between a specific index:description pair, we ran- picking” mechanism, based on schema description
domize the assignment of indices to descriptions understanding, reduces the chance of the model
for each example during training. Such a dynamic memorizing training schemata and makes it easier
construction forces the model to consider descrip- for the model to zero-shot transfer to unseen tasks.
tions rather than treating inputs as constant strings Finally, unlike previous work which also takes ad-
to make generalizable predictions. The conversa- vantage of schema descriptions (for example Lin
tion context consists of all turns of the conversa- et al., 2021b; Lee et al., 2021a) but generates values
tions concatenated together, with leading [user] for each slot in turn (even if a slot is inactive), our
and [sys] tokens before each user and system utter- approach enables predicting multiple active (and
ance, signalling the speaker of each utterance. only active) slot-value pairs together with intents
sys sys with a single decoding pass, making the inference
[usr] uusr usr
0 [sys] u0 . . . [usr] uT [sys] uT
procedure more efficient.
Output The decoder generates a sequence of We also note that the sequence of schema de-
dialogue states in the format scriptions prepended to the conversation context
plays a similar role as instructions for specific tasks
[states] as0 : vs0 . . . asM : vsM [intents] ai0 . . . aiN
(Wei et al., 2021; Mishra et al., 2021). Providing
where asm is the index of the mth active slot and more detailed human-readable descriptions enables
there are M active slots in all, vsm is its corresponding the language model understand task requirements
value. ain is the index of the nth active intent and better, and leads to improved few-shot performance,
N is the number of active intents. This way the as will be seen in experimental results.
model learns to identify active schema elements
4 Experiments
with abstract indices, as we randomize the element
order during training. Note that inactive elements We design our experiments to answer the following
are not generated. questions:
Handling categorical slots Some slots are cat-
egorical, that is, they have pre-defined candi- 1. What is the quality of the D3ST model, when
date values for the model to choose from. For all training data is available?
example “whether the hotel provides
2. How does the description type for schema
free wifi or not” could have the categor-
definition, including human-readable natural
ical values “yes” and “no”. To improve categori-
descriptions, abbreviated or even random no-
cal slot prediction accuracy, we enumerate possible
tations, affect model quality?
values together with their slot descriptions. That
is, assuming the ith slot is categorical and has k 3. How data-efficient is D3ST in the low-
values va . . . vk , its corresponding input format is resource or zero-shot regimes, and how do
i : dslot ia) va . . . ik) vk different description types affect efficiency?
i

in which ia) . . . ik) are indices assigned to each 4. How robust is the model to different wordings
of the values.1 Assuming this slot is active with its of the human-readable descriptions?
third value (vc ) being mentioned, then the corre- we found this shared indexing across categorical slots can
sponding prediction has the format i : ic). sometimes cause selection ambiguity when some values (like
“true” or “false”) are shared by multiple categorical slots.
1 We therefore apply slot-specific indices ia) . . . ik) to con-
One may also adopt a) . . . k) as value indices or even
completely discard indexing for categorical values, however strain index-picking within the ith slot value range.
0:departure location of train
1:destination location of train 2:day of
the train 2a) monday 2b) tuesday 2c)
wednesday i1:look for a train i2:change
ticket [user] i need to find a spot on a
[states] 0:london kings
train on wednesday, can you help me find
one? [system] yes i can. where are you
seq2seq cross 1:cambridge 2:2c
[intents] i1
going and what time would like to arrive
or depart? [user] i’m leaving from london
kings cross and going to cambridge. could
you choose a train and give me the
station it leaves from?

Figure 1: An example of D3ST. Red: Indexed schema description sequence as prefix; Blue: Conversation context;
Green: State prediction sequence. See Section 3 for details. Best viewed in color.

4.1 Setup mains as model prefix and set the input length
Datasets We conduct experiments on the Multi- limit to 2048. To avoid ambiguity between de-
WOZ 2.1-2.4 (Budzianowski et al., 2018; Zang scriptions from different domains, we also add
et al., 2020; Han et al., 2021; Ye et al., 2021) and domain names as part of the descriptions. For
SGD (Rastogi et al., 2020) datasets. The Multi- example for the hotel-parking slot, the de-
WOZ dataset is known to contain annotation errors scription is “hotel-parking facility at
in multiple places and previous work adopted dif- the hotel”. For SGD, we include descriptions
ferent data pre-processing procedures, so we follow from domains relevant to each turn as suggested by
the recommended procedure2 of using the TRADE the standard evaluation.
(Wu et al., 2019) script to pre-process MultiWOZ
4.2 Main Results
2.1. However, we do not apply any pre-processing
to 2.2-2.4 for reproducibility and fair comparison Table 1 gives the model quality when the entire
with existing results. We use Joint Goal Accuracy training datasets are used for fine-tuning. We show
(JGA) as the evaluation metric, which measures that D3ST is close to, or at the state-of-the-art
the percentage of turns across all conversations across all benchmarks, illustrating the effective-
for which all states are correctly predicted by the ness of the proposed approach. We also see that
model. increasing the model size significantly improves
Training setup We use the open-source T5 code the quality.
base3 and the associated T5 1.1 checkpoints.4 We Note however that not all results are directly
consider models of the size base (250M parame- comparable, and we discuss some notable incon-
ters), large (800M) and XXL (11B) initialized from gruities. The best result on SGD is from paDST, but
the corresponding pretrained checkpoints, and ran this model has signicant advantages. paDST uses a
each experiment on 64 TPU v3 chips (Jouppi et al., data augmentation procedure by back-translating
2017). For fine-tuning, we use batch size 32 and between English and Chinese, as well as special
use constant learning rate of 1e − 4 across all ex- handcrafted rules for model predictions. In con-
periments. The input and output sequence lengths trast, our models only train on the default SGD
are 1024 and 512 tokens, respectively. dataset, and do not apply any handcrafted rules
Descriptions We use the slot and intent de- whatsoever. While paDST has significantly higher
scriptions included in the original MultiWOZ and JGA compared to the similarly-sized D3ST Large.
SGD datasets as inputs (dslot
i and dint
i described D3ST XXL is on par, making up for its lack of data
in Section 3.2) to the model. For MultiWOZ, augmentation and handcrafted rules with a much
we include schema descriptions across all do- larger model.
2 One other notable comparison can be made. DaP
https://github.com/budzianowski/
multiwoz#dialog-state-tracking also relies on slot descriptions and is finetuned from
3
https://github.com/google-research/ a T5 Base model, making it directly comparable to
text-to-text-transfer-transformer our D3ST Base model, which exhibits better per-
4
https://github.com/google-research/
text-to-text-transfer-transformer/blob/ formance on SGD and MultiWOZ. One additional
main/released_checkpoints.md advantage of D3ST is that it predicts all slots at
Model Pretrain. Model (# Params.) MW2.1 MW2.2 MW2.3 MW2.4
Transformer-DST (Zeng and Nie, 2021) BERT Base (110M) 55.35 - - -
SOM-DST (Kim et al., 2020) BERT Base (110M) 51.2 - 55.5 66.8
TripPy (Heck et al., 2020) BERT Base (110M) 55.3 - 63.0 59.6
SAVN (Wang et al., 2020) BERT Base (110M) 54.5 - 58.0 60.1
SimpleTODH (Hosseini-Asl et al., 2020) DistilGPT-2 (82M) 50.3/55.7 - 51.3 -
Seq2seq (Zhao et al., 2021) T5 Base (220M) 52.8 57.6 59.3 67.1
DaP (seq) (Lee et al., 2021a) T5 Base (220M) - 51.2 - -
DaP (ind) (Lee et al., 2021a) T5 Base (220M) 56.7 57.6 - -
D3ST (Base) T5 Base (220M) 54.2 56.1 59.1 72.1
D3ST (Large) T5 Large (770M) 54.5 54.2 58.6 70.8
D3ST (XXL) T5 XXL (11B) 57.8 58.7 60.8 75.9
(a) JGA on MultiWOZ 2.1-2.4.

Model Pretrain. Model (# Params.) JGA Intent Req slot


SGD baseline (Rastogi et al., 2020) BERT Base (110M) 25.4 90.6 96.5
DaP (ind) (Lee et al., 2021a) T5 Base (220M) 71.8 90.2 97.8
SGP-DST (Ruan et al., 2020) T5 Base (220M) 72.2 91.8 99.0
paDSTn (Ma et al., 2020) XLNet Large (340M) 86.5 94.8 98.5
D3ST (Base) T5 Base (220M) 72.9 97.2 98.9
D3ST (Large) T5 Large (770M) 80.0 97.1 99.1
D3ST (XXL) T5 XXL (11B) 86.4 98.8 99.4
(b) JGA, active intent accuracy and requested slot F1 on SGD.

Table 1: Results on MultiWOZ and SGD datasets with full training data. “-” indicates no public number is
available. Best results are marked in bold. H: SimpleTOD results are retrieved from the 2.3 website https:
//github.com/lexmen318/MultiWOZ-coref, in which two numbers are reported for 2.1 (one produced
by the 2.3 author, the other by the original SimpleTOD paper). F: No data pre-processing applied for MultiWOZ
2.1. n: Data augmentation and special rules applied.

once in a single inference pass. In contrast, the pendix A.


independent (ind) decoding variant of DaP does
Type M2.1 M2.2 M2.3 M2.4 SGD
inference once for every slot, similar to most other 54.5 55.9 58.6 70.8 80.0
Language
baselines, and is thus far less efficient. This is not 57.8 58.7 60.8 75.9 86.4
55.1 55.8 59.6 72.2 73.7
scalable in TOD, especially with schemata becom- Name
57.5 57.9 60.4 75.4 79.7
ing increasingly large in terms of the number of Random
20.1 9.0 12.1 16.9 37.4
slots, intents, and domains. 57.6 56.1 59.3 73.6 64.8

Table 2: Comparison between D3ST models using dif-


4.3 Comparison of Description Types ferent types of descriptions on MultiWOZ and SGD.
We now study whether the quality of D3ST is sen- “Language”, “Name” and “Random” correspond to
sitive to the schema description types. For this, using detailed language description, schema element
name and random strings respectively. Each type con-
we run the same experiment as in Section 4.2 with
tains two rows, corresponding to the results given by
D3ST Large and XXL, but using three different “large” and “XXL” models. Note that the "Random"
types of descriptions: human-readable language de- experiments for "large" models had trouble converging,
scriptions, schema element names (abbreviations) and we instead report their JGA at 85k steps.
as defined in the original schema, and random
strings. The random string descriptions are gen- Table 2 compares the performance with different
erated by simply randomly permuting the character description types. It can be seen that using lan-
sequences of the original element names. This ex- guage descriptions consistently outperforms other
periment is designed to check how a model with types, aligned with our expectation that natural and
only memorization capability without any under- human-readable descriptions contain richer seman-
standing of schema element semantics does on seen tics and are aligned with the pretraining objective,
and unseen schemas. An example of all three de- enabling LM to perform better. Element names are
scription type comparisons can be found in Ap- less readable than full descriptions, but still retain
some semantics: they preform well but fall short iments the samples are uniformly sampled across
of full descriptions. On the other hand, using ran- the entire training set. We sample from three ran-
dom strings performs worst on average, even on dom seeds for each experiment.
MultiWOZ where the training and test schema are
the same (and the model is allowed to memorize Type 0.18% 1% 10%
descriptions from training). With random strings, 6.1 ± 0.7 36.7 ± 2.0 73.1 ± 0.2
Language
51.0 ± 0.2 79.4 ± 0.4 83.0 ± 0.1
there is the extra challenge of identifying the cor-
5.0 ± 0.2 28.0 ± 2.7 69.7 ± 0.3
rect slot id for each value to predict, since each Name
47.7 ± 0.5 74.9 ± 1.4 78.6 ± 0.7
example has a random shuffling of the slot ids. In-
deed, we observed that training "large" models on Table 3: Data efficiency of D3ST using natural lan-
random names is hard to converge, and instead of guage and element name descriptions, trained and eval-
reporting their final results, we stopped these exper- uated on SGD. Each description type contains two
iments early and reported their JGA at 85k steps. rows, corresponding to the results given by “large” and
“XXL” models. The metric is JGA.
The XXL models did not encounter the same issue;
we suspect that it was easier for larger models to
memorize slot name permutations. The results are given in Table 3. From the table
we have the following observations:
In constrast to MultiWOZ, SGD requires models
to generalize to unseen tasks and domains in the • Using human-readable language descriptions
evaluation datasets. Here, using random strings consistently outperforms other types of rep-
undermined quality significantly. In general, mean- resentations, indicating better data efficiency
ingless inputs hurt performance and lead to less with semantically-rich descriptions.
generalization. We therefore suggest instructing
the model with semantically rich representations, • With just 0.18% of the data, XXL models
in particular, language descriptions. can already reach more than half of their full
One more observation we make is that, on large quality (from Table 1). At 1%, we observe
MultiWOZ models, using element names had better quality close to using 100% data. Increasing
JGA than using a full language description. This to 10% only yielded marginal gains.
trend does not hold on SGD, and also reverses
when trained with XXL. We hypothesize that this • Larger models are much more data efficient
is a result of input sequence length: on MultiWOZ than smaller ones, as can be seen from the big
we feed slots descriptions from all domains as pre- gap between “large” and “XXL” models.
fix, and when full language description is utilized,
4.5 Zero-shot Transfer to Unseen Tasks
the input sequence becomes excessively long. Us-
ing element names shortens the length, making a To assess our approach’s zero-shot transfer ability
moderate-size model easier to learn. In contrast, in- to unseen tasks, we conduct the following set of
put sequence lengths on SGD are lower than that on experiments:
MultiWOZ, since only active domains are provided MultiWOZ cross-domain transfer Following a
as part of the input. setup similar to TransferQA (Lin et al., 2021a)
and T5DST (Lin et al., 2021b), we run the “leave-
4.4 Data Efficiency one-out” cross-domain zero-shot transfer evalua-
tion on MultiWOZ 2.1.5 For each domain, we
Properly designed prefixes or prompts have been train a model on examples excluding that domain,
shown to significantly improve an LM’s data ef- and evaluate it on examples including it. Table
ficiency (Radford et al., 2019; Liu et al., 2021; 4a shows our results in comparison with the base-
Wei et al., 2021). We investigate how different lines.6 It can be seen that our approach achieves
types of description prefixes vary in performance
5
in low-resource regimes by running experiments For zero-shot evaluation, Lin et al. (2021a) and Lin et al.
(2021b) experimented on MultiWOZ 2.1 and 2.0 respectively.
with large and XXL models on SGD with 0.16% While our models are trained and evaluated on MultiWOZ 2.1,
(10-shot), 1%, and 10% of training data. For the we include results from both of them for comparison.
6
0.16% experiment, we randomly select 10 samples When skipping the train domain, we postpro-
cess predictions for slots train-departure and
from each training domain to increase the domain train-destination by ignoring the suffix "train
diversity, totalling 260 examples. For other exper- station". This is semantically correct and improves JGA.
the best cross-domain transfer performance with JGA
Domain
D3ST TransferQA T5DST
significant gains across almost all domains. Attraction 56.4 31.3 33.1
SGD unseen service transfer The SGD bench- Hotel 21.8 22.7 21.2
Restaurant 38.2 26.3 21.7
mark contains numerous services and some do- Taxi 78.4 61.9 64.6
mains only present in the test set. We present the Train 38.7 36.7 35.4
Avg 46.7 35.8 35.2
results for zero-shot transfer to these domains and
services in Table 4b. Note that D3ST Base has (a) Cross-domain (leave-one-out) transfer on MultiWOZ.
worse JGA on unseen domains when fairly com- Model
JGA
Overall Seen Unseen
pared to DaP and SGP-DST. However, D3ST has
SGD Baseline 25.4 41.2 20.0
superlative JGA on seen domains, even better than DaP (ind) 71.8 83.3 68.0
paDST (with data augmentation and hand-crafted SGP-DST 72.2 87.9 66.9
Team14s 77.3 90.0 73.0
rules). In addition, increasing the size of D3ST fur- paDSTn 86.5 92.4 84.6
ther increases both seen and especially unseen JGA, D3ST (base) 72.9 92.5 66.4
D3ST (large) 80.0 93.8 75.4
indicating better generalization. At XXL, JGA on D3ST (XXL) 86.4 95.8 83.3
unseen domains is almost equal to paDST.
(b) JGA on seen versus unseen services for SGD. s and n have
Cross-dataset transfer In this setup, we evaluate the same meaning as in Table 1.
if a model trained on one dataset can be directly Transfer JGA
applied to another dataset. To this end, we train a SGD→MultiWOZ 28.9
model on SGD then directly evaluate on the Multi- MultiWOZ→SGD 23.1
WOZ 2.4 test set, and vice versa7 . In both cases we (c) Cross-dataset transfer b/w SGD and MultiWOZ 2.4.
use the XXL model from Section 4.2, and report
Table 4: Zero-shot transfer evaluation results from
the numbers in Table 4.
three different setups.
Despite obvious schema differences and domain
mismatch between MultiWOZ and SGD, our model
trained on MultiWOZ already achieves zero-shot in Table A2 of Appendix B. We observe that the
quality on SGD close to the BERT-baseline (Ras- model performs surprisingly well across all of our
togi et al., 2020) with 25.4% JGA. Our model handcrafted dialogues, even though the domains
trained on SGD and evaluated on MultiWOZ shows are very different from the training data.
similarly strong zero-shot results. Both results
are much lower than the state of the art for both 4.6 Robustness to Variations of Descriptions
datasets however, due to differing biases defined
Since there are many ways to provide descriptions
in schemata between the two datasets, and from
for a given schema, a natural question to raise about
latent knowledge that isn’t captured from a schema
this approach is how robust the model is against
alone.
different choices of descriptions. The recently pro-
Qualitative Evaluation In addition to quantita- posed SGD-X benchmark (Lee et al., 2021b) is
tively evaluating zero-shot transfer, we qualitatively designed specifically for the study of this prob-
examined examples of D3ST transferring to novel lem. SGD-X contains five variations of the original
domains. We handcrafted a few dialogues for do- SGD, each one using a different set of schema
mains very different from the ones seen in the descriptions provided by different crowd-source
SGD dataset (e.g. conference submission, inter- workers. To assess the robustness of D3ST, we use
net provider, e-commerce retailer). We designed the large and XXL models evaluated in Section 4.2
the dialogues to be as stylistically realistic as pos- and decode test sets from each of the five variants
sible for customer service scenarios. We tasked of SGD-X. A robust model is expected to have
the XXL model trained on SGD (from Table 1) smaller fluctuations in predictions across schema
with inferring their dialogue states, and share one variants for the same dialogue context, as measured
example in Table 5. More examples can be found by Schema Sensitivity SS(JGA) defined in Lee
7 et al. (2021b),. which calculates the average vari-
Note that the SGD dataset defines the services that will
occur in each dialogue, whereas MultiWOZ expects models to ation coefficient of JGA at turn level. A lower
be able to predict any of its domains for all dialogues. To make SS(JGA) value implies less fluctuation and more
it compatible between SGD and MultiWOZ for cross-task
zero-shot transfer, we limit the schema prefix for MutliWOZ
robustness.
to domains that appear in the current dialogue. We compare the robustness of models using dif-
Domain: Conference Submission
believe this schema representation contains more
Input: 0:name of the conference 1:title meaningful information for a strong LM to lever-
of the paper 2:the first author of
the paper 3:research areas for the age, leading to better performance and improved
paper 4:email for openreview account
i1:submit a paper to a conference data efficiency. To this end, we propose a sim-
i2:check if a paper has been accepted ple and effective DST model named “Description-
[user] hi, i’d like to submit a
paper for a conference [system] Driven Dialogue State Tracking” (D3ST), which
that’s great. which conference would
you like to submit to? [user] i’d relies fully on schema descriptions and an index-
like to submit to acl 2022 [system]
ok. could you share the title of
picking mechanism to indicate active slots or in-
your paper and the name of your tents. Our experiments verify the effectiveness of
first author? [user] the paper is
"description-driven task-oriented description-driven dialogue modeling in the fol-
dialog modeling", and the first author
is grace hopper [system] great, thank lowing ways. First, D3ST achieves superior qual-
you. note that this year, we require
all paper authors to be registered on
ity on MultiWOZ and SGD. Second, using lan-
openreview. could you give the email guage descriptions outperforms abbreviations or
for your openreview account? [user]
sure, its gracehopper@gmail.com arbitrary notations. Third, the description driven
Prediction:[states] 0:acl 2022
approach improves data-efficiency, and enables ef-
1:description-driven task-oriented fective zero-shot transfer to unseen tasks and do-
dialog modeling 2:grace hopper
4:gracehopper@gmail.com [intents] i1 mains. Fourth, using language for schema descrip-
tion improves model robustness as measured by the
Table 5: An example of D3ST performing zero-shot SGD-X benchmark.
transfer to a hypothetical "Conference Submission" do-
main. The predicted dialogue state is entirely correct.
Boldface and color were added for visual clarity.
References
Chunyuan Li Xiujun Li Jinchao Li Michael Zeng Jian-
ferent prompt types in Table 6. From the numbers feng Gao Baolin Peng, Chenguang Zhu. 2020. Few-
we see that using the most human-readable natural shot natural language generation for task-oriented di-
language descriptions not only achieves the highest alog.
average accuracy over all SGD-X test set variants, Tom Brown, Benjamin Mann, Nick Ryder, Melanie
but also enjoys the smallest SS(JGA) at the same Subbiah, Jared D Kaplan, Prafulla Dhariwal,
model size. This indicates that description-driven Arvind Neelakantan, Pranav Shyam, Girish Sastry,
models are more robust. On the other hand, using Amanda Askell, Sandhini Agarwal, Ariel Herbert-
Voss, Gretchen Krueger, Tom Henighan, Rewon
element names and random names have progres- Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu,
sively lower mean accuracy and higher sensitivity Clemens Winter, Chris Hesse, Mark Chen, Eric
to schema changes. Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish,
Size Orig v1 v2 v3 v4 v5 Avg v1-5 SS(JGA) Alec Radford, Ilya Sutskever, and Dario Amodei.
large 80.0 79.9 79.4 76.5 71.9 69.1 75.3 0.26 2020. Language models are few-shot learners. In
XXL 86.4 85.5 85.1 73.9 75.5 68.9 77.8 0.27
Advances in Neural Information Processing Systems,
(a) Natural language description volume 33, pages 1877–1901. Curran Associates,
Size Orig v1 v2 v3 v4 v5 Avg v1-5 SS(JGA) Inc.
large 73.7 72 69.5 66.4 61.1 65.7 66.9 0.37
XXL 79.7 80.8 76.6 74.2 61.2 72.3 73.0 0.35 Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s
(b) Element name description GPT-2 - how can I help you? towards the use of pre-
Size Orig v1 v2 v3 v4 v5 Avg v1-5 SS(JGA) trained language models for task-oriented dialogue
large 37.4 29.3 34.6 28.0 25.2 25.0 28.4 0.74 systems. In Proceedings of the 3rd Workshop on
XXL 64.8 67.8 68.8 72.9 58.1 68.1 67.1 0.51 Neural Generation and Translation, pages 15–22,
(c) Random description Hong Kong. Association for Computational Linguis-
tics.
Table 6: Robustness comparison for various description
types. SS(JGA) refers to schema sensitivity for JGA. Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang
Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ra-
madan, and Milica Gašić. 2018. MultiWOZ - a
large-scale multi-domain Wizard-of-Oz dataset for
5 Conclusion task-oriented dialogue modelling. In Proceedings of
the 2018 Conference on Empirical Methods in Nat-
We advocate using human-readable language de- ural Language Processing, pages 5016–5026, Brus-
scriptions in place of abbreviated or arbitrary nota- sels, Belgium. Association for Computational Lin-
tions for schema definition in TOD modeling. We guistics.
Xinya Du, Luheng He, Qi Li, Dian Yu, Panupong Chapter of the Association for Computational Lin-
Pasupat, and Yuan Zhang. 2021. QA-driven zero- guistics: Main Volume, pages 1063–1074, Online.
shot slot filling with weak supervision pretraining. Association for Computational Linguistics.
In Proceedings of the 59th Annual Meeting of the
Association for Computational Linguistics and the Zhaojiang Lin, Bing Liu, Andrea Madotto, Seung-
11th International Joint Conference on Natural Lan- whan Moon, Paul Crook, Zhenpeng Zhou, Zhiguang
guage Processing (Volume 2: Short Papers), pages Wang, Zhou Yu, Eunjoon Cho, Rajen Subba, and
654–664, Online. Association for Computational Pascale Fung. 2021a. Zero-shot dialogue state track-
Linguistics. ing via cross-task transfer.
Shuyang Gao, Sanchit Agarwal, Di Jin, Tagyoung Zhaojiang Lin, Bing Liu, Seungwhan Moon, Paul
Chung, and Dilek Hakkani-Tur. 2020. From ma- Crook, Zhenpeng Zhou, Zhiguang Wang, Zhou Yu,
chine reading comprehension to dialogue state track- Andrea Madotto, Eunjoon Cho, and Rajen Subba.
ing: Bridging the gap. In Proceedings of the 2nd 2021b. Leveraging slot descriptions for zero-shot
Workshop on Natural Language Processing for Con- cross-domain dialogue StateTracking. In Proceed-
versational AI, pages 79–89, Online. Association for ings of the 2021 Conference of the North Ameri-
Computational Linguistics. can Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
Ting Han, Ximing Liu, Ryuichi Takanobu, Yixin Lian, 5640–5648, Online. Association for Computational
Chongxuan Huang, Dazhen Wan, Wei Peng, and Linguistics.
Minlie Huang. 2021. Multiwoz 2.3: A multi-
domain task-oriented dialogue dataset enhanced Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
with annotation corrections and co-reference anno- Hiroaki Hayashi, and Graham Neubig. 2021. Pre-
tation. train, prompt, and predict: A systematic survey of
Michael Heck, Carel van Niekerk, Nurul Lubis, Chris- prompting methods in natural language processing.
tian Geishauser, Hsien-Chin Lin, Marco Moresi, and
Milica Gasic. 2020. TripPy: A triple copy strategy Yue Ma, Zengfeng Zeng, Dawei Zhu, Xuan Li, Yiy-
for value independent neural dialog state tracking. ing Yang, Xiaoyuan Yao, Kaijie Zhou, and Jianping
In Proceedings of the 21th Annual Meeting of the Shen. 2020. An end-to-end dialogue state tracking
Special Interest Group on Discourse and Dialogue, system with machine reading comprehension and
pages 35–44, 1st virtual meeting. Association for wide & deep classification.
Computational Linguistics.
Andrea Madotto, Zhaojiang Lin, Genta Indra Winata,
Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, and Pascale Fung. 2021. Few-shot bot: Prompt-
Semih Yavuz, and Richard Socher. 2020. A simple based learning for dialogue systems.
language model for task-oriented dialogue.
Andrea Madotto, Zihan Liu, Zhaojiang Lin, and Pas-
Norman P. Jouppi, Cliff Young, Nishant Patil, David cale Fung. 2020. Language models as few-shot
Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah learner for task-oriented dialogue systems.
Bates, Suresh Bhatia, Nan Boden, Al Borchers,
Rick Boyle, and Pierre-luc et al. Cantin. 2017. In- Fei Mi, Yitong Li, Yasheng Wang, Xin Jiang, and Qun
datacenter performance analysis of a tensor pro- Liu. 2021. Cins: Comprehensive instruction for few-
cessing unit. SIGARCH Comput. Archit. News, shot learning in task-oriented dialog systems.
45(2):1–12.
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and
Sungdong Kim, Sohee Yang, Gyuwan Kim, and Sang- Hannaneh Hajishirzi. 2021. Cross-task general-
Woo Lee. 2020. Efficient dialogue state tracking by ization via natural language crowdsourcing instruc-
selectively overwriting memory. tions.
Chia-Hsuan Lee, Hao Cheng, and Mari Ostendorf.
2021a. Dialogue state tracking with a language Mahdi Namazifar, Alexandros Papangelis, Gokhan Tur,
model using schema-driven prompting. In Proceed- and Dilek Hakkani-Tür. 2020. Language model is
ings of the 2021 Conference on Empirical Methods all you need: Natural language understanding as
in Natural Language Processing (EMNLP). question answering.

Harrison Lee, Raghav Gupta, Abhinav Rastogi, Yuan Alec Radford, Jeff Wu, Rewon Child, David Luan,
Cao, Bin Zhang, and Yonghui Wu. 2021b. Sgd-x: Dario Amodei, and Ilya Sutskever. 2019. Language
A benchmark for robust generalization in schema- models are unsupervised multitask learners.
guided dialogue systems.
Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
Shuyang Li, Jin Cao, Mukund Sridhar, Henghui Zhu, ine Lee, Sharan Narang, Michael Matena, Yanqi
Shang-Wen Li, Wael Hamza, and Julian McAuley. Zhou, Wei Li, and Peter J. Liu. 2020. Exploring
2021. Zero-shot generalization in dialog state track- the limits of transfer learning with a unified text-to-
ing through generative question answering. In Pro- text transformer. Journal of Machine Learning Re-
ceedings of the 16th Conference of the European search, 21(140):1–67.
Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, A Example of Description Types
Raghav Gupta, and Pranav Khaitan. 2020. To-
wards scalable multi-domain conversational agents: An example of the different description types for a
The schema-guided dialogue dataset. Proceedings single example can be found in Table A1.
of the AAAI Conference on Artificial Intelligence,
34(05):8689–8696. B Zero-shot Transfer to Novel Domains
Yu-Ping Ruan, Zhen-Hua Ling, Jia-Chen Gu, and Quan Qualitative examples showcasing zero-shot transfer
Liu. 2020. Fine-tuning bert for schema-guided zero-
to novel domains can be found in Table A2.
shot dialogue state tracking.

Darsh Shah, Raghav Gupta, Amir Fayazi, and Dilek


Hakkani-Tur. 2019. Robust zero-shot cross-domain
slot filling with example values. In Proceedings of
the 57th Annual Meeting of the Association for Com-
putational Linguistics, pages 5484–5490, Florence,
Italy. Association for Computational Linguistics.

Yexiang Wang, Yi Guo, and Siqi Zhu. 2020. Slot at-


tention with value normalization for multi-domain
dialogue state tracking. In Proceedings of the 2020
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP), pages 3019–3028, On-
line. Association for Computational Linguistics.

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin


Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
drew M. Dai, and Quoc V. Le. 2021. Finetuned lan-
guage models are zero-shot learners.

Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-


Asl, Caiming Xiong, Richard Socher, and Pascale
Fung. 2019. Transferable multi-domain state gener-
ator for task-oriented dialogue systems. In Proceed-
ings of the 57th Annual Meeting of the Association
for Computational Linguistics, pages 808–819, Flo-
rence, Italy. Association for Computational Linguis-
tics.

Fanghua Ye, Jarana Manotumruksa, and Emine Yil-


maz. 2021. Multiwoz 2.4: A multi-domain task-
oriented dialogue dataset with essential annotation
corrections to improve state tracking evaluation.

Xiaoxue Zang, Abhinav Rastogi, Srinivas Sunkara,


Raghav Gupta, Jianguo Zhang, and Jindong Chen.
2020. MultiWOZ 2.2 : A dialogue dataset with
additional annotation corrections and state tracking
baselines. In Proceedings of the 2nd Workshop on
Natural Language Processing for Conversational AI,
pages 109–117, Online. Association for Computa-
tional Linguistics.

Yan Zeng and Jian-Yun Nie. 2021. Jointly optimizing


state operation prediction and value generation for
dialogue state tracking.

Jeffrey Zhao, Mahdis Mahdieh, Ye Zhang, Yuan Cao,


and Yonghui Wu. 2021. Effective sequence-to-
sequence dialogue state tracking. In Proceedings of
the 2021 Conference on Empirical Methods in Natu-
ral Language Processing, pages 7486–7493, Online
and Punta Cana, Dominican Republic. Association
for Computational Linguistics.
0:playback device on which the song is to be played 0a) bedroom speaker
0b) tv 0c) kitchen speaker 1=name of the artist the song is performed
by 2=name of the song 3=album the song belongs to 4=genre of the song
i0=search for a song based on the name and optionally other attributes
i1=play a song by its name and optionally artist [user] i want to find
a movie. [system] what is your location. [user] santa rosa. i want to
Language see it at 3rd street cinema. [system] i found 3 movies. does hellboy,
how to train your dragon: the hidden world or the upside interest you?
[user] how to train your dragon: the hidden world is perfect. can you
find me some songs from the album summer anthems. [system] i found 1
song you may like. how about no other love from the album summer anthems
by common kings? [user] that would be great. [system] play the song
now? [user] play it on the bedroom device.

0:music_2-genre 1:music_2-playback_device 1a) bedroom speaker 1b) kitchen


speaker 1c) tv 2:music_2-album 3:music_2-artist 4:music_2-song_name
i0:music_2-playmedia i1:music_2-lookupmusic [user] i want to find a
movie. [system] what is your location. [user] santa rosa. i want to
see it at 3rd street cinema. [system] i found 3 movies. does hellboy,
Name how to train your dragon: the hidden world or the upside interest you?
[user] how to train your dragon: the hidden world is perfect. can you
find me some songs from the album summer anthems. [system] i found 1
song you may like. how about no other love from the album summer anthems
by common kings? [user] that would be great. [system] play the song
now? [user] play it on the bedroom device.

0:e-e_ciugs2mrn 1:psuekc_l-2imceyibaca_dv 1a) bedroom speaker 1b) kitchen


speaker 1c) tv 2:umm2uisc_bal- 3:satriti_2-sumc 4:_-onassng2_cemmui
i0:aeusmmci2-adipl_y i1:miiu_2olosckucp-ums [user] i want to find a
movie. [system] what is your location. [user] santa rosa. i want to
see it at 3rd street cinema. [system] i found 3 movies. does hellboy,
Random how to train your dragon: the hidden world or the upside interest you?
[user] how to train your dragon: the hidden world is perfect. can you
find me some songs from the album summer anthems. [system] i found 1
song you may like. how about no other love from the album summer anthems
by common kings? [user] that would be great. [system] play the song
now? [user] play it on the bedroom device.

States [states] 1:1a 2:summer anthems 4:no other love [intents] i0

Table A1: Examples of the same SGD dialogue with different description types. "Language" uses a detailed natural
language description, "Name" uses the schema element name, and "Random" is generated from a random shuffling
of the slot name. Note that the categorical slot value enumeration is unaffected in "Random", and that all three
description types would have the same target slots and intents.
Domain Internet Provider

0:email address of the account 1:whether professional help is needed for


internet installation 1a) true 1b) false 2:whether to bundle services on
the same plan 2a) true 2b) false 3:download speed of the internet plan
4:whether services are for residential or business use 4a) residential 4b)
business 5:the address to provide services to i0=buy or change an internet
plan i1:file a formal complaint [user] hi there - my internet contract
is up for renewal, and i’m interested in exploring other plan options.
[system] happy to help. is this for your home or for a business? [user]
Inputs
home [system] what’s the email associated with your account? [user]
noamchomsky@hotmail.com [system] thanks. your current plan is 25 mbps
download speed for $53 / month. the two other plans are 50 mbps for $63
/ month and 100 mbps for $73 / month. would you interested in either of
those? [user] i’m interested in upgrading to the 50 mbps plan. [system]
great. for $10 / month more, would you like to include our basic cable
plan? [user] no thanks. i’ll need to talk this over with my partner.
thank you for your help.

States [states] 0:noamchomsky@hotmail.com 3:50 mbps 4:4a [intents] i0


Domain E-Commerce Retailer

0:phone number associated with the customer’s account 1:a coupon code to
apply to the purchase 2:the reason for the product return 2a) accidental
purchase 2b) malfunction 2c) preference 3:the retail product to purchase
or to be returned 4:date the product was purchased 5:identifier associated
with the purchase i0:return a product i1:purchase a product [system] hi
how can i help you today? [user] hello - i recently purchased a glow in
the dark ball that i’d like to return. [system] no problem. i’m happy
Inputs
to help. can you provide the order number or date of purchase please?
[user] 1ozdl3v260lkq, and i purchased it last week on nov 1, 2021 [system]
thanks. and what’s the reason for the return? [user] the ball seems
to be broken. it doesn’t actually glow in the dark. [system] sorry to
hear about that. we’ll process the return and you should receive a refund
within 10 business days. is there anything else i can do for you? [user]
no, thanks for your help!

[states] 2:2b 3:glow in the dark ball 4:nov 1, 2021 5:1ozdl3v260lkq


States
[intents] i0

Table A2: Two more examples of D3ST trained on SGD performing zero-shot transfer to novel domains. The only
error is in the "Internet Provider" example, where the model misses that the slot for "whether to bundle services on
the same plan" should be false. We hypothesize that "bundle" is industry jargon that the model fails to associate
with the dialogue context.

You might also like