2201.08904
2201.08904
2201.08904
Jeffrey Zhao, Raghav Gupta, Yuan Cao, Dian Yu, Mingqiu Wang,
Harrison Lee, Abhinav Rastogi, Izhak Shafran, Yonghui Wu
Google Research
{jeffreyzhao, raghavgupta, yuancao}@google.com
in which ia) . . . ik) are indices assigned to each 4. How robust is the model to different wordings
of the values.1 Assuming this slot is active with its of the human-readable descriptions?
third value (vc ) being mentioned, then the corre- we found this shared indexing across categorical slots can
sponding prediction has the format i : ic). sometimes cause selection ambiguity when some values (like
“true” or “false”) are shared by multiple categorical slots.
1 We therefore apply slot-specific indices ia) . . . ik) to con-
One may also adopt a) . . . k) as value indices or even
completely discard indexing for categorical values, however strain index-picking within the ith slot value range.
0:departure location of train
1:destination location of train 2:day of
the train 2a) monday 2b) tuesday 2c)
wednesday i1:look for a train i2:change
ticket [user] i need to find a spot on a
[states] 0:london kings
train on wednesday, can you help me find
one? [system] yes i can. where are you
seq2seq cross 1:cambridge 2:2c
[intents] i1
going and what time would like to arrive
or depart? [user] i’m leaving from london
kings cross and going to cambridge. could
you choose a train and give me the
station it leaves from?
Figure 1: An example of D3ST. Red: Indexed schema description sequence as prefix; Blue: Conversation context;
Green: State prediction sequence. See Section 3 for details. Best viewed in color.
4.1 Setup mains as model prefix and set the input length
Datasets We conduct experiments on the Multi- limit to 2048. To avoid ambiguity between de-
WOZ 2.1-2.4 (Budzianowski et al., 2018; Zang scriptions from different domains, we also add
et al., 2020; Han et al., 2021; Ye et al., 2021) and domain names as part of the descriptions. For
SGD (Rastogi et al., 2020) datasets. The Multi- example for the hotel-parking slot, the de-
WOZ dataset is known to contain annotation errors scription is “hotel-parking facility at
in multiple places and previous work adopted dif- the hotel”. For SGD, we include descriptions
ferent data pre-processing procedures, so we follow from domains relevant to each turn as suggested by
the recommended procedure2 of using the TRADE the standard evaluation.
(Wu et al., 2019) script to pre-process MultiWOZ
4.2 Main Results
2.1. However, we do not apply any pre-processing
to 2.2-2.4 for reproducibility and fair comparison Table 1 gives the model quality when the entire
with existing results. We use Joint Goal Accuracy training datasets are used for fine-tuning. We show
(JGA) as the evaluation metric, which measures that D3ST is close to, or at the state-of-the-art
the percentage of turns across all conversations across all benchmarks, illustrating the effective-
for which all states are correctly predicted by the ness of the proposed approach. We also see that
model. increasing the model size significantly improves
Training setup We use the open-source T5 code the quality.
base3 and the associated T5 1.1 checkpoints.4 We Note however that not all results are directly
consider models of the size base (250M parame- comparable, and we discuss some notable incon-
ters), large (800M) and XXL (11B) initialized from gruities. The best result on SGD is from paDST, but
the corresponding pretrained checkpoints, and ran this model has signicant advantages. paDST uses a
each experiment on 64 TPU v3 chips (Jouppi et al., data augmentation procedure by back-translating
2017). For fine-tuning, we use batch size 32 and between English and Chinese, as well as special
use constant learning rate of 1e − 4 across all ex- handcrafted rules for model predictions. In con-
periments. The input and output sequence lengths trast, our models only train on the default SGD
are 1024 and 512 tokens, respectively. dataset, and do not apply any handcrafted rules
Descriptions We use the slot and intent de- whatsoever. While paDST has significantly higher
scriptions included in the original MultiWOZ and JGA compared to the similarly-sized D3ST Large.
SGD datasets as inputs (dslot
i and dint
i described D3ST XXL is on par, making up for its lack of data
in Section 3.2) to the model. For MultiWOZ, augmentation and handcrafted rules with a much
we include schema descriptions across all do- larger model.
2 One other notable comparison can be made. DaP
https://github.com/budzianowski/
multiwoz#dialog-state-tracking also relies on slot descriptions and is finetuned from
3
https://github.com/google-research/ a T5 Base model, making it directly comparable to
text-to-text-transfer-transformer our D3ST Base model, which exhibits better per-
4
https://github.com/google-research/
text-to-text-transfer-transformer/blob/ formance on SGD and MultiWOZ. One additional
main/released_checkpoints.md advantage of D3ST is that it predicts all slots at
Model Pretrain. Model (# Params.) MW2.1 MW2.2 MW2.3 MW2.4
Transformer-DST (Zeng and Nie, 2021) BERT Base (110M) 55.35 - - -
SOM-DST (Kim et al., 2020) BERT Base (110M) 51.2 - 55.5 66.8
TripPy (Heck et al., 2020) BERT Base (110M) 55.3 - 63.0 59.6
SAVN (Wang et al., 2020) BERT Base (110M) 54.5 - 58.0 60.1
SimpleTODH (Hosseini-Asl et al., 2020) DistilGPT-2 (82M) 50.3/55.7 - 51.3 -
Seq2seq (Zhao et al., 2021) T5 Base (220M) 52.8 57.6 59.3 67.1
DaP (seq) (Lee et al., 2021a) T5 Base (220M) - 51.2 - -
DaP (ind) (Lee et al., 2021a) T5 Base (220M) 56.7 57.6 - -
D3ST (Base) T5 Base (220M) 54.2 56.1 59.1 72.1
D3ST (Large) T5 Large (770M) 54.5 54.2 58.6 70.8
D3ST (XXL) T5 XXL (11B) 57.8 58.7 60.8 75.9
(a) JGA on MultiWOZ 2.1-2.4.
Table 1: Results on MultiWOZ and SGD datasets with full training data. “-” indicates no public number is
available. Best results are marked in bold. H: SimpleTOD results are retrieved from the 2.3 website https:
//github.com/lexmen318/MultiWOZ-coref, in which two numbers are reported for 2.1 (one produced
by the 2.3 author, the other by the original SimpleTOD paper). F: No data pre-processing applied for MultiWOZ
2.1. n: Data augmentation and special rules applied.
Harrison Lee, Raghav Gupta, Abhinav Rastogi, Yuan Alec Radford, Jeff Wu, Rewon Child, David Luan,
Cao, Bin Zhang, and Yonghui Wu. 2021b. Sgd-x: Dario Amodei, and Ilya Sutskever. 2019. Language
A benchmark for robust generalization in schema- models are unsupervised multitask learners.
guided dialogue systems.
Colin Raffel, Noam Shazeer, Adam Roberts, Kather-
Shuyang Li, Jin Cao, Mukund Sridhar, Henghui Zhu, ine Lee, Sharan Narang, Michael Matena, Yanqi
Shang-Wen Li, Wael Hamza, and Julian McAuley. Zhou, Wei Li, and Peter J. Liu. 2020. Exploring
2021. Zero-shot generalization in dialog state track- the limits of transfer learning with a unified text-to-
ing through generative question answering. In Pro- text transformer. Journal of Machine Learning Re-
ceedings of the 16th Conference of the European search, 21(140):1–67.
Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, A Example of Description Types
Raghav Gupta, and Pranav Khaitan. 2020. To-
wards scalable multi-domain conversational agents: An example of the different description types for a
The schema-guided dialogue dataset. Proceedings single example can be found in Table A1.
of the AAAI Conference on Artificial Intelligence,
34(05):8689–8696. B Zero-shot Transfer to Novel Domains
Yu-Ping Ruan, Zhen-Hua Ling, Jia-Chen Gu, and Quan Qualitative examples showcasing zero-shot transfer
Liu. 2020. Fine-tuning bert for schema-guided zero-
to novel domains can be found in Table A2.
shot dialogue state tracking.
Table A1: Examples of the same SGD dialogue with different description types. "Language" uses a detailed natural
language description, "Name" uses the schema element name, and "Random" is generated from a random shuffling
of the slot name. Note that the categorical slot value enumeration is unaffected in "Random", and that all three
description types would have the same target slots and intents.
Domain Internet Provider
0:phone number associated with the customer’s account 1:a coupon code to
apply to the purchase 2:the reason for the product return 2a) accidental
purchase 2b) malfunction 2c) preference 3:the retail product to purchase
or to be returned 4:date the product was purchased 5:identifier associated
with the purchase i0:return a product i1:purchase a product [system] hi
how can i help you today? [user] hello - i recently purchased a glow in
the dark ball that i’d like to return. [system] no problem. i’m happy
Inputs
to help. can you provide the order number or date of purchase please?
[user] 1ozdl3v260lkq, and i purchased it last week on nov 1, 2021 [system]
thanks. and what’s the reason for the return? [user] the ball seems
to be broken. it doesn’t actually glow in the dark. [system] sorry to
hear about that. we’ll process the return and you should receive a refund
within 10 business days. is there anything else i can do for you? [user]
no, thanks for your help!
Table A2: Two more examples of D3ST trained on SGD performing zero-shot transfer to novel domains. The only
error is in the "Internet Provider" example, where the model misses that the slot for "whether to bundle services on
the same plan" should be false. We hypothesize that "bundle" is industry jargon that the model fails to associate
with the dialogue context.