Question Answering in Context is a dataset for modeling, understanding, and participating in information seeking dialog. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context.
QuAC is meant to be an academic resource and has significant limitations. Please read our detailed datasheet before considering it for any practical application.
No, QuAC shares many principles with SQuAD 2.0 such as span based evaluation and unanswerable questions (including website design principles! Big thanks for sharing the code!) but incorporates a new dialog component. We expect models can be easily evaluated on both resources and have tried to make our evaluation protocol as similar as possible to their own.
Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):
To evaluate your models, we have also made available the evaluation script we will use for official evaluation, along with a sample prediction file that the script will take as input. To run the evaluation, use
python scorer.py --val_file <path_to_val> --model_output <path_to_predictions> --o eval.json;
.
Once you have a built a model that works to your expectations on the dev set, you submit it to get official scores on the dev and a hidden test set. To preserve the integrity of test results, we do not release the test set to the public. Instead, we require you to submit your model so that we can run it on the test set for you. The submission process is very similar to SQuaD 2.0 (Live!):
Submission TutorialAll baseline models are available through AllenNLP. Specifically, model is here and the configuration is here. AllenNLP Model
First, download the duck
The Duck
Then, put this macro in your latex:
\newcommand{\daffy}[0]{\includegraphics[width=.04\textwidth]{path_to_daffy/daffyhand.pdf}}
Finally, enjoy the command \daffy in your paper!
Ask us questions at our
google group
or at
eunsol@cs.washington.edu
hehe@stanford.edu
miyyer@cs.umass.edu
marky@allenai.org
There can be only one duck.
Rank | Model | F1 | HEQQ | HEQD |
---|---|---|---|---|
Human Performance (Choi et al. EMNLP '18) |
81.1 | 100 | 100 | |
Aug 24, 2022 |
CKT-QA (ensemble) | 76.3 | 73.6 | 17.9 |
2 Feb 2, 2022 |
CDQ-DeBERTa (single model) | 75.8 | 73.1 | 15.9 |
3 Nov 9, 2021 |
AERmodel (essemble) | 75.2 | 72.5 | 16.5 |
4 Jan 27, 2021 |
RoR (single model)
JD AI Research |
74.9 | 72.2 | 16.4 |
5 Sep 3, 2020 |
EL-QA (single model)
JD AI Research |
74.6 | 71.6 | 16.3 |
6 Jul 29, 2020 |
HistoryQA (single model)
PAII Inc. |
74.2 | 71.5 | 13.9 |
7 Dec 16, 2019 |
TR-MT (ensemble)
WeChat AI |
74.4 | 71.3 | 13.6 |
8 Nov 11, 2019 |
RoBERTa + DA (ensemble)
Microsoft Dynamics 365 AI |
74.0 | 70.7 | 13.1 |
9 Aug 9, 2022 |
GHR_ELECTRA (single model)
SUDA NLP & A*STAR |
73.7 | 69.9 | 13.7 |
10 June 22, 2022 |
MarCQAp (single model)
Technion - Israel Institute of Technology & Google Research |
74.0 | 70.7 | 12.5 |
11 Sep 15, 2019 |
History-Attentive-TransBERT (single model)
Alibaba AI Labs |
72.9 | 69.7 | 13.6 |
12 Nov 11, 2019 |
RoBERTa + DA (single model)
Microsoft Dynamics 365 AI |
73.5 | 69.8 | 12.1 |
13 Nov 25, 2019 |
BertMT (ensemble)
WeChat AI |
72.3 | 69.4 | 13.1 |
14 Nov 1, 2019 |
XLNet + Augmentation (single model)
Xiaoming |
71.2 | 67.5 | 11.8 |
15 Aug 31, 2019 |
TransBERT (single model)
Anonymous |
71.4 | 68.1 | 10.0 |
16 Nov 22, 2019 |
BertMT (single model)
WeChat AI |
69.4 | 66.0 | 9.8 |
17 June 13, 2019 |
Context-Aware-BERT (single model) Anonymous |
69.6 | 65.7 | 8.1 |
18 Sep 9, 2019 |
BertInfoFlow
(single model) PINGAN Omni-Sinitic |
69.3 | 65.2 | 8.5 |
19 Mar 14, 2019 |
ConvBERT (single model)
Joint Laboratory of HIT and iFLYTEK Research |
68.0 | 63.5 | 9.1 |
20 Aug 22, 2019 |
zhiboBERT (single model)
Anonymous |
67.0 | 63.5 | 8.6 |
21 May 21, 2019 |
HAM (single model) UMass Amherst, Alibaba PAI, Rutgers University |
65.4 | 61.8 | 6.7 |
22 Jan 10, 2020 |
Bert-FlowDelta (single model)
National Taiwan University, MiuLab https://arxiv.org/abs/1908.05117 |
65.5 | 61.0 | 6.9 |
23 Dec 4, 2019 |
BERT w/ 4-conversation history (single) Zhengzhou University
|
64.5 | 60.2 | 6.7 |
24 Mar 7, 2019 |
BERT w/ 2-context (single model) NTT Media Intelligence Labs |
64.9 | 60.2 | 6.1 |
25 Nov 27, 2019 |
AHBert (single)
ZZU
|
64.0 | 59.9 | 6.6 |
26 Feb 21, 2019 |
GraphFlow (single model) RPI-IBM |
64.9 | 60.3 | 5.1 |
27 Sep 26, 2018 |
FlowQA (single model)
Allen Institute of AI |
64.1 | 59.6 | 5.8 |
28 Aug 20, 2018 |
BERT + History Answer Embedding (single model) UMass Amherst, Alibaba PAI, Rutgers University |
62.4 | 57.8 | 5.1 |
29 Aug 20, 2018 |
BiDAF++ w/ 2-Context (single model) baseline |
60.1 | 54.8 | 4.0 |
30 Aug 20, 2018 |
BiDAF++ (single model) baseline |
50.2 | 43.3 | 2.2 |