QuAC

Question Answering in Context

What is QuAC?

Question Answering in Context is a dataset for modeling, understanding, and participating in information seeking dialog. Data instances consist of an interactive dialog between two crowd workers: (1) a student who poses a sequence of freeform questions to learn as much as possible about a hidden Wikipedia text, and (2) a teacher who answers the questions by providing short excerpts (spans) from the text. QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context.

QuAC paper

QuAC poster

QuAC is meant to be an academic resource and has significant limitations. Please read our detailed datasheet before considering it for any practical application.

Datasheet

Is QuAC exactly like SQuAD 2.0?

No, QuAC shares many principles with SQuAD 2.0 such as span based evaluation and unanswerable questions (including website design principles! Big thanks for sharing the code!) but incorporates a new dialog component. We expect models can be easily evaluated on both resources and have tried to make our evaluation protocol as similar as possible to their own.

Getting Started

Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

To evaluate your models, we have also made available the evaluation script we will use for official evaluation, along with a sample prediction file that the script will take as input. To run the evaluation, use
python scorer.py --val_file <path_to_val> --model_output <path_to_predictions> --o eval.json; .

Once you have a built a model that works to your expectations on the dev set, you submit it to get official scores on the dev and a hidden test set. To preserve the integrity of test results, we do not release the test set to the public. Instead, we require you to submit your model so that we can run it on the test set for you. The submission process is very similar to SQuaD 2.0 (Live!):

Submission Tutorial

Baseline Models

All baseline models are available through AllenNLP. Specifically, model is here and the configuration is here. AllenNLP Model

How do I get the duck in my paper?

First, download the duck The Duck Then, put this macro in your latex: \newcommand{\daffy}[0]{\includegraphics[width=.04\textwidth]{path_to_daffy/daffyhand.pdf}}
Finally, enjoy the command \daffy in your paper!

Have Questions?

Ask us questions at our google group or at eunsol@cs.washington.edu hehe@stanford.edu
miyyer@cs.umass.edu marky@allenai.org

Leaderboard

There can be only one duck.

Rank	Model	F1	HEQQ	HEQD
	Human Performance (Choi et al. EMNLP '18)	81.1	100	100
Aug 24, 2022	CKT-QA (ensemble) Ant Group, ZhiXiaoBao & Ada	76.3	73.6	17.9
2 Feb 2, 2022	CDQ-DeBERTa (single model) Anonymous	75.8	73.1	15.9
3 Nov 9, 2021	AERmodel (essemble) Protago Labs AI	75.2	72.5	16.5
4 Jan 27, 2021	RoR (single model) JD AI Research https://arxiv.org/abs/2109.04780	74.9	72.2	16.4
5 Sep 3, 2020	EL-QA (single model) JD AI Research	74.6	71.6	16.3
6 Jul 29, 2020	HistoryQA (single model) PAII Inc.	74.2	71.5	13.9
7 Dec 16, 2019	TR-MT (ensemble) WeChat AI	74.4	71.3	13.6
8 Nov 11, 2019	RoBERTa + DA (ensemble) Microsoft Dynamics 365 AI	74.0	70.7	13.1
9 Aug 9, 2022	GHR_ELECTRA (single model) SUDA NLP & A*STAR https://github.com/jaytsien/GHR	73.7	69.9	13.7
10 June 22, 2022	MarCQAp (single model) Technion - Israel Institute of Technology & Google Research https://arxiv.org/abs/2206.14796	74.0	70.7	12.5
11 Sep 15, 2019	History-Attentive-TransBERT (single model) Alibaba AI Labs	72.9	69.7	13.6
12 Nov 11, 2019	RoBERTa + DA (single model) Microsoft Dynamics 365 AI	73.5	69.8	12.1
13 Nov 25, 2019	BertMT (ensemble) WeChat AI	72.3	69.4	13.1
14 Nov 1, 2019	XLNet + Augmentation (single model) Xiaoming	71.2	67.5	11.8
15 Aug 31, 2019	TransBERT (single model) Anonymous	71.4	68.1	10.0
16 Nov 22, 2019	BertMT (single model) WeChat AI	69.4	66.0	9.8
17 June 13, 2019	Context-Aware-BERT (single model) Anonymous	69.6	65.7	8.1
18 Sep 9, 2019	BertInfoFlow (single model) PINGAN Omni-Sinitic	69.3	65.2	8.5
19 Mar 14, 2019	ConvBERT (single model) Joint Laboratory of HIT and iFLYTEK Research	68.0	63.5	9.1
20 Aug 22, 2019	zhiboBERT (single model) Anonymous	67.0	63.5	8.6
21 May 21, 2019	HAM (single model) UMass Amherst, Alibaba PAI, Rutgers University https://arxiv.org/abs/1908.09456	65.4	61.8	6.7
22 Jan 10, 2020	Bert-FlowDelta (single model) National Taiwan University, MiuLab https://arxiv.org/abs/1908.05117	65.5	61.0	6.9
23 Dec 4, 2019	BERT w/ 4-conversation history (single) Zhengzhou University	64.5	60.2	6.7
24 Mar 7, 2019	BERT w/ 2-context (single model) NTT Media Intelligence Labs	64.9	60.2	6.1
25 Nov 27, 2019	AHBert (single) ZZU	64.0	59.9	6.6
26 Feb 21, 2019	GraphFlow (single model) RPI-IBM https://arxiv.org/abs/1908.00059	64.9	60.3	5.1
27 Sep 26, 2018	FlowQA (single model) Allen Institute of AI https://arxiv.org/abs/1810.06683	64.1	59.6	5.8
28 Aug 20, 2018	BERT + History Answer Embedding (single model) UMass Amherst, Alibaba PAI, Rutgers University https://arxiv.org/abs/1905.05412	62.4	57.8	5.1
29 Aug 20, 2018	BiDAF++ w/ 2-Context (single model) baseline	60.1	54.8	4.0
30 Aug 20, 2018	BiDAF++ (single model) baseline	50.2	43.3	2.2