1704.01792v3
1704.01792v3
1704.01792v3
Qingyu Zhou†∗ Nan Yang‡ Furu Wei‡ Chuanqi Tan] Hangbo Bao† Ming Zhou‡
†
Harbin Institute of Technology, Harbin, China
‡
Microsoft Research, Beijing, China
]
Beihang University, Beijing, China
qyzhgm@gmail.com {nanya, fuwei, mingzhou}@microsoft.com
tanchuanqi@nlsde.buaa.edu.cn baohangbo@hit.edu.cn
generate questions from a text passage on question generation from text with neural net-
where the generated questions can be an- works, which is denoted as the Neural Question
swered by certain sub-spans of the given Generation (NQG) framework, to generate natural
passage. Traditional methods mainly use language questions from text without pre-defined
rigid heuristic rules to transform a sen- rules. The Neural Question Generation framework
tence into related questions. In this work, extends the sequence-to-sequence models by en-
we propose to apply the neural encoder- riching the encoder with answer and lexical fea-
decoder model to generate meaningful and tures to generate answer focused questions. Con-
diverse questions from natural language cretely, the encoder reads not only the input sen-
sentences. The encoder reads the input tence, but also the answer position indicator and
text and the answer position, to produce an lexical features. The answer position feature de-
answer-aware input representation, which notes the answer span in the input sentence, which
is fed to the decoder to generate an answer is essential to generate answer relevant questions.
focused question. We conduct a prelimi- The lexical features include part-of-speech (POS)
nary study on neural question generation and named entity (NER) tags to help produce bet-
from text with the SQuAD dataset, and the ter sentence encoding. Lastly, the decoder with
experiment results show that our method attention mechanism (Bahdanau et al., 2015) gen-
can produce fluent and diverse questions. erates an answer specific question of the sentence.
Large-scale manually annotated passage and
1 Introduction question pairs play a crucial role in developing
Automatic question generation from natural lan- question generation systems. We propose to adapt
guage text aims to generate questions taking text the recently released Stanford Question Answer-
as input, which has the potential value of educa- ing Dataset (SQuAD) (Rajpurkar et al., 2016) as
tion purpose (Heilman, 2011). As the reverse task the training and development datasets for the ques-
of question answering, question generation also tion generation task. In SQuAD, the answers are
has the potential for providing a large scale cor- labeled as subsequences in the given sentences by
pus of question-answer pairs. crowed sourcing, and it contains more than 100K
Previous works for question generation mainly questions which makes it feasible to train our neu-
use rigid heuristic rules to transform a sentence ral network models. We conduct the experiments
into related questions (Heilman, 2011; Chali and on SQuAD, and the experiment results show the
Hasan, 2015). However, these methods heavily neural network models can produce fluent and di-
rely on human-designed transformation and gen- verse questions from text.
eration rules, which cannot be easily adopted to
other domains. Instead of generating questions 2 Approach
from texts, Serban et al. (2016) proposed a neu-
In this section, we introduce the NQG framework,
∗
Contribution during internship at Microsoft Research. which consists of a feature-rich encoder and an
attention-based decoder. Figure 1 provides an Lexical Features Besides the sentence words,
overview of our NQG framework. we also feed other lexical features to the encoder.
To encode more linguistic information, we select
…
word case, POS and NER tags as the lexical fea-
Attention 𝑠𝑡 𝑦𝑡 tures. As an intermediate layer of full parsing,
𝑠𝑡−1 𝑦𝑡−1
POS tag feature is important in many NLP tasks,
ℎ1 ℎ2 ℎ3 ℎ4 ℎ5 ℎ6
such as information extraction and dependency
…
𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 parsing (Manning et al., 1999). Considering that
Word SQuAD is constructed using Wikipedia articles,
which contain lots of named entities, we add NER
Lexical Features
Answer Position Feature
feature to help detecting them.
Answer Position Feature To generate a ques- We then combine the previous word embedding
tion with respect to a specific answer in a sentence, wt−1 , the current context vector ct , and the de-
we propose using answer position feature to locate coder state st to get the readout state rt . The read-
the target answer. In this work, the BIO tagging out state is passed through a maxout hidden layer
scheme is used to label the position of a target an- (Goodfellow et al., 2013) to predict the next word
swer. In this scheme, tag B denotes the start of with a softmax layer over the decoder vocabulary:
an answer, tag I continues the answer and tag O
marks words that do not form part of an answer. rt = Wr wt−1 + Ur ct + Vr st (6)
The BIO tags of answer position are embedded to
real-valued vectors throu and fed to the feature- mt = [max{rt,2j−1 , rt,2j }]>
j=1,...,d (7)
rich encoder. With the BIO tagging feature, the p(yt |y1 , . . . , yt−1 ) = softmax(Wo mt ) (8)
answer position is encoded to the hidden vectors
and used to generate answer focused questions. where rt is a 2d-dimensional vector.
2.3 Copy Mechanism NQG−Answer Ablation test, the answer position
To deal with the rare and unknown words prob- indicator is removed from NQG model.
lem, Gulcehre et al. (2016) propose using pointing NQG−POS Ablation test, the POS tag feature is
mechanism to copy rare words from source sen- removed from NQG model.
tence. We apply this pointing method in our NQG NQG−NER Ablation test, the NER feature is re-
system. When decoding word t, the copy switch moved from NQG model.
takes current decoder state st and context vector ct NQG−Case Ablation test, the word case feature
as input and generates the probability p of copying is removed from NQG model.
a word from source sentence: 3.1 Results and Analysis
p = σ(Wst + Uct + b) (9) We report BLEU-4 score (Papineni et al., 2002) as
the evaluation metric of our NQG system.
where σ is sigmoid function. We reuse the at-
tention probability in equation 4 to decide which Model Dev set Test set
word to copy. PCFG-Trans 9.28 9.31
s2s+att 3.01 3.06
3 Experiments and Results NQG 10.06 10.13
NQG+ 12.30 12.18
We use the SQuAD dataset as our training data. NQG+Pretrain 12.80 12.69
NQG+STshare 12.92 12.80
SQuAD is composed of more than 100K questions NQG++ 13.27 13.29
posed by crowd workers on 536 Wikipedia arti- NQG−Answer 2.79 2.98
cles. We extract sentence-answer-question triples NQG−POS 9.83 9.87
NQG−NER 9.50 9.29
to build the training, development and test sets1 .
NQG−Case 9.91 9.89
Since the test set is not publicly available, we
randomly halve the development set to construct
Table 1: BLEU evaluation scores of baseline
the new development and test sets. The extracted
methods, different NQG framework configura-
training, development and test sets contain 86,635,
tions and some ablation tests.
8,965 and 8,964 triples respectively. We introduce
the implementation details in the appendix. Table 1 shows the BLEU-4 scores of different
We conduct several experiments and ablation settings. We report the beam search results on
tests as follows: both development and test sets. Our NQG frame-
PCFG-Trans The rule-based system1 modified work outperforms the PCFG-Trans and s2s+att
on the code released by Heilman (2011). We baselines by a large margin. This shows that the
modified the code so that it can generate lexical features and answer position indicator can
question based on a given word span. benefit the question generation. With the help of
s2s+att We implement a seq2seq with attention as copy mechanism, NQG+ has a 2.05 BLEU im-
the baseline method. provement since it solves the rare words problem.
NQG We extend the s2s+att with our feature-rich The extended version, NQG++, has 1.11 BLEU
encoder to build the NQG system. score gain over NQG+, which shows that initial-
NQG+ Based on NQG, we incorporate copy izing with pre-trained word vectors and sharing
mechanism to deal with rare words problem. them between encoder and decoder help learn bet-
NQG+Pretrain Based on NQG+, we initialize ter word representation.
the word embedding matrix with pre-trained Human Evaluation We evaluate the PCFG-
GloVe (Pennington et al., 2014) vectors. Trans baseline and NQG++ with human judges.
NQG+STshare Based on NQG+, we make the The rating scheme is, Good (3) - The question
encoder and decoder share the same embed- is meaningful and matches the sentence and an-
ding matrix. swer very well; Borderline (2) - The question
NQG++ Based on NQG+, we use both pre-train matches the sentence and answer, more or less;
word embedding and STshare methods, to Bad (1) - The question either does not make sense
further improve the performance. or matches the sentence and answer. We provide
1
We re-distribute the processed data split and PCFG- more detailed rating examples in the supplemen-
Trans baseline code at http://res.qyzhou.me tary material. Three human raters labeled 200
questions sampled from the test set to judge if the I: in 1226 , immediately after returning from the west
, genghis khan began a retaliatory attack on the
generated question matches the given sentence and tanguts .
answer span. The inter-rater aggreement is mea- G: in which year did genghis khan strike against the
sured with Fleiss’ kappa (Fleiss, 1971). tanguts ?
O: in what year did genghis khan begin a retaliatory
attack on the tanguts ?
I: in week 10 , manning suffered a partial tear of the
Model AvgScore Fleiss’ kappa
plantar fasciitis in his left foot .
PCFG-Trans 1.42 0.50 G: in the 10th week of the 2015 season , what injury
NQG++ 2.18 0.46 was peyton manning dealing with ?
O: what did manning suffer in his left foot ?
I: like the lombardi trophy , the “ 50 ” will be de-
Table 2: Human evaluation results. signed by tiffany & co. .
G: who designed the vince lombardi trophy ?
O: who designed the lombardi trophy ?
Table 2 reports the human judge results. The
kappa scores show a moderate agreement between
Table 3: Examples of generated questions, I is the
the human raters. Our NQG++ outperforms the
input sentence, G is the gold question and O is
PCFG-Trans baseline by 0.76 score, which shows
the NQG++ generated question. The underlined
that the questions generated by NQG++ are more
words are the target answers.
related to the given sentence and answer span.
2
We treat questions ‘what country’, ‘what place’ and so 4 Conclusion and Future Work
on as WHERE type questions. Similarly, questions con-
taining ‘what time’, ‘what year’ and so forth are counted as In this paper we conduct a preliminary study of
WHEN type questions. natural language question generation with neu-
ral network models. We propose to apply neu- of 3rd International Conference for Learning Repre-
ral encoder-decoder model to generate answer fo- sentations. San Diego.
cused questions based on natural language sen- Thang Luong, Hieu Pham, and Christopher D. Man-
tences. The proposed approach uses a feature- ning. 2015. Effective approaches to attention-
rich encoder to encode answer position, POS and based neural machine translation. In Proceedings of
EMNLP 2015. Association for Computational Lin-
NER tag information. Experiments show the ef-
guistics, Lisbon, Portugal, pages 1412–1421.
fectiveness of our NQG method. In future work,
we would like to investigate whether the auto- Christopher D Manning, Hinrich Schütze, et al. 1999.
Foundations of statistical natural language process-
matically generated questions can help to improve
ing, volume 999. MIT Press.
question answering systems.
Christopher D. Manning, Mihai Surdeanu, John Bauer,
Jenny Finkel, Steven J. Bethard, and David Mc-
References Closky. 2014. The Stanford CoreNLP natural lan-
guage processing toolkit. In Association for Compu-
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- tational Linguistics (ACL) System Demonstrations.
gio. 2015. Neural machine translation by jointly pages 55–60.
learning to align and translate. In Proceedings of 3rd
International Conference for Learning Representa- Ramesh Nallapati, Bowen Zhou, Ça glar Gulçehre,
tions. San Diego. and Bing Xiang. 2016. Abstractive text summariza-
tion using sequence-to-sequence rnns and beyond.
Yllias Chali and Sadid A. Hasan. 2015. Towards topic- In Proceedings of The 20th SIGNLL Conference on
to-question generation. Comput. Linguist. 41(1):1– Computational Natural Language Learning.
20.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Danqi Chen and Christopher Manning. 2014. A fast Jing Zhu. 2002. Bleu: a method for automatic eval-
and accurate dependency parser using neural net- uation of machine translation. In Proceedings of
works. In Proceedings of EMNLP 2014. Associ- the 40th annual meeting on association for compu-
ation for Computational Linguistics, Doha, Qatar, tational linguistics. Association for Computational
pages 740–750. Linguistics, pages 311–318.
Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger 2013. On the difficulty of training recurrent neural
Schwenk, and Yoshua Bengio. 2014. Learning networks. ICML (3) 28:1310–1318.
phrase representations using rnn encoder–decoder Jeffrey Pennington, Richard Socher, and Christo-
for statistical machine translation. In Proceedings pher D. Manning. 2014. Glove: Global vectors for
of EMNLP 2014. Doha, Qatar, pages 1724–1734. word representation. In Empirical Methods in Nat-
ural Language Processing (EMNLP). pages 1532–
Joseph L Fleiss. 1971. Measuring nominal scale agree- 1543.
ment among many raters. Psychological bulletin
76(5):378. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
Percy Liang. 2016. Squad: 100,000+ questions
Xavier Glorot and Yoshua Bengio. 2010. Understand- for machine comprehension of text. arXiv preprint
ing the difficulty of training deep feedforward neural arXiv:1606.05250 .
networks. In Aistats. volume 9, pages 249–256.
Iulian Vlad Serban, Alberto Garcı́a-Durán, Caglar
Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Gulcehre, Sungjin Ahn, Sarath Chandar, Aaron
Aaron C Courville, and Yoshua Bengio. 2013. Max- Courville, and Yoshua Bengio. 2016. Generating
out networks. ICML (3) 28:1319–1327. factoid questions with recurrent neural networks:
The 30m factoid question-answer corpus. In Pro-
Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, ceedings of the 54th Annual Meeting of the Associa-
Bowen Zhou, and Yoshua Bengio. 2016. Pointing tion for Computational Linguistics (Volume 1: Long
the unknown words. In Proceedings of the 54th Papers). Association for Computational Linguistics,
Annual Meeting of the Association for Computa- Berlin, Germany, pages 588–598.
tional Linguistics (Volume 1: Long Papers). Asso-
ciation for Computational Linguistics, Berlin, Ger- Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky,
many, pages 140–149. Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Dropout: a simple way to prevent neural networks
Michael Heilman. 2011. Automatic factual question from overfitting. Journal of Machine Learning Re-
generation from text. Ph.D. thesis, Carnegie Mellon search 15(1):1929–1958.
University.
Shuohang Wang and Jing Jiang. 2016. Machine com-
Diederik Kingma and Jimmy Ba. 2015. Adam: A prehension using match-lstm and answer pointer.
method for stochastic optimization. In Proceedings arXiv preprint arXiv:1608.07905 .
A Implementation Details Score Rating scheme
A.1 Model Parameters 3: Good The question is meaningful and
We use the same vocabulary for both encoder and matches the sentence and an-
decoder. The vocabulary is collected from the swer very well
training data and we keep the top 20,000 frequent 2: Borderline The question matches the sen-
words. We set the word embedding size to 300 tence and answer, more or less
and all GRU hidden state sizes to 512. The lexi- 1: Bad The question either does not
cal and answer position features are embedded to make sense or matches the sen-
32-dimensional vectors. We use dropout (Srivas- tence and answer
tava et al., 2014) with probability p = 0.5. During
testing, we use beam search with beam size 12. Table 4: Human rating scheme.
A.2 Lexical Feature Annotation
We use Stanford CoreNLP v3.7.0 (Manning et al., score 3 example, the question makes sense and the
2014) to annotate POS and NER tags in sentences target answer “reason” can be used to answer it
with its default configuration and pre-trained mod- given the input sentence. For the second score 2
els. example, the question is inadequate for answering
the sentence since the answer is about prime num-
A.3 Model Training ber. However, given the sentence, a reasonable
We initialize model parameters randomly using a person will give the targeted answer of the ques-
Gaussian distribution with Xavier scheme (Glo- tion. For the third score 1 example, the question is
rot and Bengio, 2010). We use a combination of totally wrong given the sentence and answer.
Adam (Kingma and Ba, 2015) and simple SGD as
our the optimizing algorithms. The training is sep- Score Sentence and generated question
arated into two phases, the first phase is optimiz- I: -lsb- ... -rsb- for reason is the greatest
ing the loss function with Adam and the second is 3
enemy that faith has ; it never comes to
with simple SGD. For the Adam optimizer, we set the aid of spiritual things . ”
the learning rate α = 0.001, two momentum pa- O: what is the biggest enemy that faith
rameters β1 = 0.9 and β2 = 0.999 respectively, has have ?
and = 10−8 . We use Adam optimizer until the I: in all other rows -lrb- a = 1 , 2
BLEU score on the development set drops for six 2
, 4 , 5 , 7 , and 8 -rrb- there are
consecutive tests (we test the BLEU score on the infinitely many prime numbers .
development set for every 1,000 batches). Then O: how many numbers are in all other
we switch to a simple SGD optimizer with initial rows ?
learning rate α = 0.5 and halve it if the BLEU I: while genghis khan never conquered
score on the development set drops for twelve con- 1
all of china , his grandson kublai
secutive tests. We also apply gradient clipping khan completed that conquest and es-
(Pascanu et al., 2013) with range [−5, 5] for both tablished the yuan dynasty that is often
Adam and SGD phases. To both speed up the credited with re-uniting china .
training and converge quickly, we use mini-batch O: who did kublai khan defeat that con-
size 64 by grid search. quered all of china ?
B Human Evaluation Examples
Table 5: Human rating scheme examples, I is in-
We evaluate the PCFG-Trans baseline and put, O is output, the underlined words are target
NQG++ with human judges. The rating scheme answers.
is provided in Table 4.
The human judges are asked to label the gen-
erated questions if they match the given sentence
and answer span according to the rating scheme
and examples. We provide some example ques-
tions with different scores in Table 5. For the first