This paper presents a new selection-based question answering dataset, SelQA. The dataset consists of questions generated through crowdsourcing and sentence length answers that are drawn from the ten most prevalent topics in the English Wikipedia. We introduce a corpus annotation scheme that enhances the generation of large, diverse, and challenging datasets by explicitly aiming to reduce word co-occurrences between the question and answers. Our annotation scheme is composed of a series of crowdsourcing tasks with a view to more effectively utilize crowdsourcing in the creation of question answering datasets in various domains. Several systems are compared on the tasks of answer sentence selection and answer triggering, providing strong baseline results for future work to improve upon.
Handwritten Text Recognition for manuscripts and early printed texts
SelQA: A New Benchmark for Selection-based Question Answering
1. SelQA: A New Benchmark for
Selection-based Question Answering
Tomasz Jurczyk*, Michael Zhai, Jinho D. Choi
https://github.com/emorynlp/question-answering/
ICTAI 2016
11/8/2016
2. Selection-based Question Answering
How many airports are in Vietnam ?
- Vietnam operates 21 major civil airports , including three
international gateways: (...)
- Tan Son Nhat is the nation's largest airport, handling (...)
- According to a state - approved plan, Vietnam will have 10
international airports by 2015
- The planned Long Thanh International Airport will have an
annual service capacity of (...)
- (...)
● A ranking problem - selects the exact answer among the candidates
● Single question might have more than 1 correct answers.
3. Tasks in Selection-based Question Answering
● The original task in question answering
● It originated as a ranking problem
● There is at least a single answer among
candidates
● Measured by MAP and MRR scores.
Answer Sentence Selection
● Recently proposed the advanced version of
ASS task
● The assumption of having at least a single
answer is not present anymore
● Thus, the task can’t be considered a ranking
problem anymore
● Significantly more complex and difficult
● Measured by Precision and Recall.
Answer Triggering
4. SelQA - A New Benchmark for Question Answering Tasks
● A corpus based on documents of various tasks drawn from Wikipedia
● An effective annotation scheme is proposed to create a large dataset
● Additional annotation for questions w.r.t. topics, types, paraphrases provided
● Two recent state-of-the-art systems based on convolutional and recurrent
neural networks are implemented to provide strong baselines
6. Task 1 & 2
● (Task 1) Given a section, annotators are asked to generate a question,
● (Task 2) Given the same section and highlighted previously used sentences,
the annotated are asked to generate another question,
● The annotators are provided the instructions, the topic, the article title, the
section title, and the list of numbered sentences in the section
● The question should be supported by one or more sentences in the paragraph
Observation: annotators tend to generate questions with some lexical overlap
with the corresponding contexts.
7. Task 3
● Given the context and the previously generated questions, the annotators are
asked to paraphrase the question.
● A necessary step in creating a corpus that evaluates reading comprehension
rather than ability to model word co-occurrences.
Observation: a significant drop in word co-occurrence
8. Task 4
Observation: Despite the high quality of the question constructed in 1-3 Tasks,
some of questions can be only answered having the additional context
Example: “How were the initial reviews?”
● ElasticSearch used to select suspicious questions
● The selected question are sent back to Amazon Mechanical Turk
12. Answer Triggering Data Set - Task 5
● Automatically generated using the previously generated questions
● Elastic used to index and query each question from the entire Wikipedia index
(~14 milion sections)
● For each question, top5 highest relevant sections are selected no matter it
contains the answer or not.
● In result, 40.76% of the questions have corresponding answer contexts
comparing to 39.25% in the WikiQA data set.
13. Neural Network approaches used for evaluation
● Two systems using convolutional neural network and recurrent neural network
are used to evaluate
● Additionally, we propose a subtree matching mechanism for measuring
contextual similarity between two sentences (applied with ConvNet system)
1. Convolutional neural network model: a single convolution with max pooling,
used as a feature in logistic regression model with several lexical features
(including subtree matching features)
2. Recurrent neural network model: GRU-based bidirectional RNN with
attention.
15. The subtree matching mechanism
For every common word wi
between question q and sentence a, calculate a similarity score
based on the similarity of the words’ parents, siblings and children nodes
20. Conclusion & Future Work
● A new benchmark for selection-based question answering presented
● Several configurations consisting of the state-of-the-art neural network
models used to analyze the introduced corpus
● Analysis on various aspects w.r.t different models shown
● More research on providing context-aware qa systems is needed
● We plan to continue working on large scale corpora for open-domain question
answering.