zpwiki/pages/topics/question/README.md

210 lines
5.2 KiB
Markdown
Raw Normal View History

2020-10-01 14:05:57 +00:00
---
title: Question Answering
published: true
taxonomy:
category: [project]
tag: [annotation,question-answer,nlp]
author: Daniel Hladek
---
2020-03-06 09:53:42 +00:00
# Question Answering
2020-06-11 12:43:08 +00:00
[Project repository](https://git.kemt.fei.tuke.sk/dano/annotation) (private)
2020-06-11 12:27:02 +00:00
## Project Description
- Create a clone of [SQuaD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) in the Slovak language
- Setup annotation infrastructure with [Prodigy](https://prodi.gy/)
- Perform and evaluate annotations of [Wikipedia data](https://dumps.wikimedia.org/backup-index.html).
Auxiliary tasks:
2020-06-11 10:07:46 +00:00
- Consider using machine translation
- Train and evaluate Question Answering model
2020-03-06 09:53:42 +00:00
2020-06-11 10:07:46 +00:00
## Tasks
2020-03-06 09:53:42 +00:00
2020-06-11 10:07:46 +00:00
### Raw Data Preparation
2020-03-06 09:53:42 +00:00
2020-06-11 10:07:46 +00:00
Input: Wikipedia
2020-03-06 09:53:42 +00:00
2020-06-11 10:07:46 +00:00
Output: a set of paragraphs
2020-03-06 09:53:42 +00:00
2020-06-11 10:07:46 +00:00
1. Obtaining and parsing of wikipedia dump
1. Selecting feasible paragraphs
2020-06-11 12:27:02 +00:00
Done:
- Wiki parsing script (Daniel Hládek)
- PageRank script (Daniel Hládek)
- selection of paragraphs: select all good paragraphs and shuffle
2020-06-11 12:27:02 +00:00
To be done:
- fix minor errors
2020-06-11 12:27:02 +00:00
2020-06-11 10:07:46 +00:00
Notes:
- PageRank Causes bias to geography, random selection might be the best
- [75 best articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_najlep%C5%A1%C3%ADch_%C4%8Dl%C3%A1nkov)
- [167 good articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_dobr%C3%BDch_%C4%8Dl%C3%A1nkov)
- [Wiki Facts](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zauj%C3%ADmavosti)
### Question Annotation
2020-06-13 05:10:17 +00:00
An annotation recipe for Prodigy
2020-06-11 10:07:46 +00:00
Input: A set of paragraphs
Output: A question for each paragraph
2020-06-11 12:27:02 +00:00
Done:
- a data preparation script (Daniel Hládek)
- annotation recipe (Daniel Hládek)
- deployment at [question.tukekemt.xyz](http://question.tukekemt.xyz) (only from tuke) (Daniel Hládek)
- answer annotation together with question (Daniel Hládek)
2020-06-11 12:27:02 +00:00
To be done:
2020-06-13 05:10:17 +00:00
- prepare final input paragraphs (dataset)
2020-06-11 12:27:02 +00:00
2020-06-11 10:07:46 +00:00
### Answer Annotation
Input: A set of paragraphs and questions
Output: An answer for each paragraph and question
2020-06-11 12:27:02 +00:00
Done:
- a data preparation script (Daniel Hládek)
- annotation recipe (Daniel Hládek)
- deployment at [answer.tukekemt.xyz](http://answer.tukekemt.xyz) (only from tuke) (Daniel Hládek)
2020-06-11 12:27:02 +00:00
To be done:
- extract annotations from question annotation
2020-06-13 05:10:17 +00:00
- input paragraphs with questions (dataset)
2020-06-11 12:27:02 +00:00
2020-06-11 10:07:46 +00:00
### Annotation Summary
Annotation work summary
2020-03-06 09:53:42 +00:00
2020-06-11 10:07:46 +00:00
Input: Database of annotations
2020-03-06 09:53:42 +00:00
2020-06-11 10:07:46 +00:00
Output: Summary of work performed by each annotator
2020-03-06 09:53:42 +00:00
Done:
- application template (Tomáš Kuchárik)
- Dockerfile (Daniel Hládek)
In progress:
- web application for annotation analysis (Tomáš Kuchárik, Flask)
- application deployment (Daniel Hládek)
2020-06-11 12:27:02 +00:00
2020-06-11 12:27:02 +00:00
### Annotation Validation
Input: annnotated questions and paragraph
Output: good annotated questions
In Progress:
- Design validation recipe (Tomáš Kuchárik)
To do:
- Implement and deploy validation recipe (Tomáš Kuchárik)
2020-06-11 10:07:46 +00:00
### Annotation Manual
2020-03-06 09:53:42 +00:00
2020-06-11 10:07:46 +00:00
Output: Recommendations for annotators
2020-03-06 09:53:42 +00:00
2020-06-13 05:10:17 +00:00
To be done:
- Web Page for annotators (Tomáš Kuchárik)
2020-06-11 12:27:02 +00:00
In progress:
- Introductory text and references to annotation (Tomáš Kuchárik)
2020-06-11 10:07:46 +00:00
### Question Answering Model
2020-03-06 09:53:42 +00:00
2020-06-11 12:27:02 +00:00
Training the model with annotated data
2020-06-11 10:07:46 +00:00
Input: An annotated QA database
2020-03-06 09:53:42 +00:00
2020-06-11 12:27:02 +00:00
Output: An evaluated model for QA
2020-03-06 09:53:42 +00:00
2020-06-11 12:27:02 +00:00
To be done:
2020-03-06 09:53:42 +00:00
2020-06-11 10:07:46 +00:00
- Selecting existing modelling approach
- Evaluation set selection
- Model evaluation
- Supporting the annotation with the model (pre-selecting answers)
### Supporting activities
Output: More annotations
Organizing voluntary student challenges to support the annotation process
2020-06-11 12:27:02 +00:00
TBD
2020-06-11 10:07:46 +00:00
## Existing implementations
- https://github.com/facebookresearch/DrQA
- https://github.com/brmson/yodaqa
- https://github.com/5hirish/adam_qas
- https://github.com/WDAqua/Qanary - metodológia a implementácia QA
## Bibligraphy
- Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes
Facebook Research
- SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250
2020-06-11 12:43:08 +00:00
- [WDaqua](https://wdaqua.eu/our-work/) publications
2020-06-11 10:07:46 +00:00
## Existing Datasets
2020-06-11 12:43:08 +00:00
- [Squad](https://rajpurkar.github.io/SQuAD-explorer/) The Stanford Question Answering Dataset(SQuAD) (Rajpurkar et al., 2016)
- [WebQuestions](https://github.com/brmson/dataset-factoid-webquestions)
- [Freebase](https://en.wikipedia.org/wiki/Freebase)
2020-06-11 12:27:02 +00:00
## Intern tasks
Week 1: Intro
- Get acquainted with the project and Squad Database
- Download the database and study the bibliography
2020-06-11 12:43:08 +00:00
- Study [Prodigy annnotation](https://Prodi.gy) tool
- Read [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250)
- Read [Know What You Don't Know: Unanswerable Questions for SQuAD](https://arxiv.org/abs/1806.03822)
2020-06-11 12:27:02 +00:00
Week 2 and 3: Web Application
- Analyze sql schema of Prodigy annotations
- Find out who annotated what.
- Make a web application that displays results.
- Extend the application to analyze more Prodigy instances (for both question and answer annotations)
- Improve the process of annotation.
Output: Web application (in Node.js or Python) and Dockerfile
Week 4-7 The model
Select and train a working question answering system
Output:
- a deployment script with comments for a selected question answering system
2020-06-11 12:43:08 +00:00
- a working training recipe (can use English data), a script with comments or Jupyter Notebook
2020-06-11 12:27:02 +00:00
- a trained model
- evaluation of the model (if possible)