2020-10-01 14:05:57 +00:00
|
|
|
---
|
|
|
|
title: Question Answering
|
|
|
|
published: true
|
|
|
|
taxonomy:
|
|
|
|
category: [project]
|
|
|
|
tag: [annotation,question-answer,nlp]
|
|
|
|
author: Daniel Hladek
|
|
|
|
---
|
2020-03-06 09:53:42 +00:00
|
|
|
# Question Answering
|
|
|
|
|
2021-01-21 15:46:42 +00:00
|
|
|
- [Project repository](https://git.kemt.fei.tuke.sk/dano/annotation) (private)
|
2021-07-12 12:43:10 +00:00
|
|
|
- [Annotation Manual for question annotation](navod)
|
|
|
|
- [Annotation Manual for validations](validacie)
|
2021-09-03 06:52:26 +00:00
|
|
|
- [Annotation Manual for unanswerable questions](nezodpovedatelne)
|
2021-01-21 15:46:42 +00:00
|
|
|
- [Summary database application](https://app.question.tukekemt,xyz)
|
|
|
|
|
2020-06-11 12:27:02 +00:00
|
|
|
|
|
|
|
## Project Description
|
|
|
|
|
|
|
|
- Create a clone of [SQuaD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) in the Slovak language
|
|
|
|
- Setup annotation infrastructure with [Prodigy](https://prodi.gy/)
|
|
|
|
- Perform and evaluate annotations of [Wikipedia data](https://dumps.wikimedia.org/backup-index.html).
|
|
|
|
|
|
|
|
Auxiliary tasks:
|
|
|
|
|
2020-06-11 10:07:46 +00:00
|
|
|
- Consider using machine translation
|
|
|
|
- Train and evaluate Question Answering model
|
2020-03-06 09:53:42 +00:00
|
|
|
|
2021-01-21 16:02:08 +00:00
|
|
|
## People
|
|
|
|
|
|
|
|
- Daniel Hládek (responsible researcher).
|
|
|
|
- Tomáš Kuchárik (student, help with web app).
|
|
|
|
- Ján Staš (BERT model).
|
2021-01-26 13:40:07 +00:00
|
|
|
- [Ondrej Megela](/students/2018/ondrej_megela), [Oleh Bilykh](/students/2018/oleh_bilykh), Matej Čarňanský (auxiliary tasks).
|
2021-01-21 16:02:08 +00:00
|
|
|
- other students and annotators (annotations).
|
|
|
|
|
2021-09-03 06:53:48 +00:00
|
|
|
## Finished Tasks
|
2020-03-06 09:53:42 +00:00
|
|
|
|
2020-06-11 10:07:46 +00:00
|
|
|
### Raw Data Preparation
|
2020-03-06 09:53:42 +00:00
|
|
|
|
2020-06-11 10:07:46 +00:00
|
|
|
Input: Wikipedia
|
2020-03-06 09:53:42 +00:00
|
|
|
|
2020-06-11 10:07:46 +00:00
|
|
|
Output: a set of paragraphs
|
2020-03-06 09:53:42 +00:00
|
|
|
|
2020-06-11 10:07:46 +00:00
|
|
|
1. Obtaining and parsing of wikipedia dump
|
|
|
|
1. Selecting feasible paragraphs
|
|
|
|
|
2020-06-11 12:27:02 +00:00
|
|
|
Done:
|
|
|
|
|
2020-07-07 06:21:12 +00:00
|
|
|
- Wiki parsing script (Daniel Hládek)
|
|
|
|
- PageRank script (Daniel Hládek)
|
2020-10-04 06:00:23 +00:00
|
|
|
- selection of paragraphs: select all good paragraphs and shuffle
|
2021-01-21 16:02:08 +00:00
|
|
|
- fix minor errors
|
|
|
|
|
2020-06-11 12:27:02 +00:00
|
|
|
|
|
|
|
To be done:
|
|
|
|
|
2021-01-21 16:02:08 +00:00
|
|
|
- Select the largest articles (to be compatible with squad).
|
2020-06-11 12:27:02 +00:00
|
|
|
|
2020-06-11 10:07:46 +00:00
|
|
|
Notes:
|
|
|
|
|
|
|
|
- PageRank Causes bias to geography, random selection might be the best
|
|
|
|
- [75 best articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_najlep%C5%A1%C3%ADch_%C4%8Dl%C3%A1nkov)
|
|
|
|
- [167 good articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_dobr%C3%BDch_%C4%8Dl%C3%A1nkov)
|
|
|
|
- [Wiki Facts](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zauj%C3%ADmavosti)
|
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
|
|
|
|
### Annotation Manual
|
|
|
|
|
|
|
|
Output: Recommendations for annotators
|
|
|
|
|
|
|
|
Done:
|
|
|
|
|
|
|
|
- Web Page for annotators (Daniel Hládek)
|
|
|
|
- Modivation video (Daniel Hládek)
|
|
|
|
- Video with instructions (Daniel Hládek)
|
|
|
|
bn application?
|
|
|
|
|
2020-06-11 10:07:46 +00:00
|
|
|
### Question Annotation
|
|
|
|
|
2020-06-13 05:10:17 +00:00
|
|
|
An annotation recipe for Prodigy
|
|
|
|
|
2020-06-11 10:07:46 +00:00
|
|
|
Input: A set of paragraphs
|
|
|
|
|
2021-01-21 15:55:16 +00:00
|
|
|
Output: 5 questions for each paragraph
|
2020-06-11 10:07:46 +00:00
|
|
|
|
2020-06-11 12:27:02 +00:00
|
|
|
Done:
|
|
|
|
|
2020-07-07 06:21:12 +00:00
|
|
|
- a data preparation script (Daniel Hládek)
|
2021-01-21 15:55:16 +00:00
|
|
|
- annotation recipe for Prodigy (Daniel Hládek)
|
2020-07-07 06:21:12 +00:00
|
|
|
- deployment at [question.tukekemt.xyz](http://question.tukekemt.xyz) (only from tuke) (Daniel Hládek)
|
|
|
|
- answer annotation together with question (Daniel Hládek)
|
2020-06-13 05:10:17 +00:00
|
|
|
- prepare final input paragraphs (dataset)
|
2020-06-26 10:58:32 +00:00
|
|
|
|
2021-01-21 15:55:16 +00:00
|
|
|
### Annotation Web Application
|
2020-06-11 10:07:46 +00:00
|
|
|
|
2021-01-21 15:17:00 +00:00
|
|
|
Annotation work summary, web applicatiobn
|
2020-03-06 09:53:42 +00:00
|
|
|
|
2020-06-11 10:07:46 +00:00
|
|
|
Input: Database of annotations
|
2020-03-06 09:53:42 +00:00
|
|
|
|
2020-06-11 10:07:46 +00:00
|
|
|
Output: Summary of work performed by each annotator
|
2020-03-06 09:53:42 +00:00
|
|
|
|
2020-07-03 04:23:27 +00:00
|
|
|
Done:
|
|
|
|
|
2020-07-06 08:16:45 +00:00
|
|
|
- application template (Tomáš Kuchárik)
|
|
|
|
- Dockerfile (Daniel Hládek)
|
2021-01-21 15:55:16 +00:00
|
|
|
- web application for annotation analysis in Flask (Tomáš Kuchárik, Daniel Hládek)
|
2020-07-06 08:16:45 +00:00
|
|
|
- application deployment (Daniel Hládek)
|
2021-01-21 15:55:16 +00:00
|
|
|
- extract annotations from question annotation in squad format (Daniel Hladek)
|
|
|
|
|
2020-07-06 08:16:45 +00:00
|
|
|
### Annotation Validation
|
|
|
|
|
|
|
|
Input: annnotated questions and paragraph
|
|
|
|
|
|
|
|
Output: good annotated questions
|
|
|
|
|
2021-01-21 15:55:16 +00:00
|
|
|
Done:
|
2020-07-06 08:16:45 +00:00
|
|
|
|
2021-01-21 15:55:16 +00:00
|
|
|
- Recipe for validations (binary annotation for paragraphs, question and answers, text fields for correction of question and answer). (Daniel Hládek)
|
|
|
|
- Deployment
|
2020-07-06 08:16:45 +00:00
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
## Tasks in progress
|
2020-07-07 06:21:12 +00:00
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
### Unanswerable question annotation
|
2020-03-06 09:53:42 +00:00
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
Input: validated questions and answers
|
|
|
|
|
|
|
|
Output: Unanswerable questions and answers
|
2020-03-06 09:53:42 +00:00
|
|
|
|
2021-01-21 15:17:00 +00:00
|
|
|
Done:
|
2020-07-06 08:16:45 +00:00
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
- Annotation manual
|
|
|
|
- Annotation interface
|
|
|
|
- Database schema modifications
|
|
|
|
- Modification of the database application
|
|
|
|
- Export of validations
|
2020-07-06 08:16:45 +00:00
|
|
|
|
2021-01-21 15:55:16 +00:00
|
|
|
In progress:
|
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
- Annotaion process optimization
|
2020-03-06 09:53:42 +00:00
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
### Final Data Export
|
2020-06-11 12:27:02 +00:00
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
Input: Validations and unanswerable questions
|
2020-03-06 09:53:42 +00:00
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
Output: Final database in SQUAD format
|
2020-03-06 09:53:42 +00:00
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
Done:
|
2021-01-21 15:17:00 +00:00
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
- Preliminary export script
|
2020-06-11 10:07:46 +00:00
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
To be done:
|
2020-06-11 12:27:02 +00:00
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
- Final export script
|
|
|
|
- Database web visualization
|
|
|
|
- Prepare development set
|
2020-06-11 10:07:46 +00:00
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
## Resources
|
2020-06-11 10:07:46 +00:00
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
### Bibligraphy
|
2020-06-11 10:07:46 +00:00
|
|
|
|
|
|
|
- Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes
|
|
|
|
Facebook Research
|
|
|
|
- SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250
|
2020-06-11 12:43:08 +00:00
|
|
|
- [WDaqua](https://wdaqua.eu/our-work/) publications
|
2020-06-11 10:07:46 +00:00
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
### Existing Datasets
|
2020-06-11 10:07:46 +00:00
|
|
|
|
2020-06-11 12:43:08 +00:00
|
|
|
- [Squad](https://rajpurkar.github.io/SQuAD-explorer/) The Stanford Question Answering Dataset(SQuAD) (Rajpurkar et al., 2016)
|
|
|
|
- [WebQuestions](https://github.com/brmson/dataset-factoid-webquestions)
|
|
|
|
- [Freebase](https://en.wikipedia.org/wiki/Freebase)
|
2020-06-11 12:27:02 +00:00
|
|
|
|
|
|
|
## Intern tasks
|
|
|
|
|
|
|
|
Week 1: Intro
|
|
|
|
|
|
|
|
- Get acquainted with the project and Squad Database
|
|
|
|
- Download the database and study the bibliography
|
2020-06-11 12:43:08 +00:00
|
|
|
- Study [Prodigy annnotation](https://Prodi.gy) tool
|
2020-07-03 04:39:53 +00:00
|
|
|
- Read [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250)
|
|
|
|
- Read [Know What You Don't Know: Unanswerable Questions for SQuAD](https://arxiv.org/abs/1806.03822)
|
2020-06-11 12:27:02 +00:00
|
|
|
|
2021-06-17 08:29:02 +00:00
|
|
|
Output:
|
2020-06-11 12:27:02 +00:00
|
|
|
|
2021-06-17 08:29:02 +00:00
|
|
|
- Short report
|
2020-06-11 12:27:02 +00:00
|
|
|
|
2021-06-17 08:29:02 +00:00
|
|
|
Week 2-4 The System
|
2020-06-11 12:27:02 +00:00
|
|
|
|
|
|
|
Select and train a working question answering system
|
|
|
|
|
|
|
|
Output:
|
|
|
|
|
|
|
|
- a deployment script with comments for a selected question answering system
|
2021-06-17 08:29:02 +00:00
|
|
|
|
|
|
|
Week 5-7 The Model
|
|
|
|
|
|
|
|
Take a working training recipe (can use English data), a script with comments or Jupyter Notebook
|
|
|
|
|
|
|
|
Output:
|
|
|
|
|
2020-06-11 12:27:02 +00:00
|
|
|
- a trained model
|
|
|
|
- evaluation of the model (if possible)
|
|
|
|
|
2021-09-03 06:52:26 +00:00
|
|
|
|
|
|
|
### Question Answering Model
|
|
|
|
|
|
|
|
Training the model with annotated data
|
|
|
|
|
|
|
|
Input: An annotated QA database
|
|
|
|
|
|
|
|
Output: An evaluated model for QA
|
|
|
|
|
|
|
|
To be done:
|
|
|
|
|
|
|
|
- Selecting existing modelling approach
|
|
|
|
- Evaluation set selection
|
|
|
|
- Model evaluation
|
|
|
|
- Supporting the annotation with the model (pre-selecting answers)
|
|
|
|
|
|
|
|
In progress:
|
|
|
|
|
|
|
|
- Preliminary model (Ján Staš and Matej Čarňanský)
|
|
|
|
|
|
|
|
|