dmytro_ushatenko/pages/topics/question/README.md

---
title: Question Answering
published: true
taxonomy:
    category: [project]
    tag: [annotation,question-answer,nlp]
    author: Daniel Hladek
---
# Question Answering

- [Project repository](https://git.kemt.fei.tuke.sk/dano/annotation) (private)
- [Annotation Manual for question annotation](navod)
- [Annotation Manual for validations](validacie)
- [Summary database application](https://app.question.tukekemt,xyz)


## Project Description

- Create a clone of [SQuaD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) in the Slovak language
- Setup annotation infrastructure with [Prodigy](https://prodi.gy/)
- Perform and evaluate annotations of [Wikipedia data](https://dumps.wikimedia.org/backup-index.html).

Auxiliary tasks:

- Consider using machine translation 
- Train and evaluate Question Answering model

## People

- Daniel Hládek (responsible researcher).
- Tomáš Kuchárik (student, help with web app).
- Ján Staš (BERT model).
- [Ondrej Megela](/students/2018/ondrej_megela), [Oleh Bilykh](/students/2018/oleh_bilykh), Matej Čarňanský (auxiliary tasks).
- other students and annotators (annotations).

## Tasks

### Raw Data Preparation

Input: Wikipedia

Output: a set of paragraphs

1. Obtaining and parsing of wikipedia dump
1. Selecting feasible paragraphs

Done:

- Wiki parsing script (Daniel Hládek)
- PageRank script (Daniel Hládek)
- selection of paragraphs: select all good paragraphs and shuffle
- fix minor errors


To be done:

- Select the largest articles (to be compatible with squad).

Notes:

- PageRank Causes bias to geography, random selection might be the best
- [75 best articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_najlep%C5%A1%C3%ADch_%C4%8Dl%C3%A1nkov)
- [167 good articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_dobr%C3%BDch_%C4%8Dl%C3%A1nkov)
- [Wiki Facts](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zauj%C3%ADmavosti)

### Question Annotation

An annotation recipe for Prodigy

Input: A set of paragraphs

Output: 5 questions for each paragraph

Done: 

- a data preparation script (Daniel Hládek)
- annotation recipe for Prodigy (Daniel Hládek)
- deployment at [question.tukekemt.xyz](http://question.tukekemt.xyz) (only from tuke) (Daniel Hládek)
- answer annotation together with question (Daniel Hládek)
- prepare final input paragraphs (dataset)

In progress:

- More annotations (volunteers and workers).

To be done:

- Prepare development set 


### Annotation Web Application

Annotation work summary, web applicatiobn

Input: Database of annotations

Output: Summary of work performed by each annotator

Done:

- application template (Tomáš Kuchárik)
- Dockerfile (Daniel Hládek)
- web application for annotation analysis in Flask (Tomáš Kuchárik, Daniel Hládek)
- application deployment (Daniel Hládek)
- extract annotations from question annotation in squad format (Daniel Hladek)


To be done:

- review of validations

### Annotation Validation

Input: annnotated questions and paragraph

Output: good annotated questions

Done:

- Recipe for validations (binary annotation for paragraphs, question and answers, text fields for correction of question and answer). (Daniel Hládek)
- Deployment 

To be done:

- Prepare for production

### Annotation Manual

Output: Recommendations for annotators

Done:

- Web Page for annotators  (Daniel Hládek)
- Modivation video (Daniel Hládek)
- Video with instructions (Daniel Hládek)

In progress:

- Should be instructions a part of the annotation webn application?

### Question Answering Model

Training the model with annotated data

Input: An annotated QA database

Output: An evaluated model for QA

To be done:

- Selecting existing modelling approach
- Evaluation set selection
- Model evaluation
- Supporting the annotation with the model (pre-selecting answers)

In progress:

- Preliminary model (Ján Staš and Matej Čarňanský)


## Existing implementations

- https://github.com/facebookresearch/DrQA
- https://github.com/brmson/yodaqa
- https://github.com/5hirish/adam_qas
- https://github.com/WDAqua/Qanary - metodológia a implementácia QA

## Bibligraphy

- Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes
Facebook Research
- SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250
- [WDaqua](https://wdaqua.eu/our-work/) publications

## Existing Datasets

- [Squad](https://rajpurkar.github.io/SQuAD-explorer/)  The Stanford Question Answering Dataset(SQuAD)  (Rajpurkar  et  al.,  2016) 
- [WebQuestions](https://github.com/brmson/dataset-factoid-webquestions)
- [Freebase](https://en.wikipedia.org/wiki/Freebase)

## Intern tasks

Week 1: Intro

- Get acquainted with the project and Squad Database
- Download the database and study the bibliography
- Study [Prodigy annnotation](https://Prodi.gy) tool
- Read [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250)
- Read [Know What You Don't Know: Unanswerable Questions for SQuAD](https://arxiv.org/abs/1806.03822)

Output:

- Short report

Week 2-4 The System

Select and train a working question answering system

Output:

- a deployment script with comments for a selected question answering system 

Week 5-7 The Model

Take a working training recipe (can use English data), a script with comments or Jupyter Notebook

Output:

- a trained model
- evaluation of the model (if possible)
zz 2020-10-01 14:05:57 +00:00			`---`
			`title: Question Answering`
			`published: true`
			`taxonomy:`
			`category: [project]`
			`tag: [annotation,question-answer,nlp]`
			`author: Daniel Hladek`
			`---`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00			`# Question Answering`

Update 'pages/topics/question/README.md' 2021-01-21 15:46:42 +00:00			`- [Project repository](https://git.kemt.fei.tuke.sk/dano/annotation) (private)`
zz 2021-07-12 12:43:10 +00:00			`- [Annotation Manual for question annotation](navod)`
			`- [Annotation Manual for validations](validacie)`
Update 'pages/topics/question/README.md' 2021-01-21 15:46:42 +00:00			`- [Summary database application](https://app.question.tukekemt,xyz)`

zz 2020-06-11 12:27:02 +00:00
			`## Project Description`

			`- Create a clone of [SQuaD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) in the Slovak language`
			`- Setup annotation infrastructure with [Prodigy](https://prodi.gy/)`
			`- Perform and evaluate annotations of [Wikipedia data](https://dumps.wikimedia.org/backup-index.html).`

			`Auxiliary tasks:`

zz 2020-06-11 10:07:46 +00:00			`- Consider using machine translation`
			`- Train and evaluate Question Answering model`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00
Update 'pages/topics/question/README.md' 2021-01-21 16:02:08 +00:00			`## People`

			`- Daniel Hládek (responsible researcher).`
			`- Tomáš Kuchárik (student, help with web app).`
			`- Ján Staš (BERT model).`
Update 'pages/topics/question/README.md' 2021-01-26 13:40:07 +00:00			`- [Ondrej Megela](/students/2018/ondrej_megela), [Oleh Bilykh](/students/2018/oleh_bilykh), Matej Čarňanský (auxiliary tasks).`
Update 'pages/topics/question/README.md' 2021-01-21 16:02:08 +00:00			`- other students and annotators (annotations).`

zz 2020-06-11 10:07:46 +00:00			`## Tasks`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00
zz 2020-06-11 10:07:46 +00:00			`### Raw Data Preparation`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00
zz 2020-06-11 10:07:46 +00:00			`Input: Wikipedia`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00
zz 2020-06-11 10:07:46 +00:00			`Output: a set of paragraphs`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00
zz 2020-06-11 10:07:46 +00:00			`1. Obtaining and parsing of wikipedia dump`
			`1. Selecting feasible paragraphs`

zz 2020-06-11 12:27:02 +00:00			`Done:`

Update 'pages/topics/question/README.md' 2020-07-07 06:21:12 +00:00			`- Wiki parsing script (Daniel Hládek)`
			`- PageRank script (Daniel Hládek)`
Update 'pages/topics/question/README.md' 2020-10-04 06:00:23 +00:00			`- selection of paragraphs: select all good paragraphs and shuffle`
Update 'pages/topics/question/README.md' 2021-01-21 16:02:08 +00:00			`- fix minor errors`

zz 2020-06-11 12:27:02 +00:00
			`To be done:`

Update 'pages/topics/question/README.md' 2021-01-21 16:02:08 +00:00			`- Select the largest articles (to be compatible with squad).`
zz 2020-06-11 12:27:02 +00:00
zz 2020-06-11 10:07:46 +00:00			`Notes:`

			`- PageRank Causes bias to geography, random selection might be the best`
			`- [75 best articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_najlep%C5%A1%C3%ADch_%C4%8Dl%C3%A1nkov)`
			`- [167 good articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_dobr%C3%BDch_%C4%8Dl%C3%A1nkov)`
			`- [Wiki Facts](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zauj%C3%ADmavosti)`

			`### Question Annotation`

zz 2020-06-13 05:10:17 +00:00			`An annotation recipe for Prodigy`

zz 2020-06-11 10:07:46 +00:00			`Input: A set of paragraphs`

Update 'pages/topics/question/README.md' 2021-01-21 15:55:16 +00:00			`Output: 5 questions for each paragraph`
zz 2020-06-11 10:07:46 +00:00
zz 2020-06-11 12:27:02 +00:00			`Done:`

Update 'pages/topics/question/README.md' 2020-07-07 06:21:12 +00:00			`- a data preparation script (Daniel Hládek)`
Update 'pages/topics/question/README.md' 2021-01-21 15:55:16 +00:00			`- annotation recipe for Prodigy (Daniel Hládek)`
Update 'pages/topics/question/README.md' 2020-07-07 06:21:12 +00:00			`- deployment at [question.tukekemt.xyz](http://question.tukekemt.xyz) (only from tuke) (Daniel Hládek)`
			`- answer annotation together with question (Daniel Hládek)`
zz 2020-06-13 05:10:17 +00:00			`- prepare final input paragraphs (dataset)`
Update 'pages/topics/question/README.md' 2020-06-26 10:58:32 +00:00
Update 'pages/topics/question/README.md' 2021-01-21 15:55:16 +00:00			`In progress:`
zz 2020-06-11 12:27:02 +00:00
Update 'pages/topics/question/README.md' 2021-01-21 15:55:16 +00:00			`- More annotations (volunteers and workers).`
Update 'pages/topics/question/README.md' 2021-01-21 15:17:00 +00:00
Update 'pages/topics/question/README.md' 2021-01-21 15:55:16 +00:00			`To be done:`
zz 2020-06-11 10:07:46 +00:00
Update 'pages/topics/question/README.md' 2021-01-21 15:55:16 +00:00			`- Prepare development set`
zz 2020-06-11 10:07:46 +00:00

Update 'pages/topics/question/README.md' 2021-01-21 15:55:16 +00:00			`### Annotation Web Application`
zz 2020-06-11 10:07:46 +00:00
Update 'pages/topics/question/README.md' 2021-01-21 15:17:00 +00:00			`Annotation work summary, web applicatiobn`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00
zz 2020-06-11 10:07:46 +00:00			`Input: Database of annotations`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00
zz 2020-06-11 10:07:46 +00:00			`Output: Summary of work performed by each annotator`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00
Update 'pages/topics/question/README.md' 2020-07-03 04:23:27 +00:00			`Done:`

Update 'pages/topics/question/README.md' 2020-07-06 08:16:45 +00:00			`- application template (Tomáš Kuchárik)`
			`- Dockerfile (Daniel Hládek)`
Update 'pages/topics/question/README.md' 2021-01-21 15:55:16 +00:00			`- web application for annotation analysis in Flask (Tomáš Kuchárik, Daniel Hládek)`
Update 'pages/topics/question/README.md' 2020-07-06 08:16:45 +00:00			`- application deployment (Daniel Hládek)`
Update 'pages/topics/question/README.md' 2021-01-21 15:55:16 +00:00			`- extract annotations from question annotation in squad format (Daniel Hladek)`


			`To be done:`

			`- review of validations`
zz 2020-06-11 12:27:02 +00:00
Update 'pages/topics/question/README.md' 2020-07-06 08:16:45 +00:00			`### Annotation Validation`

			`Input: annnotated questions and paragraph`

			`Output: good annotated questions`

Update 'pages/topics/question/README.md' 2021-01-21 15:55:16 +00:00			`Done:`
Update 'pages/topics/question/README.md' 2020-07-06 08:16:45 +00:00
Update 'pages/topics/question/README.md' 2021-01-21 15:55:16 +00:00			`- Recipe for validations (binary annotation for paragraphs, question and answers, text fields for correction of question and answer). (Daniel Hládek)`
			`- Deployment`
Update 'pages/topics/question/README.md' 2020-07-06 08:16:45 +00:00
Update 'pages/topics/question/README.md' 2021-01-21 15:55:16 +00:00			`To be done:`
Update 'pages/topics/question/README.md' 2020-07-07 06:21:12 +00:00
Update 'pages/topics/question/README.md' 2021-01-21 15:55:16 +00:00			`- Prepare for production`
Update 'pages/topics/question/README.md' 2020-07-07 06:21:12 +00:00
zz 2020-06-11 10:07:46 +00:00			`### Annotation Manual`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00
zz 2020-06-11 10:07:46 +00:00			`Output: Recommendations for annotators`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00
Update 'pages/topics/question/README.md' 2021-01-21 15:17:00 +00:00			`Done:`
Update 'pages/topics/question/README.md' 2020-07-06 08:16:45 +00:00
Update 'pages/topics/question/README.md' 2021-01-21 15:17:00 +00:00			`- Web Page for annotators (Daniel Hládek)`
			`- Modivation video (Daniel Hládek)`
			`- Video with instructions (Daniel Hládek)`
Update 'pages/topics/question/README.md' 2020-07-06 08:16:45 +00:00
Update 'pages/topics/question/README.md' 2021-01-21 15:55:16 +00:00			`In progress:`

			`- Should be instructions a part of the annotation webn application?`

zz 2020-06-11 10:07:46 +00:00			`### Question Answering Model`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00
zz 2020-06-11 12:27:02 +00:00			`Training the model with annotated data`

zz 2020-06-11 10:07:46 +00:00			`Input: An annotated QA database`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00
zz 2020-06-11 12:27:02 +00:00			`Output: An evaluated model for QA`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00
zz 2020-06-11 12:27:02 +00:00			`To be done:`
Add 'pages/topics/question/README.md' 2020-03-06 09:53:42 +00:00
zz 2020-06-11 10:07:46 +00:00			`- Selecting existing modelling approach`
			`- Evaluation set selection`
			`- Model evaluation`
			`- Supporting the annotation with the model (pre-selecting answers)`

Update 'pages/topics/question/README.md' 2021-01-21 15:17:00 +00:00			`In progress:`

			`- Preliminary model (Ján Staš and Matej Čarňanský)`

zz 2020-06-11 10:07:46 +00:00
zz 2020-06-11 12:27:02 +00:00
zz 2020-06-11 10:07:46 +00:00			`## Existing implementations`

			`- https://github.com/facebookresearch/DrQA`
			`- https://github.com/brmson/yodaqa`
			`- https://github.com/5hirish/adam_qas`
			`- https://github.com/WDAqua/Qanary - metodológia a implementácia QA`

			`## Bibligraphy`

			`- Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes`
			`Facebook Research`
			`- SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250`
zz 2020-06-11 12:43:08 +00:00			`- [WDaqua](https://wdaqua.eu/our-work/) publications`
zz 2020-06-11 10:07:46 +00:00
			`## Existing Datasets`

zz 2020-06-11 12:43:08 +00:00			`- [Squad](https://rajpurkar.github.io/SQuAD-explorer/) The Stanford Question Answering Dataset(SQuAD) (Rajpurkar et al., 2016)`
			`- [WebQuestions](https://github.com/brmson/dataset-factoid-webquestions)`
			`- [Freebase](https://en.wikipedia.org/wiki/Freebase)`
zz 2020-06-11 12:27:02 +00:00
			`## Intern tasks`

			`Week 1: Intro`

			`- Get acquainted with the project and Squad Database`
			`- Download the database and study the bibliography`
zz 2020-06-11 12:43:08 +00:00			`- Study [Prodigy annnotation](https://Prodi.gy) tool`
Update 'pages/topics/question/README.md' 2020-07-03 04:39:53 +00:00			`- Read [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250)`
			`- Read [Know What You Don't Know: Unanswerable Questions for SQuAD](https://arxiv.org/abs/1806.03822)`
zz 2020-06-11 12:27:02 +00:00
Update 'pages/topics/question/README.md' 2021-06-17 08:29:02 +00:00			`Output:`
zz 2020-06-11 12:27:02 +00:00
Update 'pages/topics/question/README.md' 2021-06-17 08:29:02 +00:00			`- Short report`
zz 2020-06-11 12:27:02 +00:00
Update 'pages/topics/question/README.md' 2021-06-17 08:29:02 +00:00			`Week 2-4 The System`
zz 2020-06-11 12:27:02 +00:00
			`Select and train a working question answering system`

			`Output:`

			`- a deployment script with comments for a selected question answering system`
Update 'pages/topics/question/README.md' 2021-06-17 08:29:02 +00:00
			`Week 5-7 The Model`

			`Take a working training recipe (can use English data), a script with comments or Jupyter Notebook`

			`Output:`

zz 2020-06-11 12:27:02 +00:00			`- a trained model`
			`- evaluation of the model (if possible)`