| .. | ||
| navod | ||
| nezodpovedatelne | ||
| validacie | ||
| README.md | ||
| title | published | taxonomy | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Question Answering | true | 
  | 
Question Answering
Project Description
- Create a clone of SQuaD 2.0 in the Slovak language
 - Setup annotation infrastructure with Prodigy
 - Perform and evaluate annotations of Wikipedia data.
 
Auxiliary tasks:
- Consider using machine translation
 - Train and evaluate Question Answering model
 
People
- Daniel Hládek (responsible researcher).
 - Tomáš Kuchárik (student, help with web app).
 - Ján Staš (BERT model).
 - Ondrej Megela, Oleh Bilykh, Matej Čarňanský (auxiliary tasks).
 - other students and annotators (annotations).
 
Tasks
Raw Data Preparation
Input: Wikipedia
Output: a set of paragraphs
- Obtaining and parsing of wikipedia dump
 - Selecting feasible paragraphs
 
Done:
- Wiki parsing script (Daniel Hládek)
 - PageRank script (Daniel Hládek)
 - selection of paragraphs: select all good paragraphs and shuffle
 - fix minor errors
 
To be done:
- Select the largest articles (to be compatible with squad).
 
Notes:
- PageRank Causes bias to geography, random selection might be the best
 - 75 best articles
 - 167 good articles
 - Wiki Facts
 
Question Annotation
An annotation recipe for Prodigy
Input: A set of paragraphs
Output: 5 questions for each paragraph
Done:
- a data preparation script (Daniel Hládek)
 - annotation recipe for Prodigy (Daniel Hládek)
 - deployment at question.tukekemt.xyz (only from tuke) (Daniel Hládek)
 - answer annotation together with question (Daniel Hládek)
 - prepare final input paragraphs (dataset)
 
In progress:
- More annotations (volunteers and workers).
 
To be done:
- Prepare development set
 
Annotation Web Application
Annotation work summary, web applicatiobn
Input: Database of annotations
Output: Summary of work performed by each annotator
Done:
- application template (Tomáš Kuchárik)
 - Dockerfile (Daniel Hládek)
 - web application for annotation analysis in Flask (Tomáš Kuchárik, Daniel Hládek)
 - application deployment (Daniel Hládek)
 - extract annotations from question annotation in squad format (Daniel Hladek)
 
To be done:
- review of validations
 
Annotation Validation
Input: annnotated questions and paragraph
Output: good annotated questions
Done:
- Recipe for validations (binary annotation for paragraphs, question and answers, text fields for correction of question and answer). (Daniel Hládek)
 - Deployment
 
To be done:
- Prepare for production
 
Annotation Manual
Output: Recommendations for annotators
Done:
- Web Page for annotators (Daniel Hládek)
 - Modivation video (Daniel Hládek)
 - Video with instructions (Daniel Hládek)
 
In progress:
- Should be instructions a part of the annotation webn application?
 
Question Answering Model
Training the model with annotated data
Input: An annotated QA database
Output: An evaluated model for QA
To be done:
- Selecting existing modelling approach
 - Evaluation set selection
 - Model evaluation
 - Supporting the annotation with the model (pre-selecting answers)
 
In progress:
- Preliminary model (Ján Staš and Matej Čarňanský)
 
Existing implementations
- https://github.com/facebookresearch/DrQA
 - https://github.com/brmson/yodaqa
 - https://github.com/5hirish/adam_qas
 - https://github.com/WDAqua/Qanary - metodológia a implementácia QA
 
Bibligraphy
- Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes Facebook Research
 - SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250
 - WDaqua publications
 
Existing Datasets
- Squad The Stanford Question Answering Dataset(SQuAD) (Rajpurkar et al., 2016)
 - WebQuestions
 - Freebase
 
Intern tasks
Week 1: Intro
- Get acquainted with the project and Squad Database
 - Download the database and study the bibliography
 - Study Prodigy annnotation tool
 - Read SQuAD: 100,000+ Questions for Machine Comprehension of Text
 - Read Know What You Don't Know: Unanswerable Questions for SQuAD
 
Output:
- Short report
 
Week 2-4 The System
Select and train a working question answering system
Output:
- a deployment script with comments for a selected question answering system
 
Week 5-7 The Model
Take a working training recipe (can use English data), a script with comments or Jupyter Notebook
Output:
- a trained model
 - evaluation of the model (if possible)