4.3 KiB
Question Answering
Project repository (private)
Project Description
- Create a clone of SQuaD 2.0 in the Slovak language
- Setup annotation infrastructure with Prodigy
- Perform and evaluate annotations of Wikipedia data.
Auxiliary tasks:
- Consider using machine translation
- Train and evaluate Question Answering model
Tasks
Raw Data Preparation
Input: Wikipedia
Output: a set of paragraphs
- Obtaining and parsing of wikipedia dump
- Selecting feasible paragraphs
Done:
- Wiki parsing script
- PageRank script
To be done:
- random selection of paragraphs: select all good paragraphs and shuffle
Notes:
- PageRank Causes bias to geography, random selection might be the best
- 75 best articles
- 167 good articles
- Wiki Facts
Question Annotation
An annotation recipe for Prodigy
Input: A set of paragraphs
Output: A question for each paragraph
Done:
- a data preparation script
- annotation recipe
- deployment at question.tukekemt.xyz (only from tuke)
- answer annotation together with question
To be done:
- prepare final input paragraphs (dataset)
Answer Annotation
Input: A set of paragraphs and questions
Output: An answer for each paragraph and question
Done:
- a data preparation script
- annotation recipe
- deployment at answer.tukekemt.xyz (only from tuke)
To be done:
- extract annotations from question annotation
- input paragraphs with questions (dataset)
Annotation Summary
Annotation work summary
Input: Database of annotations
Output: Summary of work performed by each annotator
Done:
- application template
- Dockerfile
In progress:
- web application for annotation analysis (Tomáš Kuchárik, Flask)
- application deployment
To be done:
- analyze sql schema and find out who annotated what
Annotation Manual
Output: Recommendations for annotators
To be done:
- Web Page for annotators
Question Answering Model
Training the model with annotated data
Input: An annotated QA database
Output: An evaluated model for QA
To be done:
- Selecting existing modelling approach
- Evaluation set selection
- Model evaluation
- Supporting the annotation with the model (pre-selecting answers)
Supporting activities
Output: More annotations
Organizing voluntary student challenges to support the annotation process
TBD
Existing implementations
- https://github.com/facebookresearch/DrQA
- https://github.com/brmson/yodaqa
- https://github.com/5hirish/adam_qas
- https://github.com/WDAqua/Qanary - metodológia a implementácia QA
Bibligraphy
- Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes Facebook Research
- SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250
- WDaqua publications
Existing Datasets
- Squad The Stanford Question Answering Dataset(SQuAD) (Rajpurkar et al., 2016)
- WebQuestions
- Freebase
Intern tasks
Week 1: Intro
- Get acquainted with the project and Squad Database
- Download the database and study the bibliography
- Study Prodigy annnotation tool
Week 2 and 3: Web Application
- Analyze sql schema of Prodigy annotations
- Find out who annotated what.
- Make a web application that displays results.
- Extend the application to analyze more Prodigy instances (for both question and answer annotations)
- Improve the process of annotation.
Output: Web application (in Node.js or Python) and Dockerfile
Week 4-7 The model
Select and train a working question answering system
Output:
- a deployment script with comments for a selected question answering system
- a working training recipe (can use English data), a script with comments or Jupyter Notebook
- a trained model
- evaluation of the model (if possible)