This commit is contained in:
Daniel Hládek 2020-06-11 14:27:02 +02:00
parent aa2e279ab0
commit 01077d3731

View File

@ -1,10 +1,17 @@
# Question Answering # Question Answering
[Project repository](https://git.kemt.fei.tuke.sk/dano/annotation)
## Project Description
Task definition: Task definition:
- Create a clone of SQuaD 2.0 in Slovak language - Create a clone of [SQuaD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) in the Slovak language
- Setup annotation infrastructure - Setup annotation infrastructure with [Prodigy](https://prodi.gy/)
- Perform and evaluate annotations - Perform and evaluate annotations of [Wikipedia data](https://dumps.wikimedia.org/backup-index.html).
Auxiliary tasks:
- Consider using machine translation - Consider using machine translation
- Train and evaluate Question Answering model - Train and evaluate Question Answering model
@ -19,6 +26,15 @@ Output: a set of paragraphs
1. Obtaining and parsing of wikipedia dump 1. Obtaining and parsing of wikipedia dump
1. Selecting feasible paragraphs 1. Selecting feasible paragraphs
Done:
- Wiki parsing script
- PageRank script
To be done:
- random selection of paragraphs: select all good paragraphs and shuffle
Notes: Notes:
- PageRank Causes bias to geography, random selection might be the best - PageRank Causes bias to geography, random selection might be the best
@ -32,12 +48,32 @@ Input: A set of paragraphs
Output: A question for each paragraph Output: A question for each paragraph
Done:
- a data preparation script
- annotation running script
To be done:
- final input paragraphs
- deployment
### Answer Annotation ### Answer Annotation
Input: A set of paragraphs and questions Input: A set of paragraphs and questions
Output: An answer for each paragraph and question Output: An answer for each paragraph and question
Done:
- a data preparation script
- annotation running script
To be done:
- input paragraphs with questions
- deployment
### Annotation Summary ### Annotation Summary
Annotation work summary Annotation work summary
@ -46,29 +82,41 @@ Input: Database of annotations
Output: Summary of work performed by each annotator Output: Summary of work performed by each annotator
To be done:
- web application for annotation analysis
- analyze sql schema and find out who annotated what
### Annotation Manual ### Annotation Manual
Output: Recommendations for annotators Output: Recommendations for annotators
TBD
### Question Answering Model ### Question Answering Model
Training the model with annotated data
Input: An annotated QA database Input: An annotated QA database
Otput: An evaluated model for QA Output: An evaluated model for QA
Traing the model with annotated data: To be done:
- Selecting existing modelling approach - Selecting existing modelling approach
- Evaluation set selection - Evaluation set selection
- Model evaluation - Model evaluation
- Supporting the annotation with the model (pre-selecting answers) - Supporting the annotation with the model (pre-selecting answers)
### Supporting activities ### Supporting activities
Output: More annotations Output: More annotations
Organizing voluntary student challenges to support the annotation process Organizing voluntary student challenges to support the annotation process
TBD
## Existing implementations ## Existing implementations
- https://github.com/facebookresearch/DrQA - https://github.com/facebookresearch/DrQA
@ -87,3 +135,34 @@ Facebook Research
- Squad TheStanfordQuestionAnsweringDataset(SQuAD) (Rajpurkar et al., 2016) - Squad TheStanfordQuestionAnsweringDataset(SQuAD) (Rajpurkar et al., 2016)
- WebQuestions - WebQuestions
- https://en.wikipedia.org/wiki/Freebase - https://en.wikipedia.org/wiki/Freebase
## Intern tasks
Week 1: Intro
- Get acquainted with the project and Squad Database
- Download the database and study the bibliography
Week 2 and 3: Web Application
- Analyze sql schema of Prodigy annotations
- Find out who annotated what.
- Make a web application that displays results.
- Extend the application to analyze more Prodigy instances (for both question and answer annotations)
- Improve the process of annotation.
Output: Web application (in Node.js or Python) and Dockerfile
Week 4-7 The model
Select and train a working question answering system
Output:
- a deployment script with comments for a selected question answering system
- a working training recipe (can use English ata), a script with comments or Jupyter Notebook
- a trained model
- evaluation of the model (if possible)