From 01077d37310f3befcf8b4d921f863777559b687c Mon Sep 17 00:00:00 2001 From: Daniel Hladek Date: Thu, 11 Jun 2020 14:27:02 +0200 Subject: [PATCH] zz --- pages/topics/question/README.md | 89 +++++++++++++++++++++++++++++++-- 1 file changed, 84 insertions(+), 5 deletions(-) diff --git a/pages/topics/question/README.md b/pages/topics/question/README.md index 9e21dde5..93a1122a 100644 --- a/pages/topics/question/README.md +++ b/pages/topics/question/README.md @@ -1,10 +1,17 @@ # Question Answering +[Project repository](https://git.kemt.fei.tuke.sk/dano/annotation) + +## Project Description + Task definition: -- Create a clone of SQuaD 2.0 in Slovak language -- Setup annotation infrastructure -- Perform and evaluate annotations +- Create a clone of [SQuaD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) in the Slovak language +- Setup annotation infrastructure with [Prodigy](https://prodi.gy/) +- Perform and evaluate annotations of [Wikipedia data](https://dumps.wikimedia.org/backup-index.html). + +Auxiliary tasks: + - Consider using machine translation - Train and evaluate Question Answering model @@ -19,6 +26,15 @@ Output: a set of paragraphs 1. Obtaining and parsing of wikipedia dump 1. Selecting feasible paragraphs +Done: + +- Wiki parsing script +- PageRank script + +To be done: + +- random selection of paragraphs: select all good paragraphs and shuffle + Notes: - PageRank Causes bias to geography, random selection might be the best @@ -32,12 +48,32 @@ Input: A set of paragraphs Output: A question for each paragraph +Done: + +- a data preparation script +- annotation running script + +To be done: + +- final input paragraphs +- deployment + ### Answer Annotation Input: A set of paragraphs and questions Output: An answer for each paragraph and question +Done: + +- a data preparation script +- annotation running script + +To be done: + +- input paragraphs with questions +- deployment + ### Annotation Summary Annotation work summary @@ -46,29 +82,41 @@ Input: Database of annotations Output: Summary of work performed by each annotator +To be done: + +- web application for annotation analysis +- analyze sql schema and find out who annotated what + ### Annotation Manual Output: Recommendations for annotators +TBD + ### Question Answering Model +Training the model with annotated data + Input: An annotated QA database -Otput: An evaluated model for QA +Output: An evaluated model for QA -Traing the model with annotated data: +To be done: - Selecting existing modelling approach - Evaluation set selection - Model evaluation - Supporting the annotation with the model (pre-selecting answers) + ### Supporting activities Output: More annotations Organizing voluntary student challenges to support the annotation process +TBD + ## Existing implementations - https://github.com/facebookresearch/DrQA @@ -87,3 +135,34 @@ Facebook Research - Squad TheStanfordQuestionAnsweringDataset(SQuAD) (Rajpurkar et al., 2016) - WebQuestions - https://en.wikipedia.org/wiki/Freebase + +## Intern tasks + +Week 1: Intro + +- Get acquainted with the project and Squad Database +- Download the database and study the bibliography + +Week 2 and 3: Web Application + +- Analyze sql schema of Prodigy annotations +- Find out who annotated what. +- Make a web application that displays results. +- Extend the application to analyze more Prodigy instances (for both question and answer annotations) +- Improve the process of annotation. + +Output: Web application (in Node.js or Python) and Dockerfile + +Week 4-7 The model + +Select and train a working question answering system + +Output: + +- a deployment script with comments for a selected question answering system +- a working training recipe (can use English ata), a script with comments or Jupyter Notebook +- a trained model +- evaluation of the model (if possible) + + +