History

Daniel Hladek bc76d31a90 zz		2020-10-01 16:05:57 +02:00
..
README.md	zz	2020-10-01 16:05:57 +02:00

README.md

title

published

taxonomy

Question Answering

true

Question Answering

Project repository (private)

Project Description

Create a clone of SQuaD 2.0 in the Slovak language
Setup annotation infrastructure with Prodigy
Perform and evaluate annotations of Wikipedia data.

Auxiliary tasks:

Consider using machine translation
Train and evaluate Question Answering model

Tasks

Raw Data Preparation

Input: Wikipedia

Output: a set of paragraphs

Obtaining and parsing of wikipedia dump
Selecting feasible paragraphs

Done:

Wiki parsing script (Daniel Hládek)
PageRank script (Daniel Hládek)

To be done:

random selection of paragraphs: select all good paragraphs and shuffle

Notes:

PageRank Causes bias to geography, random selection might be the best
75 best articles
167 good articles
Wiki Facts

Question Annotation

An annotation recipe for Prodigy

Input: A set of paragraphs

Output: A question for each paragraph

Done:

a data preparation script (Daniel Hládek)
annotation recipe (Daniel Hládek)
deployment at question.tukekemt.xyz (only from tuke) (Daniel Hládek)
answer annotation together with question (Daniel Hládek)

To be done:

prepare final input paragraphs (dataset)

Answer Annotation

Input: A set of paragraphs and questions

Output: An answer for each paragraph and question

Done:

a data preparation script (Daniel Hládek)
annotation recipe (Daniel Hládek)
deployment at answer.tukekemt.xyz (only from tuke) (Daniel Hládek)

To be done:

extract annotations from question annotation
input paragraphs with questions (dataset)

Annotation Summary

Annotation work summary

Input: Database of annotations

Output: Summary of work performed by each annotator

Done:

application template (Tomáš Kuchárik)
Dockerfile (Daniel Hládek)

In progress:

web application for annotation analysis (Tomáš Kuchárik, Flask)
application deployment (Daniel Hládek)

Annotation Validation

Input: annnotated questions and paragraph

Output: good annotated questions

In Progress:

Design validation recipe (Tomáš Kuchárik)

To do:

Implement and deploy validation recipe (Tomáš Kuchárik)

Annotation Manual

Output: Recommendations for annotators

To be done:

Web Page for annotators (Tomáš Kuchárik)

In progress:

Introductory text and references to annotation (Tomáš Kuchárik)

Question Answering Model

Training the model with annotated data

Input: An annotated QA database

Output: An evaluated model for QA

To be done:

Selecting existing modelling approach
Evaluation set selection
Model evaluation
Supporting the annotation with the model (pre-selecting answers)

Supporting activities

Output: More annotations

Organizing voluntary student challenges to support the annotation process

TBD

Existing implementations

https://github.com/facebookresearch/DrQA
https://github.com/brmson/yodaqa
https://github.com/5hirish/adam_qas
https://github.com/WDAqua/Qanary - metodológia a implementácia QA

Bibligraphy

Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes Facebook Research
SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250
WDaqua publications

Existing Datasets

Squad The Stanford Question Answering Dataset(SQuAD) (Rajpurkar et al., 2016)
WebQuestions
Freebase

Intern tasks

Week 1: Intro

Get acquainted with the project and Squad Database
Download the database and study the bibliography
Study Prodigy annnotation tool
Read SQuAD: 100,000+ Questions for Machine Comprehension of Text
Read Know What You Don't Know: Unanswerable Questions for SQuAD

Week 2 and 3: Web Application

Analyze sql schema of Prodigy annotations
Find out who annotated what.
Make a web application that displays results.
Extend the application to analyze more Prodigy instances (for both question and answer annotations)
Improve the process of annotation.

Output: Web application (in Node.js or Python) and Dockerfile

Week 4-7 The model

Select and train a working question answering system

Output:

a deployment script with comments for a selected question answering system
a working training recipe (can use English data), a script with comments or Jupyter Notebook
a trained model
evaluation of the model (if possible)