History

Daniel Hladek 82b583ae69 zz		2021-07-12 14:27:53 +02:00
..
navod	zz	2021-02-23 08:38:14 +01:00
nezodpovedatelne	zz	2021-07-05 10:38:52 +02:00
validacie	zz	2021-07-12 14:27:53 +02:00
README.md	Update 'pages/topics/question/README.md'	2021-06-17 08:29:02 +00:00

README.md

title

published

taxonomy

Question Answering

true

Question Answering

Project Description

Create a clone of SQuaD 2.0 in the Slovak language
Setup annotation infrastructure with Prodigy
Perform and evaluate annotations of Wikipedia data.

Auxiliary tasks:

Consider using machine translation
Train and evaluate Question Answering model

People

Daniel Hládek (responsible researcher).
Tomáš Kuchárik (student, help with web app).
Ján Staš (BERT model).
Ondrej Megela, Oleh Bilykh, Matej Čarňanský (auxiliary tasks).
other students and annotators (annotations).

Tasks

Raw Data Preparation

Input: Wikipedia

Output: a set of paragraphs

Obtaining and parsing of wikipedia dump
Selecting feasible paragraphs

Done:

Wiki parsing script (Daniel Hládek)
PageRank script (Daniel Hládek)
selection of paragraphs: select all good paragraphs and shuffle
fix minor errors

To be done:

Select the largest articles (to be compatible with squad).

Notes:

PageRank Causes bias to geography, random selection might be the best
75 best articles
167 good articles
Wiki Facts

Question Annotation

An annotation recipe for Prodigy

Input: A set of paragraphs

Output: 5 questions for each paragraph

Done:

a data preparation script (Daniel Hládek)
annotation recipe for Prodigy (Daniel Hládek)
deployment at question.tukekemt.xyz (only from tuke) (Daniel Hládek)
answer annotation together with question (Daniel Hládek)
prepare final input paragraphs (dataset)

In progress:

More annotations (volunteers and workers).

To be done:

Prepare development set

Annotation Web Application

Annotation work summary, web applicatiobn

Input: Database of annotations

Output: Summary of work performed by each annotator

Done:

application template (Tomáš Kuchárik)
Dockerfile (Daniel Hládek)
web application for annotation analysis in Flask (Tomáš Kuchárik, Daniel Hládek)
application deployment (Daniel Hládek)
extract annotations from question annotation in squad format (Daniel Hladek)

To be done:

review of validations

Annotation Validation

Input: annnotated questions and paragraph

Output: good annotated questions

Done:

Recipe for validations (binary annotation for paragraphs, question and answers, text fields for correction of question and answer). (Daniel Hládek)
Deployment

To be done:

Prepare for production

Annotation Manual

Output: Recommendations for annotators

Done:

Web Page for annotators (Daniel Hládek)
Modivation video (Daniel Hládek)
Video with instructions (Daniel Hládek)

In progress:

Should be instructions a part of the annotation webn application?

Question Answering Model

Training the model with annotated data

Input: An annotated QA database

Output: An evaluated model for QA

To be done:

Selecting existing modelling approach
Evaluation set selection
Model evaluation
Supporting the annotation with the model (pre-selecting answers)

In progress:

Preliminary model (Ján Staš and Matej Čarňanský)

Existing implementations

https://github.com/facebookresearch/DrQA
https://github.com/brmson/yodaqa
https://github.com/5hirish/adam_qas
https://github.com/WDAqua/Qanary - metodológia a implementácia QA

Bibligraphy

Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes Facebook Research
SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250
WDaqua publications

Existing Datasets

Squad The Stanford Question Answering Dataset(SQuAD) (Rajpurkar et al., 2016)
WebQuestions
Freebase

Intern tasks

Week 1: Intro

Get acquainted with the project and Squad Database
Download the database and study the bibliography
Study Prodigy annnotation tool
Read SQuAD: 100,000+ Questions for Machine Comprehension of Text
Read Know What You Don't Know: Unanswerable Questions for SQuAD

Output:

Short report

Week 2-4 The System

Select and train a working question answering system

Output:

a deployment script with comments for a selected question answering system

Week 5-7 The Model

Take a working training recipe (can use English data), a script with comments or Jupyter Notebook

Output:

a trained model
evaluation of the model (if possible)