KEMT/zpwiki

Fork 9

Daniel Hladek aa43c5c8a5 zz

2021-07-12 14:43:10 +02:00

5.3 KiB

Raw Blame History

title

published

taxonomy

Question Answering

true

Question Answering

Project Description

Create a clone of SQuaD 2.0 in the Slovak language
Setup annotation infrastructure with Prodigy
Perform and evaluate annotations of Wikipedia data.

Auxiliary tasks:

Consider using machine translation
Train and evaluate Question Answering model

People

Daniel Hládek (responsible researcher).
Tomáš Kuchárik (student, help with web app).
Ján Staš (BERT model).
Ondrej Megela, Oleh Bilykh, Matej Čarňanský (auxiliary tasks).
other students and annotators (annotations).

Tasks

Raw Data Preparation

Input: Wikipedia

Output: a set of paragraphs

Obtaining and parsing of wikipedia dump
Selecting feasible paragraphs

Done:

Wiki parsing script (Daniel Hládek)
PageRank script (Daniel Hládek)
selection of paragraphs: select all good paragraphs and shuffle
fix minor errors

To be done:

Select the largest articles (to be compatible with squad).

Notes:

PageRank Causes bias to geography, random selection might be the best
75 best articles
167 good articles
Wiki Facts

Question Annotation

An annotation recipe for Prodigy

Input: A set of paragraphs

Output: 5 questions for each paragraph

Done:

a data preparation script (Daniel Hládek)
annotation recipe for Prodigy (Daniel Hládek)
deployment at question.tukekemt.xyz (only from tuke) (Daniel Hládek)
answer annotation together with question (Daniel Hládek)
prepare final input paragraphs (dataset)

In progress:

More annotations (volunteers and workers).

To be done:

Prepare development set

Annotation Web Application

Annotation work summary, web applicatiobn

Input: Database of annotations

Output: Summary of work performed by each annotator

Done:

application template (Tomáš Kuchárik)
Dockerfile (Daniel Hládek)
web application for annotation analysis in Flask (Tomáš Kuchárik, Daniel Hládek)
application deployment (Daniel Hládek)
extract annotations from question annotation in squad format (Daniel Hladek)

To be done:

review of validations

Annotation Validation

Input: annnotated questions and paragraph

Output: good annotated questions

Done:

Recipe for validations (binary annotation for paragraphs, question and answers, text fields for correction of question and answer). (Daniel Hládek)
Deployment

To be done:

Prepare for production

Annotation Manual

Output: Recommendations for annotators

Done:

Web Page for annotators (Daniel Hládek)
Modivation video (Daniel Hládek)
Video with instructions (Daniel Hládek)

In progress:

Should be instructions a part of the annotation webn application?