dmytro_ushatenko/pages/topics/question/README.md

5.6 KiB

title published taxonomy
Question Answering true
category tag author
project
annotation
question-answer
nlp
Daniel Hladek

Question Answering

Project Description

Auxiliary tasks:

  • Consider using machine translation
  • Train and evaluate Question Answering model

People

  • Daniel Hládek (responsible researcher).
  • Tomáš Kuchárik (student, help with web app).
  • Ján Staš (BERT model).
  • Ondrej Megela, Oleh Bilykh, Matej Čarňanský (auxiliary tasks).
  • other students and annotators (annotations).

Tasks

Raw Data Preparation

Input: Wikipedia

Output: a set of paragraphs

  1. Obtaining and parsing of wikipedia dump
  2. Selecting feasible paragraphs

Done:

  • Wiki parsing script (Daniel Hládek)
  • PageRank script (Daniel Hládek)
  • selection of paragraphs: select all good paragraphs and shuffle
  • fix minor errors

To be done:

  • Select the largest articles (to be compatible with squad).

Notes:

Finished Tasks

Annotation Manual

Output: Recommendations for annotators

Done:

  • Web Page for annotators (Daniel Hládek)
  • Modivation video (Daniel Hládek)
  • Video with instructions (Daniel Hládek) bn application?

Question Annotation

An annotation recipe for Prodigy

Input: A set of paragraphs

Output: 5 questions for each paragraph

Done:

  • a data preparation script (Daniel Hládek)
  • annotation recipe for Prodigy (Daniel Hládek)
  • deployment at question.tukekemt.xyz (only from tuke) (Daniel Hládek)
  • answer annotation together with question (Daniel Hládek)
  • prepare final input paragraphs (dataset)

Annotation Web Application

Annotation work summary, web applicatiobn

Input: Database of annotations

Output: Summary of work performed by each annotator

Done:

  • application template (Tomáš Kuchárik)
  • Dockerfile (Daniel Hládek)
  • web application for annotation analysis in Flask (Tomáš Kuchárik, Daniel Hládek)
  • application deployment (Daniel Hládek)
  • extract annotations from question annotation in squad format (Daniel Hladek)

Annotation Validation

Input: annnotated questions and paragraph

Output: good annotated questions

Done:

  • Recipe for validations (binary annotation for paragraphs, question and answers, text fields for correction of question and answer). (Daniel Hládek)
  • Deployment

Tasks in progress

Unanswerable question annotation

Input: validated questions and answers

Output: Unanswerable questions and answers

Done:

  • Annotation manual
  • Annotation interface
  • Database schema modifications
  • Modification of the database application
  • Export of validations

In progress:

  • Annotaion process optimization

Final Data Export

Input: Validations and unanswerable questions

Output: Final database in SQUAD format

Done:

  • Preliminary export script

To be done:

  • Final export script
  • Database web visualization
  • Prepare development set

Resources

Bibligraphy

  • Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes Facebook Research
  • SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250
  • WDaqua publications

Existing Datasets

Intern tasks

Week 1: Intro

Output:

  • Short report

Week 2-4 The System

Select and train a working question answering system

Output:

  • a deployment script with comments for a selected question answering system

Week 5-7 The Model

Take a working training recipe (can use English data), a script with comments or Jupyter Notebook

Output:

  • a trained model
  • evaluation of the model (if possible)

Question Answering Model

Training the model with annotated data

Input: An annotated QA database

Output: An evaluated model for QA

To be done:

  • Selecting existing modelling approach
  • Evaluation set selection
  • Model evaluation
  • Supporting the annotation with the model (pre-selecting answers)

In progress:

  • Preliminary model (Ján Staš and Matej Čarňanský)