zpwiki/pages/topics/question/README.md

4.3 KiB

Question Answering

Project repository (private)

Project Description

Auxiliary tasks:

  • Consider using machine translation
  • Train and evaluate Question Answering model

Tasks

Raw Data Preparation

Input: Wikipedia

Output: a set of paragraphs

  1. Obtaining and parsing of wikipedia dump
  2. Selecting feasible paragraphs

Done:

  • Wiki parsing script
  • PageRank script

To be done:

  • random selection of paragraphs: select all good paragraphs and shuffle

Notes:

Question Annotation

An annotation recipe for Prodigy

Input: A set of paragraphs

Output: A question for each paragraph

Done:

  • a data preparation script
  • annotation annotation recipe
  • deployment at question.tukekemt.xyz (only from tuke)
  • answer annotation together with question

To be done:

  • prepare final input paragraphs (dataset)

Answer Annotation

Input: A set of paragraphs and questions

Output: An answer for each paragraph and question

Done:

  • a data preparation script
  • annotation recipe
  • deployment at answer.tukekemt.xyz (only from tuke)

To be done:

  • extraxt annotations from question annotation
  • input paragraphs with questions (dataset)

Annotation Summary

Annotation work summary

Input: Database of annotations

Output: Summary of work performed by each annotator

In progress:

  • web application for annotation analysis (Tomáš Kuchárik, Flask)

To be done:

  • analyze sql schema and find out who annotated what

Annotation Manual

Output: Recommendations for annotators

To be done:

  • Web Page for annotators

Question Answering Model

Training the model with annotated data

Input: An annotated QA database

Output: An evaluated model for QA

To be done:

  • Selecting existing modelling approach
  • Evaluation set selection
  • Model evaluation
  • Supporting the annotation with the model (pre-selecting answers)

Supporting activities

Output: More annotations

Organizing voluntary student challenges to support the annotation process

TBD

Existing implementations

Bibligraphy

  • Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes Facebook Research
  • SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250
  • WDaqua publications

Existing Datasets

Intern tasks

Week 1: Intro

  • Get acquainted with the project and Squad Database
  • Download the database and study the bibliography
  • Study Prodigy annnotation tool

Week 2 and 3: Web Application

  • Analyze sql schema of Prodigy annotations
  • Find out who annotated what.
  • Make a web application that displays results.
  • Extend the application to analyze more Prodigy instances (for both question and answer annotations)
  • Improve the process of annotation.

Output: Web application (in Node.js or Python) and Dockerfile

Week 4-7 The model

Select and train a working question answering system

Output:

  • a deployment script with comments for a selected question answering system
  • a working training recipe (can use English data), a script with comments or Jupyter Notebook
  • a trained model
  • evaluation of the model (if possible)