du707zr/dmytro_ushatenko

Fork 0

forked from KEMT/zpwiki

Daniel Hladek d88f27564f zz

2020-06-11 14:43:08 +02:00

3.9 KiB

Raw Blame History

Question Answering

Project repository (private)

Project Description

Create a clone of SQuaD 2.0 in the Slovak language
Setup annotation infrastructure with Prodigy
Perform and evaluate annotations of Wikipedia data.

Auxiliary tasks:

Consider using machine translation
Train and evaluate Question Answering model

Tasks

Raw Data Preparation

Input: Wikipedia

Output: a set of paragraphs

Obtaining and parsing of wikipedia dump
Selecting feasible paragraphs

Done:

Wiki parsing script
PageRank script

To be done:

random selection of paragraphs: select all good paragraphs and shuffle

Notes:

PageRank Causes bias to geography, random selection might be the best
75 best articles
167 good articles
Wiki Facts

Question Annotation

Input: A set of paragraphs

Output: A question for each paragraph

Done:

a data preparation script
annotation running script

To be done:

final input paragraphs
deployment

Answer Annotation

Input: A set of paragraphs and questions

Output: An answer for each paragraph and question

Done:

a data preparation script
annotation running script

To be done:

input paragraphs with questions
deployment

Annotation Summary

Annotation work summary

Input: Database of annotations

Output: Summary of work performed by each annotator

To be done:

web application for annotation analysis
analyze sql schema and find out who annotated what

Annotation Manual

Output: Recommendations for annotators

TBD

Question Answering Model

Training the model with annotated data

Input: An annotated QA database

Output: An evaluated model for QA

To be done:

Selecting existing modelling approach
Evaluation set selection
Model evaluation
Supporting the annotation with the model (pre-selecting answers)

Supporting activities

Output: More annotations

Organizing voluntary student challenges to support the annotation process

TBD

Existing implementations

https://github.com/facebookresearch/DrQA
https://github.com/brmson/yodaqa
https://github.com/5hirish/adam_qas
https://github.com/WDAqua/Qanary - metodológia a implementácia QA

Bibligraphy

Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes Facebook Research
SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250
WDaqua publications

Existing Datasets

Squad The Stanford Question Answering Dataset(SQuAD) (Rajpurkar et al., 2016)
WebQuestions
Freebase

Intern tasks

Week 1: Intro

Get acquainted with the project and Squad Database
Download the database and study the bibliography
Study Prodigy annnotation tool

Week 2 and 3: Web Application

Analyze sql schema of Prodigy annotations
Find out who annotated what.
Make a web application that displays results.
Extend the application to analyze more Prodigy instances (for both question and answer annotations)
Improve the process of annotation.

Output: Web application (in Node.js or Python) and Dockerfile

Week 4-7 The model

Select and train a working question answering system

Output:

a deployment script with comments for a selected question answering system
a working training recipe (can use English data), a script with comments or Jupyter Notebook
a trained model
evaluation of the model (if possible)

3.9 KiB Raw Blame History

Question Answering

Project Description

Tasks

Raw Data Preparation

Question Annotation

Answer Annotation

Annotation Summary

Annotation Manual

Question Answering Model

Supporting activities

Existing implementations

Bibligraphy

Existing Datasets

Intern tasks

3.9 KiB

Raw Blame History