--- title: Question Answering published: true taxonomy: category: [project] tag: [annotation,question-answer,nlp] author: Daniel Hladek --- # Question Answering - [Project repository](https://git.kemt.fei.tuke.sk/dano/annotation) (private) - [Annotation Manual for question annotation](navod) - [Annotation Manual for validations](validacie) - [Summary database application](https://app.question.tukekemt,xyz) ## Project Description - Create a clone of [SQuaD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) in the Slovak language - Setup annotation infrastructure with [Prodigy](https://prodi.gy/) - Perform and evaluate annotations of [Wikipedia data](https://dumps.wikimedia.org/backup-index.html). Auxiliary tasks: - Consider using machine translation - Train and evaluate Question Answering model ## People - Daniel Hládek (responsible researcher). - Tomáš Kuchárik (student, help with web app). - Ján Staš (BERT model). - [Ondrej Megela](/students/2018/ondrej_megela), [Oleh Bilykh](/students/2018/oleh_bilykh), Matej Čarňanský (auxiliary tasks). - other students and annotators (annotations). ## Tasks ### Raw Data Preparation Input: Wikipedia Output: a set of paragraphs 1. Obtaining and parsing of wikipedia dump 1. Selecting feasible paragraphs Done: - Wiki parsing script (Daniel Hládek) - PageRank script (Daniel Hládek) - selection of paragraphs: select all good paragraphs and shuffle - fix minor errors To be done: - Select the largest articles (to be compatible with squad). Notes: - PageRank Causes bias to geography, random selection might be the best - [75 best articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_najlep%C5%A1%C3%ADch_%C4%8Dl%C3%A1nkov) - [167 good articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_dobr%C3%BDch_%C4%8Dl%C3%A1nkov) - [Wiki Facts](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zauj%C3%ADmavosti) ### Question Annotation An annotation recipe for Prodigy Input: A set of paragraphs Output: 5 questions for each paragraph Done: - a data preparation script (Daniel Hládek) - annotation recipe for Prodigy (Daniel Hládek) - deployment at [question.tukekemt.xyz](http://question.tukekemt.xyz) (only from tuke) (Daniel Hládek) - answer annotation together with question (Daniel Hládek) - prepare final input paragraphs (dataset) In progress: - More annotations (volunteers and workers). To be done: - Prepare development set ### Annotation Web Application Annotation work summary, web applicatiobn Input: Database of annotations Output: Summary of work performed by each annotator Done: - application template (Tomáš Kuchárik) - Dockerfile (Daniel Hládek) - web application for annotation analysis in Flask (Tomáš Kuchárik, Daniel Hládek) - application deployment (Daniel Hládek) - extract annotations from question annotation in squad format (Daniel Hladek) To be done: - review of validations ### Annotation Validation Input: annnotated questions and paragraph Output: good annotated questions Done: - Recipe for validations (binary annotation for paragraphs, question and answers, text fields for correction of question and answer). (Daniel Hládek) - Deployment To be done: - Prepare for production ### Annotation Manual Output: Recommendations for annotators Done: - Web Page for annotators (Daniel Hládek) - Modivation video (Daniel Hládek) - Video with instructions (Daniel Hládek) In progress: - Should be instructions a part of the annotation webn application? ### Question Answering Model Training the model with annotated data Input: An annotated QA database Output: An evaluated model for QA To be done: - Selecting existing modelling approach - Evaluation set selection - Model evaluation - Supporting the annotation with the model (pre-selecting answers) In progress: - Preliminary model (Ján Staš and Matej Čarňanský) ## Existing implementations - https://github.com/facebookresearch/DrQA - https://github.com/brmson/yodaqa - https://github.com/5hirish/adam_qas - https://github.com/WDAqua/Qanary - metodológia a implementácia QA ## Bibligraphy - Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes Facebook Research - SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250 - [WDaqua](https://wdaqua.eu/our-work/) publications ## Existing Datasets - [Squad](https://rajpurkar.github.io/SQuAD-explorer/) The Stanford Question Answering Dataset(SQuAD) (Rajpurkar et al., 2016) - [WebQuestions](https://github.com/brmson/dataset-factoid-webquestions) - [Freebase](https://en.wikipedia.org/wiki/Freebase) ## Intern tasks Week 1: Intro - Get acquainted with the project and Squad Database - Download the database and study the bibliography - Study [Prodigy annnotation](https://Prodi.gy) tool - Read [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250) - Read [Know What You Don't Know: Unanswerable Questions for SQuAD](https://arxiv.org/abs/1806.03822) Output: - Short report Week 2-4 The System Select and train a working question answering system Output: - a deployment script with comments for a selected question answering system Week 5-7 The Model Take a working training recipe (can use English data), a script with comments or Jupyter Notebook Output: - a trained model - evaluation of the model (if possible)