# Question Answering [Project repository](https://git.kemt.fei.tuke.sk/dano/annotation) (private) ## Project Description - Create a clone of [SQuaD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) in the Slovak language - Setup annotation infrastructure with [Prodigy](https://prodi.gy/) - Perform and evaluate annotations of [Wikipedia data](https://dumps.wikimedia.org/backup-index.html). Auxiliary tasks: - Consider using machine translation - Train and evaluate Question Answering model ## Tasks ### Raw Data Preparation Input: Wikipedia Output: a set of paragraphs 1. Obtaining and parsing of wikipedia dump 1. Selecting feasible paragraphs Done: - Wiki parsing script - PageRank script To be done: - random selection of paragraphs: select all good paragraphs and shuffle Notes: - PageRank Causes bias to geography, random selection might be the best - [75 best articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_najlep%C5%A1%C3%ADch_%C4%8Dl%C3%A1nkov) - [167 good articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_dobr%C3%BDch_%C4%8Dl%C3%A1nkov) - [Wiki Facts](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zauj%C3%ADmavosti) ### Question Annotation An annotation recipe for Prodigy Input: A set of paragraphs Output: A question for each paragraph Done: - a data preparation script - annotation recipe - deployment at [question.tukekemt.xyz](http://question.tukekemt.xyz) (only from tuke) - answer annotation together with question To be done: - prepare final input paragraphs (dataset) ### Answer Annotation Input: A set of paragraphs and questions Output: An answer for each paragraph and question Done: - a data preparation script - annotation recipe - deployment at [answer.tukekemt.xyz](http://answer.tukekemt.xyz) (only from tuke) To be done: - extract annotations from question annotation - input paragraphs with questions (dataset) ### Annotation Summary Annotation work summary Input: Database of annotations Output: Summary of work performed by each annotator Done: - application template - Dockerfile In progress: - web application for annotation analysis (Tomáš Kuchárik, Flask) - application deployment To be done: - analyze sql schema and find out who annotated what ### Annotation Manual Output: Recommendations for annotators To be done: - Web Page for annotators ### Question Answering Model Training the model with annotated data Input: An annotated QA database Output: An evaluated model for QA To be done: - Selecting existing modelling approach - Evaluation set selection - Model evaluation - Supporting the annotation with the model (pre-selecting answers) ### Supporting activities Output: More annotations Organizing voluntary student challenges to support the annotation process TBD ## Existing implementations - https://github.com/facebookresearch/DrQA - https://github.com/brmson/yodaqa - https://github.com/5hirish/adam_qas - https://github.com/WDAqua/Qanary - metodológia a implementácia QA ## Bibligraphy - Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes Facebook Research - SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250 - [WDaqua](https://wdaqua.eu/our-work/) publications ## Existing Datasets - [Squad](https://rajpurkar.github.io/SQuAD-explorer/) The Stanford Question Answering Dataset(SQuAD) (Rajpurkar et al., 2016) - [WebQuestions](https://github.com/brmson/dataset-factoid-webquestions) - [Freebase](https://en.wikipedia.org/wiki/Freebase) ## Intern tasks Week 1: Intro - Get acquainted with the project and Squad Database - Download the database and study the bibliography - Study [Prodigy annnotation](https://Prodi.gy) tool Week 2 and 3: Web Application - Analyze sql schema of Prodigy annotations - Find out who annotated what. - Make a web application that displays results. - Extend the application to analyze more Prodigy instances (for both question and answer annotations) - Improve the process of annotation. Output: Web application (in Node.js or Python) and Dockerfile Week 4-7 The model Select and train a working question answering system Output: - a deployment script with comments for a selected question answering system - a working training recipe (can use English data), a script with comments or Jupyter Notebook - a trained model - evaluation of the model (if possible)