# Question Answering [Project repository](https://git.kemt.fei.tuke.sk/dano/annotation) ## Project Description Task definition: - Create a clone of [SQuaD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) in the Slovak language - Setup annotation infrastructure with [Prodigy](https://prodi.gy/) - Perform and evaluate annotations of [Wikipedia data](https://dumps.wikimedia.org/backup-index.html). Auxiliary tasks: - Consider using machine translation - Train and evaluate Question Answering model ## Tasks ### Raw Data Preparation Input: Wikipedia Output: a set of paragraphs 1. Obtaining and parsing of wikipedia dump 1. Selecting feasible paragraphs Done: - Wiki parsing script - PageRank script To be done: - random selection of paragraphs: select all good paragraphs and shuffle Notes: - PageRank Causes bias to geography, random selection might be the best - [75 best articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_najlep%C5%A1%C3%ADch_%C4%8Dl%C3%A1nkov) - [167 good articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_dobr%C3%BDch_%C4%8Dl%C3%A1nkov) - [Wiki Facts](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zauj%C3%ADmavosti) ### Question Annotation Input: A set of paragraphs Output: A question for each paragraph Done: - a data preparation script - annotation running script To be done: - final input paragraphs - deployment ### Answer Annotation Input: A set of paragraphs and questions Output: An answer for each paragraph and question Done: - a data preparation script - annotation running script To be done: - input paragraphs with questions - deployment ### Annotation Summary Annotation work summary Input: Database of annotations Output: Summary of work performed by each annotator To be done: - web application for annotation analysis - analyze sql schema and find out who annotated what ### Annotation Manual Output: Recommendations for annotators TBD ### Question Answering Model Training the model with annotated data Input: An annotated QA database Output: An evaluated model for QA To be done: - Selecting existing modelling approach - Evaluation set selection - Model evaluation - Supporting the annotation with the model (pre-selecting answers) ### Supporting activities Output: More annotations Organizing voluntary student challenges to support the annotation process TBD ## Existing implementations - https://github.com/facebookresearch/DrQA - https://github.com/brmson/yodaqa - https://github.com/5hirish/adam_qas - https://github.com/WDAqua/Qanary - metodológia a implementácia QA ## Bibligraphy - Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes Facebook Research - SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250 ## Existing Datasets - Squad TheStanfordQuestionAnsweringDataset(SQuAD) (Rajpurkar et al., 2016) - WebQuestions - https://en.wikipedia.org/wiki/Freebase ## Intern tasks Week 1: Intro - Get acquainted with the project and Squad Database - Download the database and study the bibliography Week 2 and 3: Web Application - Analyze sql schema of Prodigy annotations - Find out who annotated what. - Make a web application that displays results. - Extend the application to analyze more Prodigy instances (for both question and answer annotations) - Improve the process of annotation. Output: Web application (in Node.js or Python) and Dockerfile Week 4-7 The model Select and train a working question answering system Output: - a deployment script with comments for a selected question answering system - a working training recipe (can use English ata), a script with comments or Jupyter Notebook - a trained model - evaluation of the model (if possible)