forked from KEMT/zpwiki
		
	
		
			
				
	
	
		
			225 lines
		
	
	
		
			5.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			225 lines
		
	
	
		
			5.5 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | |
| title: Question Answering
 | |
| published: true
 | |
| taxonomy:
 | |
|     category: [project]
 | |
|     tag: [annotation,question-answer,nlp]
 | |
|     author: Daniel Hladek
 | |
| ---
 | |
| # Question Answering
 | |
| 
 | |
| - [Project repository](https://git.kemt.fei.tuke.sk/dano/annotation) (private)
 | |
| - [Annotation Manual for question annotation](navod)
 | |
| - [Annotation Manual for validations](validacie)
 | |
| - [Annotation Manual for unanswerable questions](nezodpovedatelne)
 | |
| - [Summary database application](https://app.question.tukekemt,xyz)
 | |
| 
 | |
| 
 | |
| ## Project Description
 | |
| 
 | |
| - Create a clone of [SQuaD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) in the Slovak language
 | |
| - Setup annotation infrastructure with [Prodigy](https://prodi.gy/)
 | |
| - Perform and evaluate annotations of [Wikipedia data](https://dumps.wikimedia.org/backup-index.html).
 | |
| 
 | |
| Auxiliary tasks:
 | |
| 
 | |
| - Consider using machine translation 
 | |
| - Train and evaluate Question Answering model
 | |
| 
 | |
| ## People
 | |
| 
 | |
| - Daniel Hládek (responsible researcher).
 | |
| - Tomáš Kuchárik (student, help with web app).
 | |
| - Ján Staš (BERT model).
 | |
| - [Ondrej Megela](/students/2018/ondrej_megela), [Oleh Bilykh](/students/2018/oleh_bilykh), Matej Čarňanský (auxiliary tasks).
 | |
| - other students and annotators (annotations).
 | |
| 
 | |
| ## Finished Tasks
 | |
| 
 | |
| ### Raw Data Preparation
 | |
| 
 | |
| Input: Wikipedia
 | |
| 
 | |
| Output: a set of paragraphs
 | |
| 
 | |
| 1. Obtaining and parsing of wikipedia dump
 | |
| 1. Selecting feasible paragraphs
 | |
| 
 | |
| Done:
 | |
| 
 | |
| - Wiki parsing script (Daniel Hládek)
 | |
| - PageRank script (Daniel Hládek)
 | |
| - selection of paragraphs: select all good paragraphs and shuffle
 | |
| - fix minor errors
 | |
| 
 | |
| 
 | |
| To be done:
 | |
| 
 | |
| - Select the largest articles (to be compatible with squad).
 | |
| 
 | |
| Notes:
 | |
| 
 | |
| - PageRank Causes bias to geography, random selection might be the best
 | |
| - [75 best articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_najlep%C5%A1%C3%ADch_%C4%8Dl%C3%A1nkov)
 | |
| - [167 good articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_dobr%C3%BDch_%C4%8Dl%C3%A1nkov)
 | |
| - [Wiki Facts](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zauj%C3%ADmavosti)
 | |
| 
 | |
| 
 | |
| ### Annotation Manual
 | |
| 
 | |
| Output: Recommendations for annotators
 | |
| 
 | |
| Done:
 | |
| 
 | |
| - Web Page for annotators  (Daniel Hládek)
 | |
| - Modivation video (Daniel Hládek)
 | |
| - Video with instructions (Daniel Hládek)
 | |
| bn application?
 | |
| 
 | |
| ### Question Annotation
 | |
| 
 | |
| An annotation recipe for Prodigy
 | |
| 
 | |
| Input: A set of paragraphs
 | |
| 
 | |
| Output: 5 questions for each paragraph
 | |
| 
 | |
| Done: 
 | |
| 
 | |
| - a data preparation script (Daniel Hládek)
 | |
| - annotation recipe for Prodigy (Daniel Hládek)
 | |
| - deployment at [question.tukekemt.xyz](http://question.tukekemt.xyz) (only from tuke) (Daniel Hládek)
 | |
| - answer annotation together with question (Daniel Hládek)
 | |
| - prepare final input paragraphs (dataset)
 | |
| 
 | |
| ### Annotation Web Application
 | |
| 
 | |
| Annotation work summary, web applicatiobn
 | |
| 
 | |
| Input: Database of annotations
 | |
| 
 | |
| Output: Summary of work performed by each annotator
 | |
| 
 | |
| Done:
 | |
| 
 | |
| - application template (Tomáš Kuchárik)
 | |
| - Dockerfile (Daniel Hládek)
 | |
| - web application for annotation analysis in Flask (Tomáš Kuchárik, Daniel Hládek)
 | |
| - application deployment (Daniel Hládek)
 | |
| - extract annotations from question annotation in squad format (Daniel Hladek)
 | |
| 
 | |
| ### Annotation Validation
 | |
| 
 | |
| Input: annnotated questions and paragraph
 | |
| 
 | |
| Output: good annotated questions
 | |
| 
 | |
| Done:
 | |
| 
 | |
| - Recipe for validations (binary annotation for paragraphs, question and answers, text fields for correction of question and answer). (Daniel Hládek)
 | |
| - Deployment 
 | |
| 
 | |
| ## Tasks in progress
 | |
| 
 | |
| ### Unanswerable question annotation
 | |
| 
 | |
| Input: validated questions and answers
 | |
| 
 | |
| Output: Unanswerable questions and answers
 | |
| 
 | |
| Done:
 | |
| 
 | |
| - Annotation manual
 | |
| - Annotation interface
 | |
| - Database schema modifications
 | |
| - Modification of the database application
 | |
| - Export of validations
 | |
| 
 | |
| In progress:
 | |
| 
 | |
| - Annotaion process optimization
 | |
| 
 | |
| ### Final Data Export
 | |
| 
 | |
| Input: Validations and unanswerable questions
 | |
| 
 | |
| Output: Final database in SQUAD format
 | |
| 
 | |
| Done:
 | |
| 
 | |
| - Preliminary export script
 | |
| 
 | |
| To be done:
 | |
| 
 | |
| - Final export script
 | |
| - Database web visualization
 | |
| - Prepare development set 
 | |
| 
 | |
| ## Resources
 | |
| 
 | |
| ### Bibligraphy
 | |
| 
 | |
| - Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes
 | |
| Facebook Research
 | |
| - SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250
 | |
| - [WDaqua](https://wdaqua.eu/our-work/) publications
 | |
| 
 | |
| ### Existing Datasets
 | |
| 
 | |
| - [Squad](https://rajpurkar.github.io/SQuAD-explorer/)  The Stanford Question Answering Dataset(SQuAD)  (Rajpurkar  et  al.,  2016) 
 | |
| - [WebQuestions](https://github.com/brmson/dataset-factoid-webquestions)
 | |
| - [Freebase](https://en.wikipedia.org/wiki/Freebase)
 | |
| 
 | |
| ## Intern tasks
 | |
| 
 | |
| Week 1: Intro
 | |
| 
 | |
| - Get acquainted with the project and Squad Database
 | |
| - Download the database and study the bibliography
 | |
| - Study [Prodigy annnotation](https://Prodi.gy) tool
 | |
| - Read [SQuAD: 100,000+ Questions for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250)
 | |
| - Read [Know What You Don't Know: Unanswerable Questions for SQuAD](https://arxiv.org/abs/1806.03822)
 | |
| 
 | |
| Output:
 | |
| 
 | |
| - Short report
 | |
| 
 | |
| Week 2-4 The System
 | |
| 
 | |
| Select and train a working question answering system
 | |
| 
 | |
| Output:
 | |
| 
 | |
| - a deployment script with comments for a selected question answering system 
 | |
| 
 | |
| Week 5-7 The Model
 | |
| 
 | |
| Take a working training recipe (can use English data), a script with comments or Jupyter Notebook
 | |
| 
 | |
| Output:
 | |
| 
 | |
| - a trained model
 | |
| - evaluation of the model (if possible)
 | |
| 
 | |
| 
 | |
| ### Question Answering Model
 | |
| 
 | |
| Training the model with annotated data
 | |
| 
 | |
| Input: An annotated QA database
 | |
| 
 | |
| Output: An evaluated model for QA
 | |
| 
 | |
| To be done:
 | |
| 
 | |
| - Selecting existing modelling approach
 | |
| - Evaluation set selection
 | |
| - Model evaluation
 | |
| - Supporting the annotation with the model (pre-selecting answers)
 | |
| 
 | |
| In progress:
 | |
| 
 | |
| - Preliminary model (Ján Staš and Matej Čarňanský)
 | |
| 
 | |
| 
 |