forked from KEMT/zpwiki
		
	
		
			
				
	
	
		
			169 lines
		
	
	
		
			3.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			169 lines
		
	
	
		
			3.9 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
# Question Answering
 | 
						|
 | 
						|
[Project repository](https://git.kemt.fei.tuke.sk/dano/annotation) (private)
 | 
						|
 | 
						|
## Project Description
 | 
						|
 | 
						|
- Create a clone of [SQuaD 2.0](https://rajpurkar.github.io/SQuAD-explorer/) in the Slovak language
 | 
						|
- Setup annotation infrastructure with [Prodigy](https://prodi.gy/)
 | 
						|
- Perform and evaluate annotations of [Wikipedia data](https://dumps.wikimedia.org/backup-index.html).
 | 
						|
 | 
						|
Auxiliary tasks:
 | 
						|
 | 
						|
- Consider using machine translation 
 | 
						|
- Train and evaluate Question Answering model
 | 
						|
 | 
						|
## Tasks
 | 
						|
 | 
						|
### Raw Data Preparation
 | 
						|
 | 
						|
Input: Wikipedia
 | 
						|
 | 
						|
Output: a set of paragraphs
 | 
						|
 | 
						|
1. Obtaining and parsing of wikipedia dump
 | 
						|
1. Selecting feasible paragraphs
 | 
						|
 | 
						|
Done:
 | 
						|
 | 
						|
- Wiki parsing script
 | 
						|
- PageRank script
 | 
						|
 | 
						|
To be done:
 | 
						|
 | 
						|
- random selection of paragraphs: select all good paragraphs and shuffle
 | 
						|
 | 
						|
Notes:
 | 
						|
 | 
						|
- PageRank Causes bias to geography, random selection might be the best
 | 
						|
- [75 best articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_najlep%C5%A1%C3%ADch_%C4%8Dl%C3%A1nkov)
 | 
						|
- [167 good articles](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zoznam_dobr%C3%BDch_%C4%8Dl%C3%A1nkov)
 | 
						|
- [Wiki Facts](https://sk.wikipedia.org/wiki/Wikip%C3%A9dia:Zauj%C3%ADmavosti)
 | 
						|
 | 
						|
### Question Annotation
 | 
						|
 | 
						|
Input: A set of paragraphs
 | 
						|
 | 
						|
Output: A question for each paragraph
 | 
						|
 | 
						|
Done: 
 | 
						|
 | 
						|
- a data preparation script
 | 
						|
- annotation running script
 | 
						|
 | 
						|
To be done:
 | 
						|
 | 
						|
- final input paragraphs
 | 
						|
- deployment
 | 
						|
 | 
						|
### Answer Annotation
 | 
						|
 | 
						|
Input: A set of paragraphs and questions
 | 
						|
 | 
						|
Output: An answer for each paragraph and question
 | 
						|
 | 
						|
Done: 
 | 
						|
 | 
						|
- a data preparation script
 | 
						|
- annotation running script
 | 
						|
 | 
						|
To be done:
 | 
						|
 | 
						|
- input paragraphs with questions
 | 
						|
- deployment
 | 
						|
 | 
						|
### Annotation Summary
 | 
						|
 | 
						|
Annotation work summary
 | 
						|
 | 
						|
Input: Database of annotations
 | 
						|
 | 
						|
Output: Summary of work performed by each annotator
 | 
						|
 | 
						|
To be done:
 | 
						|
 | 
						|
- web application for annotation analysis
 | 
						|
- analyze sql schema and find out who annotated what
 | 
						|
 | 
						|
### Annotation Manual
 | 
						|
 | 
						|
Output: Recommendations for annotators
 | 
						|
 | 
						|
TBD
 | 
						|
 | 
						|
### Question Answering Model
 | 
						|
 | 
						|
Training the model with annotated data
 | 
						|
 | 
						|
Input: An annotated QA database
 | 
						|
 | 
						|
Output: An evaluated model for QA
 | 
						|
 | 
						|
To be done:
 | 
						|
 | 
						|
- Selecting existing modelling approach
 | 
						|
- Evaluation set selection
 | 
						|
- Model evaluation
 | 
						|
- Supporting the annotation with the model (pre-selecting answers)
 | 
						|
 | 
						|
 | 
						|
### Supporting activities
 | 
						|
 | 
						|
Output: More annotations
 | 
						|
 | 
						|
Organizing voluntary student challenges to support the annotation process
 | 
						|
 | 
						|
TBD
 | 
						|
 | 
						|
## Existing implementations
 | 
						|
 | 
						|
- https://github.com/facebookresearch/DrQA
 | 
						|
- https://github.com/brmson/yodaqa
 | 
						|
- https://github.com/5hirish/adam_qas
 | 
						|
- https://github.com/WDAqua/Qanary - metodológia a implementácia QA
 | 
						|
 | 
						|
## Bibligraphy
 | 
						|
 | 
						|
- Reading Wikipedia to Answer Open-Domain Questions, Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes
 | 
						|
Facebook Research
 | 
						|
- SQuAD: 100,000+ Questions for Machine Comprehension of Text https://arxiv.org/abs/1606.05250
 | 
						|
- [WDaqua](https://wdaqua.eu/our-work/) publications
 | 
						|
 | 
						|
## Existing Datasets
 | 
						|
 | 
						|
- [Squad](https://rajpurkar.github.io/SQuAD-explorer/)  The Stanford Question Answering Dataset(SQuAD)  (Rajpurkar  et  al.,  2016) 
 | 
						|
- [WebQuestions](https://github.com/brmson/dataset-factoid-webquestions)
 | 
						|
- [Freebase](https://en.wikipedia.org/wiki/Freebase)
 | 
						|
 | 
						|
## Intern tasks
 | 
						|
 | 
						|
Week 1: Intro
 | 
						|
 | 
						|
- Get acquainted with the project and Squad Database
 | 
						|
- Download the database and study the bibliography
 | 
						|
- Study [Prodigy annnotation](https://Prodi.gy) tool
 | 
						|
 | 
						|
Week 2 and 3: Web Application
 | 
						|
 | 
						|
- Analyze sql schema of Prodigy annotations 
 | 
						|
- Find out who annotated what.
 | 
						|
- Make a web application that displays results.
 | 
						|
- Extend the application to analyze more Prodigy instances (for both question and answer annotations)
 | 
						|
- Improve the process of annotation.
 | 
						|
 | 
						|
Output: Web application (in Node.js or Python) and Dockerfile
 | 
						|
 | 
						|
Week 4-7 The model
 | 
						|
 | 
						|
Select and train a working question answering system
 | 
						|
 | 
						|
Output:
 | 
						|
 | 
						|
- a deployment script with comments for a selected question answering system 
 | 
						|
- a working training recipe (can use English data), a script with comments or Jupyter Notebook
 | 
						|
- a trained model
 | 
						|
- evaluation of the model (if possible)
 | 
						|
 | 
						|
 | 
						|
 |