forked from KEMT/zpwiki
		
	
		
			
				
	
	
		
			57 lines
		
	
	
		
			1.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			57 lines
		
	
	
		
			1.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | |
| title: Cesar Abascal Gutierrez
 | |
| published: true
 | |
| taxonomy:
 | |
|     category: [iaeste]
 | |
|     tag: [ner,nlp]
 | |
|     author: Daniel Hladek
 | |
| ---
 | |
| 
 | |
| ## Named entity annotations
 | |
| 
 | |
| Intern, probably summer 2019
 | |
| 
 | |
| Cesar Abascal Gutierrez <cesarbielva1994@gmail.com>
 | |
| 
 | |
| ## Goals
 | |
| 
 | |
|   - Be able to recognize unknown named entities
 | |
|   - Create a manually annotated training set from speech transcripts
 | |
|   - Propose an annotation schema
 | |
| 
 | |
| 
 | |
| ## Plan
 | |
| 
 | |
|   - Convert speech transcripts into a training set
 | |
|   - Train and evaluate classifier
 | |
|   - Establish manual annotation 
 | |
|   - Select unannotated data 
 | |
| 
 | |
| ### Data preparation
 | |
| 
 | |
| Input: Transcriber transcripts with inconsistent annotations
 | |
| 
 | |
| ``` 
 | |
|  * First small letter: regular word
 | |
|  * Capital: named entity
 | |
|  * ''^^'': faoreign word
 | |
|  * ''@'': noise
 | |
|  * ''_'': multi word expression
 | |
|  * ''/'': pronuncation
 | |
| ```
 | |
| 
 | |
| Output: A file that can be read by `spacy convert`
 | |
| 
 | |
| ## People
 | |
| 
 | |
| - Cesar Abascal Gutierrez <cesarbielva1994@gmail.com>
 | |
| - Kyryl Kobzar
 | |
| - Ediz Morochovič
 | |
| 
 | |
| ## Tools
 | |
| 
 | |
| ``` 
 | |
|  * Machine learning : https://spacy.io/usage/training
 | |
|  * Manual Annotation : https://prodi.gy/
 | |
| ```
 |