57 lines
		
	
	
		
			1.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			57 lines
		
	
	
		
			1.1 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
---
 | 
						|
title: Cesar Abascal Gutierrez
 | 
						|
published: true
 | 
						|
taxonomy:
 | 
						|
    category: [iaeste]
 | 
						|
    tag: [ner,nlp]
 | 
						|
    author: Daniel Hladek
 | 
						|
---
 | 
						|
 | 
						|
## Named entity annotations
 | 
						|
 | 
						|
Intern, probably summer 2019
 | 
						|
 | 
						|
Cesar Abascal Gutierrez <cesarbielva1994@gmail.com>
 | 
						|
 | 
						|
## Goals
 | 
						|
 | 
						|
  - Be able to recognize unknown named entities
 | 
						|
  - Create a manually annotated training set from speech transcripts
 | 
						|
  - Propose an annotation schema
 | 
						|
 | 
						|
 | 
						|
## Plan
 | 
						|
 | 
						|
  - Convert speech transcripts into a training set
 | 
						|
  - Train and evaluate classifier
 | 
						|
  - Establish manual annotation 
 | 
						|
  - Select unannotated data 
 | 
						|
 | 
						|
### Data preparation
 | 
						|
 | 
						|
Input: Transcriber transcripts with inconsistent annotations
 | 
						|
 | 
						|
``` 
 | 
						|
 * First small letter: regular word
 | 
						|
 * Capital: named entity
 | 
						|
 * ''^^'': faoreign word
 | 
						|
 * ''@'': noise
 | 
						|
 * ''_'': multi word expression
 | 
						|
 * ''/'': pronuncation
 | 
						|
```
 | 
						|
 | 
						|
Output: A file that can be read by `spacy convert`
 | 
						|
 | 
						|
## People
 | 
						|
 | 
						|
- Cesar Abascal Gutierrez <cesarbielva1994@gmail.com>
 | 
						|
- Kyryl Kobzar
 | 
						|
- Ediz Morochovič
 | 
						|
 | 
						|
## Tools
 | 
						|
 | 
						|
``` 
 | 
						|
 * Machine learning : https://spacy.io/usage/training
 | 
						|
 * Manual Annotation : https://prodi.gy/
 | 
						|
```
 |