zpwiki/pages/interns/cesar_gutierrez/README.md

57 lines
1.1 KiB
Markdown
Raw Permalink Normal View History

2024-08-05 23:05:54 +00:00
---
title: Cesar Abascal Gutierrez
published: true
taxonomy:
category: [iaeste]
tag: [ner,nlp]
author: Daniel Hladek
---
2020-07-01 16:27:35 +00:00
## Named entity annotations
2024-08-05 23:05:54 +00:00
Intern, probably summer 2019
2020-07-01 16:27:35 +00:00
Cesar Abascal Gutierrez <cesarbielva1994@gmail.com>
## Goals
- Be able to recognize unknown named entities
- Create a manually annotated training set from speech transcripts
- Propose an annotation schema
## Plan
- Convert speech transcripts into a training set
- Train and evaluate classifier
- Establish manual annotation
- Select unannotated data
### Data preparation
Input: Transcriber transcripts with inconsistent annotations
```
* First small letter: regular word
* Capital: named entity
* ''^^'': faoreign word
* ''@'': noise
* ''_'': multi word expression
* ''/'': pronuncation
```
Output: A file that can be read by `spacy convert`
## People
- Cesar Abascal Gutierrez <cesarbielva1994@gmail.com>
- Kyryl Kobzar
- Ediz Morochovič
## Tools
```
* Machine learning : https://spacy.io/usage/training
* Manual Annotation : https://prodi.gy/
```