forked from KEMT/zpwiki
zz
This commit is contained in:
parent
6c2a8ff77b
commit
965d5e7dcd
45
pages/interns/cesar_gutierrez/README.md
Normal file
45
pages/interns/cesar_gutierrez/README.md
Normal file
@ -0,0 +1,45 @@
|
||||
## Named entity annotations
|
||||
|
||||
Cesar Abascal Gutierrez <cesarbielva1994@gmail.com>
|
||||
|
||||
## Goals
|
||||
|
||||
- Be able to recognize unknown named entities
|
||||
- Create a manually annotated training set from speech transcripts
|
||||
- Propose an annotation schema
|
||||
|
||||
|
||||
## Plan
|
||||
|
||||
- Convert speech transcripts into a training set
|
||||
- Train and evaluate classifier
|
||||
- Establish manual annotation
|
||||
- Select unannotated data
|
||||
|
||||
### Data preparation
|
||||
|
||||
Input: Transcriber transcripts with inconsistent annotations
|
||||
|
||||
```
|
||||
* First small letter: regular word
|
||||
* Capital: named entity
|
||||
* ''^^'': faoreign word
|
||||
* ''@'': noise
|
||||
* ''_'': multi word expression
|
||||
* ''/'': pronuncation
|
||||
```
|
||||
|
||||
Output: A file that can be read by `spacy convert`
|
||||
|
||||
## People
|
||||
|
||||
- Cesar Abascal Gutierrez <cesarbielva1994@gmail.com>
|
||||
- Kyryl Kobzar
|
||||
- Ediz Morochovič
|
||||
|
||||
## Tools
|
||||
|
||||
```
|
||||
* Machine learning : https://spacy.io/usage/training
|
||||
* Manual Annotation : https://prodi.gy/
|
||||
```
|
@ -6,70 +6,60 @@ title: Pomenované entity
|
||||
# Pomenované entity
|
||||
|
||||
|
||||
## Goals
|
||||
|
||||
- Be able to recognize unknown named entities
|
||||
- Create a manually annotated training set from speech transcripts
|
||||
- Propose an annotation schema
|
||||
|
||||
|
||||
## Tasks
|
||||
|
||||
### Príprava dát
|
||||
|
||||
- Parsovanie XML Wiki DUMP
|
||||
- Filter pre vyradenie článkov
|
||||
- Ručný výber článkov
|
||||
Vstup: Wiki XML dump
|
||||
Výstup: Korpus dokumentov pre anotáciu
|
||||
|
||||
urobené:
|
||||
|
||||
- Parsovanie XML Wiki DUMP https://git.kemt.fei.tuke.sk/dano/annotation/src/branch/master/wikicorpus
|
||||
|
||||
urobiť:
|
||||
|
||||
- Skript pre extrakciu paragrafov.
|
||||
- Filter pre vyradenie článkov a paragrafov.
|
||||
- Ručný výber článkov.
|
||||
|
||||
### Príprava anotačnej schémy
|
||||
|
||||
- Deploymment Prodigy
|
||||
- Konverzia dát do Prodigy
|
||||
Výstup: nasadená a pripravená aplikácia na anotovanie
|
||||
|
||||
urobené:
|
||||
|
||||
- Deploymment Prodigy http://skner.tukekemt.xyz
|
||||
- Konverzia dát do Prodigy https://git.kemt.fei.tuke.sk/dano/annotation/src/branch/master/ner
|
||||
|
||||
urobiť:
|
||||
|
||||
- Anotačný manuál
|
||||
- Sada značiek na anotáciu
|
||||
- Podporný model?
|
||||
- Podporný model? Ak pomáha tak pripraviť aj schému alebo dataset s podporným modelom.
|
||||
|
||||
### Prípravná anotačná dávka
|
||||
|
||||
urobené:
|
||||
|
||||
- nasadenie aplikácie pre analýzu anotovaných dát http://aksner.tukekemt.xyz
|
||||
|
||||
https://git.kemt.fei.tuke.sk/dano/annotation/src/branch/master/database_app
|
||||
|
||||
prebieha:
|
||||
|
||||
- aplikácia pre analýzu anotovaných dát - kto anotoval čo, ako a koľko
|
||||
|
||||
urobiť:
|
||||
|
||||
- Anotácia dát
|
||||
- Príprava skriptu na čistenie anotovaných dát
|
||||
|
||||
### Produkčná anotačná dávka
|
||||
|
||||
treba urobiť:
|
||||
|
||||
- Motivácia študentov
|
||||
- Anotácia dát
|
||||
- Analýza anotovaných dát
|
||||
- tvorba korpusu anotovaných dát
|
||||
|
||||
### Analýza vykonaných anotácií
|
||||
|
||||
Aplikácia pre analýzu anotácií
|
||||
|
||||
## Plan
|
||||
|
||||
- Convert speech transcripts into a training set
|
||||
- Train and evaluate classifier
|
||||
- Establish manual annotation
|
||||
- Select unannotated data
|
||||
|
||||
### Data preparation
|
||||
|
||||
Input: Transcriber transcripts with inconsistent annotations
|
||||
|
||||
```
|
||||
* First small letter: regular word
|
||||
* Capital: named entity
|
||||
* ''^^'': faoreign word
|
||||
* ''@'': noise
|
||||
* ''_'': multi word expression
|
||||
* ''/'': pronuncation
|
||||
```
|
||||
|
||||
Output: A file that can be read by `spacy convert`
|
||||
|
||||
## People
|
||||
|
||||
- Cesar Abascal Gutierrez <cesarbielva1994@gmail.com>
|
||||
- Kyryl Kobzar
|
||||
- Ediz Morochovič
|
||||
|
||||
## Tools
|
||||
|
||||
```
|
||||
* Machine learning : https://spacy.io/usage/training
|
||||
* Manual Annotation : https://prodi.gy/
|
||||
```
|
||||
|
Loading…
Reference in New Issue
Block a user