zz
This commit is contained in:
		
							parent
							
								
									6c2a8ff77b
								
							
						
					
					
						commit
						965d5e7dcd
					
				
							
								
								
									
										45
									
								
								pages/interns/cesar_gutierrez/README.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										45
									
								
								pages/interns/cesar_gutierrez/README.md
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,45 @@ | ||||
| ## Named entity annotations | ||||
| 
 | ||||
| Cesar Abascal Gutierrez <cesarbielva1994@gmail.com> | ||||
| 
 | ||||
| ## Goals | ||||
| 
 | ||||
|   - Be able to recognize unknown named entities | ||||
|   - Create a manually annotated training set from speech transcripts | ||||
|   - Propose an annotation schema | ||||
| 
 | ||||
| 
 | ||||
| ## Plan | ||||
| 
 | ||||
|   - Convert speech transcripts into a training set | ||||
|   - Train and evaluate classifier | ||||
|   - Establish manual annotation  | ||||
|   - Select unannotated data  | ||||
| 
 | ||||
| ### Data preparation | ||||
| 
 | ||||
| Input: Transcriber transcripts with inconsistent annotations | ||||
| 
 | ||||
| ```  | ||||
|  * First small letter: regular word | ||||
|  * Capital: named entity | ||||
|  * ''^^'': faoreign word | ||||
|  * ''@'': noise | ||||
|  * ''_'': multi word expression | ||||
|  * ''/'': pronuncation | ||||
| ``` | ||||
| 
 | ||||
| Output: A file that can be read by `spacy convert` | ||||
| 
 | ||||
| ## People | ||||
| 
 | ||||
| - Cesar Abascal Gutierrez <cesarbielva1994@gmail.com> | ||||
| - Kyryl Kobzar | ||||
| - Ediz Morochovič | ||||
| 
 | ||||
| ## Tools | ||||
| 
 | ||||
| ```  | ||||
|  * Machine learning : https://spacy.io/usage/training | ||||
|  * Manual Annotation : https://prodi.gy/ | ||||
| ``` | ||||
| @ -6,70 +6,60 @@ title: Pomenované entity | ||||
| # Pomenované entity | ||||
| 
 | ||||
| 
 | ||||
| ## Goals | ||||
| 
 | ||||
|   - Be able to recognize unknown named entities | ||||
|   - Create a manually annotated training set from speech transcripts | ||||
|   - Propose an annotation schema | ||||
| 
 | ||||
| 
 | ||||
| ## Tasks | ||||
| 
 | ||||
| ### Príprava dát | ||||
| 
 | ||||
| - Parsovanie XML Wiki DUMP | ||||
| - Filter pre vyradenie článkov | ||||
| - Ručný výber článkov | ||||
| Vstup: Wiki XML dump | ||||
| Výstup: Korpus dokumentov pre anotáciu | ||||
| 
 | ||||
| urobené: | ||||
| 
 | ||||
| - Parsovanie XML Wiki DUMP https://git.kemt.fei.tuke.sk/dano/annotation/src/branch/master/wikicorpus | ||||
| 
 | ||||
| urobiť: | ||||
| 
 | ||||
| - Skript pre extrakciu paragrafov. | ||||
| - Filter pre vyradenie článkov a paragrafov. | ||||
| - Ručný výber článkov. | ||||
| 
 | ||||
| ### Príprava anotačnej schémy | ||||
| 
 | ||||
| - Deploymment Prodigy | ||||
| - Konverzia dát do Prodigy | ||||
| Výstup: nasadená a pripravená aplikácia na anotovanie | ||||
| 
 | ||||
| urobené: | ||||
| 
 | ||||
| - Deploymment Prodigy http://skner.tukekemt.xyz | ||||
| - Konverzia dát do Prodigy https://git.kemt.fei.tuke.sk/dano/annotation/src/branch/master/ner | ||||
| 
 | ||||
| urobiť: | ||||
| 
 | ||||
| - Anotačný manuál | ||||
| - Sada značiek na anotáciu | ||||
| - Podporný model? | ||||
| - Podporný model? Ak pomáha tak pripraviť aj schému alebo dataset  s podporným modelom. | ||||
| 
 | ||||
| ### Prípravná anotačná dávka | ||||
| 
 | ||||
| urobené: | ||||
| 
 | ||||
| - nasadenie aplikácie pre analýzu anotovaných dát http://aksner.tukekemt.xyz | ||||
| 
 | ||||
| https://git.kemt.fei.tuke.sk/dano/annotation/src/branch/master/database_app | ||||
| 
 | ||||
| prebieha: | ||||
| 
 | ||||
| - aplikácia pre analýzu anotovaných dát - kto anotoval čo, ako a koľko | ||||
| 
 | ||||
| urobiť: | ||||
| 
 | ||||
| - Anotácia dát | ||||
| - Príprava skriptu na čistenie anotovaných dát | ||||
| 
 | ||||
| ### Produkčná anotačná dávka | ||||
| 
 | ||||
| treba urobiť: | ||||
| 
 | ||||
| - Motivácia študentov | ||||
| - Anotácia dát | ||||
| - Analýza anotovaných dát | ||||
| - tvorba korpusu anotovaných dát | ||||
| 
 | ||||
| ### Analýza vykonaných anotácií | ||||
| 
 | ||||
| Aplikácia pre analýzu anotácií | ||||
| 
 | ||||
| ## Plan | ||||
| 
 | ||||
|   - Convert speech transcripts into a training set | ||||
|   - Train and evaluate classifier | ||||
|   - Establish manual annotation  | ||||
|   - Select unannotated data  | ||||
| 
 | ||||
| ### Data preparation | ||||
| 
 | ||||
| Input: Transcriber transcripts with inconsistent annotations | ||||
| 
 | ||||
| ```  | ||||
|  * First small letter: regular word | ||||
|  * Capital: named entity | ||||
|  * ''^^'': faoreign word | ||||
|  * ''@'': noise | ||||
|  * ''_'': multi word expression | ||||
|  * ''/'': pronuncation | ||||
| ``` | ||||
| 
 | ||||
| Output: A file that can be read by `spacy convert` | ||||
| 
 | ||||
| ## People | ||||
| 
 | ||||
| - Cesar Abascal Gutierrez <cesarbielva1994@gmail.com> | ||||
| - Kyryl Kobzar | ||||
| - Ediz Morochovič | ||||
| 
 | ||||
| ## Tools | ||||
| 
 | ||||
| ```  | ||||
|  * Machine learning : https://spacy.io/usage/training | ||||
|  * Manual Annotation : https://prodi.gy/ | ||||
| ``` | ||||
|  | ||||
		Loading…
	
		Reference in New Issue
	
	Block a user