forked from KEMT/zpwiki
		
	merge
This commit is contained in:
		
						commit
						f5455a89b3
					
				| @ -13,11 +13,28 @@ Repozitár so [zdrojovými kódmi](https://git.kemt.fei.tuke.sk/dl874wn/dp2021) | ||||
| 
 | ||||
| ## Diplomový projekt 2 2020 | ||||
| 
 | ||||
| Virtuálne stretnutie 6.11.2020 | ||||
| 
 | ||||
| Stav: | ||||
| 
 | ||||
| - Vypracovaná tabuľka s 5 experimentami. | ||||
| - vytvorený repozitár. | ||||
| 
 | ||||
| Na ďalšie stretnutie: | ||||
| 
 | ||||
| - nahrať kódy na repozitár. | ||||
| - závislosťi (názvy balíčkov) poznačte do súboru requirements.txt. | ||||
| - Prepracujte experiment tak aby akceptoval argumenty z príkazového riadka. (sys.argv) | ||||
| - K experimentom zapísať skript na spustenie. V skripte by mali byť parametre s ktorými ste spustili experiment. | ||||
| - dopracujte report. | ||||
| - do teorie urobte prehľad metód punctuation restoration a opis Vašej metódy. | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 25.9.2020 | ||||
| 
 | ||||
| Urobené: | ||||
| 
 | ||||
| - skript pre vyhodnotenie experimentov | ||||
| - skript pre vyhodnotenie experimentov. | ||||
| 
 | ||||
| 
 | ||||
| Úlohy do ďalšieho stretnutia: | ||||
|  | ||||
| @ -21,8 +21,21 @@ Zásobník úloh: | ||||
| 
 | ||||
| - Použiť model na podporu anotácie | ||||
| - Do konca ZS vytvoriť report vo forme článku. | ||||
| - Vytvorte systém pre zistenie množstva a druhu anotovaných dát. Koľko článkov? Koľko entít jednotlivvých typov? | ||||
| - Spísať pravidlá pre validáciu. Aký výsledok anotácie je dobrý? Je potrebné anotované dáta skontrolovať? | ||||
| 
 | ||||
| Virtuálne stretnutie 30.10.2020: | ||||
| 
 | ||||
| Stav: | ||||
| 
 | ||||
| - Vylepšený návod | ||||
| - Vyskúšaný export dát a trénovanie modelu z databázy. Problém pri trénovaní Spacy - iné výsledky ako cez Progigy trénovanie | ||||
| - Práca na textovej čsati. | ||||
| 
 | ||||
| Úlohy do ďalšieho stretnutia: | ||||
| - Vytvorte si repozitár s názvom dp2021 a tam pridajte skripty a poznámky. | ||||
| - Pokračujte v písaní práce. Vykonajte prieskum literatúry "named entity corpora" aj poznámky. | ||||
| - Vytvorte systém pre zistenie množstva a druhu anotovaných dát. Koľko článkov? Koľko entít jednotlivvých typov? Výsledná tabuľka pôjde do práce. | ||||
| - Pripraviť sa na produkčné anotácie. Je schéma pripravená? | ||||
| 
 | ||||
| Virtuálne stretnutie 16.10.2020: | ||||
| 
 | ||||
|  | ||||
| @ -1 +1,40 @@ | ||||
| DP2021 | ||||
| ## Diplomový projekt 2 2020 | ||||
| Stav: | ||||
| - aktualizácia anotačnej schémy (jedná sa o testovaciu schému s vlastnými dátami) | ||||
| - vykonaných niekoľko anotácii, trénovanie v Prodigy - nízka presnosť = malé množstvo anotovaných dát. Trénovanie v spacy zatiaľ nefunguje. | ||||
| - Štatistiky o množstve prijatých a odmietnutých anotácii získame z Prodigy: prodigy stats wikiart. Zatiaľ 156 anotácii (151 accept, 5 reject). Na získanie prehľadu o množstve anotácii jednotlivých entít potrebujeme vytvoriť skript. | ||||
| - Prehľad literatúry Named Entity Corpus | ||||
|     - Budovanie korpusu pre NER – automatické vytvorenie už anotovaného korpusu z Wiki pomocou DBpedia – jedná sa o anglický korpus, ale možno spomenúť v porovnaní postupov  | ||||
|         - Building a Massive Corpus for Named Entity Recognition using Free Open Data Sources - Daniel Specht Menezes, Pedro Savarese, Ruy L. Milidiú | ||||
|     - Porovnanie postupov pre anotáciu korpusu (z hľadiska presnosti aj času) - Manual, SemiManual | ||||
|         - Comparison of Annotating Methods for Named Entity Corpora - Kanako Komiya, Masaya Suzuki | ||||
|     - Čo je korpus, vývojový cyklus, analýza korpusu (Už využitá literatúra – cyklus MATTER) | ||||
|         - Natural Language Annotation for Machine Learning – James Pustejovsky, Amber Stubbs | ||||
| 
 | ||||
| Aktualizácia 09.11.2020: | ||||
| - Vyriešený problém, kedy nefungovalo trénovanie v spacy | ||||
| - Vykonaná testovacia anotácia cca 500 viet. Výsledky trénovania pri 20 iteráciách: F-Score 47% (rovnaké výsledky pri trénovaní v Spacy aj Prodigy) | ||||
| - Štatistika o počte jednotlivých entít: skript count.py | ||||
| 
 | ||||
| 
 | ||||
| ## Diplomový projekt 1 2020 | ||||
| 
 | ||||
| - vytvorenie a spustenie docker kontajneru | ||||
| 
 | ||||
| 
 | ||||
| ``` | ||||
| ./build-docker.sh | ||||
| docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash | ||||
| # (v mojom prípade:) | ||||
| winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash | ||||
| ``` | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| ### Spustenie anotačnej schémy | ||||
| - `dataminer.csv` články stiahnuté z wiki | ||||
| - `cd ner` | ||||
| - `./01_text_to_sent.sh` spustenie skriptu *text_to_sent.py*, ktorý rozdelí články na jednotlivé vety | ||||
| - `./02_ner_correct.sh` spustenie anotačného procesu pre NER s návrhmi od modelu  | ||||
| - `./03_ner_export.sh`  exportovanie anotovaných dát vo formáte jsonl potrebnom pre spracovanie vo spacy | ||||
|  | ||||
| @ -1,17 +1,16 @@ | ||||
| # > docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash | ||||
| # > winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash | ||||
| 
 | ||||
| FROM python:3.8 | ||||
| RUN mkdir /prodigy | ||||
| WORKDIR /prodigy | ||||
| COPY ./prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl  /prodigy | ||||
| RUN mkdir /work | ||||
| COPY ./ner /work | ||||
| RUN pip install prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl | ||||
| RUN pip install https://files.kemt.fei.tuke.sk/models/spacy/sk_sk1-0.0.1.tar.gz | ||||
| RUN pip install nltk | ||||
| EXPOSE 8080 | ||||
| ENV PRODIGY_HOME /work | ||||
| ENV PRODIGY_HOST 0.0.0.0 | ||||
| WORKDIR /work | ||||
| 
 | ||||
| # > docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash | ||||
| # > winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation-master/annotation/work prodigy bash | ||||
| 
 | ||||
| FROM python:3.8 | ||||
| RUN mkdir /prodigy | ||||
| WORKDIR /prodigy | ||||
| COPY ./prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl  /prodigy | ||||
| RUN mkdir /work | ||||
| COPY ./ner /work/ner | ||||
| RUN pip install uvicorn==0.11.5 prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl | ||||
| RUN pip install https://files.kemt.fei.tuke.sk/models/spacy/sk_sk1-0.0.1.tar.gz | ||||
| RUN pip install nltk | ||||
| EXPOSE 8080 | ||||
| ENV PRODIGY_HOME /work | ||||
| ENV PRODIGY_HOST 0.0.0.0 | ||||
| WORKDIR /work | ||||
| @ -1,13 +1,11 @@ | ||||
| ## Diplomový projekt 1 2020 | ||||
| ## Diplomový projekt 2 2020 | ||||
| 
 | ||||
| - vytvorenie a spustenie docker kontajneru | ||||
| 
 | ||||
| 
 | ||||
| ``` | ||||
| ./build-docker.sh | ||||
| docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash | ||||
| # (v mojom prípade:) | ||||
| winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash | ||||
| winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation-master/annotation/work prodigy bash | ||||
| ``` | ||||
| 
 | ||||
| 
 | ||||
| @ -17,5 +15,12 @@ winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/ann | ||||
| - `dataminer.csv` články stiahnuté z wiki | ||||
| - `cd ner` | ||||
| - `./01_text_to_sent.sh` spustenie skriptu *text_to_sent.py*, ktorý rozdelí články na jednotlivé vety | ||||
| - `./02_ner_correct.sh` spustenie anotačného procesu pre NER s návrhmi od modelu  | ||||
| - `./03_ner_export.sh`  exportovanie anotovaných dát vo formáte jsonl potrebnom pre spracovanie vo spacy | ||||
| - `./02_ner_manual.sh` spustenie manuálneho anotačného procesu pre NER   | ||||
| - `./03_export.sh`  exportovanie anotovaných dát vo formáte json potrebnom pre spracovanie vo spacy. Možnosť rozdelenia na trénovacie (70%) a testovacie dáta (30%) (--eval-split 0.3). | ||||
| 
 | ||||
| ### Štatistika o anotovaných dátach | ||||
| - `prodigy stats wikiart` - informácie o počte prijatých a odmietnutých článkov  | ||||
| - `python3 count.py` - informácie o počte jednotlivých entít | ||||
| 
 | ||||
| ### Trénovanie modelu | ||||
| Založené na: https://git.kemt.fei.tuke.sk/dano/spacy-skmodel | ||||
|  | ||||
| @ -0,0 +1,14 @@ | ||||
| # load data | ||||
| filename = 'ner/annotations.jsonl' | ||||
| file = open(filename, 'rt', encoding='utf-8') | ||||
| text = file.read() | ||||
| 
 | ||||
| # count entity PER | ||||
| countPER = text.count('PER') | ||||
| countLOC = text.count('LOC') | ||||
| countORG = text.count('ORG') | ||||
| countMISC = text.count('MISC') | ||||
| print('Počet anotovaných entít typu PER:', countPER,'\n',  | ||||
|       'Počet anotovaných entít typu LOC:', countLOC,'\n', | ||||
|       'Počet anotovaných entít typu ORG:', countORG,'\n', | ||||
|       'Počet anotovaných entít typu MISC:', countMISC,'\n') | ||||
| @ -1,3 +0,0 @@ | ||||
| 
 | ||||
| prodigy ner.correct wikiart sk_sk1 ./textfile.csv --label OSOBA,MIESTO,ORGANIZACIA,PRODUKT | ||||
| 
 | ||||
| @ -0,0 +1,2 @@ | ||||
| prodigy ner.manual wikiart sk_sk1 ./textfile.csv --label PER,LOC,ORG,MISC | ||||
| 
 | ||||
| @ -0,0 +1 @@ | ||||
| prodigy data-to-spacy ./train.json ./eval.json --lang sk --ner wikiart --eval-split 0.3 | ||||
| @ -1 +0,0 @@ | ||||
| prodigy db-out wikiart > ./annotations.jsonl | ||||
| @ -0,0 +1,19 @@ | ||||
| mkdir -p build | ||||
| mkdir -p build/input | ||||
| # Prepare Treebank | ||||
| mkdir -p build/input/slovak-treebank | ||||
| spacy convert ./sources/slovak-treebank/stb.conll ./build/input/slovak-treebank | ||||
| # UDAG used as evaluation | ||||
| mkdir -p build/input/ud-artificial-gapping | ||||
| spacy convert ./sources/ud-artificial-gapping/sk-ud-crawled-orphan.conllu ./build/input/ud-artificial-gapping | ||||
| # Prepare skner | ||||
| mkdir -p build/input/skner | ||||
| # Convert to IOB | ||||
| cat ./sources/skner/wikiann-sk.bio | python ./sources/bio-to-iob.py > build/input/skner/wikiann-sk.iob | ||||
| # Split to train test | ||||
| cat ./build/input/skner/wikiann-sk.iob | python ./sources/iob-to-traintest.py ./build/input/skner/wikiann-sk | ||||
| # Convert train and test | ||||
| mkdir -p build/input/skner-train | ||||
| spacy convert -n 15 --converter ner ./build/input/skner/wikiann-sk.train ./build/input/skner-train | ||||
| mkdir -p build/input/skner-test | ||||
| spacy convert -n 15 --converter ner ./build/input/skner/wikiann-sk.test ./build/input/skner-test | ||||
| @ -0,0 +1,19 @@ | ||||
| set -e | ||||
| OUTDIR=build/train/output | ||||
| TRAINDIR=build/train | ||||
| mkdir -p $TRAINDIR | ||||
| mkdir -p $OUTDIR | ||||
| mkdir -p dist | ||||
| # Delete old training results | ||||
| rm -rf $OUTDIR/* | ||||
| # Train dependency and POS | ||||
| spacy train sk $OUTDIR ./build/input/slovak-treebank ./build/input/ud-artificial-gapping  --n-iter 20 -p tagger,parser | ||||
| rm -rf $TRAINDIR/posparser | ||||
| mv $OUTDIR/model-best $TRAINDIR/posparser | ||||
| # Train NER | ||||
| # python ./train.py -t ./train.json -o $TRAINDIR/nerposparser -n 10 -m $TRAINDIR/posparser/ | ||||
| spacy train sk $TRAINDIR/nerposparser ./ner/train.json ./ner/eval.json --n-iter 20 -p ner | ||||
| # Package model | ||||
| spacy package $TRAINDIR/nerposparser dist --meta-path ./meta.json --force | ||||
| cd dist/sk_sk1-0.2.0 | ||||
| python ./setup.py sdist --dist-dir ../ | ||||
| @ -31,11 +31,39 @@ Zásobník úloh: | ||||
| 
 | ||||
| - Urobiť verejné demo - nasadenie pomocou systému Docker | ||||
| - zlepšenie Web UI | ||||
| - vytvoriť REST api pre indexovanie dokumentu. | ||||
| - V indexe prideliť ohodnotenie každému dokumentu podľa viacerých metód, napr. PageRank | ||||
| - Využiť vyhodnotenie pri vyhľadávaní | ||||
| - **Použiť overovaciu databázu SCNC na vyhodnotenie každej metódy** | ||||
| - **Do konca zimného semestra vytvoriť "Mini Diplomovú prácu cca 8 strán s experimentami" vo forme článku** | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 6.11:2020: | ||||
| 
 | ||||
| Stav: | ||||
| 
 | ||||
| - Riešenie problémov s cassandrou a javascriptom. Ako funguje funkcia then?  | ||||
| 
 | ||||
| Úlohy na ďalšie stretnutie: | ||||
| 
 | ||||
| - vypracujte funkciu na indexovanie. Vstup je dokument (objekt s textom a metainformáciami). Fukcia zaindexuje dokument do ES. | ||||
| - Naštudujte si ako funguje funkcia then a čo je to callback. | ||||
| - Naštudujte si ako sa používa Promise. | ||||
| - Naštudujte si ako funguje async - await.  | ||||
| - https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Asynchronous/ | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 23.10:2020: | ||||
| 
 | ||||
| Stav: | ||||
| - Riešenie problémov s cassandrou. Ako vybrať dáta podľa primárneho kľúča. | ||||
| 
 | ||||
| Do ďďalšiehio stretnutia: | ||||
| 
 | ||||
| - pokračovať v otvorených úlohách. | ||||
| - urobte funkciu pre indexovanie jedného dokumentu. | ||||
| 
 | ||||
| Virtuálne stretnutie 16.10. | ||||
| 
 | ||||
| Stav: | ||||
|  | ||||
							
								
								
									
										105
									
								
								pages/students/2016/jan_holp/dp2021/zdrojove_subory/cassandra.js
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										105
									
								
								pages/students/2016/jan_holp/dp2021/zdrojove_subory/cassandra.js
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,105 @@ | ||||
| //Jan Holp, DP 2021
 | ||||
| 
 | ||||
| 
 | ||||
| //client1 = cassandra
 | ||||
| //client2 = elasticsearch 
 | ||||
| //-----------------------------------------------------------------
 | ||||
| 
 | ||||
| //require the Elasticsearch librray
 | ||||
| const elasticsearch = require('elasticsearch'); | ||||
| const client2 = new elasticsearch.Client({ | ||||
|    hosts: [ 'localhost:9200'] | ||||
| }); | ||||
| client2.ping({ | ||||
|      requestTimeout: 30000, | ||||
|  }, function(error) { | ||||
|  // at this point, eastic search is down, please check your Elasticsearch service
 | ||||
|      if (error) { | ||||
|          console.error('Elasticsearch cluster is down!'); | ||||
|      } else { | ||||
|          console.log('Everything is ok'); | ||||
|      } | ||||
|  }); | ||||
| 
 | ||||
| //create new index skweb2
 | ||||
| client2.indices.create({ | ||||
|     index: 'skweb2' | ||||
| }, function(error, response, status) { | ||||
|     if (error) { | ||||
|         console.log(error); | ||||
|     } else { | ||||
|         console.log("created a new index", response); | ||||
|     } | ||||
| }); | ||||
| 
 | ||||
| const cassandra = require('cassandra-driver'); | ||||
| const client1 = new cassandra.Client({ contactPoints: ['localhost:9042'], localDataCenter: 'datacenter1', keyspace: 'websucker' }); | ||||
| const query = 'SELECT title  FROM websucker.content WHERE body_size > 0  ALLOW FILTERING'; | ||||
| client1.execute(query) | ||||
|   .then(result => console.log(result)),function(error) { | ||||
|     if(error){ | ||||
|       console.error('Something is wrong!'); | ||||
|       console.log(error); | ||||
|     } else{ | ||||
|         console.log('Everything is ok'); | ||||
|     } | ||||
|   };  | ||||
| 
 | ||||
| /* | ||||
| async  function indexData() { | ||||
| 
 | ||||
|   var i = 0; | ||||
|   const query = 'SELECT title  FROM websucker.content WHERE body_size > 0  ALLOW FILTERING';  | ||||
|   client1.execute(query) | ||||
|     .then((result) => { | ||||
|     try { | ||||
|         //for ( i=0; i<15;i++){
 | ||||
|         console.log('%s', result.row[0].title) | ||||
|       //}
 | ||||
|   } catch (query) { | ||||
|       if (query  instanceof SyntaxError) { | ||||
|           console.log( "Neplatne query" ); | ||||
|         }  | ||||
|   } | ||||
| 
 | ||||
|      | ||||
| 
 | ||||
|     }); | ||||
|        | ||||
| 
 | ||||
|   } | ||||
| 
 | ||||
| /* | ||||
| 
 | ||||
| //indexing method
 | ||||
| const bulkIndex = function bulkIndex(index, type, data) { | ||||
| 	let bulkBody = []; | ||||
| 	id = 1; | ||||
| const errorCount = 0; | ||||
| 	data.forEach(item => { | ||||
| 		bulkBody.push({ | ||||
| 			index: { | ||||
| 				_index: index, | ||||
| 				_type:  type, | ||||
| 				_id :   id++, | ||||
| 			} | ||||
| 		}); | ||||
| 		bulkBody.push(item); | ||||
| 	}); | ||||
|         console.log(bulkBody); | ||||
| 	client.bulk({body: bulkBody}) | ||||
| 		.then(response => { | ||||
| 
 | ||||
| 			response.items.forEach(item => { | ||||
| 				if (item.index && item.index.error) { | ||||
| 					console.log(++errorCount, item.index.error); | ||||
| 				} | ||||
| 			}); | ||||
| 			console.log( | ||||
| 				`Successfully indexed ${data.length - errorCount} | ||||
| 				out of ${data.length} items` | ||||
| 			); | ||||
| 		}) | ||||
| 		.catch(console.err); | ||||
| }; | ||||
| */ | ||||
| @ -23,13 +23,26 @@ Zásobník úloh : | ||||
|         - tesla | ||||
|         - xavier | ||||
|     - Trénovanie na dvoch kartách na jednom stroji  | ||||
|         - idoc | ||||
|         - idoc DONE | ||||
|         - titan | ||||
|     - možno trénovanie na 4 kartách na jednom | ||||
|         - quadra | ||||
|     - *Trénovanie na dvoch kartách na dvoch strojoch pomocou NCCL (idoc, tesla)* | ||||
|     - možno trénovanie na 2 kartách na dvoch strojoch (quadra plus idoc). | ||||
| 
 | ||||
| Virtuálne stretnutie 27.10.2020 | ||||
| 
 | ||||
| Stav: | ||||
| 
 | ||||
| - Trénovanie na procesore, na 1 GPU, na 2 GPU na idoc | ||||
| - Príprava podkladov na trénovanie na dvoch strojoch pomocou Pytorch. | ||||
| - Vytvorený prístup na teslu a xavier. | ||||
| 
 | ||||
| Úlohy na ďďalšie stretnutie: | ||||
| - Štdúdium odbornej literatúry a vypracovanie poznámok.  | ||||
| - Pokračovať v otvorených úlohách zo zásobníka | ||||
| - Vypracované skripty uložiť na GIT repozitár | ||||
| - vytvorte repozitár dp2021 | ||||
| 
 | ||||
| Stretnutie 2.10.2020 | ||||
| 
 | ||||
|  | ||||
| @ -1 +1,4 @@ | ||||
| ## Všetky skripty, súbory a konfigurácie | ||||
| ## Všetky skripty, súbory a konfigurácie | ||||
| 
 | ||||
| https://github.com/pytorch/examples/tree/master/imagenet | ||||
| - malo by fungovat pre DDP, nedostupny imagenet subor z oficialnej stranky | ||||
										
											Binary file not shown.
										
									
								
							| @ -0,0 +1,76 @@ | ||||
| import argparse | ||||
| import datetime | ||||
| import os | ||||
| import socket | ||||
| import sys | ||||
| 
 | ||||
| import numpy as np | ||||
| from torch.utils.tensorboard import SummaryWriter | ||||
| 
 | ||||
| import torch | ||||
| import torch.nn as nn | ||||
| import torch.optim | ||||
| 
 | ||||
| from torch.optim import SGD, Adam | ||||
| from torch.utils.data import DataLoader | ||||
| 
 | ||||
| from util.util import enumerateWithEstimate | ||||
| from p2ch13.dsets import Luna2dSegmentationDataset, TrainingLuna2dSegmentationDataset, getCt | ||||
| from util.logconf import logging | ||||
| from util.util import xyz2irc | ||||
| from p2ch13.model_seg import UNetWrapper, SegmentationAugmentation | ||||
| from p2ch13.train_seg import LunaTrainingApp | ||||
| 
 | ||||
| log = logging.getLogger(__name__) | ||||
| # log.setLevel(logging.WARN) | ||||
| # log.setLevel(logging.INFO) | ||||
| log.setLevel(logging.DEBUG) | ||||
| 
 | ||||
| class BenchmarkLuna2dSegmentationDataset(TrainingLuna2dSegmentationDataset): | ||||
|     def __len__(self): | ||||
|         # return 500 | ||||
|         return 5000 | ||||
|         return 1000 | ||||
| 
 | ||||
| class LunaBenchmarkApp(LunaTrainingApp): | ||||
|     def initTrainDl(self): | ||||
|         train_ds = BenchmarkLuna2dSegmentationDataset( | ||||
|             val_stride=10, | ||||
|             isValSet_bool=False, | ||||
|             contextSlices_count=3, | ||||
|             # augmentation_dict=self.augmentation_dict, | ||||
|         ) | ||||
| 
 | ||||
|         batch_size = self.cli_args.batch_size | ||||
|         if self.use_cuda: | ||||
|             batch_size *= torch.cuda.device_count() | ||||
| 
 | ||||
|         train_dl = DataLoader( | ||||
|             train_ds, | ||||
|             batch_size=batch_size, | ||||
|             num_workers=self.cli_args.num_workers, | ||||
|             pin_memory=self.use_cuda, | ||||
|         ) | ||||
| 
 | ||||
|         return train_dl | ||||
| 
 | ||||
|     def main(self): | ||||
|         log.info("Starting {}, {}".format(type(self).__name__, self.cli_args)) | ||||
| 
 | ||||
|         train_dl = self.initTrainDl() | ||||
| 
 | ||||
|         for epoch_ndx in range(1, 2): | ||||
|             log.info("Epoch {} of {}, {}/{} batches of size {}*{}".format( | ||||
|                 epoch_ndx, | ||||
|                 self.cli_args.epochs, | ||||
|                 len(train_dl), | ||||
|                 len([]), | ||||
|                 self.cli_args.batch_size, | ||||
|                 (torch.cuda.device_count() if self.use_cuda else 1), | ||||
|             )) | ||||
| 
 | ||||
|             self.doTraining(epoch_ndx, train_dl) | ||||
| 
 | ||||
| 
 | ||||
| if __name__ == '__main__': | ||||
|     LunaBenchmarkApp().main() | ||||
| @ -0,0 +1,401 @@ | ||||
| import copy | ||||
| import csv | ||||
| import functools | ||||
| import glob | ||||
| import math | ||||
| import os | ||||
| import random | ||||
| 
 | ||||
| from collections import namedtuple | ||||
| 
 | ||||
| import SimpleITK as sitk | ||||
| import numpy as np | ||||
| import scipy.ndimage.morphology as morph | ||||
| 
 | ||||
| import torch | ||||
| import torch.cuda | ||||
| import torch.nn.functional as F | ||||
| from torch.utils.data import Dataset | ||||
| 
 | ||||
| from util.disk import getCache | ||||
| from util.util import XyzTuple, xyz2irc | ||||
| from util.logconf import logging | ||||
| 
 | ||||
| log = logging.getLogger(__name__) | ||||
| # log.setLevel(logging.WARN) | ||||
| # log.setLevel(logging.INFO) | ||||
| log.setLevel(logging.DEBUG) | ||||
| 
 | ||||
| raw_cache = getCache('part2ch13_raw') | ||||
| 
 | ||||
| MaskTuple = namedtuple('MaskTuple', 'raw_dense_mask, dense_mask, body_mask, air_mask, raw_candidate_mask, candidate_mask, lung_mask, neg_mask, pos_mask') | ||||
| 
 | ||||
| CandidateInfoTuple = namedtuple('CandidateInfoTuple', 'isNodule_bool, hasAnnotation_bool, isMal_bool, diameter_mm, series_uid, center_xyz') | ||||
| 
 | ||||
| @functools.lru_cache(1) | ||||
| def getCandidateInfoList(requireOnDisk_bool=True): | ||||
|     # We construct a set with all series_uids that are present on disk. | ||||
|     # This will let us use the data, even if we haven't downloaded all of | ||||
|     # the subsets yet. | ||||
|     mhd_list = glob.glob('data-unversioned/subset*/*.mhd') | ||||
|     presentOnDisk_set = {os.path.split(p)[-1][:-4] for p in mhd_list} | ||||
| 
 | ||||
|     candidateInfo_list = [] | ||||
|     with open('data/annotations_with_malignancy.csv', "r") as f: | ||||
|         for row in list(csv.reader(f))[1:]: | ||||
|             series_uid = row[0] | ||||
|             annotationCenter_xyz = tuple([float(x) for x in row[1:4]]) | ||||
|             annotationDiameter_mm = float(row[4]) | ||||
|             isMal_bool = {'False': False, 'True': True}[row[5]] | ||||
| 
 | ||||
|             candidateInfo_list.append( | ||||
|                 CandidateInfoTuple( | ||||
|                     True, | ||||
|                     True, | ||||
|                     isMal_bool, | ||||
|                     annotationDiameter_mm, | ||||
|                     series_uid, | ||||
|                     annotationCenter_xyz, | ||||
|                 ) | ||||
|             ) | ||||
| 
 | ||||
|     with open('data/candidates.csv', "r") as f: | ||||
|         for row in list(csv.reader(f))[1:]: | ||||
|             series_uid = row[0] | ||||
| 
 | ||||
|             if series_uid not in presentOnDisk_set and requireOnDisk_bool: | ||||
|                 continue | ||||
| 
 | ||||
|             isNodule_bool = bool(int(row[4])) | ||||
|             candidateCenter_xyz = tuple([float(x) for x in row[1:4]]) | ||||
| 
 | ||||
|             if not isNodule_bool: | ||||
|                 candidateInfo_list.append( | ||||
|                     CandidateInfoTuple( | ||||
|                         False, | ||||
|                         False, | ||||
|                         False, | ||||
|                         0.0, | ||||
|                         series_uid, | ||||
|                         candidateCenter_xyz, | ||||
|                     ) | ||||
|                 ) | ||||
| 
 | ||||
|     candidateInfo_list.sort(reverse=True) | ||||
|     return candidateInfo_list | ||||
| 
 | ||||
| @functools.lru_cache(1) | ||||
| def getCandidateInfoDict(requireOnDisk_bool=True): | ||||
|     candidateInfo_list = getCandidateInfoList(requireOnDisk_bool) | ||||
|     candidateInfo_dict = {} | ||||
| 
 | ||||
|     for candidateInfo_tup in candidateInfo_list: | ||||
|         candidateInfo_dict.setdefault(candidateInfo_tup.series_uid, | ||||
|                                       []).append(candidateInfo_tup) | ||||
| 
 | ||||
|     return candidateInfo_dict | ||||
| 
 | ||||
| class Ct: | ||||
|     def __init__(self, series_uid): | ||||
|         mhd_path = glob.glob( | ||||
|             'data-unversioned/subset*/{}.mhd'.format(series_uid) | ||||
|         )[0] | ||||
| 
 | ||||
|         ct_mhd = sitk.ReadImage(mhd_path) | ||||
|         self.hu_a = np.array(sitk.GetArrayFromImage(ct_mhd), dtype=np.float32) | ||||
| 
 | ||||
|         # CTs are natively expressed in https://en.wikipedia.org/wiki/Hounsfield_scale | ||||
|         # HU are scaled oddly, with 0 g/cc (air, approximately) being -1000 and 1 g/cc (water) being 0. | ||||
| 
 | ||||
|         self.series_uid = series_uid | ||||
| 
 | ||||
|         self.origin_xyz = XyzTuple(*ct_mhd.GetOrigin()) | ||||
|         self.vxSize_xyz = XyzTuple(*ct_mhd.GetSpacing()) | ||||
|         self.direction_a = np.array(ct_mhd.GetDirection()).reshape(3, 3) | ||||
| 
 | ||||
|         candidateInfo_list = getCandidateInfoDict()[self.series_uid] | ||||
| 
 | ||||
|         self.positiveInfo_list = [ | ||||
|             candidate_tup | ||||
|             for candidate_tup in candidateInfo_list | ||||
|             if candidate_tup.isNodule_bool | ||||
|         ] | ||||
|         self.positive_mask = self.buildAnnotationMask(self.positiveInfo_list) | ||||
|         self.positive_indexes = (self.positive_mask.sum(axis=(1,2)) | ||||
|                                  .nonzero()[0].tolist()) | ||||
| 
 | ||||
|     def buildAnnotationMask(self, positiveInfo_list, threshold_hu = -700): | ||||
|         boundingBox_a = np.zeros_like(self.hu_a, dtype=np.bool) | ||||
| 
 | ||||
|         for candidateInfo_tup in positiveInfo_list: | ||||
|             center_irc = xyz2irc( | ||||
|                 candidateInfo_tup.center_xyz, | ||||
|                 self.origin_xyz, | ||||
|                 self.vxSize_xyz, | ||||
|                 self.direction_a, | ||||
|             ) | ||||
|             ci = int(center_irc.index) | ||||
|             cr = int(center_irc.row) | ||||
|             cc = int(center_irc.col) | ||||
| 
 | ||||
|             index_radius = 2 | ||||
|             try: | ||||
|                 while self.hu_a[ci + index_radius, cr, cc] > threshold_hu and \ | ||||
|                         self.hu_a[ci - index_radius, cr, cc] > threshold_hu: | ||||
|                     index_radius += 1 | ||||
|             except IndexError: | ||||
|                 index_radius -= 1 | ||||
| 
 | ||||
|             row_radius = 2 | ||||
|             try: | ||||
|                 while self.hu_a[ci, cr + row_radius, cc] > threshold_hu and \ | ||||
|                         self.hu_a[ci, cr - row_radius, cc] > threshold_hu: | ||||
|                     row_radius += 1 | ||||
|             except IndexError: | ||||
|                 row_radius -= 1 | ||||
| 
 | ||||
|             col_radius = 2 | ||||
|             try: | ||||
|                 while self.hu_a[ci, cr, cc + col_radius] > threshold_hu and \ | ||||
|                         self.hu_a[ci, cr, cc - col_radius] > threshold_hu: | ||||
|                     col_radius += 1 | ||||
|             except IndexError: | ||||
|                 col_radius -= 1 | ||||
| 
 | ||||
|             # assert index_radius > 0, repr([candidateInfo_tup.center_xyz, center_irc, self.hu_a[ci, cr, cc]]) | ||||
|             # assert row_radius > 0 | ||||
|             # assert col_radius > 0 | ||||
| 
 | ||||
|             boundingBox_a[ | ||||
|                  ci - index_radius: ci + index_radius + 1, | ||||
|                  cr - row_radius: cr + row_radius + 1, | ||||
|                  cc - col_radius: cc + col_radius + 1] = True | ||||
| 
 | ||||
|         mask_a = boundingBox_a & (self.hu_a > threshold_hu) | ||||
| 
 | ||||
|         return mask_a | ||||
| 
 | ||||
|     def getRawCandidate(self, center_xyz, width_irc): | ||||
|         center_irc = xyz2irc(center_xyz, self.origin_xyz, self.vxSize_xyz, | ||||
|                              self.direction_a) | ||||
| 
 | ||||
|         slice_list = [] | ||||
|         for axis, center_val in enumerate(center_irc): | ||||
|             start_ndx = int(round(center_val - width_irc[axis]/2)) | ||||
|             end_ndx = int(start_ndx + width_irc[axis]) | ||||
| 
 | ||||
|             assert center_val >= 0 and center_val < self.hu_a.shape[axis], repr([self.series_uid, center_xyz, self.origin_xyz, self.vxSize_xyz, center_irc, axis]) | ||||
| 
 | ||||
|             if start_ndx < 0: | ||||
|                 # log.warning("Crop outside of CT array: {} {}, center:{} shape:{} width:{}".format( | ||||
|                 #     self.series_uid, center_xyz, center_irc, self.hu_a.shape, width_irc)) | ||||
|                 start_ndx = 0 | ||||
|                 end_ndx = int(width_irc[axis]) | ||||
| 
 | ||||
|             if end_ndx > self.hu_a.shape[axis]: | ||||
|                 # log.warning("Crop outside of CT array: {} {}, center:{} shape:{} width:{}".format( | ||||
|                 #     self.series_uid, center_xyz, center_irc, self.hu_a.shape, width_irc)) | ||||
|                 end_ndx = self.hu_a.shape[axis] | ||||
|                 start_ndx = int(self.hu_a.shape[axis] - width_irc[axis]) | ||||
| 
 | ||||
|             slice_list.append(slice(start_ndx, end_ndx)) | ||||
| 
 | ||||
|         ct_chunk = self.hu_a[tuple(slice_list)] | ||||
|         pos_chunk = self.positive_mask[tuple(slice_list)] | ||||
| 
 | ||||
|         return ct_chunk, pos_chunk, center_irc | ||||
| 
 | ||||
| @functools.lru_cache(1, typed=True) | ||||
| def getCt(series_uid): | ||||
|     return Ct(series_uid) | ||||
| 
 | ||||
| @raw_cache.memoize(typed=True) | ||||
| def getCtRawCandidate(series_uid, center_xyz, width_irc): | ||||
|     ct = getCt(series_uid) | ||||
|     ct_chunk, pos_chunk, center_irc = ct.getRawCandidate(center_xyz, | ||||
|                                                          width_irc) | ||||
|     ct_chunk.clip(-1000, 1000, ct_chunk) | ||||
|     return ct_chunk, pos_chunk, center_irc | ||||
| 
 | ||||
| @raw_cache.memoize(typed=True) | ||||
| def getCtSampleSize(series_uid): | ||||
|     ct = Ct(series_uid) | ||||
|     return int(ct.hu_a.shape[0]), ct.positive_indexes | ||||
| 
 | ||||
| 
 | ||||
| class Luna2dSegmentationDataset(Dataset): | ||||
|     def __init__(self, | ||||
|                  val_stride=0, | ||||
|                  isValSet_bool=None, | ||||
|                  series_uid=None, | ||||
|                  contextSlices_count=3, | ||||
|                  fullCt_bool=False, | ||||
|             ): | ||||
|         self.contextSlices_count = contextSlices_count | ||||
|         self.fullCt_bool = fullCt_bool | ||||
| 
 | ||||
|         if series_uid: | ||||
|             self.series_list = [series_uid] | ||||
|         else: | ||||
|             self.series_list = sorted(getCandidateInfoDict().keys()) | ||||
| 
 | ||||
|         if isValSet_bool: | ||||
|             assert val_stride > 0, val_stride | ||||
|             self.series_list = self.series_list[::val_stride] | ||||
|             assert self.series_list | ||||
|         elif val_stride > 0: | ||||
|             del self.series_list[::val_stride] | ||||
|             assert self.series_list | ||||
| 
 | ||||
|         self.sample_list = [] | ||||
|         for series_uid in self.series_list: | ||||
|             index_count, positive_indexes = getCtSampleSize(series_uid) | ||||
| 
 | ||||
|             if self.fullCt_bool: | ||||
|                 self.sample_list += [(series_uid, slice_ndx) | ||||
|                                      for slice_ndx in range(index_count)] | ||||
|             else: | ||||
|                 self.sample_list += [(series_uid, slice_ndx) | ||||
|                                      for slice_ndx in positive_indexes] | ||||
| 
 | ||||
|         self.candidateInfo_list = getCandidateInfoList() | ||||
| 
 | ||||
|         series_set = set(self.series_list) | ||||
|         self.candidateInfo_list = [cit for cit in self.candidateInfo_list | ||||
|                                    if cit.series_uid in series_set] | ||||
| 
 | ||||
|         self.pos_list = [nt for nt in self.candidateInfo_list | ||||
|                             if nt.isNodule_bool] | ||||
| 
 | ||||
|         log.info("{!r}: {} {} series, {} slices, {} nodules".format( | ||||
|             self, | ||||
|             len(self.series_list), | ||||
|             {None: 'general', True: 'validation', False: 'training'}[isValSet_bool], | ||||
|             len(self.sample_list), | ||||
|             len(self.pos_list), | ||||
|         )) | ||||
| 
 | ||||
|     def __len__(self): | ||||
|         return len(self.sample_list) | ||||
| 
 | ||||
|     def __getitem__(self, ndx): | ||||
|         series_uid, slice_ndx = self.sample_list[ndx % len(self.sample_list)] | ||||
|         return self.getitem_fullSlice(series_uid, slice_ndx) | ||||
| 
 | ||||
|     def getitem_fullSlice(self, series_uid, slice_ndx): | ||||
|         ct = getCt(series_uid) | ||||
|         ct_t = torch.zeros((self.contextSlices_count * 2 + 1, 512, 512)) | ||||
| 
 | ||||
|         start_ndx = slice_ndx - self.contextSlices_count | ||||
|         end_ndx = slice_ndx + self.contextSlices_count + 1 | ||||
|         for i, context_ndx in enumerate(range(start_ndx, end_ndx)): | ||||
|             context_ndx = max(context_ndx, 0) | ||||
|             context_ndx = min(context_ndx, ct.hu_a.shape[0] - 1) | ||||
|             ct_t[i] = torch.from_numpy(ct.hu_a[context_ndx].astype(np.float32)) | ||||
| 
 | ||||
|         # CTs are natively expressed in https://en.wikipedia.org/wiki/Hounsfield_scale | ||||
|         # HU are scaled oddly, with 0 g/cc (air, approximately) being -1000 and 1 g/cc (water) being 0. | ||||
|         # The lower bound gets rid of negative density stuff used to indicate out-of-FOV | ||||
|         # The upper bound nukes any weird hotspots and clamps bone down | ||||
|         ct_t.clamp_(-1000, 1000) | ||||
| 
 | ||||
|         pos_t = torch.from_numpy(ct.positive_mask[slice_ndx]).unsqueeze(0) | ||||
| 
 | ||||
|         return ct_t, pos_t, ct.series_uid, slice_ndx | ||||
| 
 | ||||
| 
 | ||||
| class TrainingLuna2dSegmentationDataset(Luna2dSegmentationDataset): | ||||
|     def __init__(self, *args, **kwargs): | ||||
|         super().__init__(*args, **kwargs) | ||||
| 
 | ||||
|         self.ratio_int = 2 | ||||
| 
 | ||||
|     def __len__(self): | ||||
|         return 300000 | ||||
| 
 | ||||
|     def shuffleSamples(self): | ||||
|         random.shuffle(self.candidateInfo_list) | ||||
|         random.shuffle(self.pos_list) | ||||
| 
 | ||||
|     def __getitem__(self, ndx): | ||||
|         candidateInfo_tup = self.pos_list[ndx % len(self.pos_list)] | ||||
|         return self.getitem_trainingCrop(candidateInfo_tup) | ||||
| 
 | ||||
|     def getitem_trainingCrop(self, candidateInfo_tup): | ||||
|         ct_a, pos_a, center_irc = getCtRawCandidate( | ||||
|             candidateInfo_tup.series_uid, | ||||
|             candidateInfo_tup.center_xyz, | ||||
|             (7, 96, 96), | ||||
|         ) | ||||
|         pos_a = pos_a[3:4] | ||||
| 
 | ||||
|         row_offset = random.randrange(0,32) | ||||
|         col_offset = random.randrange(0,32) | ||||
|         ct_t = torch.from_numpy(ct_a[:, row_offset:row_offset+64, | ||||
|                                      col_offset:col_offset+64]).to(torch.float32) | ||||
|         pos_t = torch.from_numpy(pos_a[:, row_offset:row_offset+64, | ||||
|                                        col_offset:col_offset+64]).to(torch.long) | ||||
| 
 | ||||
|         slice_ndx = center_irc.index | ||||
| 
 | ||||
|         return ct_t, pos_t, candidateInfo_tup.series_uid, slice_ndx | ||||
| 
 | ||||
| class PrepcacheLunaDataset(Dataset): | ||||
|     def __init__(self, *args, **kwargs): | ||||
|         super().__init__(*args, **kwargs) | ||||
| 
 | ||||
|         self.candidateInfo_list = getCandidateInfoList() | ||||
|         self.pos_list = [nt for nt in self.candidateInfo_list if nt.isNodule_bool] | ||||
| 
 | ||||
|         self.seen_set = set() | ||||
|         self.candidateInfo_list.sort(key=lambda x: x.series_uid) | ||||
| 
 | ||||
|     def __len__(self): | ||||
|         return len(self.candidateInfo_list) | ||||
| 
 | ||||
|     def __getitem__(self, ndx): | ||||
|         # candidate_t, pos_t, series_uid, center_t = super().__getitem__(ndx) | ||||
| 
 | ||||
|         candidateInfo_tup = self.candidateInfo_list[ndx] | ||||
|         getCtRawCandidate(candidateInfo_tup.series_uid, candidateInfo_tup.center_xyz, (7, 96, 96)) | ||||
| 
 | ||||
|         series_uid = candidateInfo_tup.series_uid | ||||
|         if series_uid not in self.seen_set: | ||||
|             self.seen_set.add(series_uid) | ||||
| 
 | ||||
|             getCtSampleSize(series_uid) | ||||
|             # ct = getCt(series_uid) | ||||
|             # for mask_ndx in ct.positive_indexes: | ||||
|             #     build2dLungMask(series_uid, mask_ndx) | ||||
| 
 | ||||
|         return 0, 1 #candidate_t, pos_t, series_uid, center_t | ||||
| 
 | ||||
| 
 | ||||
| class TvTrainingLuna2dSegmentationDataset(torch.utils.data.Dataset): | ||||
|     def __init__(self, isValSet_bool=False, val_stride=10, contextSlices_count=3): | ||||
|         assert contextSlices_count == 3 | ||||
|         data = torch.load('./imgs_and_masks.pt') | ||||
|         suids = list(set(data['suids'])) | ||||
|         trn_mask_suids = torch.arange(len(suids)) % val_stride < (val_stride - 1) | ||||
|         trn_suids = {s for i, s in zip(trn_mask_suids, suids) if i} | ||||
|         trn_mask = torch.tensor([(s in trn_suids) for s in data["suids"]]) | ||||
|         if not isValSet_bool: | ||||
|             self.imgs = data["imgs"][trn_mask] | ||||
|             self.masks = data["masks"][trn_mask] | ||||
|             self.suids = [s for s, i in zip(data["suids"], trn_mask) if i] | ||||
|         else: | ||||
|             self.imgs = data["imgs"][~trn_mask] | ||||
|             self.masks = data["masks"][~trn_mask] | ||||
|             self.suids = [s for s, i in zip(data["suids"], trn_mask) if not i] | ||||
|         # discard spurious hotspots and clamp bone | ||||
|         self.imgs.clamp_(-1000, 1000) | ||||
|         self.imgs /= 1000 | ||||
| 
 | ||||
| 
 | ||||
|     def __len__(self): | ||||
|         return len(self.imgs) | ||||
| 
 | ||||
|     def __getitem__(self, i): | ||||
|         oh, ow = torch.randint(0, 32, (2,)) | ||||
|         sl = self.masks.size(1)//2 | ||||
|         return self.imgs[i, :, oh: oh + 64, ow: ow + 64], 1, self.masks[i, sl: sl+1, oh: oh + 64, ow: ow + 64].to(torch.float32), self.suids[i], 9999 | ||||
| @ -0,0 +1,224 @@ | ||||
| import math | ||||
| import random | ||||
| from collections import namedtuple | ||||
| 
 | ||||
| import torch | ||||
| from torch import nn as nn | ||||
| import torch.nn.functional as F | ||||
| 
 | ||||
| from util.logconf import logging | ||||
| from util.unet import UNet | ||||
| 
 | ||||
| log = logging.getLogger(__name__) | ||||
| # log.setLevel(logging.WARN) | ||||
| # log.setLevel(logging.INFO) | ||||
| log.setLevel(logging.DEBUG) | ||||
| 
 | ||||
| class UNetWrapper(nn.Module): | ||||
|     def __init__(self, **kwargs): | ||||
|         super().__init__() | ||||
| 
 | ||||
|         self.input_batchnorm = nn.BatchNorm2d(kwargs['in_channels']) | ||||
|         self.unet = UNet(**kwargs) | ||||
|         self.final = nn.Sigmoid() | ||||
| 
 | ||||
|         self._init_weights() | ||||
| 
 | ||||
|     def _init_weights(self): | ||||
|         init_set = { | ||||
|             nn.Conv2d, | ||||
|             nn.Conv3d, | ||||
|             nn.ConvTranspose2d, | ||||
|             nn.ConvTranspose3d, | ||||
|             nn.Linear, | ||||
|         } | ||||
|         for m in self.modules(): | ||||
|             if type(m) in init_set: | ||||
|                 nn.init.kaiming_normal_( | ||||
|                     m.weight.data, mode='fan_out', nonlinearity='relu', a=0 | ||||
|                 ) | ||||
|                 if m.bias is not None: | ||||
|                     fan_in, fan_out = \ | ||||
|                         nn.init._calculate_fan_in_and_fan_out(m.weight.data) | ||||
|                     bound = 1 / math.sqrt(fan_out) | ||||
|                     nn.init.normal_(m.bias, -bound, bound) | ||||
| 
 | ||||
|         # nn.init.constant_(self.unet.last.bias, -4) | ||||
|         # nn.init.constant_(self.unet.last.bias, 4) | ||||
| 
 | ||||
| 
 | ||||
|     def forward(self, input_batch): | ||||
|         bn_output = self.input_batchnorm(input_batch) | ||||
|         un_output = self.unet(bn_output) | ||||
|         fn_output = self.final(un_output) | ||||
|         return fn_output | ||||
| 
 | ||||
| class SegmentationAugmentation(nn.Module): | ||||
|     def __init__( | ||||
|             self, flip=None, offset=None, scale=None, rotate=None, noise=None | ||||
|     ): | ||||
|         super().__init__() | ||||
| 
 | ||||
|         self.flip = flip | ||||
|         self.offset = offset | ||||
|         self.scale = scale | ||||
|         self.rotate = rotate | ||||
|         self.noise = noise | ||||
| 
 | ||||
|     def forward(self, input_g, label_g): | ||||
|         transform_t = self._build2dTransformMatrix() | ||||
|         transform_t = transform_t.expand(input_g.shape[0], -1, -1) | ||||
|         transform_t = transform_t.to(input_g.device, torch.float32) | ||||
|         affine_t = F.affine_grid(transform_t[:,:2], | ||||
|                 input_g.size(), align_corners=False) | ||||
| 
 | ||||
|         augmented_input_g = F.grid_sample(input_g, | ||||
|                 affine_t, padding_mode='border', | ||||
|                 align_corners=False) | ||||
|         augmented_label_g = F.grid_sample(label_g.to(torch.float32), | ||||
|                 affine_t, padding_mode='border', | ||||
|                 align_corners=False) | ||||
| 
 | ||||
|         if self.noise: | ||||
|             noise_t = torch.randn_like(augmented_input_g) | ||||
|             noise_t *= self.noise | ||||
| 
 | ||||
|             augmented_input_g += noise_t | ||||
| 
 | ||||
|         return augmented_input_g, augmented_label_g > 0.5 | ||||
| 
 | ||||
|     def _build2dTransformMatrix(self): | ||||
|         transform_t = torch.eye(3) | ||||
| 
 | ||||
|         for i in range(2): | ||||
|             if self.flip: | ||||
|                 if random.random() > 0.5: | ||||
|                     transform_t[i,i] *= -1 | ||||
| 
 | ||||
|             if self.offset: | ||||
|                 offset_float = self.offset | ||||
|                 random_float = (random.random() * 2 - 1) | ||||
|                 transform_t[2,i] = offset_float * random_float | ||||
| 
 | ||||
|             if self.scale: | ||||
|                 scale_float = self.scale | ||||
|                 random_float = (random.random() * 2 - 1) | ||||
|                 transform_t[i,i] *= 1.0 + scale_float * random_float | ||||
| 
 | ||||
|         if self.rotate: | ||||
|             angle_rad = random.random() * math.pi * 2 | ||||
|             s = math.sin(angle_rad) | ||||
|             c = math.cos(angle_rad) | ||||
| 
 | ||||
|             rotation_t = torch.tensor([ | ||||
|                 [c, -s, 0], | ||||
|                 [s, c, 0], | ||||
|                 [0, 0, 1]]) | ||||
| 
 | ||||
|             transform_t @= rotation_t | ||||
| 
 | ||||
|         return transform_t | ||||
| 
 | ||||
| 
 | ||||
| # MaskTuple = namedtuple('MaskTuple', 'raw_dense_mask, dense_mask, body_mask, air_mask, raw_candidate_mask, candidate_mask, lung_mask, neg_mask, pos_mask') | ||||
| # | ||||
| # class SegmentationMask(nn.Module): | ||||
| #     def __init__(self): | ||||
| #         super().__init__() | ||||
| # | ||||
| #         self.conv_list = nn.ModuleList([ | ||||
| #             self._make_circle_conv(radius) for radius in range(1, 8) | ||||
| #         ]) | ||||
| # | ||||
| #     def _make_circle_conv(self, radius): | ||||
| #         diameter = 1 + radius * 2 | ||||
| # | ||||
| #         a = torch.linspace(-1, 1, steps=diameter)**2 | ||||
| #         b = (a[None] + a[:, None])**0.5 | ||||
| # | ||||
| #         circle_weights = (b <= 1.0).to(torch.float32) | ||||
| # | ||||
| #         conv = nn.Conv2d(1, 1, kernel_size=diameter, padding=radius, bias=False) | ||||
| #         conv.weight.data.fill_(1) | ||||
| #         conv.weight.data *= circle_weights / circle_weights.sum() | ||||
| # | ||||
| #         return conv | ||||
| # | ||||
| # | ||||
| #     def erode(self, input_mask, radius, threshold=1): | ||||
| #         conv = self.conv_list[radius - 1] | ||||
| #         input_float = input_mask.to(torch.float32) | ||||
| #         result = conv(input_float) | ||||
| # | ||||
| #         # log.debug(['erode in ', radius, threshold, input_float.min().item(), input_float.mean().item(), input_float.max().item()]) | ||||
| #         # log.debug(['erode out', radius, threshold, result.min().item(), result.mean().item(), result.max().item()]) | ||||
| # | ||||
| #         return result >= threshold | ||||
| # | ||||
| #     def deposit(self, input_mask, radius, threshold=0): | ||||
| #         conv = self.conv_list[radius - 1] | ||||
| #         input_float = input_mask.to(torch.float32) | ||||
| #         result = conv(input_float) | ||||
| # | ||||
| #         # log.debug(['deposit in ', radius, threshold, input_float.min().item(), input_float.mean().item(), input_float.max().item()]) | ||||
| #         # log.debug(['deposit out', radius, threshold, result.min().item(), result.mean().item(), result.max().item()]) | ||||
| # | ||||
| #         return result > threshold | ||||
| # | ||||
| #     def fill_cavity(self, input_mask): | ||||
| #         cumsum = input_mask.cumsum(-1) | ||||
| #         filled_mask = (cumsum > 0) | ||||
| #         filled_mask &= (cumsum < cumsum[..., -1:]) | ||||
| #         cumsum = input_mask.cumsum(-2) | ||||
| #         filled_mask &= (cumsum > 0) | ||||
| #         filled_mask &= (cumsum < cumsum[..., -1:, :]) | ||||
| # | ||||
| #         return filled_mask | ||||
| # | ||||
| # | ||||
| #     def forward(self, input_g, raw_pos_g): | ||||
| #         gcc_g = input_g + 1 | ||||
| # | ||||
| #         with torch.no_grad(): | ||||
| #             # log.info(['gcc_g', gcc_g.min(), gcc_g.mean(), gcc_g.max()]) | ||||
| # | ||||
| #             raw_dense_mask = gcc_g > 0.7 | ||||
| #             dense_mask = self.deposit(raw_dense_mask, 2) | ||||
| #             dense_mask = self.erode(dense_mask, 6) | ||||
| #             dense_mask = self.deposit(dense_mask, 4) | ||||
| # | ||||
| #             body_mask = self.fill_cavity(dense_mask) | ||||
| #             air_mask = self.deposit(body_mask & ~dense_mask, 5) | ||||
| #             air_mask = self.erode(air_mask, 6) | ||||
| # | ||||
| #             lung_mask = self.deposit(air_mask, 5) | ||||
| # | ||||
| #             raw_candidate_mask = gcc_g > 0.4 | ||||
| #             raw_candidate_mask &= air_mask | ||||
| #             candidate_mask = self.erode(raw_candidate_mask, 1) | ||||
| #             candidate_mask = self.deposit(candidate_mask, 1) | ||||
| # | ||||
| #             pos_mask = self.deposit((raw_pos_g > 0.5) & lung_mask, 2) | ||||
| # | ||||
| #             neg_mask = self.deposit(candidate_mask, 1) | ||||
| #             neg_mask &= ~pos_mask | ||||
| #             neg_mask &= lung_mask | ||||
| # | ||||
| #             # label_g = (neg_mask | pos_mask).to(torch.float32) | ||||
| #             label_g = (pos_mask).to(torch.float32) | ||||
| #             neg_g = neg_mask.to(torch.float32) | ||||
| #             pos_g = pos_mask.to(torch.float32) | ||||
| # | ||||
| #         mask_dict = { | ||||
| #             'raw_dense_mask': raw_dense_mask, | ||||
| #             'dense_mask': dense_mask, | ||||
| #             'body_mask': body_mask, | ||||
| #             'air_mask': air_mask, | ||||
| #             'raw_candidate_mask': raw_candidate_mask, | ||||
| #             'candidate_mask': candidate_mask, | ||||
| #             'lung_mask': lung_mask, | ||||
| #             'neg_mask': neg_mask, | ||||
| #             'pos_mask': pos_mask, | ||||
| #         } | ||||
| # | ||||
| #         return label_g, neg_g, pos_g, lung_mask, mask_dict | ||||
| @ -0,0 +1,69 @@ | ||||
| import timing | ||||
| import argparse | ||||
| import sys | ||||
| 
 | ||||
| import numpy as np | ||||
| 
 | ||||
| import torch.nn as nn | ||||
| from torch.autograd import Variable | ||||
| from torch.optim import SGD | ||||
| from torch.utils.data import DataLoader | ||||
| 
 | ||||
| from util.util import enumerateWithEstimate | ||||
| from .dsets import PrepcacheLunaDataset, getCtSampleSize | ||||
| from util.logconf import logging | ||||
| # from .model import LunaModel | ||||
| 
 | ||||
| log = logging.getLogger(__name__) | ||||
| # log.setLevel(logging.WARN) | ||||
| log.setLevel(logging.INFO) | ||||
| # log.setLevel(logging.DEBUG) | ||||
| 
 | ||||
| 
 | ||||
| class LunaPrepCacheApp: | ||||
|     @classmethod | ||||
|     def __init__(self, sys_argv=None): | ||||
|         if sys_argv is None: | ||||
|             sys_argv = sys.argv[1:] | ||||
| 
 | ||||
|         parser = argparse.ArgumentParser() | ||||
|         parser.add_argument('--batch-size', | ||||
|             help='Batch size to use for training', | ||||
|             default=1024, | ||||
|             type=int, | ||||
|         ) | ||||
|         parser.add_argument('--num-workers', | ||||
|             help='Number of worker processes for background data loading', | ||||
|             default=8, | ||||
|             type=int, | ||||
|         ) | ||||
|         # parser.add_argument('--scaled', | ||||
|         #     help="Scale the CT chunks to square voxels.", | ||||
|         #     default=False, | ||||
|         #     action='store_true', | ||||
|         # ) | ||||
| 
 | ||||
|         self.cli_args = parser.parse_args(sys_argv) | ||||
| 
 | ||||
|     def main(self): | ||||
|         log.info("Starting {}, {}".format(type(self).__name__, self.cli_args)) | ||||
| 
 | ||||
|         self.prep_dl = DataLoader( | ||||
|             PrepcacheLunaDataset( | ||||
|                 # sortby_str='series_uid', | ||||
|             ), | ||||
|             batch_size=self.cli_args.batch_size, | ||||
|             num_workers=self.cli_args.num_workers, | ||||
|         ) | ||||
| 
 | ||||
|         batch_iter = enumerateWithEstimate( | ||||
|             self.prep_dl, | ||||
|             "Stuffing cache", | ||||
|             start_ndx=self.prep_dl.num_workers, | ||||
|         ) | ||||
|         for batch_ndx, batch_tup in batch_iter: | ||||
|             pass | ||||
| 
 | ||||
| 
 | ||||
| if __name__ == '__main__': | ||||
|     LunaPrepCacheApp().main() | ||||
										
											Binary file not shown.
										
									
								
							| @ -0,0 +1,331 @@ | ||||
| import math | ||||
| import random | ||||
| import warnings | ||||
| 
 | ||||
| import numpy as np | ||||
| import scipy.ndimage | ||||
| 
 | ||||
| import torch | ||||
| from torch.autograd import Function | ||||
| from torch.autograd.function import once_differentiable | ||||
| import torch.backends.cudnn as cudnn | ||||
| 
 | ||||
| from util.logconf import logging | ||||
| log = logging.getLogger(__name__) | ||||
| # log.setLevel(logging.WARN) | ||||
| # log.setLevel(logging.INFO) | ||||
| log.setLevel(logging.DEBUG) | ||||
| 
 | ||||
| def cropToShape(image, new_shape, center_list=None, fill=0.0): | ||||
|     # log.debug([image.shape, new_shape, center_list]) | ||||
|     # assert len(image.shape) == 3, repr(image.shape) | ||||
| 
 | ||||
|     if center_list is None: | ||||
|         center_list = [int(image.shape[i] / 2) for i in range(3)] | ||||
| 
 | ||||
|     crop_list = [] | ||||
|     for i in range(0, 3): | ||||
|         crop_int = center_list[i] | ||||
|         if image.shape[i] > new_shape[i] and crop_int is not None: | ||||
| 
 | ||||
|             # We can't just do crop_int +/- shape/2 since shape might be odd | ||||
|             # and ints round down. | ||||
|             start_int = crop_int - int(new_shape[i]/2) | ||||
|             end_int = start_int + new_shape[i] | ||||
|             crop_list.append(slice(max(0, start_int), end_int)) | ||||
|         else: | ||||
|             crop_list.append(slice(0, image.shape[i])) | ||||
| 
 | ||||
|     # log.debug([image.shape, crop_list]) | ||||
|     image = image[crop_list] | ||||
| 
 | ||||
|     crop_list = [] | ||||
|     for i in range(0, 3): | ||||
|         if image.shape[i] < new_shape[i]: | ||||
|             crop_int = int((new_shape[i] - image.shape[i]) / 2) | ||||
|             crop_list.append(slice(crop_int, crop_int + image.shape[i])) | ||||
|         else: | ||||
|             crop_list.append(slice(0, image.shape[i])) | ||||
| 
 | ||||
|     # log.debug([image.shape, crop_list]) | ||||
|     new_image = np.zeros(new_shape, dtype=image.dtype) | ||||
|     new_image[:] = fill | ||||
|     new_image[crop_list] = image | ||||
| 
 | ||||
|     return new_image | ||||
| 
 | ||||
| 
 | ||||
| def zoomToShape(image, new_shape, square=True): | ||||
|     # assert image.shape[-1] in {1, 3, 4}, repr(image.shape) | ||||
| 
 | ||||
|     if square and image.shape[0] != image.shape[1]: | ||||
|         crop_int = min(image.shape[0], image.shape[1]) | ||||
|         new_shape = [crop_int, crop_int, image.shape[2]] | ||||
|         image = cropToShape(image, new_shape) | ||||
| 
 | ||||
|     zoom_shape = [new_shape[i] / image.shape[i] for i in range(3)] | ||||
| 
 | ||||
|     with warnings.catch_warnings(): | ||||
|         warnings.simplefilter("ignore") | ||||
|         image = scipy.ndimage.interpolation.zoom( | ||||
|             image, zoom_shape, | ||||
|             output=None, order=0, mode='nearest', cval=0.0, prefilter=True) | ||||
| 
 | ||||
|     return image | ||||
| 
 | ||||
| def randomOffset(image_list, offset_rows=0.125, offset_cols=0.125): | ||||
| 
 | ||||
|     center_list = [int(image_list[0].shape[i] / 2) for i in range(3)] | ||||
|     center_list[0] += int(offset_rows * (random.random() - 0.5) * 2) | ||||
|     center_list[1] += int(offset_cols * (random.random() - 0.5) * 2) | ||||
|     center_list[2] = None | ||||
| 
 | ||||
|     new_list = [] | ||||
|     for image in image_list: | ||||
|         new_image = cropToShape(image, image.shape, center_list) | ||||
|         new_list.append(new_image) | ||||
| 
 | ||||
|     return new_list | ||||
| 
 | ||||
| 
 | ||||
| def randomZoom(image_list, scale=None, scale_min=0.8, scale_max=1.3): | ||||
|     if scale is None: | ||||
|         scale = scale_min + (scale_max - scale_min) * random.random() | ||||
| 
 | ||||
|     new_list = [] | ||||
|     for image in image_list: | ||||
|         # assert image.shape[-1] in {1, 3, 4}, repr(image.shape) | ||||
| 
 | ||||
|         with warnings.catch_warnings(): | ||||
|             warnings.simplefilter("ignore") | ||||
|             # log.info([image.shape]) | ||||
|             zimage = scipy.ndimage.interpolation.zoom( | ||||
|                 image, [scale, scale, 1.0], | ||||
|                 output=None, order=0, mode='nearest', cval=0.0, prefilter=True) | ||||
|         image = cropToShape(zimage, image.shape) | ||||
| 
 | ||||
|         new_list.append(image) | ||||
| 
 | ||||
|     return new_list | ||||
| 
 | ||||
| 
 | ||||
| _randomFlip_transform_list = [ | ||||
|     # lambda a: np.rot90(a, axes=(0, 1)), | ||||
|     # lambda a: np.flip(a, 0), | ||||
|     lambda a: np.flip(a, 1), | ||||
| ] | ||||
| 
 | ||||
| def randomFlip(image_list, transform_bits=None): | ||||
|     if transform_bits is None: | ||||
|         transform_bits = random.randrange(0, 2 ** len(_randomFlip_transform_list)) | ||||
| 
 | ||||
|     new_list = [] | ||||
|     for image in image_list: | ||||
|         # assert image.shape[-1] in {1, 3, 4}, repr(image.shape) | ||||
| 
 | ||||
|         for n in range(len(_randomFlip_transform_list)): | ||||
|             if transform_bits & 2**n: | ||||
|                 # prhist(image, 'before') | ||||
|                 image = _randomFlip_transform_list[n](image) | ||||
|                 # prhist(image, 'after ') | ||||
| 
 | ||||
|         new_list.append(image) | ||||
| 
 | ||||
|     return new_list | ||||
| 
 | ||||
| 
 | ||||
| def randomSpin(image_list, angle=None, range_tup=None, axes=(0, 1)): | ||||
|     if range_tup is None: | ||||
|         range_tup = (0, 360) | ||||
| 
 | ||||
|     if angle is None: | ||||
|         angle = range_tup[0] + (range_tup[1] - range_tup[0]) * random.random() | ||||
| 
 | ||||
|     new_list = [] | ||||
|     for image in image_list: | ||||
|         # assert image.shape[-1] in {1, 3, 4}, repr(image.shape) | ||||
| 
 | ||||
|         image = scipy.ndimage.interpolation.rotate( | ||||
|                 image, angle, axes=axes, reshape=False, | ||||
|                 output=None, order=0, mode='nearest', cval=0.0, prefilter=True) | ||||
| 
 | ||||
|         new_list.append(image) | ||||
| 
 | ||||
|     return new_list | ||||
| 
 | ||||
| 
 | ||||
| def randomNoise(image_list, noise_min=-0.1, noise_max=0.1): | ||||
|     noise = np.zeros_like(image_list[0]) | ||||
|     noise += (noise_max - noise_min) * np.random.random_sample(image_list[0].shape) + noise_min | ||||
|     noise *= 5 | ||||
|     noise = scipy.ndimage.filters.gaussian_filter(noise, 3) | ||||
|     # noise += (noise_max - noise_min) * np.random.random_sample(image_hsv.shape) + noise_min | ||||
| 
 | ||||
|     new_list = [] | ||||
|     for image_hsv in image_list: | ||||
|         image_hsv = image_hsv + noise | ||||
| 
 | ||||
|         new_list.append(image_hsv) | ||||
| 
 | ||||
|     return new_list | ||||
| 
 | ||||
| 
 | ||||
| def randomHsvShift(image_list, h=None, s=None, v=None, | ||||
|                    h_min=-0.1, h_max=0.1, | ||||
|                    s_min=0.5, s_max=2.0, | ||||
|                    v_min=0.5, v_max=2.0): | ||||
|     if h is None: | ||||
|         h = h_min + (h_max - h_min) * random.random() | ||||
|     if s is None: | ||||
|         s = s_min + (s_max - s_min) * random.random() | ||||
|     if v is None: | ||||
|         v = v_min + (v_max - v_min) * random.random() | ||||
| 
 | ||||
|     new_list = [] | ||||
|     for image_hsv in image_list: | ||||
|         # assert image_hsv.shape[-1] == 3, repr(image_hsv.shape) | ||||
| 
 | ||||
|         image_hsv[:,:,0::3] += h | ||||
|         image_hsv[:,:,1::3] = image_hsv[:,:,1::3] ** s | ||||
|         image_hsv[:,:,2::3] = image_hsv[:,:,2::3] ** v | ||||
| 
 | ||||
|         new_list.append(image_hsv) | ||||
| 
 | ||||
|     return clampHsv(new_list) | ||||
| 
 | ||||
| 
 | ||||
| def clampHsv(image_list): | ||||
|     new_list = [] | ||||
|     for image_hsv in image_list: | ||||
|         image_hsv = image_hsv.clone() | ||||
| 
 | ||||
|         # Hue wraps around | ||||
|         image_hsv[:,:,0][image_hsv[:,:,0] > 1] -= 1 | ||||
|         image_hsv[:,:,0][image_hsv[:,:,0] < 0] += 1 | ||||
| 
 | ||||
|         # Everything else clamps between 0 and 1 | ||||
|         image_hsv[image_hsv > 1] = 1 | ||||
|         image_hsv[image_hsv < 0] = 0 | ||||
| 
 | ||||
|         new_list.append(image_hsv) | ||||
| 
 | ||||
|     return new_list | ||||
| 
 | ||||
| 
 | ||||
| # def torch_augment(input): | ||||
| #     theta = random.random() * math.pi * 2 | ||||
| #     s = math.sin(theta) | ||||
| #     c = math.cos(theta) | ||||
| #     c1 = 1 - c | ||||
| #     axis_vector = torch.rand(3, device='cpu', dtype=torch.float64) | ||||
| #     axis_vector -= 0.5 | ||||
| #     axis_vector /= axis_vector.abs().sum() | ||||
| #     l, m, n = axis_vector | ||||
| # | ||||
| #     matrix = torch.tensor([ | ||||
| #         [l*l*c1 +   c, m*l*c1 - n*s, n*l*c1 + m*s, 0], | ||||
| #         [l*m*c1 + n*s, m*m*c1 +   c, n*m*c1 - l*s, 0], | ||||
| #         [l*n*c1 - m*s, m*n*c1 + l*s, n*n*c1 +   c, 0], | ||||
| #         [0, 0, 0, 1], | ||||
| #     ], device=input.device, dtype=torch.float32) | ||||
| # | ||||
| #     return th_affine3d(input, matrix) | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| # following from https://github.com/ncullen93/torchsample/blob/master/torchsample/utils.py | ||||
| # MIT licensed | ||||
| 
 | ||||
| # def th_affine3d(input, matrix): | ||||
| #     """ | ||||
| #     3D Affine image transform on torch.Tensor | ||||
| #     """ | ||||
| #     A = matrix[:3,:3] | ||||
| #     b = matrix[:3,3] | ||||
| # | ||||
| #     # make a meshgrid of normal coordinates | ||||
| #     coords = th_iterproduct(input.size(-3), input.size(-2), input.size(-1), dtype=torch.float32) | ||||
| # | ||||
| #     # shift the coordinates so center is the origin | ||||
| #     coords[:,0] = coords[:,0] - (input.size(-3) / 2. - 0.5) | ||||
| #     coords[:,1] = coords[:,1] - (input.size(-2) / 2. - 0.5) | ||||
| #     coords[:,2] = coords[:,2] - (input.size(-1) / 2. - 0.5) | ||||
| # | ||||
| #     # apply the coordinate transformation | ||||
| #     new_coords = coords.mm(A.t().contiguous()) + b.expand_as(coords) | ||||
| # | ||||
| #     # shift the coordinates back so origin is origin | ||||
| #     new_coords[:,0] = new_coords[:,0] + (input.size(-3) / 2. - 0.5) | ||||
| #     new_coords[:,1] = new_coords[:,1] + (input.size(-2) / 2. - 0.5) | ||||
| #     new_coords[:,2] = new_coords[:,2] + (input.size(-1) / 2. - 0.5) | ||||
| # | ||||
| #     # map new coordinates using bilinear interpolation | ||||
| #     input_transformed = th_trilinear_interp3d(input, new_coords) | ||||
| # | ||||
| #     return input_transformed | ||||
| # | ||||
| # | ||||
| # def th_trilinear_interp3d(input, coords): | ||||
| #     """ | ||||
| #     trilinear interpolation of 3D torch.Tensor image | ||||
| #     """ | ||||
| #     # take clamp then floor/ceil of x coords | ||||
| #     x = torch.clamp(coords[:,0], 0, input.size(-3)-2) | ||||
| #     x0 = x.floor() | ||||
| #     x1 = x0 + 1 | ||||
| #     # take clamp then floor/ceil of y coords | ||||
| #     y = torch.clamp(coords[:,1], 0, input.size(-2)-2) | ||||
| #     y0 = y.floor() | ||||
| #     y1 = y0 + 1 | ||||
| #     # take clamp then floor/ceil of z coords | ||||
| #     z = torch.clamp(coords[:,2], 0, input.size(-1)-2) | ||||
| #     z0 = z.floor() | ||||
| #     z1 = z0 + 1 | ||||
| # | ||||
| #     stride = torch.tensor(input.stride()[-3:], dtype=torch.int64, device=input.device) | ||||
| #     x0_ix = x0.mul(stride[0]).long() | ||||
| #     x1_ix = x1.mul(stride[0]).long() | ||||
| #     y0_ix = y0.mul(stride[1]).long() | ||||
| #     y1_ix = y1.mul(stride[1]).long() | ||||
| #     z0_ix = z0.mul(stride[2]).long() | ||||
| #     z1_ix = z1.mul(stride[2]).long() | ||||
| # | ||||
| #     # input_flat = th_flatten(input) | ||||
| #     input_flat = x.contiguous().view(x[0], x[1], -1) | ||||
| # | ||||
| #     vals_000 = input_flat[:, :, x0_ix+y0_ix+z0_ix] | ||||
| #     vals_001 = input_flat[:, :, x0_ix+y0_ix+z1_ix] | ||||
| #     vals_010 = input_flat[:, :, x0_ix+y1_ix+z0_ix] | ||||
| #     vals_011 = input_flat[:, :, x0_ix+y1_ix+z1_ix] | ||||
| #     vals_100 = input_flat[:, :, x1_ix+y0_ix+z0_ix] | ||||
| #     vals_101 = input_flat[:, :, x1_ix+y0_ix+z1_ix] | ||||
| #     vals_110 = input_flat[:, :, x1_ix+y1_ix+z0_ix] | ||||
| #     vals_111 = input_flat[:, :, x1_ix+y1_ix+z1_ix] | ||||
| # | ||||
| #     xd = x - x0 | ||||
| #     yd = y - y0 | ||||
| #     zd = z - z0 | ||||
| #     xm1 = 1 - xd | ||||
| #     ym1 = 1 - yd | ||||
| #     zm1 = 1 - zd | ||||
| # | ||||
| #     x_mapped = ( | ||||
| #             vals_000.mul(xm1).mul(ym1).mul(zm1) + | ||||
| #             vals_001.mul(xm1).mul(ym1).mul(zd) + | ||||
| #             vals_010.mul(xm1).mul(yd).mul(zm1) + | ||||
| #             vals_011.mul(xm1).mul(yd).mul(zd) + | ||||
| #             vals_100.mul(xd).mul(ym1).mul(zm1) + | ||||
| #             vals_101.mul(xd).mul(ym1).mul(zd) + | ||||
| #             vals_110.mul(xd).mul(yd).mul(zm1) + | ||||
| #             vals_111.mul(xd).mul(yd).mul(zd) | ||||
| #     ) | ||||
| # | ||||
| #     return x_mapped.view_as(input) | ||||
| # | ||||
| # def th_iterproduct(*args, dtype=None): | ||||
| #     return torch.from_numpy(np.indices(args).reshape((len(args),-1)).T) | ||||
| # | ||||
| # def th_flatten(x): | ||||
| #     """Flatten tensor""" | ||||
| #     return x.contiguous().view(x[0], x[1], -1) | ||||
| @ -0,0 +1,136 @@ | ||||
| import gzip | ||||
| 
 | ||||
| from diskcache import FanoutCache, Disk | ||||
| from diskcache.core import BytesType, MODE_BINARY, BytesIO | ||||
| 
 | ||||
| from util.logconf import logging | ||||
| log = logging.getLogger(__name__) | ||||
| # log.setLevel(logging.WARN) | ||||
| log.setLevel(logging.INFO) | ||||
| # log.setLevel(logging.DEBUG) | ||||
| 
 | ||||
| 
 | ||||
| class GzipDisk(Disk): | ||||
|     def store(self, value, read, key=None): | ||||
|         """ | ||||
|         Override from base class diskcache.Disk. | ||||
| 
 | ||||
|         Chunking is due to needing to work on pythons < 2.7.13: | ||||
|         - Issue #27130: In the "zlib" module, fix handling of large buffers | ||||
|           (typically 2 or 4 GiB).  Previously, inputs were limited to 2 GiB, and | ||||
|           compression and decompression operations did not properly handle results of | ||||
|           2 or 4 GiB. | ||||
| 
 | ||||
|         :param value: value to convert | ||||
|         :param bool read: True when value is file-like object | ||||
|         :return: (size, mode, filename, value) tuple for Cache table | ||||
|         """ | ||||
|         # pylint: disable=unidiomatic-typecheck | ||||
|         if type(value) is BytesType: | ||||
|             if read: | ||||
|                 value = value.read() | ||||
|                 read = False | ||||
| 
 | ||||
|             str_io = BytesIO() | ||||
|             gz_file = gzip.GzipFile(mode='wb', compresslevel=1, fileobj=str_io) | ||||
| 
 | ||||
|             for offset in range(0, len(value), 2**30): | ||||
|                 gz_file.write(value[offset:offset+2**30]) | ||||
|             gz_file.close() | ||||
| 
 | ||||
|             value = str_io.getvalue() | ||||
| 
 | ||||
|         return super(GzipDisk, self).store(value, read) | ||||
| 
 | ||||
| 
 | ||||
|     def fetch(self, mode, filename, value, read): | ||||
|         """ | ||||
|         Override from base class diskcache.Disk. | ||||
| 
 | ||||
|         Chunking is due to needing to work on pythons < 2.7.13: | ||||
|         - Issue #27130: In the "zlib" module, fix handling of large buffers | ||||
|           (typically 2 or 4 GiB).  Previously, inputs were limited to 2 GiB, and | ||||
|           compression and decompression operations did not properly handle results of | ||||
|           2 or 4 GiB. | ||||
| 
 | ||||
|         :param int mode: value mode raw, binary, text, or pickle | ||||
|         :param str filename: filename of corresponding value | ||||
|         :param value: database value | ||||
|         :param bool read: when True, return an open file handle | ||||
|         :return: corresponding Python value | ||||
|         """ | ||||
|         value = super(GzipDisk, self).fetch(mode, filename, value, read) | ||||
| 
 | ||||
|         if mode == MODE_BINARY: | ||||
|             str_io = BytesIO(value) | ||||
|             gz_file = gzip.GzipFile(mode='rb', fileobj=str_io) | ||||
|             read_csio = BytesIO() | ||||
| 
 | ||||
|             while True: | ||||
|                 uncompressed_data = gz_file.read(2**30) | ||||
|                 if uncompressed_data: | ||||
|                     read_csio.write(uncompressed_data) | ||||
|                 else: | ||||
|                     break | ||||
| 
 | ||||
|             value = read_csio.getvalue() | ||||
| 
 | ||||
|         return value | ||||
| 
 | ||||
| def getCache(scope_str): | ||||
|     return FanoutCache('data-unversioned/cache/' + scope_str, | ||||
|                        disk=GzipDisk, | ||||
|                        shards=64, | ||||
|                        timeout=1, | ||||
|                        size_limit=3e11, | ||||
|                        # disk_min_file_size=2**20, | ||||
|                        ) | ||||
| 
 | ||||
| # def disk_cache(base_path, memsize=2): | ||||
| #     def disk_cache_decorator(f): | ||||
| #         @functools.wraps(f) | ||||
| #         def wrapper(*args, **kwargs): | ||||
| #             args_str = repr(args) + repr(sorted(kwargs.items())) | ||||
| #             file_str = hashlib.md5(args_str.encode('utf8')).hexdigest() | ||||
| # | ||||
| #             cache_path = os.path.join(base_path, f.__name__, file_str + '.pkl.gz') | ||||
| # | ||||
| #             if not os.path.exists(os.path.dirname(cache_path)): | ||||
| #                 os.makedirs(os.path.dirname(cache_path), exist_ok=True) | ||||
| # | ||||
| #             if os.path.exists(cache_path): | ||||
| #                 return pickle_loadgz(cache_path) | ||||
| #             else: | ||||
| #                 ret = f(*args, **kwargs) | ||||
| #                 pickle_dumpgz(cache_path, ret) | ||||
| #                 return ret | ||||
| # | ||||
| #         return wrapper | ||||
| # | ||||
| #     return disk_cache_decorator | ||||
| # | ||||
| # | ||||
| # def pickle_dumpgz(file_path, obj): | ||||
| #     log.debug("Writing {}".format(file_path)) | ||||
| #     with open(file_path, 'wb') as file_obj: | ||||
| #         with gzip.GzipFile(mode='wb', compresslevel=1, fileobj=file_obj) as gz_file: | ||||
| #             pickle.dump(obj, gz_file, pickle.HIGHEST_PROTOCOL) | ||||
| # | ||||
| # | ||||
| # def pickle_loadgz(file_path): | ||||
| #     log.debug("Reading {}".format(file_path)) | ||||
| #     with open(file_path, 'rb') as file_obj: | ||||
| #         with gzip.GzipFile(mode='rb', fileobj=file_obj) as gz_file: | ||||
| #             return pickle.load(gz_file) | ||||
| # | ||||
| # | ||||
| # def dtpath(dt=None): | ||||
| #     if dt is None: | ||||
| #         dt = datetime.datetime.now() | ||||
| # | ||||
| #     return str(dt).rsplit('.', 1)[0].replace(' ', '--').replace(':', '.') | ||||
| # | ||||
| # | ||||
| # def safepath(s): | ||||
| #     s = s.replace(' ', '_') | ||||
| #     return re.sub('[^A-Za-z0-9_.-]', '', s) | ||||
| @ -0,0 +1,19 @@ | ||||
| import logging | ||||
| import logging.handlers | ||||
| 
 | ||||
| root_logger = logging.getLogger() | ||||
| root_logger.setLevel(logging.INFO) | ||||
| 
 | ||||
| # Some libraries attempt to add their own root logger handlers. This is | ||||
| # annoying and so we get rid of them. | ||||
| for handler in list(root_logger.handlers): | ||||
|     root_logger.removeHandler(handler) | ||||
| 
 | ||||
| logfmt_str = "%(asctime)s %(levelname)-8s pid:%(process)d %(name)s:%(lineno)03d:%(funcName)s %(message)s" | ||||
| formatter = logging.Formatter(logfmt_str) | ||||
| 
 | ||||
| streamHandler = logging.StreamHandler() | ||||
| streamHandler.setFormatter(formatter) | ||||
| streamHandler.setLevel(logging.DEBUG) | ||||
| 
 | ||||
| root_logger.addHandler(streamHandler) | ||||
| @ -0,0 +1,143 @@ | ||||
| # From https://github.com/jvanvugt/pytorch-unet | ||||
| # https://raw.githubusercontent.com/jvanvugt/pytorch-unet/master/unet.py | ||||
| 
 | ||||
| # MIT License | ||||
| # | ||||
| # Copyright (c) 2018 Joris | ||||
| # | ||||
| # Permission is hereby granted, free of charge, to any person obtaining a copy | ||||
| # of this software and associated documentation files (the "Software"), to deal | ||||
| # in the Software without restriction, including without limitation the rights | ||||
| # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||||
| # copies of the Software, and to permit persons to whom the Software is | ||||
| # furnished to do so, subject to the following conditions: | ||||
| # | ||||
| # The above copyright notice and this permission notice shall be included in all | ||||
| # copies or substantial portions of the Software. | ||||
| # | ||||
| # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||||
| # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||||
| # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||||
| # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||||
| # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||||
| # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||||
| # SOFTWARE. | ||||
| 
 | ||||
| # Adapted from https://discuss.pytorch.org/t/unet-implementation/426 | ||||
| 
 | ||||
| import torch | ||||
| from torch import nn | ||||
| import torch.nn.functional as F | ||||
| 
 | ||||
| 
 | ||||
| class UNet(nn.Module): | ||||
|     def __init__(self, in_channels=1, n_classes=2, depth=5, wf=6, padding=False, | ||||
|                  batch_norm=False, up_mode='upconv'): | ||||
|         """ | ||||
|         Implementation of | ||||
|         U-Net: Convolutional Networks for Biomedical Image Segmentation | ||||
|         (Ronneberger et al., 2015) | ||||
|         https://arxiv.org/abs/1505.04597 | ||||
| 
 | ||||
|         Using the default arguments will yield the exact version used | ||||
|         in the original paper | ||||
| 
 | ||||
|         Args: | ||||
|             in_channels (int): number of input channels | ||||
|             n_classes (int): number of output channels | ||||
|             depth (int): depth of the network | ||||
|             wf (int): number of filters in the first layer is 2**wf | ||||
|             padding (bool): if True, apply padding such that the input shape | ||||
|                             is the same as the output. | ||||
|                             This may introduce artifacts | ||||
|             batch_norm (bool): Use BatchNorm after layers with an | ||||
|                                activation function | ||||
|             up_mode (str): one of 'upconv' or 'upsample'. | ||||
|                            'upconv' will use transposed convolutions for | ||||
|                            learned upsampling. | ||||
|                            'upsample' will use bilinear upsampling. | ||||
|         """ | ||||
|         super(UNet, self).__init__() | ||||
|         assert up_mode in ('upconv', 'upsample') | ||||
|         self.padding = padding | ||||
|         self.depth = depth | ||||
|         prev_channels = in_channels | ||||
|         self.down_path = nn.ModuleList() | ||||
|         for i in range(depth): | ||||
|             self.down_path.append(UNetConvBlock(prev_channels, 2**(wf+i), | ||||
|                                                 padding, batch_norm)) | ||||
|             prev_channels = 2**(wf+i) | ||||
| 
 | ||||
|         self.up_path = nn.ModuleList() | ||||
|         for i in reversed(range(depth - 1)): | ||||
|             self.up_path.append(UNetUpBlock(prev_channels, 2**(wf+i), up_mode, | ||||
|                                             padding, batch_norm)) | ||||
|             prev_channels = 2**(wf+i) | ||||
| 
 | ||||
|         self.last = nn.Conv2d(prev_channels, n_classes, kernel_size=1) | ||||
| 
 | ||||
|     def forward(self, x): | ||||
|         blocks = [] | ||||
|         for i, down in enumerate(self.down_path): | ||||
|             x = down(x) | ||||
|             if i != len(self.down_path)-1: | ||||
|                 blocks.append(x) | ||||
|                 x = F.avg_pool2d(x, 2) | ||||
| 
 | ||||
|         for i, up in enumerate(self.up_path): | ||||
|             x = up(x, blocks[-i-1]) | ||||
| 
 | ||||
|         return self.last(x) | ||||
| 
 | ||||
| 
 | ||||
| class UNetConvBlock(nn.Module): | ||||
|     def __init__(self, in_size, out_size, padding, batch_norm): | ||||
|         super(UNetConvBlock, self).__init__() | ||||
|         block = [] | ||||
| 
 | ||||
|         block.append(nn.Conv2d(in_size, out_size, kernel_size=3, | ||||
|                                padding=int(padding))) | ||||
|         block.append(nn.ReLU()) | ||||
|         # block.append(nn.LeakyReLU()) | ||||
|         if batch_norm: | ||||
|             block.append(nn.BatchNorm2d(out_size)) | ||||
| 
 | ||||
|         block.append(nn.Conv2d(out_size, out_size, kernel_size=3, | ||||
|                                padding=int(padding))) | ||||
|         block.append(nn.ReLU()) | ||||
|         # block.append(nn.LeakyReLU()) | ||||
|         if batch_norm: | ||||
|             block.append(nn.BatchNorm2d(out_size)) | ||||
| 
 | ||||
|         self.block = nn.Sequential(*block) | ||||
| 
 | ||||
|     def forward(self, x): | ||||
|         out = self.block(x) | ||||
|         return out | ||||
| 
 | ||||
| 
 | ||||
| class UNetUpBlock(nn.Module): | ||||
|     def __init__(self, in_size, out_size, up_mode, padding, batch_norm): | ||||
|         super(UNetUpBlock, self).__init__() | ||||
|         if up_mode == 'upconv': | ||||
|             self.up = nn.ConvTranspose2d(in_size, out_size, kernel_size=2, | ||||
|                                          stride=2) | ||||
|         elif up_mode == 'upsample': | ||||
|             self.up = nn.Sequential(nn.Upsample(mode='bilinear', scale_factor=2), | ||||
|                                     nn.Conv2d(in_size, out_size, kernel_size=1)) | ||||
| 
 | ||||
|         self.conv_block = UNetConvBlock(in_size, out_size, padding, batch_norm) | ||||
| 
 | ||||
|     def center_crop(self, layer, target_size): | ||||
|         _, _, layer_height, layer_width = layer.size() | ||||
|         diff_y = (layer_height - target_size[0]) // 2 | ||||
|         diff_x = (layer_width - target_size[1]) // 2 | ||||
|         return layer[:, :, diff_y:(diff_y + target_size[0]), diff_x:(diff_x + target_size[1])] | ||||
| 
 | ||||
|     def forward(self, x, bridge): | ||||
|         up = self.up(x) | ||||
|         crop1 = self.center_crop(bridge, up.shape[2:]) | ||||
|         out = torch.cat([up, crop1], 1) | ||||
|         out = self.conv_block(out) | ||||
| 
 | ||||
|         return out | ||||
							
								
								
									
										105
									
								
								pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist-dist.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										105
									
								
								pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist-dist.py
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,105 @@ | ||||
| import os | ||||
| from datetime import datetime | ||||
| import argparse | ||||
| import torch.multiprocessing as mp | ||||
| import torchvision | ||||
| import torchvision.transforms as transforms | ||||
| import torch | ||||
| import torch.nn as nn | ||||
| import torch.distributed as dist | ||||
| from apex.parallel import DistributedDataParallel as DDP | ||||
| from apex import amp | ||||
| 
 | ||||
| 
 | ||||
| def main(): | ||||
|     parser = argparse.ArgumentParser() | ||||
|     parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N', | ||||
|                         help='number of data loading workers (default: 4)') | ||||
|     parser.add_argument('-g', '--gpus', default=1, type=int, | ||||
|                         help='number of gpus per node') | ||||
|     parser.add_argument('-nr', '--nr', default=0, type=int, | ||||
|                         help='ranking within the nodes') | ||||
|     parser.add_argument('--epochs', default=2, type=int, metavar='N', | ||||
|                         help='number of total epochs to run') | ||||
|     args = parser.parse_args() | ||||
|     args.world_size = args.gpus * args.nodes | ||||
|     os.environ['MASTER_ADDR'] = '147.232.47.114' | ||||
|     os.environ['MASTER_PORT'] = '8888' | ||||
|     mp.spawn(train, nprocs=args.gpus, args=(args,)) | ||||
| 
 | ||||
| 
 | ||||
| class ConvNet(nn.Module): | ||||
|     def __init__(self, num_classes=10): | ||||
|         super(ConvNet, self).__init__() | ||||
|         self.layer1 = nn.Sequential( | ||||
|             nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2), | ||||
|             nn.BatchNorm2d(16), | ||||
|             nn.ReLU(), | ||||
|             nn.MaxPool2d(kernel_size=2, stride=2)) | ||||
|         self.layer2 = nn.Sequential( | ||||
|             nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2), | ||||
|             nn.BatchNorm2d(32), | ||||
|             nn.ReLU(), | ||||
|             nn.MaxPool2d(kernel_size=2, stride=2)) | ||||
|         self.fc = nn.Linear(7*7*32, num_classes) | ||||
| 
 | ||||
|     def forward(self, x): | ||||
|         out = self.layer1(x) | ||||
|         out = self.layer2(out) | ||||
|         out = out.reshape(out.size(0), -1) | ||||
|         out = self.fc(out) | ||||
|         return out | ||||
| 
 | ||||
| 
 | ||||
| def train(gpu, args): | ||||
|     rank = args.nr * args.gpus + gpu | ||||
|     dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank) | ||||
|     torch.manual_seed(0) | ||||
|     model = ConvNet() | ||||
|     torch.cuda.set_device(gpu) | ||||
|     model.cuda(gpu) | ||||
|     batch_size = 10 | ||||
|     # define loss function (criterion) and optimizer | ||||
|     criterion = nn.CrossEntropyLoss().cuda(gpu) | ||||
|     optimizer = torch.optim.SGD(model.parameters(), 1e-4) | ||||
|     # Wrap the model | ||||
|     model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu]) | ||||
|     # Data loading code | ||||
|     train_dataset = torchvision.datasets.MNIST(root='./data', | ||||
|                                                train=True, | ||||
|                                                transform=transforms.ToTensor(), | ||||
|                                                download=True) | ||||
|     train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, | ||||
|                                                                     num_replicas=args.world_size, | ||||
|                                                                     rank=rank) | ||||
|     train_loader = torch.utils.data.DataLoader(dataset=train_dataset, | ||||
|                                                batch_size=batch_size, | ||||
|                                                shuffle=False, | ||||
|                                                num_workers=0, | ||||
|                                                pin_memory=True, | ||||
|                                                sampler=train_sampler) | ||||
| 
 | ||||
|     start = datetime.now() | ||||
|     total_step = len(train_loader) | ||||
|     for epoch in range(args.epochs): | ||||
|         for i, (images, labels) in enumerate(train_loader): | ||||
|             images = images.cuda(non_blocking=True) | ||||
|             labels = labels.cuda(non_blocking=True) | ||||
|             # Forward pass | ||||
|             outputs = model(images) | ||||
|             loss = criterion(outputs, labels) | ||||
| 
 | ||||
|             # Backward and optimize | ||||
|             optimizer.zero_grad() | ||||
|             loss.backward() | ||||
|             optimizer.step() | ||||
|             if (i + 1) % 100 == 0 and gpu == 0: | ||||
|                 print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step, | ||||
|                                                                          loss.item())) | ||||
|     if gpu == 0: | ||||
|         print("Training complete in: " + str(datetime.now() - start)) | ||||
| 
 | ||||
| 
 | ||||
| if __name__ == '__main__': | ||||
|     torch.multiprocessing.set_start_method('spawn') | ||||
|     main() | ||||
							
								
								
									
										92
									
								
								pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										92
									
								
								pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist.py
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,92 @@ | ||||
| import os | ||||
| from datetime import datetime | ||||
| import argparse | ||||
| import torch.multiprocessing as mp | ||||
| import torchvision | ||||
| import torchvision.transforms as transforms | ||||
| import torch | ||||
| import torch.nn as nn | ||||
| import torch.distributed as dist | ||||
| from apex.parallel import DistributedDataParallel as DDP | ||||
| from apex import amp | ||||
| 
 | ||||
| 
 | ||||
| def main(): | ||||
|     parser = argparse.ArgumentParser() | ||||
|     parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N', | ||||
|                         help='number of data loading workers (default: 4)') | ||||
|     parser.add_argument('-g', '--gpus', default=1, type=int, | ||||
|                         help='number of gpus per node') | ||||
|     parser.add_argument('-nr', '--nr', default=0, type  =int, | ||||
|                         help='ranking within the nodes') | ||||
|     parser.add_argument('--epochs', default=2, type=int, metavar='N', | ||||
|                         help='number of total epochs to run') | ||||
|     args = parser.parse_args() | ||||
|     train(0, args) | ||||
| 
 | ||||
| 
 | ||||
| class ConvNet(nn.Module): | ||||
|     def __init__(self, num_classes=10): | ||||
|         super(ConvNet, self).__init__() | ||||
|         self.layer1 = nn.Sequential( | ||||
|             nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2), | ||||
|             nn.BatchNorm2d(16), | ||||
|             nn.ReLU(), | ||||
|             nn.MaxPool2d(kernel_size=2, stride=2)) | ||||
|         self.layer2 = nn.Sequential( | ||||
|             nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2), | ||||
|             nn.BatchNorm2d(32), | ||||
|             nn.ReLU(), | ||||
|             nn.MaxPool2d(kernel_size=2, stride=2)) | ||||
|         self.fc = nn.Linear(7*7*32, num_classes) | ||||
| 
 | ||||
|     def forward(self, x): | ||||
|         out = self.layer1(x) | ||||
|         out = self.layer2(out) | ||||
|         out = out.reshape(out.size(0), -1) | ||||
|         out = self.fc(out) | ||||
|         return out | ||||
| 
 | ||||
| 
 | ||||
| def train(gpu, args): | ||||
|     model = ConvNet() | ||||
|     torch.cuda.set_device(gpu) | ||||
|     model.cuda(gpu) | ||||
|     batch_size = 50 | ||||
|     # define loss function (criterion) and optimizer | ||||
|     criterion = nn.CrossEntropyLoss().cuda(gpu) | ||||
|     optimizer = torch.optim.SGD(model.parameters(), 1e-4) | ||||
|     # Data loading code | ||||
|     train_dataset = torchvision.datasets.MNIST(root='./data', | ||||
|                                                train=True, | ||||
|                                                transform=transforms.ToTensor(), | ||||
|                                                download=True) | ||||
|     train_loader = torch.utils.data.DataLoader(dataset=train_dataset, | ||||
|                                                batch_size=batch_size, | ||||
|                                                shuffle=True, | ||||
|                                                num_workers=0, | ||||
|                                                pin_memory=True) | ||||
| 
 | ||||
|     start = datetime.now() | ||||
|     total_step = len(train_loader) | ||||
|     for epoch in range(args.epochs): | ||||
|         for i, (images, labels) in enumerate(train_loader): | ||||
|             images = images.cuda(non_blocking=True) | ||||
|             labels = labels.cuda(non_blocking=True) | ||||
|             # Forward pass | ||||
|             outputs = model(images) | ||||
|             loss = criterion(outputs, labels) | ||||
| 
 | ||||
|             # Backward and optimize | ||||
|             optimizer.zero_grad() | ||||
|             loss.backward() | ||||
|             optimizer.step() | ||||
|             if (i + 1) % 100 == 0 and gpu == 0: | ||||
|                 print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step, | ||||
|                                                                          loss.item())) | ||||
|     if gpu == 0: | ||||
|         print("Training complete in: " + str(datetime.now() - start)) | ||||
| 
 | ||||
| 
 | ||||
| if __name__ == '__main__': | ||||
|     main() | ||||
							
								
								
									
										748
									
								
								pages/students/2016/lukas_pokryvka/dp2021/yelp/script.py
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										748
									
								
								pages/students/2016/lukas_pokryvka/dp2021/yelp/script.py
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,748 @@ | ||||
| from argparse import Namespace | ||||
| from collections import Counter | ||||
| import json | ||||
| import os | ||||
| import re | ||||
| import string | ||||
| 
 | ||||
| import numpy as np | ||||
| import pandas as pd | ||||
| import torch | ||||
| import torch.nn as nn | ||||
| import torch.nn.functional as F | ||||
| import torch.optim as optim | ||||
| from torch.utils.data import Dataset, DataLoader | ||||
| from tqdm.notebook import tqdm | ||||
| 
 | ||||
| 
 | ||||
| class Vocabulary(object): | ||||
|     """Class to process text and extract vocabulary for mapping""" | ||||
| 
 | ||||
|     def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"): | ||||
|         """ | ||||
|         Args: | ||||
|             token_to_idx (dict): a pre-existing map of tokens to indices | ||||
|             add_unk (bool): a flag that indicates whether to add the UNK token | ||||
|             unk_token (str): the UNK token to add into the Vocabulary | ||||
|         """ | ||||
| 
 | ||||
|         if token_to_idx is None: | ||||
|             token_to_idx = {} | ||||
|         self._token_to_idx = token_to_idx | ||||
| 
 | ||||
|         self._idx_to_token = {idx: token  | ||||
|                               for token, idx in self._token_to_idx.items()} | ||||
|          | ||||
|         self._add_unk = add_unk | ||||
|         self._unk_token = unk_token | ||||
|          | ||||
|         self.unk_index = -1 | ||||
|         if add_unk: | ||||
|             self.unk_index = self.add_token(unk_token)  | ||||
|          | ||||
|          | ||||
|     def to_serializable(self): | ||||
|         """ returns a dictionary that can be serialized """ | ||||
|         return {'token_to_idx': self._token_to_idx,  | ||||
|                 'add_unk': self._add_unk,  | ||||
|                 'unk_token': self._unk_token} | ||||
| 
 | ||||
|     @classmethod | ||||
|     def from_serializable(cls, contents): | ||||
|         """ instantiates the Vocabulary from a serialized dictionary """ | ||||
|         return cls(**contents) | ||||
| 
 | ||||
|     def add_token(self, token): | ||||
|         """Update mapping dicts based on the token. | ||||
| 
 | ||||
|         Args: | ||||
|             token (str): the item to add into the Vocabulary | ||||
|         Returns: | ||||
|             index (int): the integer corresponding to the token | ||||
|         """ | ||||
|         if token in self._token_to_idx: | ||||
|             index = self._token_to_idx[token] | ||||
|         else: | ||||
|             index = len(self._token_to_idx) | ||||
|             self._token_to_idx[token] = index | ||||
|             self._idx_to_token[index] = token | ||||
|         return index | ||||
|      | ||||
|     def add_many(self, tokens): | ||||
|         """Add a list of tokens into the Vocabulary | ||||
|          | ||||
|         Args: | ||||
|             tokens (list): a list of string tokens | ||||
|         Returns: | ||||
|             indices (list): a list of indices corresponding to the tokens | ||||
|         """ | ||||
|         return [self.add_token(token) for token in tokens] | ||||
| 
 | ||||
|     def lookup_token(self, token): | ||||
|         """Retrieve the index associated with the token  | ||||
|           or the UNK index if token isn't present. | ||||
|          | ||||
|         Args: | ||||
|             token (str): the token to look up  | ||||
|         Returns: | ||||
|             index (int): the index corresponding to the token | ||||
|         Notes: | ||||
|             `unk_index` needs to be >=0 (having been added into the Vocabulary)  | ||||
|               for the UNK functionality  | ||||
|         """ | ||||
|         if self.unk_index >= 0: | ||||
|             return self._token_to_idx.get(token, self.unk_index) | ||||
|         else: | ||||
|             return self._token_to_idx[token] | ||||
| 
 | ||||
|     def lookup_index(self, index): | ||||
|         """Return the token associated with the index | ||||
|          | ||||
|         Args:  | ||||
|             index (int): the index to look up | ||||
|         Returns: | ||||
|             token (str): the token corresponding to the index | ||||
|         Raises: | ||||
|             KeyError: if the index is not in the Vocabulary | ||||
|         """ | ||||
|         if index not in self._idx_to_token: | ||||
|             raise KeyError("the index (%d) is not in the Vocabulary" % index) | ||||
|         return self._idx_to_token[index] | ||||
| 
 | ||||
|     def __str__(self): | ||||
|         return "<Vocabulary(size=%d)>" % len(self) | ||||
| 
 | ||||
|     def __len__(self): | ||||
|         return len(self._token_to_idx) | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| class ReviewVectorizer(object): | ||||
|     """ The Vectorizer which coordinates the Vocabularies and puts them to use""" | ||||
|     def __init__(self, review_vocab, rating_vocab): | ||||
|         """ | ||||
|         Args: | ||||
|             review_vocab (Vocabulary): maps words to integers | ||||
|             rating_vocab (Vocabulary): maps class labels to integers | ||||
|         """ | ||||
|         self.review_vocab = review_vocab | ||||
|         self.rating_vocab = rating_vocab | ||||
| 
 | ||||
|     def vectorize(self, review): | ||||
|         """Create a collapsed one-hit vector for the review | ||||
|          | ||||
|         Args: | ||||
|             review (str): the review  | ||||
|         Returns: | ||||
|             one_hot (np.ndarray): the collapsed one-hot encoding  | ||||
|         """ | ||||
|         one_hot = np.zeros(len(self.review_vocab), dtype=np.float32) | ||||
|          | ||||
|         for token in review.split(" "): | ||||
|             if token not in string.punctuation: | ||||
|                 one_hot[self.review_vocab.lookup_token(token)] = 1 | ||||
| 
 | ||||
|         return one_hot | ||||
| 
 | ||||
|     @classmethod | ||||
|     def from_dataframe(cls, review_df, cutoff=25): | ||||
|         """Instantiate the vectorizer from the dataset dataframe | ||||
|          | ||||
|         Args: | ||||
|             review_df (pandas.DataFrame): the review dataset | ||||
|             cutoff (int): the parameter for frequency-based filtering | ||||
|         Returns: | ||||
|             an instance of the ReviewVectorizer | ||||
|         """ | ||||
|         review_vocab = Vocabulary(add_unk=True) | ||||
|         rating_vocab = Vocabulary(add_unk=False) | ||||
|          | ||||
|         # Add ratings | ||||
|         for rating in sorted(set(review_df.rating)): | ||||
|             rating_vocab.add_token(rating) | ||||
| 
 | ||||
|         # Add top words if count > provided count | ||||
|         word_counts = Counter() | ||||
|         for review in review_df.review: | ||||
|             for word in review.split(" "): | ||||
|                 if word not in string.punctuation: | ||||
|                     word_counts[word] += 1 | ||||
|                 | ||||
|         for word, count in word_counts.items(): | ||||
|             if count > cutoff: | ||||
|                 review_vocab.add_token(word) | ||||
| 
 | ||||
|         return cls(review_vocab, rating_vocab) | ||||
| 
 | ||||
|     @classmethod | ||||
|     def from_serializable(cls, contents): | ||||
|         """Instantiate a ReviewVectorizer from a serializable dictionary | ||||
|          | ||||
|         Args: | ||||
|             contents (dict): the serializable dictionary | ||||
|         Returns: | ||||
|             an instance of the ReviewVectorizer class | ||||
|         """ | ||||
|         review_vocab = Vocabulary.from_serializable(contents['review_vocab']) | ||||
|         rating_vocab =  Vocabulary.from_serializable(contents['rating_vocab']) | ||||
| 
 | ||||
|         return cls(review_vocab=review_vocab, rating_vocab=rating_vocab) | ||||
| 
 | ||||
|     def to_serializable(self): | ||||
|         """Create the serializable dictionary for caching | ||||
|          | ||||
|         Returns: | ||||
|             contents (dict): the serializable dictionary | ||||
|         """ | ||||
|         return {'review_vocab': self.review_vocab.to_serializable(), | ||||
|                 'rating_vocab': self.rating_vocab.to_serializable()} | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| class ReviewDataset(Dataset): | ||||
|     def __init__(self, review_df, vectorizer): | ||||
|         """ | ||||
|         Args: | ||||
|             review_df (pandas.DataFrame): the dataset | ||||
|             vectorizer (ReviewVectorizer): vectorizer instantiated from dataset | ||||
|         """ | ||||
|         self.review_df = review_df | ||||
|         self._vectorizer = vectorizer | ||||
| 
 | ||||
|         self.train_df = self.review_df[self.review_df.split=='train'] | ||||
|         self.train_size = len(self.train_df) | ||||
| 
 | ||||
|         self.val_df = self.review_df[self.review_df.split=='val'] | ||||
|         self.validation_size = len(self.val_df) | ||||
| 
 | ||||
|         self.test_df = self.review_df[self.review_df.split=='test'] | ||||
|         self.test_size = len(self.test_df) | ||||
| 
 | ||||
|         self._lookup_dict = {'train': (self.train_df, self.train_size), | ||||
|                              'val': (self.val_df, self.validation_size), | ||||
|                              'test': (self.test_df, self.test_size)} | ||||
| 
 | ||||
|         self.set_split('train') | ||||
| 
 | ||||
|     @classmethod | ||||
|     def load_dataset_and_make_vectorizer(cls, review_csv): | ||||
|         """Load dataset and make a new vectorizer from scratch | ||||
|          | ||||
|         Args: | ||||
|             review_csv (str): location of the dataset | ||||
|         Returns: | ||||
|             an instance of ReviewDataset | ||||
|         """ | ||||
|         review_df = pd.read_csv(review_csv) | ||||
|         train_review_df = review_df[review_df.split=='train'] | ||||
|         return cls(review_df, ReviewVectorizer.from_dataframe(train_review_df)) | ||||
|      | ||||
|     @classmethod | ||||
|     def load_dataset_and_load_vectorizer(cls, review_csv, vectorizer_filepath): | ||||
|         """Load dataset and the corresponding vectorizer.  | ||||
|         Used in the case in the vectorizer has been cached for re-use | ||||
|          | ||||
|         Args: | ||||
|             review_csv (str): location of the dataset | ||||
|             vectorizer_filepath (str): location of the saved vectorizer | ||||
|         Returns: | ||||
|             an instance of ReviewDataset | ||||
|         """ | ||||
|         review_df = pd.read_csv(review_csv) | ||||
|         vectorizer = cls.load_vectorizer_only(vectorizer_filepath) | ||||
|         return cls(review_df, vectorizer) | ||||
| 
 | ||||
|     @staticmethod | ||||
|     def load_vectorizer_only(vectorizer_filepath): | ||||
|         """a static method for loading the vectorizer from file | ||||
|          | ||||
|         Args: | ||||
|             vectorizer_filepath (str): the location of the serialized vectorizer | ||||
|         Returns: | ||||
|             an instance of ReviewVectorizer | ||||
|         """ | ||||
|         with open(vectorizer_filepath) as fp: | ||||
|             return ReviewVectorizer.from_serializable(json.load(fp)) | ||||
| 
 | ||||
|     def save_vectorizer(self, vectorizer_filepath): | ||||
|         """saves the vectorizer to disk using json | ||||
|          | ||||
|         Args: | ||||
|             vectorizer_filepath (str): the location to save the vectorizer | ||||
|         """ | ||||
|         with open(vectorizer_filepath, "w") as fp: | ||||
|             json.dump(self._vectorizer.to_serializable(), fp) | ||||
| 
 | ||||
|     def get_vectorizer(self): | ||||
|         """ returns the vectorizer """ | ||||
|         return self._vectorizer | ||||
| 
 | ||||
|     def set_split(self, split="train"): | ||||
|         """ selects the splits in the dataset using a column in the dataframe  | ||||
|          | ||||
|         Args: | ||||
|             split (str): one of "train", "val", or "test" | ||||
|         """ | ||||
|         self._target_split = split | ||||
|         self._target_df, self._target_size = self._lookup_dict[split] | ||||
| 
 | ||||
|     def __len__(self): | ||||
|         return self._target_size | ||||
| 
 | ||||
|     def __getitem__(self, index): | ||||
|         """the primary entry point method for PyTorch datasets | ||||
|          | ||||
|         Args: | ||||
|             index (int): the index to the data point  | ||||
|         Returns: | ||||
|             a dictionary holding the data point's features (x_data) and label (y_target) | ||||
|         """ | ||||
|         row = self._target_df.iloc[index] | ||||
| 
 | ||||
|         review_vector = \ | ||||
|             self._vectorizer.vectorize(row.review) | ||||
| 
 | ||||
|         rating_index = \ | ||||
|             self._vectorizer.rating_vocab.lookup_token(row.rating) | ||||
| 
 | ||||
|         return {'x_data': review_vector, | ||||
|                 'y_target': rating_index} | ||||
| 
 | ||||
|     def get_num_batches(self, batch_size): | ||||
|         """Given a batch size, return the number of batches in the dataset | ||||
|          | ||||
|         Args: | ||||
|             batch_size (int) | ||||
|         Returns: | ||||
|             number of batches in the dataset | ||||
|         """ | ||||
|         return len(self) // batch_size   | ||||
|      | ||||
| def generate_batches(dataset, batch_size, shuffle=True, | ||||
|                      drop_last=True, device="cpu"): | ||||
|     """ | ||||
|     A generator function which wraps the PyTorch DataLoader. It will  | ||||
|       ensure each tensor is on the write device location. | ||||
|     """ | ||||
|     dataloader = DataLoader(dataset=dataset, batch_size=batch_size, | ||||
|                             shuffle=shuffle, drop_last=drop_last) | ||||
| 
 | ||||
|     for data_dict in dataloader: | ||||
|         out_data_dict = {} | ||||
|         for name, tensor in data_dict.items(): | ||||
|             out_data_dict[name] = data_dict[name].to(device) | ||||
|         yield out_data_dict | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| class ReviewClassifier(nn.Module): | ||||
|     """ a simple perceptron based classifier """ | ||||
|     def __init__(self, num_features): | ||||
|         """ | ||||
|         Args: | ||||
|             num_features (int): the size of the input feature vector | ||||
|         """ | ||||
|         super(ReviewClassifier, self).__init__() | ||||
|         self.fc1 = nn.Linear(in_features=num_features,  | ||||
|                              out_features=1) | ||||
| 
 | ||||
|     def forward(self, x_in, apply_sigmoid=False): | ||||
|         """The forward pass of the classifier | ||||
|          | ||||
|         Args: | ||||
|             x_in (torch.Tensor): an input data tensor.  | ||||
|                 x_in.shape should be (batch, num_features) | ||||
|             apply_sigmoid (bool): a flag for the sigmoid activation | ||||
|                 should be false if used with the Cross Entropy losses | ||||
|         Returns: | ||||
|             the resulting tensor. tensor.shape should be (batch,) | ||||
|         """ | ||||
|         y_out = self.fc1(x_in).squeeze() | ||||
|         if apply_sigmoid: | ||||
|             y_out = torch.sigmoid(y_out) | ||||
|         return y_out | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| def make_train_state(args): | ||||
|     return {'stop_early': False, | ||||
|             'early_stopping_step': 0, | ||||
|             'early_stopping_best_val': 1e8, | ||||
|             'learning_rate': args.learning_rate, | ||||
|             'epoch_index': 0, | ||||
|             'train_loss': [], | ||||
|             'train_acc': [], | ||||
|             'val_loss': [], | ||||
|             'val_acc': [], | ||||
|             'test_loss': -1, | ||||
|             'test_acc': -1, | ||||
|             'model_filename': args.model_state_file} | ||||
| 
 | ||||
| def update_train_state(args, model, train_state): | ||||
|     """Handle the training state updates. | ||||
| 
 | ||||
|     Components: | ||||
|      - Early Stopping: Prevent overfitting. | ||||
|      - Model Checkpoint: Model is saved if the model is better | ||||
| 
 | ||||
|     :param args: main arguments | ||||
|     :param model: model to train | ||||
|     :param train_state: a dictionary representing the training state values | ||||
|     :returns: | ||||
|         a new train_state | ||||
|     """ | ||||
| 
 | ||||
|     # Save one model at least | ||||
|     if train_state['epoch_index'] == 0: | ||||
|         torch.save(model.state_dict(), train_state['model_filename']) | ||||
|         train_state['stop_early'] = False | ||||
| 
 | ||||
|     # Save model if performance improved | ||||
|     elif train_state['epoch_index'] >= 1: | ||||
|         loss_tm1, loss_t = train_state['val_loss'][-2:] | ||||
| 
 | ||||
|         # If loss worsened | ||||
|         if loss_t >= train_state['early_stopping_best_val']: | ||||
|             # Update step | ||||
|             train_state['early_stopping_step'] += 1 | ||||
|         # Loss decreased | ||||
|         else: | ||||
|             # Save the best model | ||||
|             if loss_t < train_state['early_stopping_best_val']: | ||||
|                 torch.save(model.state_dict(), train_state['model_filename']) | ||||
| 
 | ||||
|             # Reset early stopping step | ||||
|             train_state['early_stopping_step'] = 0 | ||||
| 
 | ||||
|         # Stop early ? | ||||
|         train_state['stop_early'] = \ | ||||
|             train_state['early_stopping_step'] >= args.early_stopping_criteria | ||||
| 
 | ||||
|     return train_state | ||||
| 
 | ||||
| def compute_accuracy(y_pred, y_target): | ||||
|     y_target = y_target.cpu() | ||||
|     y_pred_indices = (torch.sigmoid(y_pred)>0.5).cpu().long()#.max(dim=1)[1] | ||||
|     n_correct = torch.eq(y_pred_indices, y_target).sum().item() | ||||
|     return n_correct / len(y_pred_indices) * 100 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| def set_seed_everywhere(seed, cuda): | ||||
|     np.random.seed(seed) | ||||
|     torch.manual_seed(seed) | ||||
|     if cuda: | ||||
|         torch.cuda.manual_seed_all(seed) | ||||
| 
 | ||||
| def handle_dirs(dirpath): | ||||
|     if not os.path.exists(dirpath): | ||||
|         os.makedirs(dirpath) | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| args = Namespace( | ||||
|     # Data and Path information | ||||
|     frequency_cutoff=25, | ||||
|     model_state_file='model.pth', | ||||
|     review_csv='data/yelp/reviews_with_splits_lite.csv', | ||||
|     # review_csv='data/yelp/reviews_with_splits_full.csv', | ||||
|     save_dir='model_storage/ch3/yelp/', | ||||
|     vectorizer_file='vectorizer.json', | ||||
|     # No Model hyper parameters | ||||
|     # Training hyper parameters | ||||
|     batch_size=128, | ||||
|     early_stopping_criteria=5, | ||||
|     learning_rate=0.001, | ||||
|     num_epochs=100, | ||||
|     seed=1337, | ||||
|     # Runtime options | ||||
|     catch_keyboard_interrupt=True, | ||||
|     cuda=True, | ||||
|     expand_filepaths_to_save_dir=True, | ||||
|     reload_from_files=False, | ||||
| ) | ||||
| 
 | ||||
| if args.expand_filepaths_to_save_dir: | ||||
|     args.vectorizer_file = os.path.join(args.save_dir, | ||||
|                                         args.vectorizer_file) | ||||
| 
 | ||||
|     args.model_state_file = os.path.join(args.save_dir, | ||||
|                                          args.model_state_file) | ||||
|      | ||||
|     print("Expanded filepaths: ") | ||||
|     print("\t{}".format(args.vectorizer_file)) | ||||
|     print("\t{}".format(args.model_state_file)) | ||||
|      | ||||
| # Check CUDA | ||||
| if not torch.cuda.is_available(): | ||||
|     args.cuda = False | ||||
| if torch.cuda.device_count() > 1: | ||||
|   print("Pouzivam", torch.cuda.device_count(), "graficke karty!") | ||||
| 
 | ||||
| args.device = torch.device("cuda" if args.cuda else "cpu") | ||||
| 
 | ||||
| # Set seed for reproducibility | ||||
| set_seed_everywhere(args.seed, args.cuda) | ||||
| 
 | ||||
| # handle dirs | ||||
| handle_dirs(args.save_dir) | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| if args.reload_from_files: | ||||
|     # training from a checkpoint | ||||
|     print("Loading dataset and vectorizer") | ||||
|     dataset = ReviewDataset.load_dataset_and_load_vectorizer(args.review_csv, | ||||
|                                                             args.vectorizer_file) | ||||
| else: | ||||
|     print("Loading dataset and creating vectorizer") | ||||
|     # create dataset and vectorizer | ||||
|     dataset = ReviewDataset.load_dataset_and_make_vectorizer(args.review_csv) | ||||
|     dataset.save_vectorizer(args.vectorizer_file)     | ||||
| vectorizer = dataset.get_vectorizer() | ||||
| 
 | ||||
| classifier = ReviewClassifier(num_features=len(vectorizer.review_vocab)) | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| classifier = nn.DataParallel(classifier) | ||||
| classifier = classifier.to(args.device) | ||||
| 
 | ||||
| loss_func = nn.BCEWithLogitsLoss() | ||||
| optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate) | ||||
| scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer, | ||||
|                                                  mode='min', factor=0.5, | ||||
|                                                  patience=1) | ||||
| 
 | ||||
| train_state = make_train_state(args) | ||||
| 
 | ||||
| epoch_bar = tqdm(desc='training routine',  | ||||
|                           total=args.num_epochs, | ||||
|                           position=0) | ||||
| 
 | ||||
| dataset.set_split('train') | ||||
| train_bar = tqdm(desc='split=train', | ||||
|                           total=dataset.get_num_batches(args.batch_size),  | ||||
|                           position=1,  | ||||
|                           leave=True) | ||||
| dataset.set_split('val') | ||||
| val_bar = tqdm(desc='split=val', | ||||
|                         total=dataset.get_num_batches(args.batch_size),  | ||||
|                         position=1,  | ||||
|                         leave=True) | ||||
| 
 | ||||
| try: | ||||
|     for epoch_index in range(args.num_epochs): | ||||
|         train_state['epoch_index'] = epoch_index | ||||
| 
 | ||||
|         # Iterate over training dataset | ||||
| 
 | ||||
|         # setup: batch generator, set loss and acc to 0, set train mode on | ||||
|         dataset.set_split('train') | ||||
|         batch_generator = generate_batches(dataset,  | ||||
|                                            batch_size=args.batch_size,  | ||||
|                                            device=args.device) | ||||
|         running_loss = 0.0 | ||||
|         running_acc = 0.0 | ||||
|         classifier.train() | ||||
| 
 | ||||
|         for batch_index, batch_dict in enumerate(batch_generator): | ||||
|             # the training routine is these 5 steps: | ||||
| 
 | ||||
|             # -------------------------------------- | ||||
|             # step 1. zero the gradients | ||||
|             optimizer.zero_grad() | ||||
| 
 | ||||
|             # step 2. compute the output | ||||
|             y_pred = classifier(x_in=batch_dict['x_data'].float()) | ||||
| 
 | ||||
|             # step 3. compute the loss | ||||
|             loss = loss_func(y_pred, batch_dict['y_target'].float()) | ||||
|             loss_t = loss.item() | ||||
|             running_loss += (loss_t - running_loss) / (batch_index + 1) | ||||
| 
 | ||||
|             # step 4. use loss to produce gradients | ||||
|             loss.backward() | ||||
| 
 | ||||
|             # step 5. use optimizer to take gradient step | ||||
|             optimizer.step() | ||||
|             # ----------------------------------------- | ||||
|             # compute the accuracy | ||||
|             acc_t = compute_accuracy(y_pred, batch_dict['y_target']) | ||||
|             running_acc += (acc_t - running_acc) / (batch_index + 1) | ||||
| 
 | ||||
|             # update bar | ||||
|             train_bar.set_postfix(loss=running_loss,  | ||||
|                                   acc=running_acc,  | ||||
|                                   epoch=epoch_index) | ||||
|             train_bar.update() | ||||
| 
 | ||||
|         train_state['train_loss'].append(running_loss) | ||||
|         train_state['train_acc'].append(running_acc) | ||||
| 
 | ||||
|         # Iterate over val dataset | ||||
| 
 | ||||
|         # setup: batch generator, set loss and acc to 0; set eval mode on | ||||
|         dataset.set_split('val') | ||||
|         batch_generator = generate_batches(dataset,  | ||||
|                                            batch_size=args.batch_size,  | ||||
|                                            device=args.device) | ||||
|         running_loss = 0. | ||||
|         running_acc = 0. | ||||
|         classifier.eval() | ||||
| 
 | ||||
|         for batch_index, batch_dict in enumerate(batch_generator): | ||||
| 
 | ||||
|             # compute the output | ||||
|             y_pred = classifier(x_in=batch_dict['x_data'].float()) | ||||
| 
 | ||||
|             # step 3. compute the loss | ||||
|             loss = loss_func(y_pred, batch_dict['y_target'].float()) | ||||
|             loss_t = loss.item() | ||||
|             running_loss += (loss_t - running_loss) / (batch_index + 1) | ||||
| 
 | ||||
|             # compute the accuracy | ||||
|             acc_t = compute_accuracy(y_pred, batch_dict['y_target']) | ||||
|             running_acc += (acc_t - running_acc) / (batch_index + 1) | ||||
|              | ||||
|             val_bar.set_postfix(loss=running_loss,  | ||||
|                                 acc=running_acc,  | ||||
|                                 epoch=epoch_index) | ||||
|             val_bar.update() | ||||
| 
 | ||||
|         train_state['val_loss'].append(running_loss) | ||||
|         train_state['val_acc'].append(running_acc) | ||||
| 
 | ||||
|         train_state = update_train_state(args=args, model=classifier, | ||||
|                                          train_state=train_state) | ||||
| 
 | ||||
|         scheduler.step(train_state['val_loss'][-1]) | ||||
| 
 | ||||
|         train_bar.n = 0 | ||||
|         val_bar.n = 0 | ||||
|         epoch_bar.update() | ||||
| 
 | ||||
|         if train_state['stop_early']: | ||||
|             break | ||||
| 
 | ||||
|         train_bar.n = 0 | ||||
|         val_bar.n = 0 | ||||
|         epoch_bar.update() | ||||
| except KeyboardInterrupt: | ||||
|     print("Exiting loop") | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| classifier.load_state_dict(torch.load(train_state['model_filename'])) | ||||
| classifier = classifier.to(args.device) | ||||
| 
 | ||||
| dataset.set_split('test') | ||||
| batch_generator = generate_batches(dataset,  | ||||
|                                    batch_size=args.batch_size,  | ||||
|                                    device=args.device) | ||||
| running_loss = 0. | ||||
| running_acc = 0. | ||||
| classifier.eval() | ||||
| 
 | ||||
| for batch_index, batch_dict in enumerate(batch_generator): | ||||
|     # compute the output | ||||
|     y_pred = classifier(x_in=batch_dict['x_data'].float()) | ||||
| 
 | ||||
|     # compute the loss | ||||
|     loss = loss_func(y_pred, batch_dict['y_target'].float()) | ||||
|     loss_t = loss.item() | ||||
|     running_loss += (loss_t - running_loss) / (batch_index + 1) | ||||
| 
 | ||||
|     # compute the accuracy | ||||
|     acc_t = compute_accuracy(y_pred, batch_dict['y_target']) | ||||
|     running_acc += (acc_t - running_acc) / (batch_index + 1) | ||||
| 
 | ||||
| train_state['test_loss'] = running_loss | ||||
| train_state['test_acc'] = running_acc | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| print("Test loss: {:.3f}".format(train_state['test_loss'])) | ||||
| print("Test Accuracy: {:.2f}".format(train_state['test_acc'])) | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| def preprocess_text(text): | ||||
|     text = text.lower() | ||||
|     text = re.sub(r"([.,!?])", r" \1 ", text) | ||||
|     text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text) | ||||
|     return text | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| def predict_rating(review, classifier, vectorizer, decision_threshold=0.5): | ||||
|     """Predict the rating of a review | ||||
|      | ||||
|     Args: | ||||
|         review (str): the text of the review | ||||
|         classifier (ReviewClassifier): the trained model | ||||
|         vectorizer (ReviewVectorizer): the corresponding vectorizer | ||||
|         decision_threshold (float): The numerical boundary which separates the rating classes | ||||
|     """ | ||||
|     review = preprocess_text(review) | ||||
|      | ||||
|     vectorized_review = torch.tensor(vectorizer.vectorize(review)) | ||||
|     result = classifier(vectorized_review.view(1, -1)) | ||||
|      | ||||
|     probability_value = F.sigmoid(result).item() | ||||
|     index = 1 | ||||
|     if probability_value < decision_threshold: | ||||
|         index = 0 | ||||
| 
 | ||||
|     return vectorizer.rating_vocab.lookup_index(index) | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| test_review = "this is a pretty awesome book" | ||||
| 
 | ||||
| classifier = classifier.cpu() | ||||
| prediction = predict_rating(test_review, classifier, vectorizer, decision_threshold=0.5) | ||||
| print("{} -> {}".format(test_review, prediction)) | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| # Sort weights | ||||
| fc1_weights = classifier.fc1.weight.detach()[0] | ||||
| _, indices = torch.sort(fc1_weights, dim=0, descending=True) | ||||
| indices = indices.numpy().tolist() | ||||
| 
 | ||||
| # Top 20 words | ||||
| print("Influential words in Positive Reviews:") | ||||
| print("--------------------------------------") | ||||
| for i in range(20): | ||||
|     print(vectorizer.review_vocab.lookup_index(indices[i])) | ||||
|      | ||||
| print("====\n\n\n") | ||||
| 
 | ||||
| # Top 20 negative words | ||||
| print("Influential words in Negative Reviews:") | ||||
| print("--------------------------------------") | ||||
| indices.reverse() | ||||
| for i in range(20): | ||||
|     print(vectorizer.review_vocab.lookup_index(indices[i])) | ||||
| @ -12,16 +12,42 @@ taxonomy: | ||||
| 
 | ||||
| Zásobník úloh: | ||||
| 
 | ||||
| - skúsiť prezentovať na lokálnej konferencii, (Data, Znalosti and WIKT) alebo fakultný zborník (krátka verzia diplomovky). | ||||
| - Využiť korpus Multext East pri trénovaní.  Vytvoriť mapovanie Multext Tagov na SNK Tagy. | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 6.11.2020 | ||||
| 
 | ||||
| Stav: | ||||
| 
 | ||||
| - Prečítané (podrobne) 2 články a urobené poznámky. Poznánky sú na GITe. | ||||
| - Dorobené ďalšie experimenty. | ||||
| 
 | ||||
| Úlohy do ďalšieho stretnutia: | ||||
| 
 | ||||
| - Pokračovať v otvorených úlohách. | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 30.10.2020 | ||||
| 
 | ||||
| Stav: | ||||
| 
 | ||||
| - Súbory sú na GIte | ||||
| - Vykonané experimenty, Výsledky experimentov sú v tabuľke | ||||
| - Návod na spustenie | ||||
| - Vyriešenie technických problémov. Je k dispozicíí Conda prostredie. | ||||
| 
 | ||||
| Úlohy na ďalšie stretnutie: | ||||
| 
 | ||||
| - Preštudovať literatúru na tému "pretrain" a "word embedding" | ||||
|     - [Healthcare NERModelsUsing Language Model Pretraining](http://ceur-ws.org/Vol-2551/paper-04.pdf) | ||||
|     - [Healthcare NER Models Using Language Model Pretraining](http://ceur-ws.org/Vol-2551/paper-04.pdf) | ||||
|     - [Design and implementation of an open source Greek POS Tagger and Entity Recognizer using spaCy](https://ieeexplore.ieee.org/abstract/document/8909591) | ||||
|     - https://arxiv.org/abs/1909.00505 | ||||
|     - https://arxiv.org/abs/1607.04606 | ||||
|     - LSTM, recurrent neural network,  | ||||
|     - Urobte si poznámky z viacerých čnánkov, poznačte si zdroj a čo ste sa dozvedeli. | ||||
| - Vykonať viacero experimentov s pretrénovaním - rôzne modely, rôzne veľkosti adaptačných dát a zostaviť tabuľku | ||||
| - Opísať pretrénovanie, zhrnúť vplyv pretrénovania na trénovanie v krátkom článku cca 10 strán. | ||||
| - skúsiť prezentovať na lokálnej konferencii, (Data, Znalosti and WIKT) alebo fakultný zborník (krátka verzia diplomovky). | ||||
| - Využiť korpus Multext East pri trénovaní.  Vytvoriť mapovanie Multext Tagov na SNK Tagy. | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 8.10.2020 | ||||
|  | ||||
| @ -21,6 +21,46 @@ Cieľom práce je príprava nástrojov a budovanie tzv. "Question Answering data | ||||
| 
 | ||||
| ## Diplomový projekt 2 | ||||
| 
 | ||||
| Zásobník úloh: | ||||
| 
 | ||||
| - Dá sa zistiť koľko času strávil anotátor pri vytváraní otázky? Ak sa to dá zistiť z DB schémy, tak by bolo dobré to zobraziť vo webovej aplikácii. | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 27.10.2020 | ||||
| 
 | ||||
| Stav: | ||||
| 
 | ||||
| - Dorobená webová aplikácia podľa pokynov z minulého stretnutia, kódy sú na gite | ||||
| 
 | ||||
| Úlohy na ďalšie stretnutie: | ||||
| 
 | ||||
| - Urobiť konfiguračný systém - načítať konfiguráciu zo súboru (python-configuration?). Meno konfiguračného súboru by sa malo dať zmeniť cez premennú prostredia (getenv). | ||||
| - Dorobiť autentifikáciu pre anotátorov pre zobrazovanie výsledkov, aby anotátor videl iba svoje výsledky. Je to potrebné? Zatiaľ dorobiť iba pomocou e-mailu. | ||||
| - Dorobiť heslo na webovú aplikáciu | ||||
| - Dorobiť zobrazovanie zlých a dobrých anotácií pre každého anotátora. | ||||
| - Preštudovať odbornú literatúru na tému "Crowdsourcing language resources". Vyberte niekoľko odborných publikácií (scholar, scopus), napíšte bibliografický odkaz a čo ste sa z publikácii dozvedeli o vytváraní jazykových zdrojov. Aké iné korpusy boli touto metódou vytvorené?  | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 20.10.2020 | ||||
| 
 | ||||
| Stav: | ||||
| 
 | ||||
| - Vylepšený skript pre prípravu dát , mierna zmena rozhrania (duplicitná práca kvôli nedostatku v komunikácii). | ||||
| 
 | ||||
| Úohy do ďalšieho stretnutia: | ||||
| 
 | ||||
| - Dorobiť webovú aplikáciu pre zisťoovanie množstva anotovaných dát. | ||||
| - Odladiť chyby súvisiace s novou anotačnou schémou. | ||||
| - Zobraziť množstvo anotovaných dát | ||||
| - Zobraziť množstvo platných anotovaných dát. | ||||
| - Zobbraziť množstvo validovaných dát. | ||||
| - Otázky sa v rámci jedného paragrafu nesmú opakovať. Každá otázka musí mať odpoveď. Každá otázka musí byť dlhšia ako 10 znakov alebo dlhšia ako 2 slová. Odpoveď musí mať aspoň jedno slovo. Otázka musí obsahovať slovenské slová.  | ||||
| - Výsledky posielajte čím skôr do projektového repozitára, adresár database_app. | ||||
| 
 | ||||
| 
 | ||||
| 
 | ||||
| Stretnutie 25.9.2020 | ||||
| 
 | ||||
| Urobené: | ||||
|  | ||||
| @ -6,10 +6,8 @@ taxonomy: | ||||
|     tag: [demo,nlp] | ||||
|     author: Daniel Hladek | ||||
| --- | ||||
| 
 | ||||
| # Martin Jancura | ||||
| 
 | ||||
| 
 | ||||
| *Rok začiatku štúdia*:  2017 | ||||
| 
 | ||||
| ## Bakalársky projekt 2020 | ||||
| @ -31,9 +29,36 @@ Možné backendy: | ||||
| Zásobník úloh: | ||||
| 
 | ||||
| - Pripraviť backend. | ||||
| - Pripraviť frontend v Javascripte. | ||||
| - Pripraviť frontend v Javascripte - in progress. | ||||
| - Zapisať človekom urobený preklad do databázy. | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 6.11.2020: | ||||
| 
 | ||||
| Stav:  | ||||
| 
 | ||||
| Práca na písomnej časti. | ||||
| 
 | ||||
| Úlohy do ďalšieho stretnutia: | ||||
| 
 | ||||
| - Pohľadať takú knižnicu, kde vieme využiť vlastný preklad. Skúste si nainštalovať OpenNMT. | ||||
| - Prejdite si tutoriál https://github.com/OpenNMT/OpenNMT-py#quickstart alebo podobný. | ||||
| - Navrhnite ako prepojiť frontend a backend. | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 23.10.2020: | ||||
| 
 | ||||
| Stav: | ||||
| 
 | ||||
| - Urobený frontend pre komunikáciu s Microsof Translation API, využíva Axios a Vanilla Javascriupt | ||||
| 
 | ||||
| Úlohy do ďalšieho stretnutia: | ||||
| 
 | ||||
| - Pohľadať takú knižnicu, kde vieme využiť vlastný preklad. Skúste si nainštalovať OpenNMT. | ||||
| - Zistiť čo znamená  politika CORS. | ||||
| - Pokračujte v písaní práce, pridajte časť o strojovom preklade.. Prečítajte si články https://opennmt.net/OpenNMT/references/ a urobte si poznámky. Do poznámky dajte bibliografický odkaz a čo ste sa dozvedeli z článku. | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 16.10: | ||||
| 
 | ||||
| Stav: | ||||
|  | ||||
| @ -31,7 +31,42 @@ Návrh na zadanie: | ||||
| 1. Navrhnite možné zlepšenia Vami vytvorenej aplikácie. | ||||
| 
 | ||||
| Zásobník úloh: | ||||
| - Vytvorte si repozitár na GITe, nazvite ho bp2010. Do neho budete dávať kódy a dokumentáciu ktorú vytvoríte. | ||||
| 
 | ||||
| - Pripravte Docker image Vašej aplikácie podľa https://pythonspeed.com/docker/ | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 30.10.: | ||||
| 
 | ||||
| Stav: | ||||
| 
 | ||||
| - Modifikovaná existujúca aplikácia "spacy-streamlit", zdrojové kóódy sú na GITe podľa pokynov z minulého stretnutia. | ||||
| - Obsahuje aj formulár, neobsahuje REST API | ||||
| 
 | ||||
| Úlohy do ďalšieho stretnutia: | ||||
| 
 | ||||
| - Pokračujte v písaní. Prečítajte si odborné články na tému "dependency parsing" a vypracujte poznámky čo ste sa dozvedeli. Poznačte si zdroj. | ||||
| - Pokkračujte v práci na demonštračnej webovej aplikácii. | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 19.10.: | ||||
| 
 | ||||
| Stav:  | ||||
| 
 | ||||
| - Vypracované a odovzdané poznámky k bakalárskej práci, obsahujú výpisy z literatúry. | ||||
| - Vytvorený repozitár. https://git.kemt.fei.tuke.sk/mw223on/bp2020 | ||||
| - Nainštalovaný a spustený slovenský Spacy model. | ||||
| - Nainštalované Spacy REST Api https://github.com/explosion/spacy-services | ||||
| - Vyskúšané demo Display so slovenským modelom | ||||
| 
 | ||||
| Úlohy na ďalšie stretnutie: | ||||
| 
 | ||||
| - Pripravte webovú aplikáciu ktorá bude prezentovať rozpoznávanie závislostí a pomenovaných entít v slovenskom jayzyku. Mala by sa skladať z frontentu a backendu. | ||||
| - zapíšte potrebné Python balíčky do súboru "requirements.txt" | ||||
| - Vytvorte skript na inštaláciu aplikácie pomocou pip. | ||||
| - Vytvorte skript pre spustenie backendu aj frontendu. Výsledky dajte do repozitára. | ||||
| - Vytvorte návrh na frontend (HTML + CSS). | ||||
| - Pozrite na zdrojové kódy Spacy a zistite, čo presne robí príkaz display.serve | ||||
| - Vysledky dajte do repozitára. | ||||
| 
 | ||||
| Virtuálne stretnutie 9.10. | ||||
| 
 | ||||
|  | ||||
| @ -20,8 +20,21 @@ Návrh na zadanie: | ||||
| 2. Vytvorte jazykový model metódou BERT alebo poodobnou metódou. | ||||
| 3. Vyhodnnotte vytvorený jazykový model a navrhnite zlepšenia.  | ||||
| 
 | ||||
| Zásobník úloh: | ||||
| 
 | ||||
| Virtuálne stretnutie 30.10.2020 | ||||
| 
 | ||||
| Stav: | ||||
| - Vypracované poznámky k seq2seq | ||||
| - nainštalovaný Pytorch a fairseq | ||||
| - problémy s tutoriálom. Riešenie by mohlo byť použitie release verzie 0.9.0, pip install fairseq=0.9.0 | ||||
| 
 | ||||
| Do ďďalšieho stretnutia: | ||||
| 
 | ||||
| - Vyriešte technické porblémy | ||||
| - prejdide si tutoriál https://fairseq.readthedocs.io/en/latest/getting_started.html#training-a-new-model | ||||
| - Prejsť si tutoriál https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md alebo podobný. | ||||
| - Preštudujte si články na tému BERT, urobte si poznámky čo ste sa dozvedeli spolu so zdrojom. | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 16.10.2020 | ||||
| 
 | ||||
|  | ||||
| @ -23,13 +23,50 @@ Pokusný klaster Raspberry Pi pre výuku klaudových technológií | ||||
| Ciel projektu je vytvoriť domáci lacný klaster pre výuku cloudových technológií. | ||||
| 
 | ||||
| 
 | ||||
| Zásobník úloh: | ||||
| 
 | ||||
| - Aktivujte si technológiu WSL2 a Docker Desktop ak používate Windows. | ||||
| 
 | ||||
| Virtuálne stretnutie 30.10. | ||||
| 
 | ||||
| Stav: | ||||
| - vypracovaný písomný prehľad podľa pokynov | ||||
| - nainštalovaný RaspberryPI OS do Virtual\boxu | ||||
| - vypracovaný predbežný HW návrh | ||||
| - Nainšalované Docker Toolbox aj Ubuntu s Dockerom | ||||
| - Oboznámenie sa s Dockerom | ||||
| - Vedúci: vykoananý nákup HW - Dosky 5x RPi4 model B 8GB, SD Karty 128GB 11ks, the pi hut Cluster Case for raspberry pi 4ks, Zdroj 60W and 18W Quick Charger Epico 1ks. 220V kábel a zásuvka s vypínačom. | ||||
| 
 | ||||
| Do budúceho stretnutia: | ||||
| 
 | ||||
| - Dá sa kúpiť oficiálmy 5 portový switch? | ||||
| - Skompletizovať nákup a dohodntúť spôsob odovzdania. Podpísať preberací protokol. | ||||
| - Použite https://kind.sigs.k8s.io na simuláciu klastra. | ||||
| - Nainštalujte si https://microk8s.io/ , prečítajte tutoriály  https://ubuntu.com/tutorials/ | ||||
| - Prejdite si https://kubernetes.io/docs/tutorials/hello-minikube/ alebo pododbný tutoriály | ||||
| 
 | ||||
| 
 | ||||
| Virtuálne stretnutie 16.10. | ||||
| 
 | ||||
| 
 | ||||
| Stav: | ||||
| - Prečítanie články | ||||
| - začatý tutorál Docker zo ZCT | ||||
| - vedúci vytovoril prístup na Jetson Xavier  AGX2 s ARM procesorom. | ||||
| - začatý nákup na Raspberry Pi a príslušenstvo. | ||||
| 
 | ||||
| Úlohy do ďalšieho stretnutia | ||||
| - Vypracovať prehľad (min 4) existujúcich riešení Raspberry Pi cluster (na odovzdanie). Aký hardware a software použili? | ||||
|     - napájanie, chladenie, sieťové prepojenie | ||||
| - Oboznámte sa s https://www.raspberrypi.org/downloads/raspberry-pi-os/ | ||||
| - Nainštalujte si https://roboticsbackend.com/install-raspbian-desktop-on-a-virtual-machine-virtualbox/ | ||||
| - Napíšte podrobný návrh hardware pre vytvorenie Raspberry Pi Cluster. | ||||
| 
 | ||||
| Stretnutie 29.9. | ||||
| 
 | ||||
| 
 | ||||
| Dohodli sme sa na zadaní práce. | ||||
| 
 | ||||
| 
 | ||||
| Návrhy na zlepšenie (pre vedúceho): | ||||
| 
 | ||||
| - Zistiť podmienky financovania (odhad 350EUR). | ||||
|  | ||||
| @ -39,23 +39,35 @@ Učenie prebieha tak, že v texte ukážete ktoré slová patria názvom osôb, | ||||
| 
 | ||||
| 
 | ||||
| Vašou úlohou bude v texte vyznačiť vlastné podstatné mená. | ||||
| Vlastné podstatné meno sa v slovenskom jazyku spravidla začína veľkým písmeno, ale môže obsahovať aj ďalšie slová písané malým písmenom.  | ||||
| Vlastné podstatné meno sa v slovenskom jazyku spravidla začína veľkým písmenom, ale môže obsahovať aj ďalšie slová písané malým písmenom.  | ||||
| Ak vlastné podstatné meno v sebe obsahuje iný názov, napr. Nové Mesto nad Váhom, anotujte ho ako jeden celok. | ||||
| 
 | ||||
| - PER: mená osôb | ||||
| - LOC: geografické názvy | ||||
| - ORG: názvy organizácii | ||||
| - MISC: iné názvy, napr. názvy produktov. | ||||
| 
 | ||||
| Ak vlastné podstatné meno v sebe obsahuje iný názov, napr. Nové Mesto nad Váhom, anotujte ho ako jeden celok. | ||||
| V texte narazíte aj na slová, ktoré síce pomenúvajú geografickú oblasť, avšak nejedná sa o vlastné podstatné mená (napr. britská kolónia, londýnsky šerif...). Takéto slová nepovažujeme za pomenované entity a preto Vás prosíme, aby ste ich neoznačovali. | ||||
| 
 | ||||
| V prípade, že v texte sa nenachádzajú žiadne anotácie, tento článok je platný, a teda zvoľte možnosť Accept. | ||||
| 
 | ||||
| V prípade, že text sa skladá iba z jedného, resp. niekoľkých slov, ktoré sami o sebe nenesú žiaden význam, tento článok je neplatný, a teda zvoľte možnosť Reject.   | ||||
| 
 | ||||
| ## Anotačné dávky | ||||
| 
 | ||||
| Do formulára napíšte Váš e-mail aby bolo možné rozpoznať, kto vykonal anotáciu.  | ||||
| 
 | ||||
| Počas anotácie môžete pre zjednodušenie práce využívať klávesové skratky:  | ||||
| - 1,2,3,4 - prepínanie medzi entitami | ||||
| - klávesa "a" - Accept | ||||
| - klávesa "x" - Reject | ||||
| - klávesa "space" - Ignore | ||||
| - klávesa "backspace" alebo "del" - Undo | ||||
| 
 | ||||
| Po anotovaní nezabudnite svojú prácu uložiť (ikona vľavo hore, alebo "Ctrl + s"). | ||||
| 
 | ||||
| ### Pokusná anotačná dávka | ||||
| 
 | ||||
| Dávka je zameraná na zber spätnej väzby od anotátorov na zlepšenie rozhrania a anotačného procesu. | ||||
| 
 | ||||
| {% include "forms/form.html.twig" with { form: forms('ner1') } %} | ||||
| 
 | ||||
| 
 | ||||
|  | ||||
| @ -36,11 +36,12 @@ Učenie prebieha tak, že vytvoríte príklad s otázkou a odpoveďou. Účasť | ||||
| 
 | ||||
| ## Návod pre anotátorov | ||||
| 
 | ||||
| Najprv sa Vám zobrazí krátky článok.  Vašou úlohou bude prečítať si časť článku, vymyslieť k nemu otázku a v texte vyznačiť odpoveď. Odpoveď na otázku sa musí nachádzať v texte článku. Na vyznačenie jednej otázky máte asi 50 sekúnd. | ||||
| Najprv sa Vám zobrazí krátky článok.  Vašou úlohou bude prečítať si časť článku, vymyslieť k nemu otázku a v texte vyznačiť odpoveď.  Otázka musí byť jednoznačná a odpoveď na otázku sa musí nachádzať v texte článku. Na vyznačenie jednej otázky máte asi 50 sekúnd. | ||||
| 
 | ||||
| 1. Prečítajte si článok. Ak článok nie je vyhovujúci ťuknite na červený krížik "reject" (Tab a potom 'x'). | ||||
| 1. Prečítajte si článok. Ak článok nie je vyhovujúci ťuknite na červený krížik "Reject" (Tab a potom 'x'). | ||||
| 2. Napíšte otázku. Ak neviete napísať otázku, ťuknite na "Ignore" (Tab a potom 'i').  | ||||
| 3. Vyznačte myšou odpoveď a ťuknite na zelenú fajku "Accept" (klávesa a) a pokračujte ďalšou otázkou k tomu istému článku alebo k novému článku. Ten istý text sa zobrazí maximálne 5 krát. | ||||
| 3. Vyznačte myšou odpoveď a ťuknite na zelenú fajku "Accept" (klávesa a) a pokračujte ďalšou otázkou k tomu istému článku alebo k novému článku.  | ||||
| 4. Ten istý článok sa Vám zobrazí 5 krát, vymyslite k nemu 5 rôznych otázok.  | ||||
| 
 | ||||
| Ak je zobrazený text nevhodný, tak ho zamietnite. Nevhodný text: | ||||
| 
 | ||||
| @ -61,6 +62,12 @@ Ak je zobrazený text nevhodný, tak ho zamietnite. Nevhodný text: | ||||
| 4. <span style="color:pink">Na čo slúži lyzozóm?</span> | ||||
| 5. <span style="color:orange">Čo je to autofágia?<span> | ||||
| 
 | ||||
| Príklad na nesprávu otázku: | ||||
| 1. Čo je to Golgiho aparát? - odpoveď sa v článku nenachádza. | ||||
| 2. Čo sa deje v mŕtvych bunkách? - otázka nie je jednoznačná, presná odpoveď sa v článku nenachádza. | ||||
| 3. Čo je normálny fyziologický proces? - odpoveď sa v článku nenachádza. | ||||
| 
 | ||||
| 
 | ||||
| Do formulára napíšte Váš e-mail aby bolo možné rozpoznať, kto vykonal anotáciu.  | ||||
| 
 | ||||
| ## Anotačné dávky | ||||
|  | ||||
		Loading…
	
		Reference in New Issue
	
	Block a user