diff --git a/pages/students/2016/darius_lindvai/README.md b/pages/students/2016/darius_lindvai/README.md index b63d984a..48472cab 100644 --- a/pages/students/2016/darius_lindvai/README.md +++ b/pages/students/2016/darius_lindvai/README.md @@ -13,11 +13,28 @@ Repozitár so [zdrojovými kódmi](https://git.kemt.fei.tuke.sk/dl874wn/dp2021) ## Diplomový projekt 2 2020 +Virtuálne stretnutie 6.11.2020 + +Stav: + +- Vypracovaná tabuľka s 5 experimentami. +- vytvorený repozitár. + +Na ďalšie stretnutie: + +- nahrať kódy na repozitár. +- závislosťi (názvy balíčkov) poznačte do súboru requirements.txt. +- Prepracujte experiment tak aby akceptoval argumenty z príkazového riadka. (sys.argv) +- K experimentom zapísať skript na spustenie. V skripte by mali byť parametre s ktorými ste spustili experiment. +- dopracujte report. +- do teorie urobte prehľad metód punctuation restoration a opis Vašej metódy. + + Virtuálne stretnutie 25.9.2020 Urobené: -- skript pre vyhodnotenie experimentov +- skript pre vyhodnotenie experimentov. Úlohy do ďalšieho stretnutia: diff --git a/pages/students/2016/jakub_maruniak/README.md b/pages/students/2016/jakub_maruniak/README.md index 5908689f..7692d2c3 100644 --- a/pages/students/2016/jakub_maruniak/README.md +++ b/pages/students/2016/jakub_maruniak/README.md @@ -21,8 +21,21 @@ Zásobník úloh: - Použiť model na podporu anotácie - Do konca ZS vytvoriť report vo forme článku. -- Vytvorte systém pre zistenie množstva a druhu anotovaných dát. Koľko článkov? Koľko entít jednotlivvých typov? +- Spísať pravidlá pre validáciu. Aký výsledok anotácie je dobrý? Je potrebné anotované dáta skontrolovať? +Virtuálne stretnutie 30.10.2020: + +Stav: + +- Vylepšený návod +- Vyskúšaný export dát a trénovanie modelu z databázy. Problém pri trénovaní Spacy - iné výsledky ako cez Progigy trénovanie +- Práca na textovej čsati. + +Úlohy do ďalšieho stretnutia: +- Vytvorte si repozitár s názvom dp2021 a tam pridajte skripty a poznámky. +- Pokračujte v písaní práce. Vykonajte prieskum literatúry "named entity corpora" aj poznámky. +- Vytvorte systém pre zistenie množstva a druhu anotovaných dát. Koľko článkov? Koľko entít jednotlivvých typov? Výsledná tabuľka pôjde do práce. +- Pripraviť sa na produkčné anotácie. Je schéma pripravená? Virtuálne stretnutie 16.10.2020: diff --git a/pages/students/2016/jakub_maruniak/dp2021/README.md b/pages/students/2016/jakub_maruniak/dp2021/README.md index f3d29488..e1f1cc7b 100644 --- a/pages/students/2016/jakub_maruniak/dp2021/README.md +++ b/pages/students/2016/jakub_maruniak/dp2021/README.md @@ -1 +1,40 @@ -DP2021 \ No newline at end of file +## Diplomový projekt 2 2020 +Stav: +- aktualizácia anotačnej schémy (jedná sa o testovaciu schému s vlastnými dátami) +- vykonaných niekoľko anotácii, trénovanie v Prodigy - nízka presnosť = malé množstvo anotovaných dát. Trénovanie v spacy zatiaľ nefunguje. +- Štatistiky o množstve prijatých a odmietnutých anotácii získame z Prodigy: prodigy stats wikiart. Zatiaľ 156 anotácii (151 accept, 5 reject). Na získanie prehľadu o množstve anotácii jednotlivých entít potrebujeme vytvoriť skript. +- Prehľad literatúry Named Entity Corpus + - Budovanie korpusu pre NER – automatické vytvorenie už anotovaného korpusu z Wiki pomocou DBpedia – jedná sa o anglický korpus, ale možno spomenúť v porovnaní postupov + - Building a Massive Corpus for Named Entity Recognition using Free Open Data Sources - Daniel Specht Menezes, Pedro Savarese, Ruy L. Milidiú + - Porovnanie postupov pre anotáciu korpusu (z hľadiska presnosti aj času) - Manual, SemiManual + - Comparison of Annotating Methods for Named Entity Corpora - Kanako Komiya, Masaya Suzuki + - Čo je korpus, vývojový cyklus, analýza korpusu (Už využitá literatúra – cyklus MATTER) + - Natural Language Annotation for Machine Learning – James Pustejovsky, Amber Stubbs + +Aktualizácia 09.11.2020: +- Vyriešený problém, kedy nefungovalo trénovanie v spacy +- Vykonaná testovacia anotácia cca 500 viet. Výsledky trénovania pri 20 iteráciách: F-Score 47% (rovnaké výsledky pri trénovaní v Spacy aj Prodigy) +- Štatistika o počte jednotlivých entít: skript count.py + + +## Diplomový projekt 1 2020 + +- vytvorenie a spustenie docker kontajneru + + +``` +./build-docker.sh +docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash +# (v mojom prípade:) +winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash +``` + + + + +### Spustenie anotačnej schémy +- `dataminer.csv` články stiahnuté z wiki +- `cd ner` +- `./01_text_to_sent.sh` spustenie skriptu *text_to_sent.py*, ktorý rozdelí články na jednotlivé vety +- `./02_ner_correct.sh` spustenie anotačného procesu pre NER s návrhmi od modelu +- `./03_ner_export.sh` exportovanie anotovaných dát vo formáte jsonl potrebnom pre spracovanie vo spacy diff --git a/pages/students/2016/jakub_maruniak/dp2021/annotation/Dockerfile b/pages/students/2016/jakub_maruniak/dp2021/annotation/Dockerfile index e9e36bc6..0531f482 100644 --- a/pages/students/2016/jakub_maruniak/dp2021/annotation/Dockerfile +++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/Dockerfile @@ -1,17 +1,16 @@ -# > docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash -# > winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash - -FROM python:3.8 -RUN mkdir /prodigy -WORKDIR /prodigy -COPY ./prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl /prodigy -RUN mkdir /work -COPY ./ner /work -RUN pip install prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl -RUN pip install https://files.kemt.fei.tuke.sk/models/spacy/sk_sk1-0.0.1.tar.gz -RUN pip install nltk -EXPOSE 8080 -ENV PRODIGY_HOME /work -ENV PRODIGY_HOST 0.0.0.0 -WORKDIR /work - +# > docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash +# > winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation-master/annotation/work prodigy bash + +FROM python:3.8 +RUN mkdir /prodigy +WORKDIR /prodigy +COPY ./prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl /prodigy +RUN mkdir /work +COPY ./ner /work/ner +RUN pip install uvicorn==0.11.5 prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl +RUN pip install https://files.kemt.fei.tuke.sk/models/spacy/sk_sk1-0.0.1.tar.gz +RUN pip install nltk +EXPOSE 8080 +ENV PRODIGY_HOME /work +ENV PRODIGY_HOST 0.0.0.0 +WORKDIR /work \ No newline at end of file diff --git a/pages/students/2016/jakub_maruniak/dp2021/annotation/README.md b/pages/students/2016/jakub_maruniak/dp2021/annotation/README.md index 63c2525d..5a8e6560 100644 --- a/pages/students/2016/jakub_maruniak/dp2021/annotation/README.md +++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/README.md @@ -1,13 +1,11 @@ -## Diplomový projekt 1 2020 +## Diplomový projekt 2 2020 - vytvorenie a spustenie docker kontajneru ``` ./build-docker.sh -docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash -# (v mojom prípade:) -winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash +winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation-master/annotation/work prodigy bash ``` @@ -17,5 +15,12 @@ winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/ann - `dataminer.csv` články stiahnuté z wiki - `cd ner` - `./01_text_to_sent.sh` spustenie skriptu *text_to_sent.py*, ktorý rozdelí články na jednotlivé vety -- `./02_ner_correct.sh` spustenie anotačného procesu pre NER s návrhmi od modelu -- `./03_ner_export.sh` exportovanie anotovaných dát vo formáte jsonl potrebnom pre spracovanie vo spacy +- `./02_ner_manual.sh` spustenie manuálneho anotačného procesu pre NER +- `./03_export.sh` exportovanie anotovaných dát vo formáte json potrebnom pre spracovanie vo spacy. Možnosť rozdelenia na trénovacie (70%) a testovacie dáta (30%) (--eval-split 0.3). + +### Štatistika o anotovaných dátach +- `prodigy stats wikiart` - informácie o počte prijatých a odmietnutých článkov +- `python3 count.py` - informácie o počte jednotlivých entít + +### Trénovanie modelu +Založené na: https://git.kemt.fei.tuke.sk/dano/spacy-skmodel diff --git a/pages/students/2016/jakub_maruniak/dp2021/annotation/count.py b/pages/students/2016/jakub_maruniak/dp2021/annotation/count.py new file mode 100644 index 00000000..c4f7fe0d --- /dev/null +++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/count.py @@ -0,0 +1,14 @@ +# load data +filename = 'ner/annotations.jsonl' +file = open(filename, 'rt', encoding='utf-8') +text = file.read() + +# count entity PER +countPER = text.count('PER') +countLOC = text.count('LOC') +countORG = text.count('ORG') +countMISC = text.count('MISC') +print('Počet anotovaných entít typu PER:', countPER,'\n', + 'Počet anotovaných entít typu LOC:', countLOC,'\n', + 'Počet anotovaných entít typu ORG:', countORG,'\n', + 'Počet anotovaných entít typu MISC:', countMISC,'\n') \ No newline at end of file diff --git a/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/02_ner_correct.sh b/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/02_ner_correct.sh deleted file mode 100644 index 7758b202..00000000 --- a/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/02_ner_correct.sh +++ /dev/null @@ -1,3 +0,0 @@ - -prodigy ner.correct wikiart sk_sk1 ./textfile.csv --label OSOBA,MIESTO,ORGANIZACIA,PRODUKT - diff --git a/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/02_ner_manual.sh b/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/02_ner_manual.sh new file mode 100644 index 00000000..ab4f8407 --- /dev/null +++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/02_ner_manual.sh @@ -0,0 +1,2 @@ +prodigy ner.manual wikiart sk_sk1 ./textfile.csv --label PER,LOC,ORG,MISC + diff --git a/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/03_export.sh b/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/03_export.sh new file mode 100644 index 00000000..3adc2731 --- /dev/null +++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/03_export.sh @@ -0,0 +1 @@ +prodigy data-to-spacy ./train.json ./eval.json --lang sk --ner wikiart --eval-split 0.3 \ No newline at end of file diff --git a/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/03_ner_export.sh b/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/03_ner_export.sh deleted file mode 100644 index 3d60d2c3..00000000 --- a/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/03_ner_export.sh +++ /dev/null @@ -1 +0,0 @@ -prodigy db-out wikiart > ./annotations.jsonl \ No newline at end of file diff --git a/pages/students/2016/jakub_maruniak/dp2021/annotation/train/prepare.sh b/pages/students/2016/jakub_maruniak/dp2021/annotation/train/prepare.sh new file mode 100644 index 00000000..ade40371 --- /dev/null +++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/train/prepare.sh @@ -0,0 +1,19 @@ +mkdir -p build +mkdir -p build/input +# Prepare Treebank +mkdir -p build/input/slovak-treebank +spacy convert ./sources/slovak-treebank/stb.conll ./build/input/slovak-treebank +# UDAG used as evaluation +mkdir -p build/input/ud-artificial-gapping +spacy convert ./sources/ud-artificial-gapping/sk-ud-crawled-orphan.conllu ./build/input/ud-artificial-gapping +# Prepare skner +mkdir -p build/input/skner +# Convert to IOB +cat ./sources/skner/wikiann-sk.bio | python ./sources/bio-to-iob.py > build/input/skner/wikiann-sk.iob +# Split to train test +cat ./build/input/skner/wikiann-sk.iob | python ./sources/iob-to-traintest.py ./build/input/skner/wikiann-sk +# Convert train and test +mkdir -p build/input/skner-train +spacy convert -n 15 --converter ner ./build/input/skner/wikiann-sk.train ./build/input/skner-train +mkdir -p build/input/skner-test +spacy convert -n 15 --converter ner ./build/input/skner/wikiann-sk.test ./build/input/skner-test diff --git a/pages/students/2016/jakub_maruniak/dp2021/annotation/train/train.sh b/pages/students/2016/jakub_maruniak/dp2021/annotation/train/train.sh new file mode 100644 index 00000000..a0d1c7cf --- /dev/null +++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/train/train.sh @@ -0,0 +1,19 @@ +set -e +OUTDIR=build/train/output +TRAINDIR=build/train +mkdir -p $TRAINDIR +mkdir -p $OUTDIR +mkdir -p dist +# Delete old training results +rm -rf $OUTDIR/* +# Train dependency and POS +spacy train sk $OUTDIR ./build/input/slovak-treebank ./build/input/ud-artificial-gapping --n-iter 20 -p tagger,parser +rm -rf $TRAINDIR/posparser +mv $OUTDIR/model-best $TRAINDIR/posparser +# Train NER +# python ./train.py -t ./train.json -o $TRAINDIR/nerposparser -n 10 -m $TRAINDIR/posparser/ +spacy train sk $TRAINDIR/nerposparser ./ner/train.json ./ner/eval.json --n-iter 20 -p ner +# Package model +spacy package $TRAINDIR/nerposparser dist --meta-path ./meta.json --force +cd dist/sk_sk1-0.2.0 +python ./setup.py sdist --dist-dir ../ diff --git a/pages/students/2016/jan_holp/README.md b/pages/students/2016/jan_holp/README.md index ac1ca10e..c053e6fd 100644 --- a/pages/students/2016/jan_holp/README.md +++ b/pages/students/2016/jan_holp/README.md @@ -31,11 +31,39 @@ Zásobník úloh: - Urobiť verejné demo - nasadenie pomocou systému Docker - zlepšenie Web UI +- vytvoriť REST api pre indexovanie dokumentu. - V indexe prideliť ohodnotenie každému dokumentu podľa viacerých metód, napr. PageRank - Využiť vyhodnotenie pri vyhľadávaní - **Použiť overovaciu databázu SCNC na vyhodnotenie každej metódy** - **Do konca zimného semestra vytvoriť "Mini Diplomovú prácu cca 8 strán s experimentami" vo forme článku** + +Virtuálne stretnutie 6.11:2020: + +Stav: + +- Riešenie problémov s cassandrou a javascriptom. Ako funguje funkcia then? + +Úlohy na ďalšie stretnutie: + +- vypracujte funkciu na indexovanie. Vstup je dokument (objekt s textom a metainformáciami). Fukcia zaindexuje dokument do ES. +- Naštudujte si ako funguje funkcia then a čo je to callback. +- Naštudujte si ako sa používa Promise. +- Naštudujte si ako funguje async - await. +- https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Asynchronous/ + + + +Virtuálne stretnutie 23.10:2020: + +Stav: +- Riešenie problémov s cassandrou. Ako vybrať dáta podľa primárneho kľúča. + +Do ďďalšiehio stretnutia: + +- pokračovať v otvorených úlohách. +- urobte funkciu pre indexovanie jedného dokumentu. + Virtuálne stretnutie 16.10. Stav: diff --git a/pages/students/2016/jan_holp/dp2021/zdrojove_subory/cassandra.js b/pages/students/2016/jan_holp/dp2021/zdrojove_subory/cassandra.js new file mode 100644 index 00000000..43e0157e --- /dev/null +++ b/pages/students/2016/jan_holp/dp2021/zdrojove_subory/cassandra.js @@ -0,0 +1,105 @@ +//Jan Holp, DP 2021 + + +//client1 = cassandra +//client2 = elasticsearch +//----------------------------------------------------------------- + +//require the Elasticsearch librray +const elasticsearch = require('elasticsearch'); +const client2 = new elasticsearch.Client({ + hosts: [ 'localhost:9200'] +}); +client2.ping({ + requestTimeout: 30000, + }, function(error) { + // at this point, eastic search is down, please check your Elasticsearch service + if (error) { + console.error('Elasticsearch cluster is down!'); + } else { + console.log('Everything is ok'); + } + }); + +//create new index skweb2 +client2.indices.create({ + index: 'skweb2' +}, function(error, response, status) { + if (error) { + console.log(error); + } else { + console.log("created a new index", response); + } +}); + +const cassandra = require('cassandra-driver'); +const client1 = new cassandra.Client({ contactPoints: ['localhost:9042'], localDataCenter: 'datacenter1', keyspace: 'websucker' }); +const query = 'SELECT title FROM websucker.content WHERE body_size > 0 ALLOW FILTERING'; +client1.execute(query) + .then(result => console.log(result)),function(error) { + if(error){ + console.error('Something is wrong!'); + console.log(error); + } else{ + console.log('Everything is ok'); + } + }; + +/* +async function indexData() { + + var i = 0; + const query = 'SELECT title FROM websucker.content WHERE body_size > 0 ALLOW FILTERING'; + client1.execute(query) + .then((result) => { + try { + //for ( i=0; i<15;i++){ + console.log('%s', result.row[0].title) + //} + } catch (query) { + if (query instanceof SyntaxError) { + console.log( "Neplatne query" ); + } + } + + + + }); + + + } + +/* + +//indexing method +const bulkIndex = function bulkIndex(index, type, data) { + let bulkBody = []; + id = 1; +const errorCount = 0; + data.forEach(item => { + bulkBody.push({ + index: { + _index: index, + _type: type, + _id : id++, + } + }); + bulkBody.push(item); + }); + console.log(bulkBody); + client.bulk({body: bulkBody}) + .then(response => { + + response.items.forEach(item => { + if (item.index && item.index.error) { + console.log(++errorCount, item.index.error); + } + }); + console.log( + `Successfully indexed ${data.length - errorCount} + out of ${data.length} items` + ); + }) + .catch(console.err); +}; +*/ \ No newline at end of file diff --git a/pages/students/2016/lukas_pokryvka/README.md b/pages/students/2016/lukas_pokryvka/README.md index 2a988b57..3366ac9d 100644 --- a/pages/students/2016/lukas_pokryvka/README.md +++ b/pages/students/2016/lukas_pokryvka/README.md @@ -23,13 +23,26 @@ Zásobník úloh : - tesla - xavier - Trénovanie na dvoch kartách na jednom stroji - - idoc + - idoc DONE - titan - možno trénovanie na 4 kartách na jednom - quadra - *Trénovanie na dvoch kartách na dvoch strojoch pomocou NCCL (idoc, tesla)* - možno trénovanie na 2 kartách na dvoch strojoch (quadra plus idoc). +Virtuálne stretnutie 27.10.2020 + +Stav: + +- Trénovanie na procesore, na 1 GPU, na 2 GPU na idoc +- Príprava podkladov na trénovanie na dvoch strojoch pomocou Pytorch. +- Vytvorený prístup na teslu a xavier. + +Úlohy na ďďalšie stretnutie: +- Štdúdium odbornej literatúry a vypracovanie poznámok. +- Pokračovať v otvorených úlohách zo zásobníka +- Vypracované skripty uložiť na GIT repozitár +- vytvorte repozitár dp2021 Stretnutie 2.10.2020 diff --git a/pages/students/2016/lukas_pokryvka/dp2021/README.md b/pages/students/2016/lukas_pokryvka/dp2021/README.md index edbe299d..8978bff6 100644 --- a/pages/students/2016/lukas_pokryvka/dp2021/README.md +++ b/pages/students/2016/lukas_pokryvka/dp2021/README.md @@ -1 +1,4 @@ -## Všetky skripty, súbory a konfigurácie \ No newline at end of file +## Všetky skripty, súbory a konfigurácie + +https://github.com/pytorch/examples/tree/master/imagenet +- malo by fungovat pre DDP, nedostupny imagenet subor z oficialnej stranky \ No newline at end of file diff --git a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/data-unversioned/data.txt b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/data-unversioned/data.txt new file mode 100644 index 00000000..e69de29b diff --git a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/data/data.txt b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/data/data.txt new file mode 100644 index 00000000..e69de29b diff --git a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/__init__.py b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/__init__.py new file mode 100644 index 00000000..06d74050 Binary files /dev/null and b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/__init__.py differ diff --git a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/benchmark_seg.py b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/benchmark_seg.py new file mode 100644 index 00000000..05384204 --- /dev/null +++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/benchmark_seg.py @@ -0,0 +1,76 @@ +import argparse +import datetime +import os +import socket +import sys + +import numpy as np +from torch.utils.tensorboard import SummaryWriter + +import torch +import torch.nn as nn +import torch.optim + +from torch.optim import SGD, Adam +from torch.utils.data import DataLoader + +from util.util import enumerateWithEstimate +from p2ch13.dsets import Luna2dSegmentationDataset, TrainingLuna2dSegmentationDataset, getCt +from util.logconf import logging +from util.util import xyz2irc +from p2ch13.model_seg import UNetWrapper, SegmentationAugmentation +from p2ch13.train_seg import LunaTrainingApp + +log = logging.getLogger(__name__) +# log.setLevel(logging.WARN) +# log.setLevel(logging.INFO) +log.setLevel(logging.DEBUG) + +class BenchmarkLuna2dSegmentationDataset(TrainingLuna2dSegmentationDataset): + def __len__(self): + # return 500 + return 5000 + return 1000 + +class LunaBenchmarkApp(LunaTrainingApp): + def initTrainDl(self): + train_ds = BenchmarkLuna2dSegmentationDataset( + val_stride=10, + isValSet_bool=False, + contextSlices_count=3, + # augmentation_dict=self.augmentation_dict, + ) + + batch_size = self.cli_args.batch_size + if self.use_cuda: + batch_size *= torch.cuda.device_count() + + train_dl = DataLoader( + train_ds, + batch_size=batch_size, + num_workers=self.cli_args.num_workers, + pin_memory=self.use_cuda, + ) + + return train_dl + + def main(self): + log.info("Starting {}, {}".format(type(self).__name__, self.cli_args)) + + train_dl = self.initTrainDl() + + for epoch_ndx in range(1, 2): + log.info("Epoch {} of {}, {}/{} batches of size {}*{}".format( + epoch_ndx, + self.cli_args.epochs, + len(train_dl), + len([]), + self.cli_args.batch_size, + (torch.cuda.device_count() if self.use_cuda else 1), + )) + + self.doTraining(epoch_ndx, train_dl) + + +if __name__ == '__main__': + LunaBenchmarkApp().main() diff --git a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/dsets.py b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/dsets.py new file mode 100644 index 00000000..f16b1386 --- /dev/null +++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/dsets.py @@ -0,0 +1,401 @@ +import copy +import csv +import functools +import glob +import math +import os +import random + +from collections import namedtuple + +import SimpleITK as sitk +import numpy as np +import scipy.ndimage.morphology as morph + +import torch +import torch.cuda +import torch.nn.functional as F +from torch.utils.data import Dataset + +from util.disk import getCache +from util.util import XyzTuple, xyz2irc +from util.logconf import logging + +log = logging.getLogger(__name__) +# log.setLevel(logging.WARN) +# log.setLevel(logging.INFO) +log.setLevel(logging.DEBUG) + +raw_cache = getCache('part2ch13_raw') + +MaskTuple = namedtuple('MaskTuple', 'raw_dense_mask, dense_mask, body_mask, air_mask, raw_candidate_mask, candidate_mask, lung_mask, neg_mask, pos_mask') + +CandidateInfoTuple = namedtuple('CandidateInfoTuple', 'isNodule_bool, hasAnnotation_bool, isMal_bool, diameter_mm, series_uid, center_xyz') + +@functools.lru_cache(1) +def getCandidateInfoList(requireOnDisk_bool=True): + # We construct a set with all series_uids that are present on disk. + # This will let us use the data, even if we haven't downloaded all of + # the subsets yet. + mhd_list = glob.glob('data-unversioned/subset*/*.mhd') + presentOnDisk_set = {os.path.split(p)[-1][:-4] for p in mhd_list} + + candidateInfo_list = [] + with open('data/annotations_with_malignancy.csv', "r") as f: + for row in list(csv.reader(f))[1:]: + series_uid = row[0] + annotationCenter_xyz = tuple([float(x) for x in row[1:4]]) + annotationDiameter_mm = float(row[4]) + isMal_bool = {'False': False, 'True': True}[row[5]] + + candidateInfo_list.append( + CandidateInfoTuple( + True, + True, + isMal_bool, + annotationDiameter_mm, + series_uid, + annotationCenter_xyz, + ) + ) + + with open('data/candidates.csv', "r") as f: + for row in list(csv.reader(f))[1:]: + series_uid = row[0] + + if series_uid not in presentOnDisk_set and requireOnDisk_bool: + continue + + isNodule_bool = bool(int(row[4])) + candidateCenter_xyz = tuple([float(x) for x in row[1:4]]) + + if not isNodule_bool: + candidateInfo_list.append( + CandidateInfoTuple( + False, + False, + False, + 0.0, + series_uid, + candidateCenter_xyz, + ) + ) + + candidateInfo_list.sort(reverse=True) + return candidateInfo_list + +@functools.lru_cache(1) +def getCandidateInfoDict(requireOnDisk_bool=True): + candidateInfo_list = getCandidateInfoList(requireOnDisk_bool) + candidateInfo_dict = {} + + for candidateInfo_tup in candidateInfo_list: + candidateInfo_dict.setdefault(candidateInfo_tup.series_uid, + []).append(candidateInfo_tup) + + return candidateInfo_dict + +class Ct: + def __init__(self, series_uid): + mhd_path = glob.glob( + 'data-unversioned/subset*/{}.mhd'.format(series_uid) + )[0] + + ct_mhd = sitk.ReadImage(mhd_path) + self.hu_a = np.array(sitk.GetArrayFromImage(ct_mhd), dtype=np.float32) + + # CTs are natively expressed in https://en.wikipedia.org/wiki/Hounsfield_scale + # HU are scaled oddly, with 0 g/cc (air, approximately) being -1000 and 1 g/cc (water) being 0. + + self.series_uid = series_uid + + self.origin_xyz = XyzTuple(*ct_mhd.GetOrigin()) + self.vxSize_xyz = XyzTuple(*ct_mhd.GetSpacing()) + self.direction_a = np.array(ct_mhd.GetDirection()).reshape(3, 3) + + candidateInfo_list = getCandidateInfoDict()[self.series_uid] + + self.positiveInfo_list = [ + candidate_tup + for candidate_tup in candidateInfo_list + if candidate_tup.isNodule_bool + ] + self.positive_mask = self.buildAnnotationMask(self.positiveInfo_list) + self.positive_indexes = (self.positive_mask.sum(axis=(1,2)) + .nonzero()[0].tolist()) + + def buildAnnotationMask(self, positiveInfo_list, threshold_hu = -700): + boundingBox_a = np.zeros_like(self.hu_a, dtype=np.bool) + + for candidateInfo_tup in positiveInfo_list: + center_irc = xyz2irc( + candidateInfo_tup.center_xyz, + self.origin_xyz, + self.vxSize_xyz, + self.direction_a, + ) + ci = int(center_irc.index) + cr = int(center_irc.row) + cc = int(center_irc.col) + + index_radius = 2 + try: + while self.hu_a[ci + index_radius, cr, cc] > threshold_hu and \ + self.hu_a[ci - index_radius, cr, cc] > threshold_hu: + index_radius += 1 + except IndexError: + index_radius -= 1 + + row_radius = 2 + try: + while self.hu_a[ci, cr + row_radius, cc] > threshold_hu and \ + self.hu_a[ci, cr - row_radius, cc] > threshold_hu: + row_radius += 1 + except IndexError: + row_radius -= 1 + + col_radius = 2 + try: + while self.hu_a[ci, cr, cc + col_radius] > threshold_hu and \ + self.hu_a[ci, cr, cc - col_radius] > threshold_hu: + col_radius += 1 + except IndexError: + col_radius -= 1 + + # assert index_radius > 0, repr([candidateInfo_tup.center_xyz, center_irc, self.hu_a[ci, cr, cc]]) + # assert row_radius > 0 + # assert col_radius > 0 + + boundingBox_a[ + ci - index_radius: ci + index_radius + 1, + cr - row_radius: cr + row_radius + 1, + cc - col_radius: cc + col_radius + 1] = True + + mask_a = boundingBox_a & (self.hu_a > threshold_hu) + + return mask_a + + def getRawCandidate(self, center_xyz, width_irc): + center_irc = xyz2irc(center_xyz, self.origin_xyz, self.vxSize_xyz, + self.direction_a) + + slice_list = [] + for axis, center_val in enumerate(center_irc): + start_ndx = int(round(center_val - width_irc[axis]/2)) + end_ndx = int(start_ndx + width_irc[axis]) + + assert center_val >= 0 and center_val < self.hu_a.shape[axis], repr([self.series_uid, center_xyz, self.origin_xyz, self.vxSize_xyz, center_irc, axis]) + + if start_ndx < 0: + # log.warning("Crop outside of CT array: {} {}, center:{} shape:{} width:{}".format( + # self.series_uid, center_xyz, center_irc, self.hu_a.shape, width_irc)) + start_ndx = 0 + end_ndx = int(width_irc[axis]) + + if end_ndx > self.hu_a.shape[axis]: + # log.warning("Crop outside of CT array: {} {}, center:{} shape:{} width:{}".format( + # self.series_uid, center_xyz, center_irc, self.hu_a.shape, width_irc)) + end_ndx = self.hu_a.shape[axis] + start_ndx = int(self.hu_a.shape[axis] - width_irc[axis]) + + slice_list.append(slice(start_ndx, end_ndx)) + + ct_chunk = self.hu_a[tuple(slice_list)] + pos_chunk = self.positive_mask[tuple(slice_list)] + + return ct_chunk, pos_chunk, center_irc + +@functools.lru_cache(1, typed=True) +def getCt(series_uid): + return Ct(series_uid) + +@raw_cache.memoize(typed=True) +def getCtRawCandidate(series_uid, center_xyz, width_irc): + ct = getCt(series_uid) + ct_chunk, pos_chunk, center_irc = ct.getRawCandidate(center_xyz, + width_irc) + ct_chunk.clip(-1000, 1000, ct_chunk) + return ct_chunk, pos_chunk, center_irc + +@raw_cache.memoize(typed=True) +def getCtSampleSize(series_uid): + ct = Ct(series_uid) + return int(ct.hu_a.shape[0]), ct.positive_indexes + + +class Luna2dSegmentationDataset(Dataset): + def __init__(self, + val_stride=0, + isValSet_bool=None, + series_uid=None, + contextSlices_count=3, + fullCt_bool=False, + ): + self.contextSlices_count = contextSlices_count + self.fullCt_bool = fullCt_bool + + if series_uid: + self.series_list = [series_uid] + else: + self.series_list = sorted(getCandidateInfoDict().keys()) + + if isValSet_bool: + assert val_stride > 0, val_stride + self.series_list = self.series_list[::val_stride] + assert self.series_list + elif val_stride > 0: + del self.series_list[::val_stride] + assert self.series_list + + self.sample_list = [] + for series_uid in self.series_list: + index_count, positive_indexes = getCtSampleSize(series_uid) + + if self.fullCt_bool: + self.sample_list += [(series_uid, slice_ndx) + for slice_ndx in range(index_count)] + else: + self.sample_list += [(series_uid, slice_ndx) + for slice_ndx in positive_indexes] + + self.candidateInfo_list = getCandidateInfoList() + + series_set = set(self.series_list) + self.candidateInfo_list = [cit for cit in self.candidateInfo_list + if cit.series_uid in series_set] + + self.pos_list = [nt for nt in self.candidateInfo_list + if nt.isNodule_bool] + + log.info("{!r}: {} {} series, {} slices, {} nodules".format( + self, + len(self.series_list), + {None: 'general', True: 'validation', False: 'training'}[isValSet_bool], + len(self.sample_list), + len(self.pos_list), + )) + + def __len__(self): + return len(self.sample_list) + + def __getitem__(self, ndx): + series_uid, slice_ndx = self.sample_list[ndx % len(self.sample_list)] + return self.getitem_fullSlice(series_uid, slice_ndx) + + def getitem_fullSlice(self, series_uid, slice_ndx): + ct = getCt(series_uid) + ct_t = torch.zeros((self.contextSlices_count * 2 + 1, 512, 512)) + + start_ndx = slice_ndx - self.contextSlices_count + end_ndx = slice_ndx + self.contextSlices_count + 1 + for i, context_ndx in enumerate(range(start_ndx, end_ndx)): + context_ndx = max(context_ndx, 0) + context_ndx = min(context_ndx, ct.hu_a.shape[0] - 1) + ct_t[i] = torch.from_numpy(ct.hu_a[context_ndx].astype(np.float32)) + + # CTs are natively expressed in https://en.wikipedia.org/wiki/Hounsfield_scale + # HU are scaled oddly, with 0 g/cc (air, approximately) being -1000 and 1 g/cc (water) being 0. + # The lower bound gets rid of negative density stuff used to indicate out-of-FOV + # The upper bound nukes any weird hotspots and clamps bone down + ct_t.clamp_(-1000, 1000) + + pos_t = torch.from_numpy(ct.positive_mask[slice_ndx]).unsqueeze(0) + + return ct_t, pos_t, ct.series_uid, slice_ndx + + +class TrainingLuna2dSegmentationDataset(Luna2dSegmentationDataset): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + self.ratio_int = 2 + + def __len__(self): + return 300000 + + def shuffleSamples(self): + random.shuffle(self.candidateInfo_list) + random.shuffle(self.pos_list) + + def __getitem__(self, ndx): + candidateInfo_tup = self.pos_list[ndx % len(self.pos_list)] + return self.getitem_trainingCrop(candidateInfo_tup) + + def getitem_trainingCrop(self, candidateInfo_tup): + ct_a, pos_a, center_irc = getCtRawCandidate( + candidateInfo_tup.series_uid, + candidateInfo_tup.center_xyz, + (7, 96, 96), + ) + pos_a = pos_a[3:4] + + row_offset = random.randrange(0,32) + col_offset = random.randrange(0,32) + ct_t = torch.from_numpy(ct_a[:, row_offset:row_offset+64, + col_offset:col_offset+64]).to(torch.float32) + pos_t = torch.from_numpy(pos_a[:, row_offset:row_offset+64, + col_offset:col_offset+64]).to(torch.long) + + slice_ndx = center_irc.index + + return ct_t, pos_t, candidateInfo_tup.series_uid, slice_ndx + +class PrepcacheLunaDataset(Dataset): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + self.candidateInfo_list = getCandidateInfoList() + self.pos_list = [nt for nt in self.candidateInfo_list if nt.isNodule_bool] + + self.seen_set = set() + self.candidateInfo_list.sort(key=lambda x: x.series_uid) + + def __len__(self): + return len(self.candidateInfo_list) + + def __getitem__(self, ndx): + # candidate_t, pos_t, series_uid, center_t = super().__getitem__(ndx) + + candidateInfo_tup = self.candidateInfo_list[ndx] + getCtRawCandidate(candidateInfo_tup.series_uid, candidateInfo_tup.center_xyz, (7, 96, 96)) + + series_uid = candidateInfo_tup.series_uid + if series_uid not in self.seen_set: + self.seen_set.add(series_uid) + + getCtSampleSize(series_uid) + # ct = getCt(series_uid) + # for mask_ndx in ct.positive_indexes: + # build2dLungMask(series_uid, mask_ndx) + + return 0, 1 #candidate_t, pos_t, series_uid, center_t + + +class TvTrainingLuna2dSegmentationDataset(torch.utils.data.Dataset): + def __init__(self, isValSet_bool=False, val_stride=10, contextSlices_count=3): + assert contextSlices_count == 3 + data = torch.load('./imgs_and_masks.pt') + suids = list(set(data['suids'])) + trn_mask_suids = torch.arange(len(suids)) % val_stride < (val_stride - 1) + trn_suids = {s for i, s in zip(trn_mask_suids, suids) if i} + trn_mask = torch.tensor([(s in trn_suids) for s in data["suids"]]) + if not isValSet_bool: + self.imgs = data["imgs"][trn_mask] + self.masks = data["masks"][trn_mask] + self.suids = [s for s, i in zip(data["suids"], trn_mask) if i] + else: + self.imgs = data["imgs"][~trn_mask] + self.masks = data["masks"][~trn_mask] + self.suids = [s for s, i in zip(data["suids"], trn_mask) if not i] + # discard spurious hotspots and clamp bone + self.imgs.clamp_(-1000, 1000) + self.imgs /= 1000 + + + def __len__(self): + return len(self.imgs) + + def __getitem__(self, i): + oh, ow = torch.randint(0, 32, (2,)) + sl = self.masks.size(1)//2 + return self.imgs[i, :, oh: oh + 64, ow: ow + 64], 1, self.masks[i, sl: sl+1, oh: oh + 64, ow: ow + 64].to(torch.float32), self.suids[i], 9999 diff --git a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/model.py b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/model.py new file mode 100644 index 00000000..20cecbb9 --- /dev/null +++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/model.py @@ -0,0 +1,224 @@ +import math +import random +from collections import namedtuple + +import torch +from torch import nn as nn +import torch.nn.functional as F + +from util.logconf import logging +from util.unet import UNet + +log = logging.getLogger(__name__) +# log.setLevel(logging.WARN) +# log.setLevel(logging.INFO) +log.setLevel(logging.DEBUG) + +class UNetWrapper(nn.Module): + def __init__(self, **kwargs): + super().__init__() + + self.input_batchnorm = nn.BatchNorm2d(kwargs['in_channels']) + self.unet = UNet(**kwargs) + self.final = nn.Sigmoid() + + self._init_weights() + + def _init_weights(self): + init_set = { + nn.Conv2d, + nn.Conv3d, + nn.ConvTranspose2d, + nn.ConvTranspose3d, + nn.Linear, + } + for m in self.modules(): + if type(m) in init_set: + nn.init.kaiming_normal_( + m.weight.data, mode='fan_out', nonlinearity='relu', a=0 + ) + if m.bias is not None: + fan_in, fan_out = \ + nn.init._calculate_fan_in_and_fan_out(m.weight.data) + bound = 1 / math.sqrt(fan_out) + nn.init.normal_(m.bias, -bound, bound) + + # nn.init.constant_(self.unet.last.bias, -4) + # nn.init.constant_(self.unet.last.bias, 4) + + + def forward(self, input_batch): + bn_output = self.input_batchnorm(input_batch) + un_output = self.unet(bn_output) + fn_output = self.final(un_output) + return fn_output + +class SegmentationAugmentation(nn.Module): + def __init__( + self, flip=None, offset=None, scale=None, rotate=None, noise=None + ): + super().__init__() + + self.flip = flip + self.offset = offset + self.scale = scale + self.rotate = rotate + self.noise = noise + + def forward(self, input_g, label_g): + transform_t = self._build2dTransformMatrix() + transform_t = transform_t.expand(input_g.shape[0], -1, -1) + transform_t = transform_t.to(input_g.device, torch.float32) + affine_t = F.affine_grid(transform_t[:,:2], + input_g.size(), align_corners=False) + + augmented_input_g = F.grid_sample(input_g, + affine_t, padding_mode='border', + align_corners=False) + augmented_label_g = F.grid_sample(label_g.to(torch.float32), + affine_t, padding_mode='border', + align_corners=False) + + if self.noise: + noise_t = torch.randn_like(augmented_input_g) + noise_t *= self.noise + + augmented_input_g += noise_t + + return augmented_input_g, augmented_label_g > 0.5 + + def _build2dTransformMatrix(self): + transform_t = torch.eye(3) + + for i in range(2): + if self.flip: + if random.random() > 0.5: + transform_t[i,i] *= -1 + + if self.offset: + offset_float = self.offset + random_float = (random.random() * 2 - 1) + transform_t[2,i] = offset_float * random_float + + if self.scale: + scale_float = self.scale + random_float = (random.random() * 2 - 1) + transform_t[i,i] *= 1.0 + scale_float * random_float + + if self.rotate: + angle_rad = random.random() * math.pi * 2 + s = math.sin(angle_rad) + c = math.cos(angle_rad) + + rotation_t = torch.tensor([ + [c, -s, 0], + [s, c, 0], + [0, 0, 1]]) + + transform_t @= rotation_t + + return transform_t + + +# MaskTuple = namedtuple('MaskTuple', 'raw_dense_mask, dense_mask, body_mask, air_mask, raw_candidate_mask, candidate_mask, lung_mask, neg_mask, pos_mask') +# +# class SegmentationMask(nn.Module): +# def __init__(self): +# super().__init__() +# +# self.conv_list = nn.ModuleList([ +# self._make_circle_conv(radius) for radius in range(1, 8) +# ]) +# +# def _make_circle_conv(self, radius): +# diameter = 1 + radius * 2 +# +# a = torch.linspace(-1, 1, steps=diameter)**2 +# b = (a[None] + a[:, None])**0.5 +# +# circle_weights = (b <= 1.0).to(torch.float32) +# +# conv = nn.Conv2d(1, 1, kernel_size=diameter, padding=radius, bias=False) +# conv.weight.data.fill_(1) +# conv.weight.data *= circle_weights / circle_weights.sum() +# +# return conv +# +# +# def erode(self, input_mask, radius, threshold=1): +# conv = self.conv_list[radius - 1] +# input_float = input_mask.to(torch.float32) +# result = conv(input_float) +# +# # log.debug(['erode in ', radius, threshold, input_float.min().item(), input_float.mean().item(), input_float.max().item()]) +# # log.debug(['erode out', radius, threshold, result.min().item(), result.mean().item(), result.max().item()]) +# +# return result >= threshold +# +# def deposit(self, input_mask, radius, threshold=0): +# conv = self.conv_list[radius - 1] +# input_float = input_mask.to(torch.float32) +# result = conv(input_float) +# +# # log.debug(['deposit in ', radius, threshold, input_float.min().item(), input_float.mean().item(), input_float.max().item()]) +# # log.debug(['deposit out', radius, threshold, result.min().item(), result.mean().item(), result.max().item()]) +# +# return result > threshold +# +# def fill_cavity(self, input_mask): +# cumsum = input_mask.cumsum(-1) +# filled_mask = (cumsum > 0) +# filled_mask &= (cumsum < cumsum[..., -1:]) +# cumsum = input_mask.cumsum(-2) +# filled_mask &= (cumsum > 0) +# filled_mask &= (cumsum < cumsum[..., -1:, :]) +# +# return filled_mask +# +# +# def forward(self, input_g, raw_pos_g): +# gcc_g = input_g + 1 +# +# with torch.no_grad(): +# # log.info(['gcc_g', gcc_g.min(), gcc_g.mean(), gcc_g.max()]) +# +# raw_dense_mask = gcc_g > 0.7 +# dense_mask = self.deposit(raw_dense_mask, 2) +# dense_mask = self.erode(dense_mask, 6) +# dense_mask = self.deposit(dense_mask, 4) +# +# body_mask = self.fill_cavity(dense_mask) +# air_mask = self.deposit(body_mask & ~dense_mask, 5) +# air_mask = self.erode(air_mask, 6) +# +# lung_mask = self.deposit(air_mask, 5) +# +# raw_candidate_mask = gcc_g > 0.4 +# raw_candidate_mask &= air_mask +# candidate_mask = self.erode(raw_candidate_mask, 1) +# candidate_mask = self.deposit(candidate_mask, 1) +# +# pos_mask = self.deposit((raw_pos_g > 0.5) & lung_mask, 2) +# +# neg_mask = self.deposit(candidate_mask, 1) +# neg_mask &= ~pos_mask +# neg_mask &= lung_mask +# +# # label_g = (neg_mask | pos_mask).to(torch.float32) +# label_g = (pos_mask).to(torch.float32) +# neg_g = neg_mask.to(torch.float32) +# pos_g = pos_mask.to(torch.float32) +# +# mask_dict = { +# 'raw_dense_mask': raw_dense_mask, +# 'dense_mask': dense_mask, +# 'body_mask': body_mask, +# 'air_mask': air_mask, +# 'raw_candidate_mask': raw_candidate_mask, +# 'candidate_mask': candidate_mask, +# 'lung_mask': lung_mask, +# 'neg_mask': neg_mask, +# 'pos_mask': pos_mask, +# } +# +# return label_g, neg_g, pos_g, lung_mask, mask_dict diff --git a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/prepcache.py b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/prepcache.py new file mode 100644 index 00000000..9e867cde --- /dev/null +++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/prepcache.py @@ -0,0 +1,69 @@ +import timing +import argparse +import sys + +import numpy as np + +import torch.nn as nn +from torch.autograd import Variable +from torch.optim import SGD +from torch.utils.data import DataLoader + +from util.util import enumerateWithEstimate +from .dsets import PrepcacheLunaDataset, getCtSampleSize +from util.logconf import logging +# from .model import LunaModel + +log = logging.getLogger(__name__) +# log.setLevel(logging.WARN) +log.setLevel(logging.INFO) +# log.setLevel(logging.DEBUG) + + +class LunaPrepCacheApp: + @classmethod + def __init__(self, sys_argv=None): + if sys_argv is None: + sys_argv = sys.argv[1:] + + parser = argparse.ArgumentParser() + parser.add_argument('--batch-size', + help='Batch size to use for training', + default=1024, + type=int, + ) + parser.add_argument('--num-workers', + help='Number of worker processes for background data loading', + default=8, + type=int, + ) + # parser.add_argument('--scaled', + # help="Scale the CT chunks to square voxels.", + # default=False, + # action='store_true', + # ) + + self.cli_args = parser.parse_args(sys_argv) + + def main(self): + log.info("Starting {}, {}".format(type(self).__name__, self.cli_args)) + + self.prep_dl = DataLoader( + PrepcacheLunaDataset( + # sortby_str='series_uid', + ), + batch_size=self.cli_args.batch_size, + num_workers=self.cli_args.num_workers, + ) + + batch_iter = enumerateWithEstimate( + self.prep_dl, + "Stuffing cache", + start_ndx=self.prep_dl.num_workers, + ) + for batch_ndx, batch_tup in batch_iter: + pass + + +if __name__ == '__main__': + LunaPrepCacheApp().main() diff --git a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/__init__.py b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/__init__.py new file mode 100644 index 00000000..06d74050 Binary files /dev/null and b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/__init__.py differ diff --git a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/augmentation.py b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/augmentation.py new file mode 100644 index 00000000..c5345e84 --- /dev/null +++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/augmentation.py @@ -0,0 +1,331 @@ +import math +import random +import warnings + +import numpy as np +import scipy.ndimage + +import torch +from torch.autograd import Function +from torch.autograd.function import once_differentiable +import torch.backends.cudnn as cudnn + +from util.logconf import logging +log = logging.getLogger(__name__) +# log.setLevel(logging.WARN) +# log.setLevel(logging.INFO) +log.setLevel(logging.DEBUG) + +def cropToShape(image, new_shape, center_list=None, fill=0.0): + # log.debug([image.shape, new_shape, center_list]) + # assert len(image.shape) == 3, repr(image.shape) + + if center_list is None: + center_list = [int(image.shape[i] / 2) for i in range(3)] + + crop_list = [] + for i in range(0, 3): + crop_int = center_list[i] + if image.shape[i] > new_shape[i] and crop_int is not None: + + # We can't just do crop_int +/- shape/2 since shape might be odd + # and ints round down. + start_int = crop_int - int(new_shape[i]/2) + end_int = start_int + new_shape[i] + crop_list.append(slice(max(0, start_int), end_int)) + else: + crop_list.append(slice(0, image.shape[i])) + + # log.debug([image.shape, crop_list]) + image = image[crop_list] + + crop_list = [] + for i in range(0, 3): + if image.shape[i] < new_shape[i]: + crop_int = int((new_shape[i] - image.shape[i]) / 2) + crop_list.append(slice(crop_int, crop_int + image.shape[i])) + else: + crop_list.append(slice(0, image.shape[i])) + + # log.debug([image.shape, crop_list]) + new_image = np.zeros(new_shape, dtype=image.dtype) + new_image[:] = fill + new_image[crop_list] = image + + return new_image + + +def zoomToShape(image, new_shape, square=True): + # assert image.shape[-1] in {1, 3, 4}, repr(image.shape) + + if square and image.shape[0] != image.shape[1]: + crop_int = min(image.shape[0], image.shape[1]) + new_shape = [crop_int, crop_int, image.shape[2]] + image = cropToShape(image, new_shape) + + zoom_shape = [new_shape[i] / image.shape[i] for i in range(3)] + + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + image = scipy.ndimage.interpolation.zoom( + image, zoom_shape, + output=None, order=0, mode='nearest', cval=0.0, prefilter=True) + + return image + +def randomOffset(image_list, offset_rows=0.125, offset_cols=0.125): + + center_list = [int(image_list[0].shape[i] / 2) for i in range(3)] + center_list[0] += int(offset_rows * (random.random() - 0.5) * 2) + center_list[1] += int(offset_cols * (random.random() - 0.5) * 2) + center_list[2] = None + + new_list = [] + for image in image_list: + new_image = cropToShape(image, image.shape, center_list) + new_list.append(new_image) + + return new_list + + +def randomZoom(image_list, scale=None, scale_min=0.8, scale_max=1.3): + if scale is None: + scale = scale_min + (scale_max - scale_min) * random.random() + + new_list = [] + for image in image_list: + # assert image.shape[-1] in {1, 3, 4}, repr(image.shape) + + with warnings.catch_warnings(): + warnings.simplefilter("ignore") + # log.info([image.shape]) + zimage = scipy.ndimage.interpolation.zoom( + image, [scale, scale, 1.0], + output=None, order=0, mode='nearest', cval=0.0, prefilter=True) + image = cropToShape(zimage, image.shape) + + new_list.append(image) + + return new_list + + +_randomFlip_transform_list = [ + # lambda a: np.rot90(a, axes=(0, 1)), + # lambda a: np.flip(a, 0), + lambda a: np.flip(a, 1), +] + +def randomFlip(image_list, transform_bits=None): + if transform_bits is None: + transform_bits = random.randrange(0, 2 ** len(_randomFlip_transform_list)) + + new_list = [] + for image in image_list: + # assert image.shape[-1] in {1, 3, 4}, repr(image.shape) + + for n in range(len(_randomFlip_transform_list)): + if transform_bits & 2**n: + # prhist(image, 'before') + image = _randomFlip_transform_list[n](image) + # prhist(image, 'after ') + + new_list.append(image) + + return new_list + + +def randomSpin(image_list, angle=None, range_tup=None, axes=(0, 1)): + if range_tup is None: + range_tup = (0, 360) + + if angle is None: + angle = range_tup[0] + (range_tup[1] - range_tup[0]) * random.random() + + new_list = [] + for image in image_list: + # assert image.shape[-1] in {1, 3, 4}, repr(image.shape) + + image = scipy.ndimage.interpolation.rotate( + image, angle, axes=axes, reshape=False, + output=None, order=0, mode='nearest', cval=0.0, prefilter=True) + + new_list.append(image) + + return new_list + + +def randomNoise(image_list, noise_min=-0.1, noise_max=0.1): + noise = np.zeros_like(image_list[0]) + noise += (noise_max - noise_min) * np.random.random_sample(image_list[0].shape) + noise_min + noise *= 5 + noise = scipy.ndimage.filters.gaussian_filter(noise, 3) + # noise += (noise_max - noise_min) * np.random.random_sample(image_hsv.shape) + noise_min + + new_list = [] + for image_hsv in image_list: + image_hsv = image_hsv + noise + + new_list.append(image_hsv) + + return new_list + + +def randomHsvShift(image_list, h=None, s=None, v=None, + h_min=-0.1, h_max=0.1, + s_min=0.5, s_max=2.0, + v_min=0.5, v_max=2.0): + if h is None: + h = h_min + (h_max - h_min) * random.random() + if s is None: + s = s_min + (s_max - s_min) * random.random() + if v is None: + v = v_min + (v_max - v_min) * random.random() + + new_list = [] + for image_hsv in image_list: + # assert image_hsv.shape[-1] == 3, repr(image_hsv.shape) + + image_hsv[:,:,0::3] += h + image_hsv[:,:,1::3] = image_hsv[:,:,1::3] ** s + image_hsv[:,:,2::3] = image_hsv[:,:,2::3] ** v + + new_list.append(image_hsv) + + return clampHsv(new_list) + + +def clampHsv(image_list): + new_list = [] + for image_hsv in image_list: + image_hsv = image_hsv.clone() + + # Hue wraps around + image_hsv[:,:,0][image_hsv[:,:,0] > 1] -= 1 + image_hsv[:,:,0][image_hsv[:,:,0] < 0] += 1 + + # Everything else clamps between 0 and 1 + image_hsv[image_hsv > 1] = 1 + image_hsv[image_hsv < 0] = 0 + + new_list.append(image_hsv) + + return new_list + + +# def torch_augment(input): +# theta = random.random() * math.pi * 2 +# s = math.sin(theta) +# c = math.cos(theta) +# c1 = 1 - c +# axis_vector = torch.rand(3, device='cpu', dtype=torch.float64) +# axis_vector -= 0.5 +# axis_vector /= axis_vector.abs().sum() +# l, m, n = axis_vector +# +# matrix = torch.tensor([ +# [l*l*c1 + c, m*l*c1 - n*s, n*l*c1 + m*s, 0], +# [l*m*c1 + n*s, m*m*c1 + c, n*m*c1 - l*s, 0], +# [l*n*c1 - m*s, m*n*c1 + l*s, n*n*c1 + c, 0], +# [0, 0, 0, 1], +# ], device=input.device, dtype=torch.float32) +# +# return th_affine3d(input, matrix) + + + + +# following from https://github.com/ncullen93/torchsample/blob/master/torchsample/utils.py +# MIT licensed + +# def th_affine3d(input, matrix): +# """ +# 3D Affine image transform on torch.Tensor +# """ +# A = matrix[:3,:3] +# b = matrix[:3,3] +# +# # make a meshgrid of normal coordinates +# coords = th_iterproduct(input.size(-3), input.size(-2), input.size(-1), dtype=torch.float32) +# +# # shift the coordinates so center is the origin +# coords[:,0] = coords[:,0] - (input.size(-3) / 2. - 0.5) +# coords[:,1] = coords[:,1] - (input.size(-2) / 2. - 0.5) +# coords[:,2] = coords[:,2] - (input.size(-1) / 2. - 0.5) +# +# # apply the coordinate transformation +# new_coords = coords.mm(A.t().contiguous()) + b.expand_as(coords) +# +# # shift the coordinates back so origin is origin +# new_coords[:,0] = new_coords[:,0] + (input.size(-3) / 2. - 0.5) +# new_coords[:,1] = new_coords[:,1] + (input.size(-2) / 2. - 0.5) +# new_coords[:,2] = new_coords[:,2] + (input.size(-1) / 2. - 0.5) +# +# # map new coordinates using bilinear interpolation +# input_transformed = th_trilinear_interp3d(input, new_coords) +# +# return input_transformed +# +# +# def th_trilinear_interp3d(input, coords): +# """ +# trilinear interpolation of 3D torch.Tensor image +# """ +# # take clamp then floor/ceil of x coords +# x = torch.clamp(coords[:,0], 0, input.size(-3)-2) +# x0 = x.floor() +# x1 = x0 + 1 +# # take clamp then floor/ceil of y coords +# y = torch.clamp(coords[:,1], 0, input.size(-2)-2) +# y0 = y.floor() +# y1 = y0 + 1 +# # take clamp then floor/ceil of z coords +# z = torch.clamp(coords[:,2], 0, input.size(-1)-2) +# z0 = z.floor() +# z1 = z0 + 1 +# +# stride = torch.tensor(input.stride()[-3:], dtype=torch.int64, device=input.device) +# x0_ix = x0.mul(stride[0]).long() +# x1_ix = x1.mul(stride[0]).long() +# y0_ix = y0.mul(stride[1]).long() +# y1_ix = y1.mul(stride[1]).long() +# z0_ix = z0.mul(stride[2]).long() +# z1_ix = z1.mul(stride[2]).long() +# +# # input_flat = th_flatten(input) +# input_flat = x.contiguous().view(x[0], x[1], -1) +# +# vals_000 = input_flat[:, :, x0_ix+y0_ix+z0_ix] +# vals_001 = input_flat[:, :, x0_ix+y0_ix+z1_ix] +# vals_010 = input_flat[:, :, x0_ix+y1_ix+z0_ix] +# vals_011 = input_flat[:, :, x0_ix+y1_ix+z1_ix] +# vals_100 = input_flat[:, :, x1_ix+y0_ix+z0_ix] +# vals_101 = input_flat[:, :, x1_ix+y0_ix+z1_ix] +# vals_110 = input_flat[:, :, x1_ix+y1_ix+z0_ix] +# vals_111 = input_flat[:, :, x1_ix+y1_ix+z1_ix] +# +# xd = x - x0 +# yd = y - y0 +# zd = z - z0 +# xm1 = 1 - xd +# ym1 = 1 - yd +# zm1 = 1 - zd +# +# x_mapped = ( +# vals_000.mul(xm1).mul(ym1).mul(zm1) + +# vals_001.mul(xm1).mul(ym1).mul(zd) + +# vals_010.mul(xm1).mul(yd).mul(zm1) + +# vals_011.mul(xm1).mul(yd).mul(zd) + +# vals_100.mul(xd).mul(ym1).mul(zm1) + +# vals_101.mul(xd).mul(ym1).mul(zd) + +# vals_110.mul(xd).mul(yd).mul(zm1) + +# vals_111.mul(xd).mul(yd).mul(zd) +# ) +# +# return x_mapped.view_as(input) +# +# def th_iterproduct(*args, dtype=None): +# return torch.from_numpy(np.indices(args).reshape((len(args),-1)).T) +# +# def th_flatten(x): +# """Flatten tensor""" +# return x.contiguous().view(x[0], x[1], -1) diff --git a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/disk.py b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/disk.py new file mode 100644 index 00000000..091d2bb6 --- /dev/null +++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/disk.py @@ -0,0 +1,136 @@ +import gzip + +from diskcache import FanoutCache, Disk +from diskcache.core import BytesType, MODE_BINARY, BytesIO + +from util.logconf import logging +log = logging.getLogger(__name__) +# log.setLevel(logging.WARN) +log.setLevel(logging.INFO) +# log.setLevel(logging.DEBUG) + + +class GzipDisk(Disk): + def store(self, value, read, key=None): + """ + Override from base class diskcache.Disk. + + Chunking is due to needing to work on pythons < 2.7.13: + - Issue #27130: In the "zlib" module, fix handling of large buffers + (typically 2 or 4 GiB). Previously, inputs were limited to 2 GiB, and + compression and decompression operations did not properly handle results of + 2 or 4 GiB. + + :param value: value to convert + :param bool read: True when value is file-like object + :return: (size, mode, filename, value) tuple for Cache table + """ + # pylint: disable=unidiomatic-typecheck + if type(value) is BytesType: + if read: + value = value.read() + read = False + + str_io = BytesIO() + gz_file = gzip.GzipFile(mode='wb', compresslevel=1, fileobj=str_io) + + for offset in range(0, len(value), 2**30): + gz_file.write(value[offset:offset+2**30]) + gz_file.close() + + value = str_io.getvalue() + + return super(GzipDisk, self).store(value, read) + + + def fetch(self, mode, filename, value, read): + """ + Override from base class diskcache.Disk. + + Chunking is due to needing to work on pythons < 2.7.13: + - Issue #27130: In the "zlib" module, fix handling of large buffers + (typically 2 or 4 GiB). Previously, inputs were limited to 2 GiB, and + compression and decompression operations did not properly handle results of + 2 or 4 GiB. + + :param int mode: value mode raw, binary, text, or pickle + :param str filename: filename of corresponding value + :param value: database value + :param bool read: when True, return an open file handle + :return: corresponding Python value + """ + value = super(GzipDisk, self).fetch(mode, filename, value, read) + + if mode == MODE_BINARY: + str_io = BytesIO(value) + gz_file = gzip.GzipFile(mode='rb', fileobj=str_io) + read_csio = BytesIO() + + while True: + uncompressed_data = gz_file.read(2**30) + if uncompressed_data: + read_csio.write(uncompressed_data) + else: + break + + value = read_csio.getvalue() + + return value + +def getCache(scope_str): + return FanoutCache('data-unversioned/cache/' + scope_str, + disk=GzipDisk, + shards=64, + timeout=1, + size_limit=3e11, + # disk_min_file_size=2**20, + ) + +# def disk_cache(base_path, memsize=2): +# def disk_cache_decorator(f): +# @functools.wraps(f) +# def wrapper(*args, **kwargs): +# args_str = repr(args) + repr(sorted(kwargs.items())) +# file_str = hashlib.md5(args_str.encode('utf8')).hexdigest() +# +# cache_path = os.path.join(base_path, f.__name__, file_str + '.pkl.gz') +# +# if not os.path.exists(os.path.dirname(cache_path)): +# os.makedirs(os.path.dirname(cache_path), exist_ok=True) +# +# if os.path.exists(cache_path): +# return pickle_loadgz(cache_path) +# else: +# ret = f(*args, **kwargs) +# pickle_dumpgz(cache_path, ret) +# return ret +# +# return wrapper +# +# return disk_cache_decorator +# +# +# def pickle_dumpgz(file_path, obj): +# log.debug("Writing {}".format(file_path)) +# with open(file_path, 'wb') as file_obj: +# with gzip.GzipFile(mode='wb', compresslevel=1, fileobj=file_obj) as gz_file: +# pickle.dump(obj, gz_file, pickle.HIGHEST_PROTOCOL) +# +# +# def pickle_loadgz(file_path): +# log.debug("Reading {}".format(file_path)) +# with open(file_path, 'rb') as file_obj: +# with gzip.GzipFile(mode='rb', fileobj=file_obj) as gz_file: +# return pickle.load(gz_file) +# +# +# def dtpath(dt=None): +# if dt is None: +# dt = datetime.datetime.now() +# +# return str(dt).rsplit('.', 1)[0].replace(' ', '--').replace(':', '.') +# +# +# def safepath(s): +# s = s.replace(' ', '_') +# return re.sub('[^A-Za-z0-9_.-]', '', s) diff --git a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/logconf.py b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/logconf.py new file mode 100644 index 00000000..65f7b9da --- /dev/null +++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/logconf.py @@ -0,0 +1,19 @@ +import logging +import logging.handlers + +root_logger = logging.getLogger() +root_logger.setLevel(logging.INFO) + +# Some libraries attempt to add their own root logger handlers. This is +# annoying and so we get rid of them. +for handler in list(root_logger.handlers): + root_logger.removeHandler(handler) + +logfmt_str = "%(asctime)s %(levelname)-8s pid:%(process)d %(name)s:%(lineno)03d:%(funcName)s %(message)s" +formatter = logging.Formatter(logfmt_str) + +streamHandler = logging.StreamHandler() +streamHandler.setFormatter(formatter) +streamHandler.setLevel(logging.DEBUG) + +root_logger.addHandler(streamHandler) diff --git a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/unet.py b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/unet.py new file mode 100644 index 00000000..9e16a525 --- /dev/null +++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/unet.py @@ -0,0 +1,143 @@ +# From https://github.com/jvanvugt/pytorch-unet +# https://raw.githubusercontent.com/jvanvugt/pytorch-unet/master/unet.py + +# MIT License +# +# Copyright (c) 2018 Joris +# +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: +# +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. + +# Adapted from https://discuss.pytorch.org/t/unet-implementation/426 + +import torch +from torch import nn +import torch.nn.functional as F + + +class UNet(nn.Module): + def __init__(self, in_channels=1, n_classes=2, depth=5, wf=6, padding=False, + batch_norm=False, up_mode='upconv'): + """ + Implementation of + U-Net: Convolutional Networks for Biomedical Image Segmentation + (Ronneberger et al., 2015) + https://arxiv.org/abs/1505.04597 + + Using the default arguments will yield the exact version used + in the original paper + + Args: + in_channels (int): number of input channels + n_classes (int): number of output channels + depth (int): depth of the network + wf (int): number of filters in the first layer is 2**wf + padding (bool): if True, apply padding such that the input shape + is the same as the output. + This may introduce artifacts + batch_norm (bool): Use BatchNorm after layers with an + activation function + up_mode (str): one of 'upconv' or 'upsample'. + 'upconv' will use transposed convolutions for + learned upsampling. + 'upsample' will use bilinear upsampling. + """ + super(UNet, self).__init__() + assert up_mode in ('upconv', 'upsample') + self.padding = padding + self.depth = depth + prev_channels = in_channels + self.down_path = nn.ModuleList() + for i in range(depth): + self.down_path.append(UNetConvBlock(prev_channels, 2**(wf+i), + padding, batch_norm)) + prev_channels = 2**(wf+i) + + self.up_path = nn.ModuleList() + for i in reversed(range(depth - 1)): + self.up_path.append(UNetUpBlock(prev_channels, 2**(wf+i), up_mode, + padding, batch_norm)) + prev_channels = 2**(wf+i) + + self.last = nn.Conv2d(prev_channels, n_classes, kernel_size=1) + + def forward(self, x): + blocks = [] + for i, down in enumerate(self.down_path): + x = down(x) + if i != len(self.down_path)-1: + blocks.append(x) + x = F.avg_pool2d(x, 2) + + for i, up in enumerate(self.up_path): + x = up(x, blocks[-i-1]) + + return self.last(x) + + +class UNetConvBlock(nn.Module): + def __init__(self, in_size, out_size, padding, batch_norm): + super(UNetConvBlock, self).__init__() + block = [] + + block.append(nn.Conv2d(in_size, out_size, kernel_size=3, + padding=int(padding))) + block.append(nn.ReLU()) + # block.append(nn.LeakyReLU()) + if batch_norm: + block.append(nn.BatchNorm2d(out_size)) + + block.append(nn.Conv2d(out_size, out_size, kernel_size=3, + padding=int(padding))) + block.append(nn.ReLU()) + # block.append(nn.LeakyReLU()) + if batch_norm: + block.append(nn.BatchNorm2d(out_size)) + + self.block = nn.Sequential(*block) + + def forward(self, x): + out = self.block(x) + return out + + +class UNetUpBlock(nn.Module): + def __init__(self, in_size, out_size, up_mode, padding, batch_norm): + super(UNetUpBlock, self).__init__() + if up_mode == 'upconv': + self.up = nn.ConvTranspose2d(in_size, out_size, kernel_size=2, + stride=2) + elif up_mode == 'upsample': + self.up = nn.Sequential(nn.Upsample(mode='bilinear', scale_factor=2), + nn.Conv2d(in_size, out_size, kernel_size=1)) + + self.conv_block = UNetConvBlock(in_size, out_size, padding, batch_norm) + + def center_crop(self, layer, target_size): + _, _, layer_height, layer_width = layer.size() + diff_y = (layer_height - target_size[0]) // 2 + diff_x = (layer_width - target_size[1]) // 2 + return layer[:, :, diff_y:(diff_y + target_size[0]), diff_x:(diff_x + target_size[1])] + + def forward(self, x, bridge): + up = self.up(x) + crop1 = self.center_crop(bridge, up.shape[2:]) + out = torch.cat([up, crop1], 1) + out = self.conv_block(out) + + return out diff --git a/pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist-dist.py b/pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist-dist.py new file mode 100644 index 00000000..789bf6c8 --- /dev/null +++ b/pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist-dist.py @@ -0,0 +1,105 @@ +import os +from datetime import datetime +import argparse +import torch.multiprocessing as mp +import torchvision +import torchvision.transforms as transforms +import torch +import torch.nn as nn +import torch.distributed as dist +from apex.parallel import DistributedDataParallel as DDP +from apex import amp + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N', + help='number of data loading workers (default: 4)') + parser.add_argument('-g', '--gpus', default=1, type=int, + help='number of gpus per node') + parser.add_argument('-nr', '--nr', default=0, type=int, + help='ranking within the nodes') + parser.add_argument('--epochs', default=2, type=int, metavar='N', + help='number of total epochs to run') + args = parser.parse_args() + args.world_size = args.gpus * args.nodes + os.environ['MASTER_ADDR'] = '147.232.47.114' + os.environ['MASTER_PORT'] = '8888' + mp.spawn(train, nprocs=args.gpus, args=(args,)) + + +class ConvNet(nn.Module): + def __init__(self, num_classes=10): + super(ConvNet, self).__init__() + self.layer1 = nn.Sequential( + nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2), + nn.BatchNorm2d(16), + nn.ReLU(), + nn.MaxPool2d(kernel_size=2, stride=2)) + self.layer2 = nn.Sequential( + nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2), + nn.BatchNorm2d(32), + nn.ReLU(), + nn.MaxPool2d(kernel_size=2, stride=2)) + self.fc = nn.Linear(7*7*32, num_classes) + + def forward(self, x): + out = self.layer1(x) + out = self.layer2(out) + out = out.reshape(out.size(0), -1) + out = self.fc(out) + return out + + +def train(gpu, args): + rank = args.nr * args.gpus + gpu + dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank) + torch.manual_seed(0) + model = ConvNet() + torch.cuda.set_device(gpu) + model.cuda(gpu) + batch_size = 10 + # define loss function (criterion) and optimizer + criterion = nn.CrossEntropyLoss().cuda(gpu) + optimizer = torch.optim.SGD(model.parameters(), 1e-4) + # Wrap the model + model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu]) + # Data loading code + train_dataset = torchvision.datasets.MNIST(root='./data', + train=True, + transform=transforms.ToTensor(), + download=True) + train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, + num_replicas=args.world_size, + rank=rank) + train_loader = torch.utils.data.DataLoader(dataset=train_dataset, + batch_size=batch_size, + shuffle=False, + num_workers=0, + pin_memory=True, + sampler=train_sampler) + + start = datetime.now() + total_step = len(train_loader) + for epoch in range(args.epochs): + for i, (images, labels) in enumerate(train_loader): + images = images.cuda(non_blocking=True) + labels = labels.cuda(non_blocking=True) + # Forward pass + outputs = model(images) + loss = criterion(outputs, labels) + + # Backward and optimize + optimizer.zero_grad() + loss.backward() + optimizer.step() + if (i + 1) % 100 == 0 and gpu == 0: + print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step, + loss.item())) + if gpu == 0: + print("Training complete in: " + str(datetime.now() - start)) + + +if __name__ == '__main__': + torch.multiprocessing.set_start_method('spawn') + main() \ No newline at end of file diff --git a/pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist.py b/pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist.py new file mode 100644 index 00000000..9b72fa9c --- /dev/null +++ b/pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist.py @@ -0,0 +1,92 @@ +import os +from datetime import datetime +import argparse +import torch.multiprocessing as mp +import torchvision +import torchvision.transforms as transforms +import torch +import torch.nn as nn +import torch.distributed as dist +from apex.parallel import DistributedDataParallel as DDP +from apex import amp + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N', + help='number of data loading workers (default: 4)') + parser.add_argument('-g', '--gpus', default=1, type=int, + help='number of gpus per node') + parser.add_argument('-nr', '--nr', default=0, type =int, + help='ranking within the nodes') + parser.add_argument('--epochs', default=2, type=int, metavar='N', + help='number of total epochs to run') + args = parser.parse_args() + train(0, args) + + +class ConvNet(nn.Module): + def __init__(self, num_classes=10): + super(ConvNet, self).__init__() + self.layer1 = nn.Sequential( + nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2), + nn.BatchNorm2d(16), + nn.ReLU(), + nn.MaxPool2d(kernel_size=2, stride=2)) + self.layer2 = nn.Sequential( + nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2), + nn.BatchNorm2d(32), + nn.ReLU(), + nn.MaxPool2d(kernel_size=2, stride=2)) + self.fc = nn.Linear(7*7*32, num_classes) + + def forward(self, x): + out = self.layer1(x) + out = self.layer2(out) + out = out.reshape(out.size(0), -1) + out = self.fc(out) + return out + + +def train(gpu, args): + model = ConvNet() + torch.cuda.set_device(gpu) + model.cuda(gpu) + batch_size = 50 + # define loss function (criterion) and optimizer + criterion = nn.CrossEntropyLoss().cuda(gpu) + optimizer = torch.optim.SGD(model.parameters(), 1e-4) + # Data loading code + train_dataset = torchvision.datasets.MNIST(root='./data', + train=True, + transform=transforms.ToTensor(), + download=True) + train_loader = torch.utils.data.DataLoader(dataset=train_dataset, + batch_size=batch_size, + shuffle=True, + num_workers=0, + pin_memory=True) + + start = datetime.now() + total_step = len(train_loader) + for epoch in range(args.epochs): + for i, (images, labels) in enumerate(train_loader): + images = images.cuda(non_blocking=True) + labels = labels.cuda(non_blocking=True) + # Forward pass + outputs = model(images) + loss = criterion(outputs, labels) + + # Backward and optimize + optimizer.zero_grad() + loss.backward() + optimizer.step() + if (i + 1) % 100 == 0 and gpu == 0: + print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step, + loss.item())) + if gpu == 0: + print("Training complete in: " + str(datetime.now() - start)) + + +if __name__ == '__main__': + main() diff --git a/pages/students/2016/lukas_pokryvka/dp2021/yelp/data/random.csv b/pages/students/2016/lukas_pokryvka/dp2021/yelp/data/random.csv new file mode 100644 index 00000000..e69de29b diff --git a/pages/students/2016/lukas_pokryvka/dp2021/yelp/script.py b/pages/students/2016/lukas_pokryvka/dp2021/yelp/script.py new file mode 100644 index 00000000..b0629a18 --- /dev/null +++ b/pages/students/2016/lukas_pokryvka/dp2021/yelp/script.py @@ -0,0 +1,748 @@ +from argparse import Namespace +from collections import Counter +import json +import os +import re +import string + +import numpy as np +import pandas as pd +import torch +import torch.nn as nn +import torch.nn.functional as F +import torch.optim as optim +from torch.utils.data import Dataset, DataLoader +from tqdm.notebook import tqdm + + +class Vocabulary(object): + """Class to process text and extract vocabulary for mapping""" + + def __init__(self, token_to_idx=None, add_unk=True, unk_token=""): + """ + Args: + token_to_idx (dict): a pre-existing map of tokens to indices + add_unk (bool): a flag that indicates whether to add the UNK token + unk_token (str): the UNK token to add into the Vocabulary + """ + + if token_to_idx is None: + token_to_idx = {} + self._token_to_idx = token_to_idx + + self._idx_to_token = {idx: token + for token, idx in self._token_to_idx.items()} + + self._add_unk = add_unk + self._unk_token = unk_token + + self.unk_index = -1 + if add_unk: + self.unk_index = self.add_token(unk_token) + + + def to_serializable(self): + """ returns a dictionary that can be serialized """ + return {'token_to_idx': self._token_to_idx, + 'add_unk': self._add_unk, + 'unk_token': self._unk_token} + + @classmethod + def from_serializable(cls, contents): + """ instantiates the Vocabulary from a serialized dictionary """ + return cls(**contents) + + def add_token(self, token): + """Update mapping dicts based on the token. + + Args: + token (str): the item to add into the Vocabulary + Returns: + index (int): the integer corresponding to the token + """ + if token in self._token_to_idx: + index = self._token_to_idx[token] + else: + index = len(self._token_to_idx) + self._token_to_idx[token] = index + self._idx_to_token[index] = token + return index + + def add_many(self, tokens): + """Add a list of tokens into the Vocabulary + + Args: + tokens (list): a list of string tokens + Returns: + indices (list): a list of indices corresponding to the tokens + """ + return [self.add_token(token) for token in tokens] + + def lookup_token(self, token): + """Retrieve the index associated with the token + or the UNK index if token isn't present. + + Args: + token (str): the token to look up + Returns: + index (int): the index corresponding to the token + Notes: + `unk_index` needs to be >=0 (having been added into the Vocabulary) + for the UNK functionality + """ + if self.unk_index >= 0: + return self._token_to_idx.get(token, self.unk_index) + else: + return self._token_to_idx[token] + + def lookup_index(self, index): + """Return the token associated with the index + + Args: + index (int): the index to look up + Returns: + token (str): the token corresponding to the index + Raises: + KeyError: if the index is not in the Vocabulary + """ + if index not in self._idx_to_token: + raise KeyError("the index (%d) is not in the Vocabulary" % index) + return self._idx_to_token[index] + + def __str__(self): + return "" % len(self) + + def __len__(self): + return len(self._token_to_idx) + + + + +class ReviewVectorizer(object): + """ The Vectorizer which coordinates the Vocabularies and puts them to use""" + def __init__(self, review_vocab, rating_vocab): + """ + Args: + review_vocab (Vocabulary): maps words to integers + rating_vocab (Vocabulary): maps class labels to integers + """ + self.review_vocab = review_vocab + self.rating_vocab = rating_vocab + + def vectorize(self, review): + """Create a collapsed one-hit vector for the review + + Args: + review (str): the review + Returns: + one_hot (np.ndarray): the collapsed one-hot encoding + """ + one_hot = np.zeros(len(self.review_vocab), dtype=np.float32) + + for token in review.split(" "): + if token not in string.punctuation: + one_hot[self.review_vocab.lookup_token(token)] = 1 + + return one_hot + + @classmethod + def from_dataframe(cls, review_df, cutoff=25): + """Instantiate the vectorizer from the dataset dataframe + + Args: + review_df (pandas.DataFrame): the review dataset + cutoff (int): the parameter for frequency-based filtering + Returns: + an instance of the ReviewVectorizer + """ + review_vocab = Vocabulary(add_unk=True) + rating_vocab = Vocabulary(add_unk=False) + + # Add ratings + for rating in sorted(set(review_df.rating)): + rating_vocab.add_token(rating) + + # Add top words if count > provided count + word_counts = Counter() + for review in review_df.review: + for word in review.split(" "): + if word not in string.punctuation: + word_counts[word] += 1 + + for word, count in word_counts.items(): + if count > cutoff: + review_vocab.add_token(word) + + return cls(review_vocab, rating_vocab) + + @classmethod + def from_serializable(cls, contents): + """Instantiate a ReviewVectorizer from a serializable dictionary + + Args: + contents (dict): the serializable dictionary + Returns: + an instance of the ReviewVectorizer class + """ + review_vocab = Vocabulary.from_serializable(contents['review_vocab']) + rating_vocab = Vocabulary.from_serializable(contents['rating_vocab']) + + return cls(review_vocab=review_vocab, rating_vocab=rating_vocab) + + def to_serializable(self): + """Create the serializable dictionary for caching + + Returns: + contents (dict): the serializable dictionary + """ + return {'review_vocab': self.review_vocab.to_serializable(), + 'rating_vocab': self.rating_vocab.to_serializable()} + + + +class ReviewDataset(Dataset): + def __init__(self, review_df, vectorizer): + """ + Args: + review_df (pandas.DataFrame): the dataset + vectorizer (ReviewVectorizer): vectorizer instantiated from dataset + """ + self.review_df = review_df + self._vectorizer = vectorizer + + self.train_df = self.review_df[self.review_df.split=='train'] + self.train_size = len(self.train_df) + + self.val_df = self.review_df[self.review_df.split=='val'] + self.validation_size = len(self.val_df) + + self.test_df = self.review_df[self.review_df.split=='test'] + self.test_size = len(self.test_df) + + self._lookup_dict = {'train': (self.train_df, self.train_size), + 'val': (self.val_df, self.validation_size), + 'test': (self.test_df, self.test_size)} + + self.set_split('train') + + @classmethod + def load_dataset_and_make_vectorizer(cls, review_csv): + """Load dataset and make a new vectorizer from scratch + + Args: + review_csv (str): location of the dataset + Returns: + an instance of ReviewDataset + """ + review_df = pd.read_csv(review_csv) + train_review_df = review_df[review_df.split=='train'] + return cls(review_df, ReviewVectorizer.from_dataframe(train_review_df)) + + @classmethod + def load_dataset_and_load_vectorizer(cls, review_csv, vectorizer_filepath): + """Load dataset and the corresponding vectorizer. + Used in the case in the vectorizer has been cached for re-use + + Args: + review_csv (str): location of the dataset + vectorizer_filepath (str): location of the saved vectorizer + Returns: + an instance of ReviewDataset + """ + review_df = pd.read_csv(review_csv) + vectorizer = cls.load_vectorizer_only(vectorizer_filepath) + return cls(review_df, vectorizer) + + @staticmethod + def load_vectorizer_only(vectorizer_filepath): + """a static method for loading the vectorizer from file + + Args: + vectorizer_filepath (str): the location of the serialized vectorizer + Returns: + an instance of ReviewVectorizer + """ + with open(vectorizer_filepath) as fp: + return ReviewVectorizer.from_serializable(json.load(fp)) + + def save_vectorizer(self, vectorizer_filepath): + """saves the vectorizer to disk using json + + Args: + vectorizer_filepath (str): the location to save the vectorizer + """ + with open(vectorizer_filepath, "w") as fp: + json.dump(self._vectorizer.to_serializable(), fp) + + def get_vectorizer(self): + """ returns the vectorizer """ + return self._vectorizer + + def set_split(self, split="train"): + """ selects the splits in the dataset using a column in the dataframe + + Args: + split (str): one of "train", "val", or "test" + """ + self._target_split = split + self._target_df, self._target_size = self._lookup_dict[split] + + def __len__(self): + return self._target_size + + def __getitem__(self, index): + """the primary entry point method for PyTorch datasets + + Args: + index (int): the index to the data point + Returns: + a dictionary holding the data point's features (x_data) and label (y_target) + """ + row = self._target_df.iloc[index] + + review_vector = \ + self._vectorizer.vectorize(row.review) + + rating_index = \ + self._vectorizer.rating_vocab.lookup_token(row.rating) + + return {'x_data': review_vector, + 'y_target': rating_index} + + def get_num_batches(self, batch_size): + """Given a batch size, return the number of batches in the dataset + + Args: + batch_size (int) + Returns: + number of batches in the dataset + """ + return len(self) // batch_size + +def generate_batches(dataset, batch_size, shuffle=True, + drop_last=True, device="cpu"): + """ + A generator function which wraps the PyTorch DataLoader. It will + ensure each tensor is on the write device location. + """ + dataloader = DataLoader(dataset=dataset, batch_size=batch_size, + shuffle=shuffle, drop_last=drop_last) + + for data_dict in dataloader: + out_data_dict = {} + for name, tensor in data_dict.items(): + out_data_dict[name] = data_dict[name].to(device) + yield out_data_dict + + + +class ReviewClassifier(nn.Module): + """ a simple perceptron based classifier """ + def __init__(self, num_features): + """ + Args: + num_features (int): the size of the input feature vector + """ + super(ReviewClassifier, self).__init__() + self.fc1 = nn.Linear(in_features=num_features, + out_features=1) + + def forward(self, x_in, apply_sigmoid=False): + """The forward pass of the classifier + + Args: + x_in (torch.Tensor): an input data tensor. + x_in.shape should be (batch, num_features) + apply_sigmoid (bool): a flag for the sigmoid activation + should be false if used with the Cross Entropy losses + Returns: + the resulting tensor. tensor.shape should be (batch,) + """ + y_out = self.fc1(x_in).squeeze() + if apply_sigmoid: + y_out = torch.sigmoid(y_out) + return y_out + + + + +def make_train_state(args): + return {'stop_early': False, + 'early_stopping_step': 0, + 'early_stopping_best_val': 1e8, + 'learning_rate': args.learning_rate, + 'epoch_index': 0, + 'train_loss': [], + 'train_acc': [], + 'val_loss': [], + 'val_acc': [], + 'test_loss': -1, + 'test_acc': -1, + 'model_filename': args.model_state_file} + +def update_train_state(args, model, train_state): + """Handle the training state updates. + + Components: + - Early Stopping: Prevent overfitting. + - Model Checkpoint: Model is saved if the model is better + + :param args: main arguments + :param model: model to train + :param train_state: a dictionary representing the training state values + :returns: + a new train_state + """ + + # Save one model at least + if train_state['epoch_index'] == 0: + torch.save(model.state_dict(), train_state['model_filename']) + train_state['stop_early'] = False + + # Save model if performance improved + elif train_state['epoch_index'] >= 1: + loss_tm1, loss_t = train_state['val_loss'][-2:] + + # If loss worsened + if loss_t >= train_state['early_stopping_best_val']: + # Update step + train_state['early_stopping_step'] += 1 + # Loss decreased + else: + # Save the best model + if loss_t < train_state['early_stopping_best_val']: + torch.save(model.state_dict(), train_state['model_filename']) + + # Reset early stopping step + train_state['early_stopping_step'] = 0 + + # Stop early ? + train_state['stop_early'] = \ + train_state['early_stopping_step'] >= args.early_stopping_criteria + + return train_state + +def compute_accuracy(y_pred, y_target): + y_target = y_target.cpu() + y_pred_indices = (torch.sigmoid(y_pred)>0.5).cpu().long()#.max(dim=1)[1] + n_correct = torch.eq(y_pred_indices, y_target).sum().item() + return n_correct / len(y_pred_indices) * 100 + + + + +def set_seed_everywhere(seed, cuda): + np.random.seed(seed) + torch.manual_seed(seed) + if cuda: + torch.cuda.manual_seed_all(seed) + +def handle_dirs(dirpath): + if not os.path.exists(dirpath): + os.makedirs(dirpath) + + + + +args = Namespace( + # Data and Path information + frequency_cutoff=25, + model_state_file='model.pth', + review_csv='data/yelp/reviews_with_splits_lite.csv', + # review_csv='data/yelp/reviews_with_splits_full.csv', + save_dir='model_storage/ch3/yelp/', + vectorizer_file='vectorizer.json', + # No Model hyper parameters + # Training hyper parameters + batch_size=128, + early_stopping_criteria=5, + learning_rate=0.001, + num_epochs=100, + seed=1337, + # Runtime options + catch_keyboard_interrupt=True, + cuda=True, + expand_filepaths_to_save_dir=True, + reload_from_files=False, +) + +if args.expand_filepaths_to_save_dir: + args.vectorizer_file = os.path.join(args.save_dir, + args.vectorizer_file) + + args.model_state_file = os.path.join(args.save_dir, + args.model_state_file) + + print("Expanded filepaths: ") + print("\t{}".format(args.vectorizer_file)) + print("\t{}".format(args.model_state_file)) + +# Check CUDA +if not torch.cuda.is_available(): + args.cuda = False +if torch.cuda.device_count() > 1: + print("Pouzivam", torch.cuda.device_count(), "graficke karty!") + +args.device = torch.device("cuda" if args.cuda else "cpu") + +# Set seed for reproducibility +set_seed_everywhere(args.seed, args.cuda) + +# handle dirs +handle_dirs(args.save_dir) + + + + +if args.reload_from_files: + # training from a checkpoint + print("Loading dataset and vectorizer") + dataset = ReviewDataset.load_dataset_and_load_vectorizer(args.review_csv, + args.vectorizer_file) +else: + print("Loading dataset and creating vectorizer") + # create dataset and vectorizer + dataset = ReviewDataset.load_dataset_and_make_vectorizer(args.review_csv) + dataset.save_vectorizer(args.vectorizer_file) +vectorizer = dataset.get_vectorizer() + +classifier = ReviewClassifier(num_features=len(vectorizer.review_vocab)) + + + +classifier = nn.DataParallel(classifier) +classifier = classifier.to(args.device) + +loss_func = nn.BCEWithLogitsLoss() +optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate) +scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer, + mode='min', factor=0.5, + patience=1) + +train_state = make_train_state(args) + +epoch_bar = tqdm(desc='training routine', + total=args.num_epochs, + position=0) + +dataset.set_split('train') +train_bar = tqdm(desc='split=train', + total=dataset.get_num_batches(args.batch_size), + position=1, + leave=True) +dataset.set_split('val') +val_bar = tqdm(desc='split=val', + total=dataset.get_num_batches(args.batch_size), + position=1, + leave=True) + +try: + for epoch_index in range(args.num_epochs): + train_state['epoch_index'] = epoch_index + + # Iterate over training dataset + + # setup: batch generator, set loss and acc to 0, set train mode on + dataset.set_split('train') + batch_generator = generate_batches(dataset, + batch_size=args.batch_size, + device=args.device) + running_loss = 0.0 + running_acc = 0.0 + classifier.train() + + for batch_index, batch_dict in enumerate(batch_generator): + # the training routine is these 5 steps: + + # -------------------------------------- + # step 1. zero the gradients + optimizer.zero_grad() + + # step 2. compute the output + y_pred = classifier(x_in=batch_dict['x_data'].float()) + + # step 3. compute the loss + loss = loss_func(y_pred, batch_dict['y_target'].float()) + loss_t = loss.item() + running_loss += (loss_t - running_loss) / (batch_index + 1) + + # step 4. use loss to produce gradients + loss.backward() + + # step 5. use optimizer to take gradient step + optimizer.step() + # ----------------------------------------- + # compute the accuracy + acc_t = compute_accuracy(y_pred, batch_dict['y_target']) + running_acc += (acc_t - running_acc) / (batch_index + 1) + + # update bar + train_bar.set_postfix(loss=running_loss, + acc=running_acc, + epoch=epoch_index) + train_bar.update() + + train_state['train_loss'].append(running_loss) + train_state['train_acc'].append(running_acc) + + # Iterate over val dataset + + # setup: batch generator, set loss and acc to 0; set eval mode on + dataset.set_split('val') + batch_generator = generate_batches(dataset, + batch_size=args.batch_size, + device=args.device) + running_loss = 0. + running_acc = 0. + classifier.eval() + + for batch_index, batch_dict in enumerate(batch_generator): + + # compute the output + y_pred = classifier(x_in=batch_dict['x_data'].float()) + + # step 3. compute the loss + loss = loss_func(y_pred, batch_dict['y_target'].float()) + loss_t = loss.item() + running_loss += (loss_t - running_loss) / (batch_index + 1) + + # compute the accuracy + acc_t = compute_accuracy(y_pred, batch_dict['y_target']) + running_acc += (acc_t - running_acc) / (batch_index + 1) + + val_bar.set_postfix(loss=running_loss, + acc=running_acc, + epoch=epoch_index) + val_bar.update() + + train_state['val_loss'].append(running_loss) + train_state['val_acc'].append(running_acc) + + train_state = update_train_state(args=args, model=classifier, + train_state=train_state) + + scheduler.step(train_state['val_loss'][-1]) + + train_bar.n = 0 + val_bar.n = 0 + epoch_bar.update() + + if train_state['stop_early']: + break + + train_bar.n = 0 + val_bar.n = 0 + epoch_bar.update() +except KeyboardInterrupt: + print("Exiting loop") + + + + + + + +classifier.load_state_dict(torch.load(train_state['model_filename'])) +classifier = classifier.to(args.device) + +dataset.set_split('test') +batch_generator = generate_batches(dataset, + batch_size=args.batch_size, + device=args.device) +running_loss = 0. +running_acc = 0. +classifier.eval() + +for batch_index, batch_dict in enumerate(batch_generator): + # compute the output + y_pred = classifier(x_in=batch_dict['x_data'].float()) + + # compute the loss + loss = loss_func(y_pred, batch_dict['y_target'].float()) + loss_t = loss.item() + running_loss += (loss_t - running_loss) / (batch_index + 1) + + # compute the accuracy + acc_t = compute_accuracy(y_pred, batch_dict['y_target']) + running_acc += (acc_t - running_acc) / (batch_index + 1) + +train_state['test_loss'] = running_loss +train_state['test_acc'] = running_acc + + + + + + +print("Test loss: {:.3f}".format(train_state['test_loss'])) +print("Test Accuracy: {:.2f}".format(train_state['test_acc'])) + + + + + + +def preprocess_text(text): + text = text.lower() + text = re.sub(r"([.,!?])", r" \1 ", text) + text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text) + return text + + + + + +def predict_rating(review, classifier, vectorizer, decision_threshold=0.5): + """Predict the rating of a review + + Args: + review (str): the text of the review + classifier (ReviewClassifier): the trained model + vectorizer (ReviewVectorizer): the corresponding vectorizer + decision_threshold (float): The numerical boundary which separates the rating classes + """ + review = preprocess_text(review) + + vectorized_review = torch.tensor(vectorizer.vectorize(review)) + result = classifier(vectorized_review.view(1, -1)) + + probability_value = F.sigmoid(result).item() + index = 1 + if probability_value < decision_threshold: + index = 0 + + return vectorizer.rating_vocab.lookup_index(index) + + + + + +test_review = "this is a pretty awesome book" + +classifier = classifier.cpu() +prediction = predict_rating(test_review, classifier, vectorizer, decision_threshold=0.5) +print("{} -> {}".format(test_review, prediction)) + + + + + +# Sort weights +fc1_weights = classifier.fc1.weight.detach()[0] +_, indices = torch.sort(fc1_weights, dim=0, descending=True) +indices = indices.numpy().tolist() + +# Top 20 words +print("Influential words in Positive Reviews:") +print("--------------------------------------") +for i in range(20): + print(vectorizer.review_vocab.lookup_index(indices[i])) + +print("====\n\n\n") + +# Top 20 negative words +print("Influential words in Negative Reviews:") +print("--------------------------------------") +indices.reverse() +for i in range(20): + print(vectorizer.review_vocab.lookup_index(indices[i])) \ No newline at end of file diff --git a/pages/students/2016/maros_harahus/README.md b/pages/students/2016/maros_harahus/README.md index cdb36aa6..8b0c0f13 100644 --- a/pages/students/2016/maros_harahus/README.md +++ b/pages/students/2016/maros_harahus/README.md @@ -12,16 +12,42 @@ taxonomy: Zásobník úloh: +- skúsiť prezentovať na lokálnej konferencii, (Data, Znalosti and WIKT) alebo fakultný zborník (krátka verzia diplomovky). +- Využiť korpus Multext East pri trénovaní. Vytvoriť mapovanie Multext Tagov na SNK Tagy. + + +Virtuálne stretnutie 6.11.2020 + +Stav: + +- Prečítané (podrobne) 2 články a urobené poznámky. Poznánky sú na GITe. +- Dorobené ďalšie experimenty. + +Úlohy do ďalšieho stretnutia: + +- Pokračovať v otvorených úlohách. + + +Virtuálne stretnutie 30.10.2020 + +Stav: + +- Súbory sú na GIte +- Vykonané experimenty, Výsledky experimentov sú v tabuľke +- Návod na spustenie +- Vyriešenie technických problémov. Je k dispozicíí Conda prostredie. + +Úlohy na ďalšie stretnutie: + - Preštudovať literatúru na tému "pretrain" a "word embedding" - - [Healthcare NERModelsUsing Language Model Pretraining](http://ceur-ws.org/Vol-2551/paper-04.pdf) + - [Healthcare NER Models Using Language Model Pretraining](http://ceur-ws.org/Vol-2551/paper-04.pdf) - [Design and implementation of an open source Greek POS Tagger and Entity Recognizer using spaCy](https://ieeexplore.ieee.org/abstract/document/8909591) - https://arxiv.org/abs/1909.00505 - https://arxiv.org/abs/1607.04606 - LSTM, recurrent neural network, + - Urobte si poznámky z viacerých čnánkov, poznačte si zdroj a čo ste sa dozvedeli. - Vykonať viacero experimentov s pretrénovaním - rôzne modely, rôzne veľkosti adaptačných dát a zostaviť tabuľku - Opísať pretrénovanie, zhrnúť vplyv pretrénovania na trénovanie v krátkom článku cca 10 strán. -- skúsiť prezentovať na lokálnej konferencii, (Data, Znalosti and WIKT) alebo fakultný zborník (krátka verzia diplomovky). -- Využiť korpus Multext East pri trénovaní. Vytvoriť mapovanie Multext Tagov na SNK Tagy. Virtuálne stretnutie 8.10.2020 diff --git a/pages/students/2016/tomas_kucharik/README.md b/pages/students/2016/tomas_kucharik/README.md index c25011b1..5ebb6bdc 100644 --- a/pages/students/2016/tomas_kucharik/README.md +++ b/pages/students/2016/tomas_kucharik/README.md @@ -21,6 +21,46 @@ Cieľom práce je príprava nástrojov a budovanie tzv. "Question Answering data ## Diplomový projekt 2 +Zásobník úloh: + +- Dá sa zistiť koľko času strávil anotátor pri vytváraní otázky? Ak sa to dá zistiť z DB schémy, tak by bolo dobré to zobraziť vo webovej aplikácii. + + +Virtuálne stretnutie 27.10.2020 + +Stav: + +- Dorobená webová aplikácia podľa pokynov z minulého stretnutia, kódy sú na gite + +Úlohy na ďalšie stretnutie: + +- Urobiť konfiguračný systém - načítať konfiguráciu zo súboru (python-configuration?). Meno konfiguračného súboru by sa malo dať zmeniť cez premennú prostredia (getenv). +- Dorobiť autentifikáciu pre anotátorov pre zobrazovanie výsledkov, aby anotátor videl iba svoje výsledky. Je to potrebné? Zatiaľ dorobiť iba pomocou e-mailu. +- Dorobiť heslo na webovú aplikáciu +- Dorobiť zobrazovanie zlých a dobrých anotácií pre každého anotátora. +- Preštudovať odbornú literatúru na tému "Crowdsourcing language resources". Vyberte niekoľko odborných publikácií (scholar, scopus), napíšte bibliografický odkaz a čo ste sa z publikácii dozvedeli o vytváraní jazykových zdrojov. Aké iné korpusy boli touto metódou vytvorené? + + + + +Virtuálne stretnutie 20.10.2020 + +Stav: + +- Vylepšený skript pre prípravu dát , mierna zmena rozhrania (duplicitná práca kvôli nedostatku v komunikácii). + +Úohy do ďalšieho stretnutia: + +- Dorobiť webovú aplikáciu pre zisťoovanie množstva anotovaných dát. +- Odladiť chyby súvisiace s novou anotačnou schémou. +- Zobraziť množstvo anotovaných dát +- Zobraziť množstvo platných anotovaných dát. +- Zobbraziť množstvo validovaných dát. +- Otázky sa v rámci jedného paragrafu nesmú opakovať. Každá otázka musí mať odpoveď. Každá otázka musí byť dlhšia ako 10 znakov alebo dlhšia ako 2 slová. Odpoveď musí mať aspoň jedno slovo. Otázka musí obsahovať slovenské slová. +- Výsledky posielajte čím skôr do projektového repozitára, adresár database_app. + + + Stretnutie 25.9.2020 Urobené: diff --git a/pages/students/2017/martin_jancura/README.md b/pages/students/2017/martin_jancura/README.md index fd6e21a5..fd9c7bb6 100644 --- a/pages/students/2017/martin_jancura/README.md +++ b/pages/students/2017/martin_jancura/README.md @@ -6,10 +6,8 @@ taxonomy: tag: [demo,nlp] author: Daniel Hladek --- - # Martin Jancura - *Rok začiatku štúdia*: 2017 ## Bakalársky projekt 2020 @@ -31,9 +29,36 @@ Možné backendy: Zásobník úloh: - Pripraviť backend. -- Pripraviť frontend v Javascripte. +- Pripraviť frontend v Javascripte - in progress. - Zapisať človekom urobený preklad do databázy. + +Virtuálne stretnutie 6.11.2020: + +Stav: + +Práca na písomnej časti. + +Úlohy do ďalšieho stretnutia: + +- Pohľadať takú knižnicu, kde vieme využiť vlastný preklad. Skúste si nainštalovať OpenNMT. +- Prejdite si tutoriál https://github.com/OpenNMT/OpenNMT-py#quickstart alebo podobný. +- Navrhnite ako prepojiť frontend a backend. + + +Virtuálne stretnutie 23.10.2020: + +Stav: + +- Urobený frontend pre komunikáciu s Microsof Translation API, využíva Axios a Vanilla Javascriupt + +Úlohy do ďalšieho stretnutia: + +- Pohľadať takú knižnicu, kde vieme využiť vlastný preklad. Skúste si nainštalovať OpenNMT. +- Zistiť čo znamená politika CORS. +- Pokračujte v písaní práce, pridajte časť o strojovom preklade.. Prečítajte si články https://opennmt.net/OpenNMT/references/ a urobte si poznámky. Do poznámky dajte bibliografický odkaz a čo ste sa dozvedeli z článku. + + Virtuálne stretnutie 16.10: Stav: diff --git a/pages/students/2018/martin_wencel/README.md b/pages/students/2018/martin_wencel/README.md index e05ebd3e..6c1a2903 100644 --- a/pages/students/2018/martin_wencel/README.md +++ b/pages/students/2018/martin_wencel/README.md @@ -31,7 +31,42 @@ Návrh na zadanie: 1. Navrhnite možné zlepšenia Vami vytvorenej aplikácie. Zásobník úloh: -- Vytvorte si repozitár na GITe, nazvite ho bp2010. Do neho budete dávať kódy a dokumentáciu ktorú vytvoríte. + +- Pripravte Docker image Vašej aplikácie podľa https://pythonspeed.com/docker/ + + +Virtuálne stretnutie 30.10.: + +Stav: + +- Modifikovaná existujúca aplikácia "spacy-streamlit", zdrojové kóódy sú na GITe podľa pokynov z minulého stretnutia. +- Obsahuje aj formulár, neobsahuje REST API + +Úlohy do ďalšieho stretnutia: + +- Pokračujte v písaní. Prečítajte si odborné články na tému "dependency parsing" a vypracujte poznámky čo ste sa dozvedeli. Poznačte si zdroj. +- Pokkračujte v práci na demonštračnej webovej aplikácii. + + +Virtuálne stretnutie 19.10.: + +Stav: + +- Vypracované a odovzdané poznámky k bakalárskej práci, obsahujú výpisy z literatúry. +- Vytvorený repozitár. https://git.kemt.fei.tuke.sk/mw223on/bp2020 +- Nainštalovaný a spustený slovenský Spacy model. +- Nainštalované Spacy REST Api https://github.com/explosion/spacy-services +- Vyskúšané demo Display so slovenským modelom + +Úlohy na ďalšie stretnutie: + +- Pripravte webovú aplikáciu ktorá bude prezentovať rozpoznávanie závislostí a pomenovaných entít v slovenskom jayzyku. Mala by sa skladať z frontentu a backendu. +- zapíšte potrebné Python balíčky do súboru "requirements.txt" +- Vytvorte skript na inštaláciu aplikácie pomocou pip. +- Vytvorte skript pre spustenie backendu aj frontendu. Výsledky dajte do repozitára. +- Vytvorte návrh na frontend (HTML + CSS). +- Pozrite na zdrojové kódy Spacy a zistite, čo presne robí príkaz display.serve +- Vysledky dajte do repozitára. Virtuálne stretnutie 9.10. diff --git a/pages/students/2018/ondrej_megela/README.md b/pages/students/2018/ondrej_megela/README.md index 1e82c03f..b8486476 100644 --- a/pages/students/2018/ondrej_megela/README.md +++ b/pages/students/2018/ondrej_megela/README.md @@ -20,8 +20,21 @@ Návrh na zadanie: 2. Vytvorte jazykový model metódou BERT alebo poodobnou metódou. 3. Vyhodnnotte vytvorený jazykový model a navrhnite zlepšenia. -Zásobník úloh: + +Virtuálne stretnutie 30.10.2020 + +Stav: +- Vypracované poznámky k seq2seq +- nainštalovaný Pytorch a fairseq +- problémy s tutoriálom. Riešenie by mohlo byť použitie release verzie 0.9.0, pip install fairseq=0.9.0 + +Do ďďalšieho stretnutia: + +- Vyriešte technické porblémy +- prejdide si tutoriál https://fairseq.readthedocs.io/en/latest/getting_started.html#training-a-new-model - Prejsť si tutoriál https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md alebo podobný. +- Preštudujte si články na tému BERT, urobte si poznámky čo ste sa dozvedeli spolu so zdrojom. + Virtuálne stretnutie 16.10.2020 diff --git a/pages/students/2018/samuel_sirotnik/README.md b/pages/students/2018/samuel_sirotnik/README.md index 68bf7da8..0306a75a 100644 --- a/pages/students/2018/samuel_sirotnik/README.md +++ b/pages/students/2018/samuel_sirotnik/README.md @@ -23,13 +23,50 @@ Pokusný klaster Raspberry Pi pre výuku klaudových technológií Ciel projektu je vytvoriť domáci lacný klaster pre výuku cloudových technológií. +Zásobník úloh: + +- Aktivujte si technológiu WSL2 a Docker Desktop ak používate Windows. + +Virtuálne stretnutie 30.10. + +Stav: +- vypracovaný písomný prehľad podľa pokynov +- nainštalovaný RaspberryPI OS do Virtual\boxu +- vypracovaný predbežný HW návrh +- Nainšalované Docker Toolbox aj Ubuntu s Dockerom +- Oboznámenie sa s Dockerom +- Vedúci: vykoananý nákup HW - Dosky 5x RPi4 model B 8GB, SD Karty 128GB 11ks, the pi hut Cluster Case for raspberry pi 4ks, Zdroj 60W and 18W Quick Charger Epico 1ks. 220V kábel a zásuvka s vypínačom. + +Do budúceho stretnutia: + +- Dá sa kúpiť oficiálmy 5 portový switch? +- Skompletizovať nákup a dohodntúť spôsob odovzdania. Podpísať preberací protokol. +- Použite https://kind.sigs.k8s.io na simuláciu klastra. +- Nainštalujte si https://microk8s.io/ , prečítajte tutoriály https://ubuntu.com/tutorials/ +- Prejdite si https://kubernetes.io/docs/tutorials/hello-minikube/ alebo pododbný tutoriály + + +Virtuálne stretnutie 16.10. + + +Stav: +- Prečítanie články +- začatý tutorál Docker zo ZCT +- vedúci vytovoril prístup na Jetson Xavier AGX2 s ARM procesorom. +- začatý nákup na Raspberry Pi a príslušenstvo. + +Úlohy do ďalšieho stretnutia +- Vypracovať prehľad (min 4) existujúcich riešení Raspberry Pi cluster (na odovzdanie). Aký hardware a software použili? + - napájanie, chladenie, sieťové prepojenie +- Oboznámte sa s https://www.raspberrypi.org/downloads/raspberry-pi-os/ +- Nainštalujte si https://roboticsbackend.com/install-raspbian-desktop-on-a-virtual-machine-virtualbox/ +- Napíšte podrobný návrh hardware pre vytvorenie Raspberry Pi Cluster. Stretnutie 29.9. Dohodli sme sa na zadaní práce. - Návrhy na zlepšenie (pre vedúceho): - Zistiť podmienky financovania (odhad 350EUR). diff --git a/pages/topics/named-entity/navod/README.md b/pages/topics/named-entity/navod/README.md index f092aa61..a8aef370 100644 --- a/pages/topics/named-entity/navod/README.md +++ b/pages/topics/named-entity/navod/README.md @@ -39,23 +39,35 @@ Učenie prebieha tak, že v texte ukážete ktoré slová patria názvom osôb, Vašou úlohou bude v texte vyznačiť vlastné podstatné mená. -Vlastné podstatné meno sa v slovenskom jazyku spravidla začína veľkým písmeno, ale môže obsahovať aj ďalšie slová písané malým písmenom. +Vlastné podstatné meno sa v slovenskom jazyku spravidla začína veľkým písmenom, ale môže obsahovať aj ďalšie slová písané malým písmenom. +Ak vlastné podstatné meno v sebe obsahuje iný názov, napr. Nové Mesto nad Váhom, anotujte ho ako jeden celok. - PER: mená osôb - LOC: geografické názvy - ORG: názvy organizácii - MISC: iné názvy, napr. názvy produktov. -Ak vlastné podstatné meno v sebe obsahuje iný názov, napr. Nové Mesto nad Váhom, anotujte ho ako jeden celok. +V texte narazíte aj na slová, ktoré síce pomenúvajú geografickú oblasť, avšak nejedná sa o vlastné podstatné mená (napr. britská kolónia, londýnsky šerif...). Takéto slová nepovažujeme za pomenované entity a preto Vás prosíme, aby ste ich neoznačovali. + +V prípade, že v texte sa nenachádzajú žiadne anotácie, tento článok je platný, a teda zvoľte možnosť Accept. + +V prípade, že text sa skladá iba z jedného, resp. niekoľkých slov, ktoré sami o sebe nenesú žiaden význam, tento článok je neplatný, a teda zvoľte možnosť Reject. ## Anotačné dávky Do formulára napíšte Váš e-mail aby bolo možné rozpoznať, kto vykonal anotáciu. +Počas anotácie môžete pre zjednodušenie práce využívať klávesové skratky: +- 1,2,3,4 - prepínanie medzi entitami +- klávesa "a" - Accept +- klávesa "x" - Reject +- klávesa "space" - Ignore +- klávesa "backspace" alebo "del" - Undo + +Po anotovaní nezabudnite svojú prácu uložiť (ikona vľavo hore, alebo "Ctrl + s"). + ### Pokusná anotačná dávka Dávka je zameraná na zber spätnej väzby od anotátorov na zlepšenie rozhrania a anotačného procesu. {% include "forms/form.html.twig" with { form: forms('ner1') } %} - - diff --git a/pages/topics/question/navod/README.md b/pages/topics/question/navod/README.md index 128512ec..0497a093 100644 --- a/pages/topics/question/navod/README.md +++ b/pages/topics/question/navod/README.md @@ -36,11 +36,12 @@ Učenie prebieha tak, že vytvoríte príklad s otázkou a odpoveďou. Účasť ## Návod pre anotátorov -Najprv sa Vám zobrazí krátky článok. Vašou úlohou bude prečítať si časť článku, vymyslieť k nemu otázku a v texte vyznačiť odpoveď. Odpoveď na otázku sa musí nachádzať v texte článku. Na vyznačenie jednej otázky máte asi 50 sekúnd. +Najprv sa Vám zobrazí krátky článok. Vašou úlohou bude prečítať si časť článku, vymyslieť k nemu otázku a v texte vyznačiť odpoveď. Otázka musí byť jednoznačná a odpoveď na otázku sa musí nachádzať v texte článku. Na vyznačenie jednej otázky máte asi 50 sekúnd. -1. Prečítajte si článok. Ak článok nie je vyhovujúci ťuknite na červený krížik "reject" (Tab a potom 'x'). +1. Prečítajte si článok. Ak článok nie je vyhovujúci ťuknite na červený krížik "Reject" (Tab a potom 'x'). 2. Napíšte otázku. Ak neviete napísať otázku, ťuknite na "Ignore" (Tab a potom 'i'). -3. Vyznačte myšou odpoveď a ťuknite na zelenú fajku "Accept" (klávesa a) a pokračujte ďalšou otázkou k tomu istému článku alebo k novému článku. Ten istý text sa zobrazí maximálne 5 krát. +3. Vyznačte myšou odpoveď a ťuknite na zelenú fajku "Accept" (klávesa a) a pokračujte ďalšou otázkou k tomu istému článku alebo k novému článku. +4. Ten istý článok sa Vám zobrazí 5 krát, vymyslite k nemu 5 rôznych otázok. Ak je zobrazený text nevhodný, tak ho zamietnite. Nevhodný text: @@ -61,6 +62,12 @@ Ak je zobrazený text nevhodný, tak ho zamietnite. Nevhodný text: 4. Na čo slúži lyzozóm? 5. Čo je to autofágia? +Príklad na nesprávu otázku: +1. Čo je to Golgiho aparát? - odpoveď sa v článku nenachádza. +2. Čo sa deje v mŕtvych bunkách? - otázka nie je jednoznačná, presná odpoveď sa v článku nenachádza. +3. Čo je normálny fyziologický proces? - odpoveď sa v článku nenachádza. + + Do formulára napíšte Váš e-mail aby bolo možné rozpoznať, kto vykonal anotáciu. ## Anotačné dávky