merge

2020-11-13 09:04:42 +01:00 · 2020-11-13 09:04:42 +01:00 · f5455a89b3
commit f5455a89b3
parent 36dd4ec638 550b764fd9
40 changed files with 2860 additions and 48 deletions
--- a/pages/students/2016/darius_lindvai/README.md
+++ b/pages/students/2016/darius_lindvai/README.md
@ -13,11 +13,28 @@ Repozitár so [zdrojovými kódmi](https://git.kemt.fei.tuke.sk/dl874wn/dp2021)
 ## Diplomový projekt 2 2020
 Virtuálne stretnutie 6.11.2020
 Stav:
 - Vypracovaná tabuľka s 5 experimentami.
 - vytvorený repozitár.
 Na ďalšie stretnutie:
 - nahrať kódy na repozitár.
 - závislosťi (názvy balíčkov) poznačte do súboru requirements.txt.
 - Prepracujte experiment tak aby akceptoval argumenty z príkazového riadka. (sys.argv)
 - K experimentom zapísať skript na spustenie. V skripte by mali byť parametre s ktorými ste spustili experiment.
 - dopracujte report.
 - do teorie urobte prehľad metód punctuation restoration a opis Vašej metódy.
 Virtuálne stretnutie 25.9.2020
 Urobené:
- skript pre vyhodnotenie experimentov
+- skript pre vyhodnotenie experimentov.
 Úlohy do ďalšieho stretnutia:
--- a/pages/students/2016/jakub_maruniak/README.md
+++ b/pages/students/2016/jakub_maruniak/README.md
@ -21,8 +21,21 @@ Zásobník úloh:
 - Použiť model na podporu anotácie
 - Do konca ZS vytvoriť report vo forme článku.
- Vytvorte systém pre zistenie množstva a druhu anotovaných dát. Koľko článkov? Koľko entít jednotlivvých typov?
+- Spísať pravidlá pre validáciu. Aký výsledok anotácie je dobrý? Je potrebné anotované dáta skontrolovať?
 Virtuálne stretnutie 30.10.2020:
 Stav:
 - Vylepšený návod
 - Vyskúšaný export dát a trénovanie modelu z databázy. Problém pri trénovaní Spacy - iné výsledky ako cez Progigy trénovanie
 - Práca na textovej čsati.
 Úlohy do ďalšieho stretnutia:
 - Vytvorte si repozitár s názvom dp2021 a tam pridajte skripty a poznámky.
 - Pokračujte v písaní práce. Vykonajte prieskum literatúry "named entity corpora" aj poznámky.
 - Vytvorte systém pre zistenie množstva a druhu anotovaných dát. Koľko článkov? Koľko entít jednotlivvých typov? Výsledná tabuľka pôjde do práce.
 - Pripraviť sa na produkčné anotácie. Je schéma pripravená?
 Virtuálne stretnutie 16.10.2020:
--- a/pages/students/2016/jakub_maruniak/dp2021/README.md
+++ b/pages/students/2016/jakub_maruniak/dp2021/README.md
@ -1 +1,40 @@
-DP2021
+## Diplomový projekt 2 2020
 Stav:
 - aktualizácia anotačnej schémy (jedná sa o testovaciu schému s vlastnými dátami)
 - vykonaných niekoľko anotácii, trénovanie v Prodigy - nízka presnosť = malé množstvo anotovaných dát. Trénovanie v spacy zatiaľ nefunguje.
 - Štatistiky o množstve prijatých a odmietnutých anotácii získame z Prodigy: prodigy stats wikiart. Zatiaľ 156 anotácii (151 accept, 5 reject). Na získanie prehľadu o množstve anotácii jednotlivých entít potrebujeme vytvoriť skript.
 - Prehľad literatúry Named Entity Corpus
    - Budovanie korpusu pre NER – automatické vytvorenie už anotovaného korpusu z Wiki pomocou DBpedia – jedná sa o anglický korpus, ale možno spomenúť v porovnaní postupov 
        - Building a Massive Corpus for Named Entity Recognition using Free Open Data Sources - Daniel Specht Menezes, Pedro Savarese, Ruy L. Milidiú
    - Porovnanie postupov pre anotáciu korpusu (z hľadiska presnosti aj času) - Manual, SemiManual
        - Comparison of Annotating Methods for Named Entity Corpora - Kanako Komiya, Masaya Suzuki
    - Čo je korpus, vývojový cyklus, analýza korpusu (Už využitá literatúra – cyklus MATTER)
        - Natural Language Annotation for Machine Learning – James Pustejovsky, Amber Stubbs
 Aktualizácia 09.11.2020:
 - Vyriešený problém, kedy nefungovalo trénovanie v spacy
 - Vykonaná testovacia anotácia cca 500 viet. Výsledky trénovania pri 20 iteráciách: F-Score 47% (rovnaké výsledky pri trénovaní v Spacy aj Prodigy)
 - Štatistika o počte jednotlivých entít: skript count.py
 ## Diplomový projekt 1 2020
 - vytvorenie a spustenie docker kontajneru
 ```
 ./build-docker.sh
 docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash
 # (v mojom prípade:)
 winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash
 ```
 ### Spustenie anotačnej schémy
 - `dataminer.csv` články stiahnuté z wiki
 - `cd ner`
 - `./01_text_to_sent.sh` spustenie skriptu *text_to_sent.py*, ktorý rozdelí články na jednotlivé vety
 - `./02_ner_correct.sh` spustenie anotačného procesu pre NER s návrhmi od modelu 
 - `./03_ner_export.sh`  exportovanie anotovaných dát vo formáte jsonl potrebnom pre spracovanie vo spacy
--- a/pages/students/2016/jakub_maruniak/dp2021/annotation/Dockerfile
+++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/Dockerfile
@ -1,17 +1,16 @@
-# > docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash
+# > docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash
-# > winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash
+# > winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation-master/annotation/work prodigy bash
-
+
-FROM python:3.8
+FROM python:3.8
-RUN mkdir /prodigy
+RUN mkdir /prodigy
-WORKDIR /prodigy
+WORKDIR /prodigy
-COPY ./prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl  /prodigy
+COPY ./prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl  /prodigy
-RUN mkdir /work
+RUN mkdir /work
-COPY ./ner /work
+COPY ./ner /work/ner
-RUN pip install prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl
+RUN pip install uvicorn==0.11.5 prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl
-RUN pip install https://files.kemt.fei.tuke.sk/models/spacy/sk_sk1-0.0.1.tar.gz
+RUN pip install https://files.kemt.fei.tuke.sk/models/spacy/sk_sk1-0.0.1.tar.gz
-RUN pip install nltk
+RUN pip install nltk
-EXPOSE 8080
+EXPOSE 8080
-ENV PRODIGY_HOME /work
+ENV PRODIGY_HOME /work
-ENV PRODIGY_HOST 0.0.0.0
+ENV PRODIGY_HOST 0.0.0.0
-WORKDIR /work
+WORKDIR /work
--- a/pages/students/2016/jakub_maruniak/dp2021/annotation/README.md
+++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/README.md
@ -1,13 +1,11 @@
-## Diplomový projekt 1 2020
+## Diplomový projekt 2 2020
 - vytvorenie a spustenie docker kontajneru
 ```
 ./build-docker.sh
-docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash
+winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation-master/annotation/work prodigy bash
 # (v mojom prípade:)
 winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash
 ```
@ -17,5 +15,12 @@ winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/ann
 - `dataminer.csv` články stiahnuté z wiki
 - `cd ner`
 - `./01_text_to_sent.sh` spustenie skriptu *text_to_sent.py*, ktorý rozdelí články na jednotlivé vety
- `./02_ner_correct.sh` spustenie anotačného procesu pre NER s návrhmi od modelu 
+- `./02_ner_manual.sh` spustenie manuálneho anotačného procesu pre NER  
- `./03_ner_export.sh`  exportovanie anotovaných dát vo formáte jsonl potrebnom pre spracovanie vo spacy
+- `./03_export.sh`  exportovanie anotovaných dát vo formáte json potrebnom pre spracovanie vo spacy. Možnosť rozdelenia na trénovacie (70%) a testovacie dáta (30%) (--eval-split 0.3).
 ### Štatistika o anotovaných dátach
 - `prodigy stats wikiart` - informácie o počte prijatých a odmietnutých článkov 
 - `python3 count.py` - informácie o počte jednotlivých entít
 ### Trénovanie modelu
 Založené na: https://git.kemt.fei.tuke.sk/dano/spacy-skmodel
--- a/pages/students/2016/jakub_maruniak/dp2021/annotation/count.py
+++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/count.py
@ -0,0 +1,14 @@
 # load data
 filename = 'ner/annotations.jsonl'
 file = open(filename, 'rt', encoding='utf-8')
 text = file.read()
 # count entity PER
 countPER = text.count('PER')
 countLOC = text.count('LOC')
 countORG = text.count('ORG')
 countMISC = text.count('MISC')
 print('Počet anotovaných entít typu PER:', countPER,'\n', 
      'Počet anotovaných entít typu LOC:', countLOC,'\n',
      'Počet anotovaných entít typu ORG:', countORG,'\n',
      'Počet anotovaných entít typu MISC:', countMISC,'\n')
--- a/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/02_ner_correct.sh
+++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/02_ner_correct.sh
@ -1,3 +0,0 @@
 prodigy ner.correct wikiart sk_sk1 ./textfile.csv --label OSOBA,MIESTO,ORGANIZACIA,PRODUKT
--- a/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/02_ner_manual.sh
+++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/02_ner_manual.sh
@ -0,0 +1,2 @@
 prodigy ner.manual wikiart sk_sk1 ./textfile.csv --label PER,LOC,ORG,MISC
--- a/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/03_export.sh
+++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/03_export.sh
@ -0,0 +1 @@
 prodigy data-to-spacy ./train.json ./eval.json --lang sk --ner wikiart --eval-split 0.3
--- a/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/03_ner_export.sh
+++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/ner/03_ner_export.sh
@ -1 +0,0 @@
 prodigy db-out wikiart > ./annotations.jsonl
--- a/pages/students/2016/jakub_maruniak/dp2021/annotation/train/prepare.sh
+++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/train/prepare.sh
@ -0,0 +1,19 @@
 mkdir -p build
 mkdir -p build/input
 # Prepare Treebank
 mkdir -p build/input/slovak-treebank
 spacy convert ./sources/slovak-treebank/stb.conll ./build/input/slovak-treebank
 # UDAG used as evaluation
 mkdir -p build/input/ud-artificial-gapping
 spacy convert ./sources/ud-artificial-gapping/sk-ud-crawled-orphan.conllu ./build/input/ud-artificial-gapping
 # Prepare skner
 mkdir -p build/input/skner
 # Convert to IOB
 cat ./sources/skner/wikiann-sk.bio | python ./sources/bio-to-iob.py > build/input/skner/wikiann-sk.iob
 # Split to train test
 cat ./build/input/skner/wikiann-sk.iob | python ./sources/iob-to-traintest.py ./build/input/skner/wikiann-sk
 # Convert train and test
 mkdir -p build/input/skner-train
 spacy convert -n 15 --converter ner ./build/input/skner/wikiann-sk.train ./build/input/skner-train
 mkdir -p build/input/skner-test
 spacy convert -n 15 --converter ner ./build/input/skner/wikiann-sk.test ./build/input/skner-test
--- a/pages/students/2016/jakub_maruniak/dp2021/annotation/train/train.sh
+++ b/pages/students/2016/jakub_maruniak/dp2021/annotation/train/train.sh
@ -0,0 +1,19 @@
 set -e
 OUTDIR=build/train/output
 TRAINDIR=build/train
 mkdir -p $TRAINDIR
 mkdir -p $OUTDIR
 mkdir -p dist
 # Delete old training results
 rm -rf $OUTDIR/*
 # Train dependency and POS
 spacy train sk $OUTDIR ./build/input/slovak-treebank ./build/input/ud-artificial-gapping  --n-iter 20 -p tagger,parser
 rm -rf $TRAINDIR/posparser
 mv $OUTDIR/model-best $TRAINDIR/posparser
 # Train NER
 # python ./train.py -t ./train.json -o $TRAINDIR/nerposparser -n 10 -m $TRAINDIR/posparser/
 spacy train sk $TRAINDIR/nerposparser ./ner/train.json ./ner/eval.json --n-iter 20 -p ner
 # Package model
 spacy package $TRAINDIR/nerposparser dist --meta-path ./meta.json --force
 cd dist/sk_sk1-0.2.0
 python ./setup.py sdist --dist-dir ../
--- a/pages/students/2016/jan_holp/README.md
+++ b/pages/students/2016/jan_holp/README.md
@ -31,11 +31,39 @@ Zásobník úloh:
 - Urobiť verejné demo - nasadenie pomocou systému Docker
 - zlepšenie Web UI
 - vytvoriť REST api pre indexovanie dokumentu.
 - V indexe prideliť ohodnotenie každému dokumentu podľa viacerých metód, napr. PageRank
 - Využiť vyhodnotenie pri vyhľadávaní
 - **Použiť overovaciu databázu SCNC na vyhodnotenie každej metódy**
 - **Do konca zimného semestra vytvoriť "Mini Diplomovú prácu cca 8 strán s experimentami" vo forme článku**
 Virtuálne stretnutie 6.11:2020:
 Stav:
 - Riešenie problémov s cassandrou a javascriptom. Ako funguje funkcia then? 
 Úlohy na ďalšie stretnutie:
 - vypracujte funkciu na indexovanie. Vstup je dokument (objekt s textom a metainformáciami). Fukcia zaindexuje dokument do ES.
 - Naštudujte si ako funguje funkcia then a čo je to callback.
 - Naštudujte si ako sa používa Promise.
 - Naštudujte si ako funguje async - await. 
 - https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Asynchronous/
 Virtuálne stretnutie 23.10:2020:
 Stav:
 - Riešenie problémov s cassandrou. Ako vybrať dáta podľa primárneho kľúča.
 Do ďďalšiehio stretnutia:
 - pokračovať v otvorených úlohách.
 - urobte funkciu pre indexovanie jedného dokumentu.
 Virtuálne stretnutie 16.10.
 Stav:
--- a/pages/students/2016/jan_holp/dp2021/zdrojove_subory/cassandra.js
+++ b/pages/students/2016/jan_holp/dp2021/zdrojove_subory/cassandra.js
@ -0,0 +1,105 @@
 //Jan Holp, DP 2021
 //client1 = cassandra
 //client2 = elasticsearch 
 //-----------------------------------------------------------------
 //require the Elasticsearch librray
 const elasticsearch = require('elasticsearch');
 const client2 = new elasticsearch.Client({
   hosts: [ 'localhost:9200']
 });
 client2.ping({
     requestTimeout: 30000,
 }, function(error) {
 // at this point, eastic search is down, please check your Elasticsearch service
     if (error) {
         console.error('Elasticsearch cluster is down!');
     } else {
         console.log('Everything is ok');
     }
 });
 //create new index skweb2
 client2.indices.create({
    index: 'skweb2'
 }, function(error, response, status) {
    if (error) {
        console.log(error);
    } else {
        console.log("created a new index", response);
    }
 });
 const cassandra = require('cassandra-driver');
 const client1 = new cassandra.Client({ contactPoints: ['localhost:9042'], localDataCenter: 'datacenter1', keyspace: 'websucker' });
 const query = 'SELECT title  FROM websucker.content WHERE body_size > 0  ALLOW FILTERING';
 client1.execute(query)
  .then(result => console.log(result)),function(error) {
    if(error){
      console.error('Something is wrong!');
      console.log(error);
    } else{
        console.log('Everything is ok');
    }
  }; 
 /*
 async  function indexData() {
  var i = 0;
  const query = 'SELECT title  FROM websucker.content WHERE body_size > 0  ALLOW FILTERING'; 
  client1.execute(query)
    .then((result) => {
    try {
        //for ( i=0; i<15;i++){
        console.log('%s', result.row[0].title)
      //}
  } catch (query) {
      if (query  instanceof SyntaxError) {
          console.log( "Neplatne query" );
        } 
  }
    });
  }
 /*
 //indexing method
 const bulkIndex = function bulkIndex(index, type, data) {
 	let bulkBody = [];
 	id = 1;
 const errorCount = 0;
 	data.forEach(item => {
 		bulkBody.push({
 			index: {
 				_index: index,
 				_type:  type,
 				_id :   id++,
 			}
 		});
 		bulkBody.push(item);
 	});
        console.log(bulkBody);
 	client.bulk({body: bulkBody})
 		.then(response => {
 			response.items.forEach(item => {
 				if (item.index && item.index.error) {
 					console.log(++errorCount, item.index.error);
 				}
 			});
 			console.log(
 				`Successfully indexed ${data.length - errorCount}
 				out of ${data.length} items`
 			);
 		})
 		.catch(console.err);
 };
 */
--- a/pages/students/2016/lukas_pokryvka/README.md
+++ b/pages/students/2016/lukas_pokryvka/README.md
@ -23,13 +23,26 @@ Zásobník úloh :
        - tesla
        - xavier
    - Trénovanie na dvoch kartách na jednom stroji 
-        - idoc
+        - idoc DONE
        - titan
    - možno trénovanie na 4 kartách na jednom
        - quadra
    - *Trénovanie na dvoch kartách na dvoch strojoch pomocou NCCL (idoc, tesla)*
    - možno trénovanie na 2 kartách na dvoch strojoch (quadra plus idoc).
 Virtuálne stretnutie 27.10.2020
 Stav:
 - Trénovanie na procesore, na 1 GPU, na 2 GPU na idoc
 - Príprava podkladov na trénovanie na dvoch strojoch pomocou Pytorch.
 - Vytvorený prístup na teslu a xavier.
 Úlohy na ďďalšie stretnutie:
 - Štdúdium odbornej literatúry a vypracovanie poznámok. 
 - Pokračovať v otvorených úlohách zo zásobníka
 - Vypracované skripty uložiť na GIT repozitár
 - vytvorte repozitár dp2021
 Stretnutie 2.10.2020
--- a/pages/students/2016/lukas_pokryvka/dp2021/README.md
+++ b/pages/students/2016/lukas_pokryvka/dp2021/README.md
@ -1 +1,4 @@
-## Všetky skripty, súbory a konfigurácie
+## Všetky skripty, súbory a konfigurácie
 https://github.com/pytorch/examples/tree/master/imagenet
 - malo by fungovat pre DDP, nedostupny imagenet subor z oficialnej stranky
--- a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/data-unversioned/data.txt
+++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/data-unversioned/data.txt
--- a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/data/data.txt
+++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/data/data.txt
--- a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/init.py
+++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/init.py
--- a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/benchmark_seg.py
+++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/benchmark_seg.py
@ -0,0 +1,76 @@
 import argparse
 import datetime
 import os
 import socket
 import sys
 import numpy as np
 from torch.utils.tensorboard import SummaryWriter
 import torch
 import torch.nn as nn
 import torch.optim
 from torch.optim import SGD, Adam
 from torch.utils.data import DataLoader
 from util.util import enumerateWithEstimate
 from p2ch13.dsets import Luna2dSegmentationDataset, TrainingLuna2dSegmentationDataset, getCt
 from util.logconf import logging
 from util.util import xyz2irc
 from p2ch13.model_seg import UNetWrapper, SegmentationAugmentation
 from p2ch13.train_seg import LunaTrainingApp
 log = logging.getLogger(__name__)
 # log.setLevel(logging.WARN)
 # log.setLevel(logging.INFO)
 log.setLevel(logging.DEBUG)
 class BenchmarkLuna2dSegmentationDataset(TrainingLuna2dSegmentationDataset):
    def __len__(self):
        # return 500
        return 5000
        return 1000
 class LunaBenchmarkApp(LunaTrainingApp):
    def initTrainDl(self):
        train_ds = BenchmarkLuna2dSegmentationDataset(
            val_stride=10,
            isValSet_bool=False,
            contextSlices_count=3,
            # augmentation_dict=self.augmentation_dict,
        )
        batch_size = self.cli_args.batch_size
        if self.use_cuda:
            batch_size *= torch.cuda.device_count()
        train_dl = DataLoader(
            train_ds,
            batch_size=batch_size,
            num_workers=self.cli_args.num_workers,
            pin_memory=self.use_cuda,
        )
        return train_dl
    def main(self):
        log.info("Starting {}, {}".format(type(self).__name__, self.cli_args))
        train_dl = self.initTrainDl()
        for epoch_ndx in range(1, 2):
            log.info("Epoch {} of {}, {}/{} batches of size {}*{}".format(
                epoch_ndx,
                self.cli_args.epochs,
                len(train_dl),
                len([]),
                self.cli_args.batch_size,
                (torch.cuda.device_count() if self.use_cuda else 1),
            ))
            self.doTraining(epoch_ndx, train_dl)
 if __name__ == '__main__':
    LunaBenchmarkApp().main()
--- a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/dsets.py
+++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/dsets.py
@ -0,0 +1,401 @@
 import copy
 import csv
 import functools
 import glob
 import math
 import os
 import random
 from collections import namedtuple
 import SimpleITK as sitk
 import numpy as np
 import scipy.ndimage.morphology as morph
 import torch
 import torch.cuda
 import torch.nn.functional as F
 from torch.utils.data import Dataset
 from util.disk import getCache
 from util.util import XyzTuple, xyz2irc
 from util.logconf import logging
 log = logging.getLogger(__name__)
 # log.setLevel(logging.WARN)
 # log.setLevel(logging.INFO)
 log.setLevel(logging.DEBUG)
 raw_cache = getCache('part2ch13_raw')
 MaskTuple = namedtuple('MaskTuple', 'raw_dense_mask, dense_mask, body_mask, air_mask, raw_candidate_mask, candidate_mask, lung_mask, neg_mask, pos_mask')
 CandidateInfoTuple = namedtuple('CandidateInfoTuple', 'isNodule_bool, hasAnnotation_bool, isMal_bool, diameter_mm, series_uid, center_xyz')
@functools.lru_cache(1)
 def getCandidateInfoList(requireOnDisk_bool=True):
    # We construct a set with all series_uids that are present on disk.
    # This will let us use the data, even if we haven't downloaded all of
    # the subsets yet.
    mhd_list = glob.glob('data-unversioned/subset*/*.mhd')
    presentOnDisk_set = {os.path.split(p)[-1][:-4] for p in mhd_list}
    candidateInfo_list = []
    with open('data/annotations_with_malignancy.csv', "r") as f:
        for row in list(csv.reader(f))[1:]:
            series_uid = row[0]
            annotationCenter_xyz = tuple([float(x) for x in row[1:4]])
            annotationDiameter_mm = float(row[4])
            isMal_bool = {'False': False, 'True': True}[row[5]]
            candidateInfo_list.append(
                CandidateInfoTuple(
                    True,
                    True,
                    isMal_bool,
                    annotationDiameter_mm,
                    series_uid,
                    annotationCenter_xyz,
                )
            )
    with open('data/candidates.csv', "r") as f:
        for row in list(csv.reader(f))[1:]:
            series_uid = row[0]
            if series_uid not in presentOnDisk_set and requireOnDisk_bool:
                continue
            isNodule_bool = bool(int(row[4]))
            candidateCenter_xyz = tuple([float(x) for x in row[1:4]])
            if not isNodule_bool:
                candidateInfo_list.append(
                    CandidateInfoTuple(
                        False,
                        False,
                        False,
                        0.0,
                        series_uid,
                        candidateCenter_xyz,
                    )
                )
    candidateInfo_list.sort(reverse=True)
    return candidateInfo_list
@functools.lru_cache(1)
 def getCandidateInfoDict(requireOnDisk_bool=True):
    candidateInfo_list = getCandidateInfoList(requireOnDisk_bool)
    candidateInfo_dict = {}
    for candidateInfo_tup in candidateInfo_list:
        candidateInfo_dict.setdefault(candidateInfo_tup.series_uid,
                                      []).append(candidateInfo_tup)
    return candidateInfo_dict
 class Ct:
    def __init__(self, series_uid):
        mhd_path = glob.glob(
            'data-unversioned/subset*/{}.mhd'.format(series_uid)
        )[0]
        ct_mhd = sitk.ReadImage(mhd_path)
        self.hu_a = np.array(sitk.GetArrayFromImage(ct_mhd), dtype=np.float32)
        # CTs are natively expressed in https://en.wikipedia.org/wiki/Hounsfield_scale
        # HU are scaled oddly, with 0 g/cc (air, approximately) being -1000 and 1 g/cc (water) being 0.
        self.series_uid = series_uid
        self.origin_xyz = XyzTuple(*ct_mhd.GetOrigin())
        self.vxSize_xyz = XyzTuple(*ct_mhd.GetSpacing())
        self.direction_a = np.array(ct_mhd.GetDirection()).reshape(3, 3)
        candidateInfo_list = getCandidateInfoDict()[self.series_uid]
        self.positiveInfo_list = [
            candidate_tup
            for candidate_tup in candidateInfo_list
            if candidate_tup.isNodule_bool
        ]
        self.positive_mask = self.buildAnnotationMask(self.positiveInfo_list)
        self.positive_indexes = (self.positive_mask.sum(axis=(1,2))
                                 .nonzero()[0].tolist())
    def buildAnnotationMask(self, positiveInfo_list, threshold_hu = -700):
        boundingBox_a = np.zeros_like(self.hu_a, dtype=np.bool)
        for candidateInfo_tup in positiveInfo_list:
            center_irc = xyz2irc(
                candidateInfo_tup.center_xyz,
                self.origin_xyz,
                self.vxSize_xyz,
                self.direction_a,
            )
            ci = int(center_irc.index)
            cr = int(center_irc.row)
            cc = int(center_irc.col)
            index_radius = 2
            try:
                while self.hu_a[ci + index_radius, cr, cc] > threshold_hu and \
                        self.hu_a[ci - index_radius, cr, cc] > threshold_hu:
                    index_radius += 1
            except IndexError:
                index_radius -= 1
            row_radius = 2
            try:
                while self.hu_a[ci, cr + row_radius, cc] > threshold_hu and \
                        self.hu_a[ci, cr - row_radius, cc] > threshold_hu:
                    row_radius += 1
            except IndexError:
                row_radius -= 1
            col_radius = 2
            try:
                while self.hu_a[ci, cr, cc + col_radius] > threshold_hu and \
                        self.hu_a[ci, cr, cc - col_radius] > threshold_hu:
                    col_radius += 1
            except IndexError:
                col_radius -= 1
            # assert index_radius > 0, repr([candidateInfo_tup.center_xyz, center_irc, self.hu_a[ci, cr, cc]])
            # assert row_radius > 0
            # assert col_radius > 0
            boundingBox_a[
                 ci - index_radius: ci + index_radius + 1,
                 cr - row_radius: cr + row_radius + 1,
                 cc - col_radius: cc + col_radius + 1] = True
        mask_a = boundingBox_a & (self.hu_a > threshold_hu)
        return mask_a
    def getRawCandidate(self, center_xyz, width_irc):
        center_irc = xyz2irc(center_xyz, self.origin_xyz, self.vxSize_xyz,
                             self.direction_a)
        slice_list = []
        for axis, center_val in enumerate(center_irc):
            start_ndx = int(round(center_val - width_irc[axis]/2))
            end_ndx = int(start_ndx + width_irc[axis])
            assert center_val >= 0 and center_val < self.hu_a.shape[axis], repr([self.series_uid, center_xyz, self.origin_xyz, self.vxSize_xyz, center_irc, axis])
            if start_ndx < 0:
                # log.warning("Crop outside of CT array: {} {}, center:{} shape:{} width:{}".format(
                #     self.series_uid, center_xyz, center_irc, self.hu_a.shape, width_irc))
                start_ndx = 0
                end_ndx = int(width_irc[axis])
            if end_ndx > self.hu_a.shape[axis]:
                # log.warning("Crop outside of CT array: {} {}, center:{} shape:{} width:{}".format(
                #     self.series_uid, center_xyz, center_irc, self.hu_a.shape, width_irc))
                end_ndx = self.hu_a.shape[axis]
                start_ndx = int(self.hu_a.shape[axis] - width_irc[axis])
            slice_list.append(slice(start_ndx, end_ndx))
        ct_chunk = self.hu_a[tuple(slice_list)]
        pos_chunk = self.positive_mask[tuple(slice_list)]
        return ct_chunk, pos_chunk, center_irc
@functools.lru_cache(1, typed=True)
 def getCt(series_uid):
    return Ct(series_uid)
@raw_cache.memoize(typed=True)
 def getCtRawCandidate(series_uid, center_xyz, width_irc):
    ct = getCt(series_uid)
    ct_chunk, pos_chunk, center_irc = ct.getRawCandidate(center_xyz,
                                                         width_irc)
    ct_chunk.clip(-1000, 1000, ct_chunk)
    return ct_chunk, pos_chunk, center_irc
@raw_cache.memoize(typed=True)
 def getCtSampleSize(series_uid):
    ct = Ct(series_uid)
    return int(ct.hu_a.shape[0]), ct.positive_indexes
 class Luna2dSegmentationDataset(Dataset):
    def __init__(self,
                 val_stride=0,
                 isValSet_bool=None,
                 series_uid=None,
                 contextSlices_count=3,
                 fullCt_bool=False,
            ):
        self.contextSlices_count = contextSlices_count
        self.fullCt_bool = fullCt_bool
        if series_uid:
            self.series_list = [series_uid]
        else:
            self.series_list = sorted(getCandidateInfoDict().keys())
        if isValSet_bool:
            assert val_stride > 0, val_stride
            self.series_list = self.series_list[::val_stride]
            assert self.series_list
        elif val_stride > 0:
            del self.series_list[::val_stride]
            assert self.series_list
        self.sample_list = []
        for series_uid in self.series_list:
            index_count, positive_indexes = getCtSampleSize(series_uid)
            if self.fullCt_bool:
                self.sample_list += [(series_uid, slice_ndx)
                                     for slice_ndx in range(index_count)]
            else:
                self.sample_list += [(series_uid, slice_ndx)
                                     for slice_ndx in positive_indexes]
        self.candidateInfo_list = getCandidateInfoList()
        series_set = set(self.series_list)
        self.candidateInfo_list = [cit for cit in self.candidateInfo_list
                                   if cit.series_uid in series_set]
        self.pos_list = [nt for nt in self.candidateInfo_list
                            if nt.isNodule_bool]
        log.info("{!r}: {} {} series, {} slices, {} nodules".format(
            self,
            len(self.series_list),
            {None: 'general', True: 'validation', False: 'training'}[isValSet_bool],
            len(self.sample_list),
            len(self.pos_list),
        ))
    def __len__(self):
        return len(self.sample_list)
    def __getitem__(self, ndx):
        series_uid, slice_ndx = self.sample_list[ndx % len(self.sample_list)]
        return self.getitem_fullSlice(series_uid, slice_ndx)
    def getitem_fullSlice(self, series_uid, slice_ndx):
        ct = getCt(series_uid)
        ct_t = torch.zeros((self.contextSlices_count * 2 + 1, 512, 512))
        start_ndx = slice_ndx - self.contextSlices_count
        end_ndx = slice_ndx + self.contextSlices_count + 1
        for i, context_ndx in enumerate(range(start_ndx, end_ndx)):
            context_ndx = max(context_ndx, 0)
            context_ndx = min(context_ndx, ct.hu_a.shape[0] - 1)
            ct_t[i] = torch.from_numpy(ct.hu_a[context_ndx].astype(np.float32))
        # CTs are natively expressed in https://en.wikipedia.org/wiki/Hounsfield_scale
        # HU are scaled oddly, with 0 g/cc (air, approximately) being -1000 and 1 g/cc (water) being 0.
        # The lower bound gets rid of negative density stuff used to indicate out-of-FOV
        # The upper bound nukes any weird hotspots and clamps bone down
        ct_t.clamp_(-1000, 1000)
        pos_t = torch.from_numpy(ct.positive_mask[slice_ndx]).unsqueeze(0)
        return ct_t, pos_t, ct.series_uid, slice_ndx
 class TrainingLuna2dSegmentationDataset(Luna2dSegmentationDataset):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.ratio_int = 2
    def __len__(self):
        return 300000
    def shuffleSamples(self):
        random.shuffle(self.candidateInfo_list)
        random.shuffle(self.pos_list)
    def __getitem__(self, ndx):
        candidateInfo_tup = self.pos_list[ndx % len(self.pos_list)]
        return self.getitem_trainingCrop(candidateInfo_tup)
    def getitem_trainingCrop(self, candidateInfo_tup):
        ct_a, pos_a, center_irc = getCtRawCandidate(
            candidateInfo_tup.series_uid,
            candidateInfo_tup.center_xyz,
            (7, 96, 96),
        )
        pos_a = pos_a[3:4]
        row_offset = random.randrange(0,32)
        col_offset = random.randrange(0,32)
        ct_t = torch.from_numpy(ct_a[:, row_offset:row_offset+64,
                                     col_offset:col_offset+64]).to(torch.float32)
        pos_t = torch.from_numpy(pos_a[:, row_offset:row_offset+64,
                                       col_offset:col_offset+64]).to(torch.long)
        slice_ndx = center_irc.index
        return ct_t, pos_t, candidateInfo_tup.series_uid, slice_ndx
 class PrepcacheLunaDataset(Dataset):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.candidateInfo_list = getCandidateInfoList()
        self.pos_list = [nt for nt in self.candidateInfo_list if nt.isNodule_bool]
        self.seen_set = set()
        self.candidateInfo_list.sort(key=lambda x: x.series_uid)
    def __len__(self):
        return len(self.candidateInfo_list)
    def __getitem__(self, ndx):
        # candidate_t, pos_t, series_uid, center_t = super().__getitem__(ndx)
        candidateInfo_tup = self.candidateInfo_list[ndx]
        getCtRawCandidate(candidateInfo_tup.series_uid, candidateInfo_tup.center_xyz, (7, 96, 96))
        series_uid = candidateInfo_tup.series_uid
        if series_uid not in self.seen_set:
            self.seen_set.add(series_uid)
            getCtSampleSize(series_uid)
            # ct = getCt(series_uid)
            # for mask_ndx in ct.positive_indexes:
            #     build2dLungMask(series_uid, mask_ndx)
        return 0, 1 #candidate_t, pos_t, series_uid, center_t
 class TvTrainingLuna2dSegmentationDataset(torch.utils.data.Dataset):
    def __init__(self, isValSet_bool=False, val_stride=10, contextSlices_count=3):
        assert contextSlices_count == 3
        data = torch.load('./imgs_and_masks.pt')
        suids = list(set(data['suids']))
        trn_mask_suids = torch.arange(len(suids)) % val_stride < (val_stride - 1)
        trn_suids = {s for i, s in zip(trn_mask_suids, suids) if i}
        trn_mask = torch.tensor([(s in trn_suids) for s in data["suids"]])
        if not isValSet_bool:
            self.imgs = data["imgs"][trn_mask]
            self.masks = data["masks"][trn_mask]
            self.suids = [s for s, i in zip(data["suids"], trn_mask) if i]
        else:
            self.imgs = data["imgs"][~trn_mask]
            self.masks = data["masks"][~trn_mask]
            self.suids = [s for s, i in zip(data["suids"], trn_mask) if not i]
        # discard spurious hotspots and clamp bone
        self.imgs.clamp_(-1000, 1000)
        self.imgs /= 1000
    def __len__(self):
        return len(self.imgs)
    def __getitem__(self, i):
        oh, ow = torch.randint(0, 32, (2,))
        sl = self.masks.size(1)//2
        return self.imgs[i, :, oh: oh + 64, ow: ow + 64], 1, self.masks[i, sl: sl+1, oh: oh + 64, ow: ow + 64].to(torch.float32), self.suids[i], 9999
--- a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/model.py
+++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/model.py
@ -0,0 +1,224 @@
 import math
 import random
 from collections import namedtuple
 import torch
 from torch import nn as nn
 import torch.nn.functional as F
 from util.logconf import logging
 from util.unet import UNet
 log = logging.getLogger(__name__)
 # log.setLevel(logging.WARN)
 # log.setLevel(logging.INFO)
 log.setLevel(logging.DEBUG)
 class UNetWrapper(nn.Module):
    def __init__(self, **kwargs):
        super().__init__()
        self.input_batchnorm = nn.BatchNorm2d(kwargs['in_channels'])
        self.unet = UNet(**kwargs)
        self.final = nn.Sigmoid()
        self._init_weights()
    def _init_weights(self):
        init_set = {
            nn.Conv2d,
            nn.Conv3d,
            nn.ConvTranspose2d,
            nn.ConvTranspose3d,
            nn.Linear,
        }
        for m in self.modules():
            if type(m) in init_set:
                nn.init.kaiming_normal_(
                    m.weight.data, mode='fan_out', nonlinearity='relu', a=0
                )
                if m.bias is not None:
                    fan_in, fan_out = \
                        nn.init._calculate_fan_in_and_fan_out(m.weight.data)
                    bound = 1 / math.sqrt(fan_out)
                    nn.init.normal_(m.bias, -bound, bound)
        # nn.init.constant_(self.unet.last.bias, -4)
        # nn.init.constant_(self.unet.last.bias, 4)
    def forward(self, input_batch):
        bn_output = self.input_batchnorm(input_batch)
        un_output = self.unet(bn_output)
        fn_output = self.final(un_output)
        return fn_output
 class SegmentationAugmentation(nn.Module):
    def __init__(
            self, flip=None, offset=None, scale=None, rotate=None, noise=None
    ):
        super().__init__()
        self.flip = flip
        self.offset = offset
        self.scale = scale
        self.rotate = rotate
        self.noise = noise
    def forward(self, input_g, label_g):
        transform_t = self._build2dTransformMatrix()
        transform_t = transform_t.expand(input_g.shape[0], -1, -1)
        transform_t = transform_t.to(input_g.device, torch.float32)
        affine_t = F.affine_grid(transform_t[:,:2],
                input_g.size(), align_corners=False)
        augmented_input_g = F.grid_sample(input_g,
                affine_t, padding_mode='border',
                align_corners=False)
        augmented_label_g = F.grid_sample(label_g.to(torch.float32),
                affine_t, padding_mode='border',
                align_corners=False)
        if self.noise:
            noise_t = torch.randn_like(augmented_input_g)
            noise_t *= self.noise
            augmented_input_g += noise_t
        return augmented_input_g, augmented_label_g > 0.5
    def _build2dTransformMatrix(self):
        transform_t = torch.eye(3)
        for i in range(2):
            if self.flip:
                if random.random() > 0.5:
                    transform_t[i,i] *= -1
            if self.offset:
                offset_float = self.offset
                random_float = (random.random() * 2 - 1)
                transform_t[2,i] = offset_float * random_float
            if self.scale:
                scale_float = self.scale
                random_float = (random.random() * 2 - 1)
                transform_t[i,i] *= 1.0 + scale_float * random_float
        if self.rotate:
            angle_rad = random.random() * math.pi * 2
            s = math.sin(angle_rad)
            c = math.cos(angle_rad)
            rotation_t = torch.tensor([
                [c, -s, 0],
                [s, c, 0],
                [0, 0, 1]])
            transform_t @= rotation_t
        return transform_t
 # MaskTuple = namedtuple('MaskTuple', 'raw_dense_mask, dense_mask, body_mask, air_mask, raw_candidate_mask, candidate_mask, lung_mask, neg_mask, pos_mask')
 #
 # class SegmentationMask(nn.Module):
 #     def __init__(self):
 #         super().__init__()
 #
 #         self.conv_list = nn.ModuleList([
 #             self._make_circle_conv(radius) for radius in range(1, 8)
 #         ])
 #
 #     def _make_circle_conv(self, radius):
 #         diameter = 1 + radius * 2
 #
 #         a = torch.linspace(-1, 1, steps=diameter)**2
 #         b = (a[None] + a[:, None])**0.5
 #
 #         circle_weights = (b <= 1.0).to(torch.float32)
 #
 #         conv = nn.Conv2d(1, 1, kernel_size=diameter, padding=radius, bias=False)
 #         conv.weight.data.fill_(1)
 #         conv.weight.data *= circle_weights / circle_weights.sum()
 #
 #         return conv
 #
 #
 #     def erode(self, input_mask, radius, threshold=1):
 #         conv = self.conv_list[radius - 1]
 #         input_float = input_mask.to(torch.float32)
 #         result = conv(input_float)
 #
 #         # log.debug(['erode in ', radius, threshold, input_float.min().item(), input_float.mean().item(), input_float.max().item()])
 #         # log.debug(['erode out', radius, threshold, result.min().item(), result.mean().item(), result.max().item()])
 #
 #         return result >= threshold
 #
 #     def deposit(self, input_mask, radius, threshold=0):
 #         conv = self.conv_list[radius - 1]
 #         input_float = input_mask.to(torch.float32)
 #         result = conv(input_float)
 #
 #         # log.debug(['deposit in ', radius, threshold, input_float.min().item(), input_float.mean().item(), input_float.max().item()])
 #         # log.debug(['deposit out', radius, threshold, result.min().item(), result.mean().item(), result.max().item()])
 #
 #         return result > threshold
 #
 #     def fill_cavity(self, input_mask):
 #         cumsum = input_mask.cumsum(-1)
 #         filled_mask = (cumsum > 0)
 #         filled_mask &= (cumsum < cumsum[..., -1:])
 #         cumsum = input_mask.cumsum(-2)
 #         filled_mask &= (cumsum > 0)
 #         filled_mask &= (cumsum < cumsum[..., -1:, :])
 #
 #         return filled_mask
 #
 #
 #     def forward(self, input_g, raw_pos_g):
 #         gcc_g = input_g + 1
 #
 #         with torch.no_grad():
 #             # log.info(['gcc_g', gcc_g.min(), gcc_g.mean(), gcc_g.max()])
 #
 #             raw_dense_mask = gcc_g > 0.7
 #             dense_mask = self.deposit(raw_dense_mask, 2)
 #             dense_mask = self.erode(dense_mask, 6)
 #             dense_mask = self.deposit(dense_mask, 4)
 #
 #             body_mask = self.fill_cavity(dense_mask)
 #             air_mask = self.deposit(body_mask & ~dense_mask, 5)
 #             air_mask = self.erode(air_mask, 6)
 #
 #             lung_mask = self.deposit(air_mask, 5)
 #
 #             raw_candidate_mask = gcc_g > 0.4
 #             raw_candidate_mask &= air_mask
 #             candidate_mask = self.erode(raw_candidate_mask, 1)
 #             candidate_mask = self.deposit(candidate_mask, 1)
 #
 #             pos_mask = self.deposit((raw_pos_g > 0.5) & lung_mask, 2)
 #
 #             neg_mask = self.deposit(candidate_mask, 1)
 #             neg_mask &= ~pos_mask
 #             neg_mask &= lung_mask
 #
 #             # label_g = (neg_mask | pos_mask).to(torch.float32)
 #             label_g = (pos_mask).to(torch.float32)
 #             neg_g = neg_mask.to(torch.float32)
 #             pos_g = pos_mask.to(torch.float32)
 #
 #         mask_dict = {
 #             'raw_dense_mask': raw_dense_mask,
 #             'dense_mask': dense_mask,
 #             'body_mask': body_mask,
 #             'air_mask': air_mask,
 #             'raw_candidate_mask': raw_candidate_mask,
 #             'candidate_mask': candidate_mask,
 #             'lung_mask': lung_mask,
 #             'neg_mask': neg_mask,
 #             'pos_mask': pos_mask,
 #         }
 #
 #         return label_g, neg_g, pos_g, lung_mask, mask_dict
--- a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/prepcache.py
+++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/model/prepcache.py
@ -0,0 +1,69 @@
 import timing
 import argparse
 import sys
 import numpy as np
 import torch.nn as nn
 from torch.autograd import Variable
 from torch.optim import SGD
 from torch.utils.data import DataLoader
 from util.util import enumerateWithEstimate
 from .dsets import PrepcacheLunaDataset, getCtSampleSize
 from util.logconf import logging
 # from .model import LunaModel
 log = logging.getLogger(__name__)
 # log.setLevel(logging.WARN)
 log.setLevel(logging.INFO)
 # log.setLevel(logging.DEBUG)
 class LunaPrepCacheApp:
    @classmethod
    def __init__(self, sys_argv=None):
        if sys_argv is None:
            sys_argv = sys.argv[1:]
        parser = argparse.ArgumentParser()
        parser.add_argument('--batch-size',
            help='Batch size to use for training',
            default=1024,
            type=int,
        )
        parser.add_argument('--num-workers',
            help='Number of worker processes for background data loading',
            default=8,
            type=int,
        )
        # parser.add_argument('--scaled',
        #     help="Scale the CT chunks to square voxels.",
        #     default=False,
        #     action='store_true',
        # )
        self.cli_args = parser.parse_args(sys_argv)
    def main(self):
        log.info("Starting {}, {}".format(type(self).__name__, self.cli_args))
        self.prep_dl = DataLoader(
            PrepcacheLunaDataset(
                # sortby_str='series_uid',
            ),
            batch_size=self.cli_args.batch_size,
            num_workers=self.cli_args.num_workers,
        )
        batch_iter = enumerateWithEstimate(
            self.prep_dl,
            "Stuffing cache",
            start_ndx=self.prep_dl.num_workers,
        )
        for batch_ndx, batch_tup in batch_iter:
            pass
 if __name__ == '__main__':
    LunaPrepCacheApp().main()
--- a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/init.py
+++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/init.py
--- a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/augmentation.py
+++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/augmentation.py
@ -0,0 +1,331 @@
 import math
 import random
 import warnings
 import numpy as np
 import scipy.ndimage
 import torch
 from torch.autograd import Function
 from torch.autograd.function import once_differentiable
 import torch.backends.cudnn as cudnn
 from util.logconf import logging
 log = logging.getLogger(__name__)
 # log.setLevel(logging.WARN)
 # log.setLevel(logging.INFO)
 log.setLevel(logging.DEBUG)
 def cropToShape(image, new_shape, center_list=None, fill=0.0):
    # log.debug([image.shape, new_shape, center_list])
    # assert len(image.shape) == 3, repr(image.shape)
    if center_list is None:
        center_list = [int(image.shape[i] / 2) for i in range(3)]
    crop_list = []
    for i in range(0, 3):
        crop_int = center_list[i]
        if image.shape[i] > new_shape[i] and crop_int is not None:
            # We can't just do crop_int +/- shape/2 since shape might be odd
            # and ints round down.
            start_int = crop_int - int(new_shape[i]/2)
            end_int = start_int + new_shape[i]
            crop_list.append(slice(max(0, start_int), end_int))
        else:
            crop_list.append(slice(0, image.shape[i]))
    # log.debug([image.shape, crop_list])
    image = image[crop_list]
    crop_list = []
    for i in range(0, 3):
        if image.shape[i] < new_shape[i]:
            crop_int = int((new_shape[i] - image.shape[i]) / 2)
            crop_list.append(slice(crop_int, crop_int + image.shape[i]))
        else:
            crop_list.append(slice(0, image.shape[i]))
    # log.debug([image.shape, crop_list])
    new_image = np.zeros(new_shape, dtype=image.dtype)
    new_image[:] = fill
    new_image[crop_list] = image
    return new_image
 def zoomToShape(image, new_shape, square=True):
    # assert image.shape[-1] in {1, 3, 4}, repr(image.shape)
    if square and image.shape[0] != image.shape[1]:
        crop_int = min(image.shape[0], image.shape[1])
        new_shape = [crop_int, crop_int, image.shape[2]]
        image = cropToShape(image, new_shape)
    zoom_shape = [new_shape[i] / image.shape[i] for i in range(3)]
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        image = scipy.ndimage.interpolation.zoom(
            image, zoom_shape,
            output=None, order=0, mode='nearest', cval=0.0, prefilter=True)
    return image
 def randomOffset(image_list, offset_rows=0.125, offset_cols=0.125):
    center_list = [int(image_list[0].shape[i] / 2) for i in range(3)]
    center_list[0] += int(offset_rows * (random.random() - 0.5) * 2)
    center_list[1] += int(offset_cols * (random.random() - 0.5) * 2)
    center_list[2] = None
    new_list = []
    for image in image_list:
        new_image = cropToShape(image, image.shape, center_list)
        new_list.append(new_image)
    return new_list
 def randomZoom(image_list, scale=None, scale_min=0.8, scale_max=1.3):
    if scale is None:
        scale = scale_min + (scale_max - scale_min) * random.random()
    new_list = []
    for image in image_list:
        # assert image.shape[-1] in {1, 3, 4}, repr(image.shape)
        with warnings.catch_warnings():
            warnings.simplefilter("ignore")
            # log.info([image.shape])
            zimage = scipy.ndimage.interpolation.zoom(
                image, [scale, scale, 1.0],
                output=None, order=0, mode='nearest', cval=0.0, prefilter=True)
        image = cropToShape(zimage, image.shape)
        new_list.append(image)
    return new_list
 _randomFlip_transform_list = [
    # lambda a: np.rot90(a, axes=(0, 1)),
    # lambda a: np.flip(a, 0),
    lambda a: np.flip(a, 1),
 ]
 def randomFlip(image_list, transform_bits=None):
    if transform_bits is None:
        transform_bits = random.randrange(0, 2 ** len(_randomFlip_transform_list))
    new_list = []
    for image in image_list:
        # assert image.shape[-1] in {1, 3, 4}, repr(image.shape)
        for n in range(len(_randomFlip_transform_list)):
            if transform_bits & 2**n:
                # prhist(image, 'before')
                image = _randomFlip_transform_list[n](image)
                # prhist(image, 'after ')
        new_list.append(image)
    return new_list
 def randomSpin(image_list, angle=None, range_tup=None, axes=(0, 1)):
    if range_tup is None:
        range_tup = (0, 360)
    if angle is None:
        angle = range_tup[0] + (range_tup[1] - range_tup[0]) * random.random()
    new_list = []
    for image in image_list:
        # assert image.shape[-1] in {1, 3, 4}, repr(image.shape)
        image = scipy.ndimage.interpolation.rotate(
                image, angle, axes=axes, reshape=False,
                output=None, order=0, mode='nearest', cval=0.0, prefilter=True)
        new_list.append(image)
    return new_list
 def randomNoise(image_list, noise_min=-0.1, noise_max=0.1):
    noise = np.zeros_like(image_list[0])
    noise += (noise_max - noise_min) * np.random.random_sample(image_list[0].shape) + noise_min
    noise *= 5
    noise = scipy.ndimage.filters.gaussian_filter(noise, 3)
    # noise += (noise_max - noise_min) * np.random.random_sample(image_hsv.shape) + noise_min
    new_list = []
    for image_hsv in image_list:
        image_hsv = image_hsv + noise
        new_list.append(image_hsv)
    return new_list
 def randomHsvShift(image_list, h=None, s=None, v=None,
                   h_min=-0.1, h_max=0.1,
                   s_min=0.5, s_max=2.0,
                   v_min=0.5, v_max=2.0):
    if h is None:
        h = h_min + (h_max - h_min) * random.random()
    if s is None:
        s = s_min + (s_max - s_min) * random.random()
    if v is None:
        v = v_min + (v_max - v_min) * random.random()
    new_list = []
    for image_hsv in image_list:
        # assert image_hsv.shape[-1] == 3, repr(image_hsv.shape)
        image_hsv[:,:,0::3] += h
        image_hsv[:,:,1::3] = image_hsv[:,:,1::3] ** s
        image_hsv[:,:,2::3] = image_hsv[:,:,2::3] ** v
        new_list.append(image_hsv)
    return clampHsv(new_list)
 def clampHsv(image_list):
    new_list = []
    for image_hsv in image_list:
        image_hsv = image_hsv.clone()
        # Hue wraps around
        image_hsv[:,:,0][image_hsv[:,:,0] > 1] -= 1
        image_hsv[:,:,0][image_hsv[:,:,0] < 0] += 1
        # Everything else clamps between 0 and 1
        image_hsv[image_hsv > 1] = 1
        image_hsv[image_hsv < 0] = 0
        new_list.append(image_hsv)
    return new_list
 # def torch_augment(input):
 #     theta = random.random() * math.pi * 2
 #     s = math.sin(theta)
 #     c = math.cos(theta)
 #     c1 = 1 - c
 #     axis_vector = torch.rand(3, device='cpu', dtype=torch.float64)
 #     axis_vector -= 0.5
 #     axis_vector /= axis_vector.abs().sum()
 #     l, m, n = axis_vector
 #
 #     matrix = torch.tensor([
 #         [l*l*c1 +   c, m*l*c1 - n*s, n*l*c1 + m*s, 0],
 #         [l*m*c1 + n*s, m*m*c1 +   c, n*m*c1 - l*s, 0],
 #         [l*n*c1 - m*s, m*n*c1 + l*s, n*n*c1 +   c, 0],
 #         [0, 0, 0, 1],
 #     ], device=input.device, dtype=torch.float32)
 #
 #     return th_affine3d(input, matrix)
 # following from https://github.com/ncullen93/torchsample/blob/master/torchsample/utils.py
 # MIT licensed
 # def th_affine3d(input, matrix):
 #     """
 #     3D Affine image transform on torch.Tensor
 #     """
 #     A = matrix[:3,:3]
 #     b = matrix[:3,3]
 #
 #     # make a meshgrid of normal coordinates
 #     coords = th_iterproduct(input.size(-3), input.size(-2), input.size(-1), dtype=torch.float32)
 #
 #     # shift the coordinates so center is the origin
 #     coords[:,0] = coords[:,0] - (input.size(-3) / 2. - 0.5)
 #     coords[:,1] = coords[:,1] - (input.size(-2) / 2. - 0.5)
 #     coords[:,2] = coords[:,2] - (input.size(-1) / 2. - 0.5)
 #
 #     # apply the coordinate transformation
 #     new_coords = coords.mm(A.t().contiguous()) + b.expand_as(coords)
 #
 #     # shift the coordinates back so origin is origin
 #     new_coords[:,0] = new_coords[:,0] + (input.size(-3) / 2. - 0.5)
 #     new_coords[:,1] = new_coords[:,1] + (input.size(-2) / 2. - 0.5)
 #     new_coords[:,2] = new_coords[:,2] + (input.size(-1) / 2. - 0.5)
 #
 #     # map new coordinates using bilinear interpolation
 #     input_transformed = th_trilinear_interp3d(input, new_coords)
 #
 #     return input_transformed
 #
 #
 # def th_trilinear_interp3d(input, coords):
 #     """
 #     trilinear interpolation of 3D torch.Tensor image
 #     """
 #     # take clamp then floor/ceil of x coords
 #     x = torch.clamp(coords[:,0], 0, input.size(-3)-2)
 #     x0 = x.floor()
 #     x1 = x0 + 1
 #     # take clamp then floor/ceil of y coords
 #     y = torch.clamp(coords[:,1], 0, input.size(-2)-2)
 #     y0 = y.floor()
 #     y1 = y0 + 1
 #     # take clamp then floor/ceil of z coords
 #     z = torch.clamp(coords[:,2], 0, input.size(-1)-2)
 #     z0 = z.floor()
 #     z1 = z0 + 1
 #
 #     stride = torch.tensor(input.stride()[-3:], dtype=torch.int64, device=input.device)
 #     x0_ix = x0.mul(stride[0]).long()
 #     x1_ix = x1.mul(stride[0]).long()
 #     y0_ix = y0.mul(stride[1]).long()
 #     y1_ix = y1.mul(stride[1]).long()
 #     z0_ix = z0.mul(stride[2]).long()
 #     z1_ix = z1.mul(stride[2]).long()
 #
 #     # input_flat = th_flatten(input)
 #     input_flat = x.contiguous().view(x[0], x[1], -1)
 #
 #     vals_000 = input_flat[:, :, x0_ix+y0_ix+z0_ix]
 #     vals_001 = input_flat[:, :, x0_ix+y0_ix+z1_ix]
 #     vals_010 = input_flat[:, :, x0_ix+y1_ix+z0_ix]
 #     vals_011 = input_flat[:, :, x0_ix+y1_ix+z1_ix]
 #     vals_100 = input_flat[:, :, x1_ix+y0_ix+z0_ix]
 #     vals_101 = input_flat[:, :, x1_ix+y0_ix+z1_ix]
 #     vals_110 = input_flat[:, :, x1_ix+y1_ix+z0_ix]
 #     vals_111 = input_flat[:, :, x1_ix+y1_ix+z1_ix]
 #
 #     xd = x - x0
 #     yd = y - y0
 #     zd = z - z0
 #     xm1 = 1 - xd
 #     ym1 = 1 - yd
 #     zm1 = 1 - zd
 #
 #     x_mapped = (
 #             vals_000.mul(xm1).mul(ym1).mul(zm1) +
 #             vals_001.mul(xm1).mul(ym1).mul(zd) +
 #             vals_010.mul(xm1).mul(yd).mul(zm1) +
 #             vals_011.mul(xm1).mul(yd).mul(zd) +
 #             vals_100.mul(xd).mul(ym1).mul(zm1) +
 #             vals_101.mul(xd).mul(ym1).mul(zd) +
 #             vals_110.mul(xd).mul(yd).mul(zm1) +
 #             vals_111.mul(xd).mul(yd).mul(zd)
 #     )
 #
 #     return x_mapped.view_as(input)
 #
 # def th_iterproduct(*args, dtype=None):
 #     return torch.from_numpy(np.indices(args).reshape((len(args),-1)).T)
 #
 # def th_flatten(x):
 #     """Flatten tensor"""
 #     return x.contiguous().view(x[0], x[1], -1)
--- a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/disk.py
+++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/disk.py
@ -0,0 +1,136 @@
 import gzip
 from diskcache import FanoutCache, Disk
 from diskcache.core import BytesType, MODE_BINARY, BytesIO
 from util.logconf import logging
 log = logging.getLogger(__name__)
 # log.setLevel(logging.WARN)
 log.setLevel(logging.INFO)
 # log.setLevel(logging.DEBUG)
 class GzipDisk(Disk):
    def store(self, value, read, key=None):
        """
        Override from base class diskcache.Disk.
        Chunking is due to needing to work on pythons < 2.7.13:
        - Issue #27130: In the "zlib" module, fix handling of large buffers
          (typically 2 or 4 GiB).  Previously, inputs were limited to 2 GiB, and
          compression and decompression operations did not properly handle results of
          2 or 4 GiB.
        :param value: value to convert
        :param bool read: True when value is file-like object
        :return: (size, mode, filename, value) tuple for Cache table
        """
        # pylint: disable=unidiomatic-typecheck
        if type(value) is BytesType:
            if read:
                value = value.read()
                read = False
            str_io = BytesIO()
            gz_file = gzip.GzipFile(mode='wb', compresslevel=1, fileobj=str_io)
            for offset in range(0, len(value), 2**30):
                gz_file.write(value[offset:offset+2**30])
            gz_file.close()
            value = str_io.getvalue()
        return super(GzipDisk, self).store(value, read)
    def fetch(self, mode, filename, value, read):
        """
        Override from base class diskcache.Disk.
        Chunking is due to needing to work on pythons < 2.7.13:
        - Issue #27130: In the "zlib" module, fix handling of large buffers
          (typically 2 or 4 GiB).  Previously, inputs were limited to 2 GiB, and
          compression and decompression operations did not properly handle results of
          2 or 4 GiB.
        :param int mode: value mode raw, binary, text, or pickle
        :param str filename: filename of corresponding value
        :param value: database value
        :param bool read: when True, return an open file handle
        :return: corresponding Python value
        """
        value = super(GzipDisk, self).fetch(mode, filename, value, read)
        if mode == MODE_BINARY:
            str_io = BytesIO(value)
            gz_file = gzip.GzipFile(mode='rb', fileobj=str_io)
            read_csio = BytesIO()
            while True:
                uncompressed_data = gz_file.read(2**30)
                if uncompressed_data:
                    read_csio.write(uncompressed_data)
                else:
                    break
            value = read_csio.getvalue()
        return value
 def getCache(scope_str):
    return FanoutCache('data-unversioned/cache/' + scope_str,
                       disk=GzipDisk,
                       shards=64,
                       timeout=1,
                       size_limit=3e11,
                       # disk_min_file_size=2**20,
                       )
 # def disk_cache(base_path, memsize=2):
 #     def disk_cache_decorator(f):
 #         @functools.wraps(f)
 #         def wrapper(*args, **kwargs):
 #             args_str = repr(args) + repr(sorted(kwargs.items()))
 #             file_str = hashlib.md5(args_str.encode('utf8')).hexdigest()
 #
 #             cache_path = os.path.join(base_path, f.__name__, file_str + '.pkl.gz')
 #
 #             if not os.path.exists(os.path.dirname(cache_path)):
 #                 os.makedirs(os.path.dirname(cache_path), exist_ok=True)
 #
 #             if os.path.exists(cache_path):
 #                 return pickle_loadgz(cache_path)
 #             else:
 #                 ret = f(*args, **kwargs)
 #                 pickle_dumpgz(cache_path, ret)
 #                 return ret
 #
 #         return wrapper
 #
 #     return disk_cache_decorator
 #
 #
 # def pickle_dumpgz(file_path, obj):
 #     log.debug("Writing {}".format(file_path))
 #     with open(file_path, 'wb') as file_obj:
 #         with gzip.GzipFile(mode='wb', compresslevel=1, fileobj=file_obj) as gz_file:
 #             pickle.dump(obj, gz_file, pickle.HIGHEST_PROTOCOL)
 #
 #
 # def pickle_loadgz(file_path):
 #     log.debug("Reading {}".format(file_path))
 #     with open(file_path, 'rb') as file_obj:
 #         with gzip.GzipFile(mode='rb', fileobj=file_obj) as gz_file:
 #             return pickle.load(gz_file)
 #
 #
 # def dtpath(dt=None):
 #     if dt is None:
 #         dt = datetime.datetime.now()
 #
 #     return str(dt).rsplit('.', 1)[0].replace(' ', '--').replace(':', '.')
 #
 #
 # def safepath(s):
 #     s = s.replace(' ', '_')
 #     return re.sub('[^A-Za-z0-9_.-]', '', s)
--- a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/logconf.py
+++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/logconf.py
@ -0,0 +1,19 @@
 import logging
 import logging.handlers
 root_logger = logging.getLogger()
 root_logger.setLevel(logging.INFO)
 # Some libraries attempt to add their own root logger handlers. This is
 # annoying and so we get rid of them.
 for handler in list(root_logger.handlers):
    root_logger.removeHandler(handler)
 logfmt_str = "%(asctime)s %(levelname)-8s pid:%(process)d %(name)s:%(lineno)03d:%(funcName)s %(message)s"
 formatter = logging.Formatter(logfmt_str)
 streamHandler = logging.StreamHandler()
 streamHandler.setFormatter(formatter)
 streamHandler.setLevel(logging.DEBUG)
 root_logger.addHandler(streamHandler)
--- a/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/unet.py
+++ b/pages/students/2016/lukas_pokryvka/dp2021/lungCancer/util/unet.py
@ -0,0 +1,143 @@
 # From https://github.com/jvanvugt/pytorch-unet
 # https://raw.githubusercontent.com/jvanvugt/pytorch-unet/master/unet.py
 # MIT License
 #
 # Copyright (c) 2018 Joris
 #
 # Permission is hereby granted, free of charge, to any person obtaining a copy
 # of this software and associated documentation files (the "Software"), to deal
 # in the Software without restriction, including without limitation the rights
 # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 # copies of the Software, and to permit persons to whom the Software is
 # furnished to do so, subject to the following conditions:
 #
 # The above copyright notice and this permission notice shall be included in all
 # copies or substantial portions of the Software.
 #
 # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 # SOFTWARE.
 # Adapted from https://discuss.pytorch.org/t/unet-implementation/426
 import torch
 from torch import nn
 import torch.nn.functional as F
 class UNet(nn.Module):
    def __init__(self, in_channels=1, n_classes=2, depth=5, wf=6, padding=False,
                 batch_norm=False, up_mode='upconv'):
        """
        Implementation of
        U-Net: Convolutional Networks for Biomedical Image Segmentation
        (Ronneberger et al., 2015)
        https://arxiv.org/abs/1505.04597
        Using the default arguments will yield the exact version used
        in the original paper
        Args:
            in_channels (int): number of input channels
            n_classes (int): number of output channels
            depth (int): depth of the network
            wf (int): number of filters in the first layer is 2**wf
            padding (bool): if True, apply padding such that the input shape
                            is the same as the output.
                            This may introduce artifacts
            batch_norm (bool): Use BatchNorm after layers with an
                               activation function
            up_mode (str): one of 'upconv' or 'upsample'.
                           'upconv' will use transposed convolutions for
                           learned upsampling.
                           'upsample' will use bilinear upsampling.
        """
        super(UNet, self).__init__()
        assert up_mode in ('upconv', 'upsample')
        self.padding = padding
        self.depth = depth
        prev_channels = in_channels
        self.down_path = nn.ModuleList()
        for i in range(depth):
            self.down_path.append(UNetConvBlock(prev_channels, 2**(wf+i),
                                                padding, batch_norm))
            prev_channels = 2**(wf+i)
        self.up_path = nn.ModuleList()
        for i in reversed(range(depth - 1)):
            self.up_path.append(UNetUpBlock(prev_channels, 2**(wf+i), up_mode,
                                            padding, batch_norm))
            prev_channels = 2**(wf+i)
        self.last = nn.Conv2d(prev_channels, n_classes, kernel_size=1)
    def forward(self, x):
        blocks = []
        for i, down in enumerate(self.down_path):
            x = down(x)
            if i != len(self.down_path)-1:
                blocks.append(x)
                x = F.avg_pool2d(x, 2)
        for i, up in enumerate(self.up_path):
            x = up(x, blocks[-i-1])
        return self.last(x)
 class UNetConvBlock(nn.Module):
    def __init__(self, in_size, out_size, padding, batch_norm):
        super(UNetConvBlock, self).__init__()
        block = []
        block.append(nn.Conv2d(in_size, out_size, kernel_size=3,
                               padding=int(padding)))
        block.append(nn.ReLU())
        # block.append(nn.LeakyReLU())
        if batch_norm:
            block.append(nn.BatchNorm2d(out_size))
        block.append(nn.Conv2d(out_size, out_size, kernel_size=3,
                               padding=int(padding)))
        block.append(nn.ReLU())
        # block.append(nn.LeakyReLU())
        if batch_norm:
            block.append(nn.BatchNorm2d(out_size))
        self.block = nn.Sequential(*block)
    def forward(self, x):
        out = self.block(x)
        return out
 class UNetUpBlock(nn.Module):
    def __init__(self, in_size, out_size, up_mode, padding, batch_norm):
        super(UNetUpBlock, self).__init__()
        if up_mode == 'upconv':
            self.up = nn.ConvTranspose2d(in_size, out_size, kernel_size=2,
                                         stride=2)
        elif up_mode == 'upsample':
            self.up = nn.Sequential(nn.Upsample(mode='bilinear', scale_factor=2),
                                    nn.Conv2d(in_size, out_size, kernel_size=1))
        self.conv_block = UNetConvBlock(in_size, out_size, padding, batch_norm)
    def center_crop(self, layer, target_size):
        _, _, layer_height, layer_width = layer.size()
        diff_y = (layer_height - target_size[0]) // 2
        diff_x = (layer_width - target_size[1]) // 2
        return layer[:, :, diff_y:(diff_y + target_size[0]), diff_x:(diff_x + target_size[1])]
    def forward(self, x, bridge):
        up = self.up(x)
        crop1 = self.center_crop(bridge, up.shape[2:])
        out = torch.cat([up, crop1], 1)
        out = self.conv_block(out)
        return out
--- a/pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist-dist.py
+++ b/pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist-dist.py
@ -0,0 +1,105 @@
 import os
 from datetime import datetime
 import argparse
 import torch.multiprocessing as mp
 import torchvision
 import torchvision.transforms as transforms
 import torch
 import torch.nn as nn
 import torch.distributed as dist
 from apex.parallel import DistributedDataParallel as DDP
 from apex import amp
 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N',
                        help='number of data loading workers (default: 4)')
    parser.add_argument('-g', '--gpus', default=1, type=int,
                        help='number of gpus per node')
    parser.add_argument('-nr', '--nr', default=0, type=int,
                        help='ranking within the nodes')
    parser.add_argument('--epochs', default=2, type=int, metavar='N',
                        help='number of total epochs to run')
    args = parser.parse_args()
    args.world_size = args.gpus * args.nodes
    os.environ['MASTER_ADDR'] = '147.232.47.114'
    os.environ['MASTER_PORT'] = '8888'
    mp.spawn(train, nprocs=args.gpus, args=(args,))
 class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7*7*32, num_classes)
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out
 def train(gpu, args):
    rank = args.nr * args.gpus + gpu
    dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)
    torch.manual_seed(0)
    model = ConvNet()
    torch.cuda.set_device(gpu)
    model.cuda(gpu)
    batch_size = 10
    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(gpu)
    optimizer = torch.optim.SGD(model.parameters(), 1e-4)
    # Wrap the model
    model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
    # Data loading code
    train_dataset = torchvision.datasets.MNIST(root='./data',
                                               train=True,
                                               transform=transforms.ToTensor(),
                                               download=True)
    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,
                                                                    num_replicas=args.world_size,
                                                                    rank=rank)
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                               batch_size=batch_size,
                                               shuffle=False,
                                               num_workers=0,
                                               pin_memory=True,
                                               sampler=train_sampler)
    start = datetime.now()
    total_step = len(train_loader)
    for epoch in range(args.epochs):
        for i, (images, labels) in enumerate(train_loader):
            images = images.cuda(non_blocking=True)
            labels = labels.cuda(non_blocking=True)
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)
            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if (i + 1) % 100 == 0 and gpu == 0:
                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
                                                                         loss.item()))
    if gpu == 0:
        print("Training complete in: " + str(datetime.now() - start))
 if __name__ == '__main__':
    torch.multiprocessing.set_start_method('spawn')
    main()
--- a/pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist.py
+++ b/pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist.py
@ -0,0 +1,92 @@
 import os
 from datetime import datetime
 import argparse
 import torch.multiprocessing as mp
 import torchvision
 import torchvision.transforms as transforms
 import torch
 import torch.nn as nn
 import torch.distributed as dist
 from apex.parallel import DistributedDataParallel as DDP
 from apex import amp
 def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N',
                        help='number of data loading workers (default: 4)')
    parser.add_argument('-g', '--gpus', default=1, type=int,
                        help='number of gpus per node')
    parser.add_argument('-nr', '--nr', default=0, type  =int,
                        help='ranking within the nodes')
    parser.add_argument('--epochs', default=2, type=int, metavar='N',
                        help='number of total epochs to run')
    args = parser.parse_args()
    train(0, args)
 class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7*7*32, num_classes)
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out
 def train(gpu, args):
    model = ConvNet()
    torch.cuda.set_device(gpu)
    model.cuda(gpu)
    batch_size = 50
    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(gpu)
    optimizer = torch.optim.SGD(model.parameters(), 1e-4)
    # Data loading code
    train_dataset = torchvision.datasets.MNIST(root='./data',
                                               train=True,
                                               transform=transforms.ToTensor(),
                                               download=True)
    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                               batch_size=batch_size,
                                               shuffle=True,
                                               num_workers=0,
                                               pin_memory=True)
    start = datetime.now()
    total_step = len(train_loader)
    for epoch in range(args.epochs):
        for i, (images, labels) in enumerate(train_loader):
            images = images.cuda(non_blocking=True)
            labels = labels.cuda(non_blocking=True)
            # Forward pass
            outputs = model(images)
            loss = criterion(outputs, labels)
            # Backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if (i + 1) % 100 == 0 and gpu == 0:
                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
                                                                         loss.item()))
    if gpu == 0:
        print("Training complete in: " + str(datetime.now() - start))
 if __name__ == '__main__':
    main()
--- a/pages/students/2016/lukas_pokryvka/dp2021/yelp/data/random.csv
+++ b/pages/students/2016/lukas_pokryvka/dp2021/yelp/data/random.csv
--- a/pages/students/2016/lukas_pokryvka/dp2021/yelp/script.py
+++ b/pages/students/2016/lukas_pokryvka/dp2021/yelp/script.py
@ -0,0 +1,748 @@
 from argparse import Namespace
 from collections import Counter
 import json
 import os
 import re
 import string
 import numpy as np
 import pandas as pd
 import torch
 import torch.nn as nn
 import torch.nn.functional as F
 import torch.optim as optim
 from torch.utils.data import Dataset, DataLoader
 from tqdm.notebook import tqdm
 class Vocabulary(object):
    """Class to process text and extract vocabulary for mapping"""
    def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
        """
        Args:
            token_to_idx (dict): a pre-existing map of tokens to indices
            add_unk (bool): a flag that indicates whether to add the UNK token
            unk_token (str): the UNK token to add into the Vocabulary
        """
        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx
        self._idx_to_token = {idx: token 
                              for token, idx in self._token_to_idx.items()}
        self._add_unk = add_unk
        self._unk_token = unk_token
        self.unk_index = -1
        if add_unk:
            self.unk_index = self.add_token(unk_token) 
    def to_serializable(self):
        """ returns a dictionary that can be serialized """
        return {'token_to_idx': self._token_to_idx, 
                'add_unk': self._add_unk, 
                'unk_token': self._unk_token}
    @classmethod
    def from_serializable(cls, contents):
        """ instantiates the Vocabulary from a serialized dictionary """
        return cls(**contents)
    def add_token(self, token):
        """Update mapping dicts based on the token.
        Args:
            token (str): the item to add into the Vocabulary
        Returns:
            index (int): the integer corresponding to the token
        """
        if token in self._token_to_idx:
            index = self._token_to_idx[token]
        else:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index
    def add_many(self, tokens):
        """Add a list of tokens into the Vocabulary
        Args:
            tokens (list): a list of string tokens
        Returns:
            indices (list): a list of indices corresponding to the tokens
        """
        return [self.add_token(token) for token in tokens]
    def lookup_token(self, token):
        """Retrieve the index associated with the token 
          or the UNK index if token isn't present.
        Args:
            token (str): the token to look up 
        Returns:
            index (int): the index corresponding to the token
        Notes:
            `unk_index` needs to be >=0 (having been added into the Vocabulary) 
              for the UNK functionality 
        """
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]
    def lookup_index(self, index):
        """Return the token associated with the index
        Args: 
            index (int): the index to look up
        Returns:
            token (str): the token corresponding to the index
        Raises:
            KeyError: if the index is not in the Vocabulary
        """
        if index not in self._idx_to_token:
            raise KeyError("the index (%d) is not in the Vocabulary" % index)
        return self._idx_to_token[index]
    def __str__(self):
        return "<Vocabulary(size=%d)>" % len(self)
    def __len__(self):
        return len(self._token_to_idx)
 class ReviewVectorizer(object):
    """ The Vectorizer which coordinates the Vocabularies and puts them to use"""
    def __init__(self, review_vocab, rating_vocab):
        """
        Args:
            review_vocab (Vocabulary): maps words to integers
            rating_vocab (Vocabulary): maps class labels to integers
        """
        self.review_vocab = review_vocab
        self.rating_vocab = rating_vocab
    def vectorize(self, review):
        """Create a collapsed one-hit vector for the review
        Args:
            review (str): the review 
        Returns:
            one_hot (np.ndarray): the collapsed one-hot encoding 
        """
        one_hot = np.zeros(len(self.review_vocab), dtype=np.float32)
        for token in review.split(" "):
            if token not in string.punctuation:
                one_hot[self.review_vocab.lookup_token(token)] = 1
        return one_hot
    @classmethod
    def from_dataframe(cls, review_df, cutoff=25):
        """Instantiate the vectorizer from the dataset dataframe
        Args:
            review_df (pandas.DataFrame): the review dataset
            cutoff (int): the parameter for frequency-based filtering
        Returns:
            an instance of the ReviewVectorizer
        """
        review_vocab = Vocabulary(add_unk=True)
        rating_vocab = Vocabulary(add_unk=False)
        # Add ratings
        for rating in sorted(set(review_df.rating)):
            rating_vocab.add_token(rating)
        # Add top words if count > provided count
        word_counts = Counter()
        for review in review_df.review:
            for word in review.split(" "):
                if word not in string.punctuation:
                    word_counts[word] += 1
        for word, count in word_counts.items():
            if count > cutoff:
                review_vocab.add_token(word)
        return cls(review_vocab, rating_vocab)
    @classmethod
    def from_serializable(cls, contents):
        """Instantiate a ReviewVectorizer from a serializable dictionary
        Args:
            contents (dict): the serializable dictionary
        Returns:
            an instance of the ReviewVectorizer class
        """
        review_vocab = Vocabulary.from_serializable(contents['review_vocab'])
        rating_vocab =  Vocabulary.from_serializable(contents['rating_vocab'])
        return cls(review_vocab=review_vocab, rating_vocab=rating_vocab)
    def to_serializable(self):
        """Create the serializable dictionary for caching
        Returns:
            contents (dict): the serializable dictionary
        """
        return {'review_vocab': self.review_vocab.to_serializable(),
                'rating_vocab': self.rating_vocab.to_serializable()}
 class ReviewDataset(Dataset):
    def __init__(self, review_df, vectorizer):
        """
        Args:
            review_df (pandas.DataFrame): the dataset
            vectorizer (ReviewVectorizer): vectorizer instantiated from dataset
        """
        self.review_df = review_df
        self._vectorizer = vectorizer
        self.train_df = self.review_df[self.review_df.split=='train']
        self.train_size = len(self.train_df)
        self.val_df = self.review_df[self.review_df.split=='val']
        self.validation_size = len(self.val_df)
        self.test_df = self.review_df[self.review_df.split=='test']
        self.test_size = len(self.test_df)
        self._lookup_dict = {'train': (self.train_df, self.train_size),
                             'val': (self.val_df, self.validation_size),
                             'test': (self.test_df, self.test_size)}
        self.set_split('train')
    @classmethod
    def load_dataset_and_make_vectorizer(cls, review_csv):
        """Load dataset and make a new vectorizer from scratch
        Args:
            review_csv (str): location of the dataset
        Returns:
            an instance of ReviewDataset
        """
        review_df = pd.read_csv(review_csv)
        train_review_df = review_df[review_df.split=='train']
        return cls(review_df, ReviewVectorizer.from_dataframe(train_review_df))
    @classmethod
    def load_dataset_and_load_vectorizer(cls, review_csv, vectorizer_filepath):
        """Load dataset and the corresponding vectorizer. 
        Used in the case in the vectorizer has been cached for re-use
        Args:
            review_csv (str): location of the dataset
            vectorizer_filepath (str): location of the saved vectorizer
        Returns:
            an instance of ReviewDataset
        """
        review_df = pd.read_csv(review_csv)
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(review_df, vectorizer)
    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        """a static method for loading the vectorizer from file
        Args:
            vectorizer_filepath (str): the location of the serialized vectorizer
        Returns:
            an instance of ReviewVectorizer
        """
        with open(vectorizer_filepath) as fp:
            return ReviewVectorizer.from_serializable(json.load(fp))
    def save_vectorizer(self, vectorizer_filepath):
        """saves the vectorizer to disk using json
        Args:
            vectorizer_filepath (str): the location to save the vectorizer
        """
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self._vectorizer.to_serializable(), fp)
    def get_vectorizer(self):
        """ returns the vectorizer """
        return self._vectorizer
    def set_split(self, split="train"):
        """ selects the splits in the dataset using a column in the dataframe 
        Args:
            split (str): one of "train", "val", or "test"
        """
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]
    def __len__(self):
        return self._target_size
    def __getitem__(self, index):
        """the primary entry point method for PyTorch datasets
        Args:
            index (int): the index to the data point 
        Returns:
            a dictionary holding the data point's features (x_data) and label (y_target)
        """
        row = self._target_df.iloc[index]
        review_vector = \
            self._vectorizer.vectorize(row.review)
        rating_index = \
            self._vectorizer.rating_vocab.lookup_token(row.rating)
        return {'x_data': review_vector,
                'y_target': rating_index}
    def get_num_batches(self, batch_size):
        """Given a batch size, return the number of batches in the dataset
        Args:
            batch_size (int)
        Returns:
            number of batches in the dataset
        """
        return len(self) // batch_size  
 def generate_batches(dataset, batch_size, shuffle=True,
                     drop_last=True, device="cpu"):
    """
    A generator function which wraps the PyTorch DataLoader. It will 
      ensure each tensor is on the write device location.
    """
    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                            shuffle=shuffle, drop_last=drop_last)
    for data_dict in dataloader:
        out_data_dict = {}
        for name, tensor in data_dict.items():
            out_data_dict[name] = data_dict[name].to(device)
        yield out_data_dict
 class ReviewClassifier(nn.Module):
    """ a simple perceptron based classifier """
    def __init__(self, num_features):
        """
        Args:
            num_features (int): the size of the input feature vector
        """
        super(ReviewClassifier, self).__init__()
        self.fc1 = nn.Linear(in_features=num_features, 
                             out_features=1)
    def forward(self, x_in, apply_sigmoid=False):
        """The forward pass of the classifier
        Args:
            x_in (torch.Tensor): an input data tensor. 
                x_in.shape should be (batch, num_features)
            apply_sigmoid (bool): a flag for the sigmoid activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch,)
        """
        y_out = self.fc1(x_in).squeeze()
        if apply_sigmoid:
            y_out = torch.sigmoid(y_out)
        return y_out
 def make_train_state(args):
    return {'stop_early': False,
            'early_stopping_step': 0,
            'early_stopping_best_val': 1e8,
            'learning_rate': args.learning_rate,
            'epoch_index': 0,
            'train_loss': [],
            'train_acc': [],
            'val_loss': [],
            'val_acc': [],
            'test_loss': -1,
            'test_acc': -1,
            'model_filename': args.model_state_file}
 def update_train_state(args, model, train_state):
    """Handle the training state updates.
    Components:
     - Early Stopping: Prevent overfitting.
     - Model Checkpoint: Model is saved if the model is better
    :param args: main arguments
    :param model: model to train
    :param train_state: a dictionary representing the training state values
    :returns:
        a new train_state
    """
    # Save one model at least
    if train_state['epoch_index'] == 0:
        torch.save(model.state_dict(), train_state['model_filename'])
        train_state['stop_early'] = False
    # Save model if performance improved
    elif train_state['epoch_index'] >= 1:
        loss_tm1, loss_t = train_state['val_loss'][-2:]
        # If loss worsened
        if loss_t >= train_state['early_stopping_best_val']:
            # Update step
            train_state['early_stopping_step'] += 1
        # Loss decreased
        else:
            # Save the best model
            if loss_t < train_state['early_stopping_best_val']:
                torch.save(model.state_dict(), train_state['model_filename'])
            # Reset early stopping step
            train_state['early_stopping_step'] = 0
        # Stop early ?
        train_state['stop_early'] = \
            train_state['early_stopping_step'] >= args.early_stopping_criteria
    return train_state
 def compute_accuracy(y_pred, y_target):
    y_target = y_target.cpu()
    y_pred_indices = (torch.sigmoid(y_pred)>0.5).cpu().long()#.max(dim=1)[1]
    n_correct = torch.eq(y_pred_indices, y_target).sum().item()
    return n_correct / len(y_pred_indices) * 100
 def set_seed_everywhere(seed, cuda):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if cuda:
        torch.cuda.manual_seed_all(seed)
 def handle_dirs(dirpath):
    if not os.path.exists(dirpath):
        os.makedirs(dirpath)
 args = Namespace(
    # Data and Path information
    frequency_cutoff=25,
    model_state_file='model.pth',
    review_csv='data/yelp/reviews_with_splits_lite.csv',
    # review_csv='data/yelp/reviews_with_splits_full.csv',
    save_dir='model_storage/ch3/yelp/',
    vectorizer_file='vectorizer.json',
    # No Model hyper parameters
    # Training hyper parameters
    batch_size=128,
    early_stopping_criteria=5,
    learning_rate=0.001,
    num_epochs=100,
    seed=1337,
    # Runtime options
    catch_keyboard_interrupt=True,
    cuda=True,
    expand_filepaths_to_save_dir=True,
    reload_from_files=False,
 )
 if args.expand_filepaths_to_save_dir:
    args.vectorizer_file = os.path.join(args.save_dir,
                                        args.vectorizer_file)
    args.model_state_file = os.path.join(args.save_dir,
                                         args.model_state_file)
    print("Expanded filepaths: ")
    print("\t{}".format(args.vectorizer_file))
    print("\t{}".format(args.model_state_file))
 # Check CUDA
 if not torch.cuda.is_available():
    args.cuda = False
 if torch.cuda.device_count() > 1:
  print("Pouzivam", torch.cuda.device_count(), "graficke karty!")
 args.device = torch.device("cuda" if args.cuda else "cpu")
 # Set seed for reproducibility
 set_seed_everywhere(args.seed, args.cuda)
 # handle dirs
 handle_dirs(args.save_dir)
 if args.reload_from_files:
    # training from a checkpoint
    print("Loading dataset and vectorizer")
    dataset = ReviewDataset.load_dataset_and_load_vectorizer(args.review_csv,
                                                            args.vectorizer_file)
 else:
    print("Loading dataset and creating vectorizer")
    # create dataset and vectorizer
    dataset = ReviewDataset.load_dataset_and_make_vectorizer(args.review_csv)
    dataset.save_vectorizer(args.vectorizer_file)    
 vectorizer = dataset.get_vectorizer()
 classifier = ReviewClassifier(num_features=len(vectorizer.review_vocab))
 classifier = nn.DataParallel(classifier)
 classifier = classifier.to(args.device)
 loss_func = nn.BCEWithLogitsLoss()
 optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
 scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
                                                 mode='min', factor=0.5,
                                                 patience=1)
 train_state = make_train_state(args)
 epoch_bar = tqdm(desc='training routine', 
                          total=args.num_epochs,
                          position=0)
 dataset.set_split('train')
 train_bar = tqdm(desc='split=train',
                          total=dataset.get_num_batches(args.batch_size), 
                          position=1, 
                          leave=True)
 dataset.set_split('val')
 val_bar = tqdm(desc='split=val',
                        total=dataset.get_num_batches(args.batch_size), 
                        position=1, 
                        leave=True)
 try:
    for epoch_index in range(args.num_epochs):
        train_state['epoch_index'] = epoch_index
        # Iterate over training dataset
        # setup: batch generator, set loss and acc to 0, set train mode on
        dataset.set_split('train')
        batch_generator = generate_batches(dataset, 
                                           batch_size=args.batch_size, 
                                           device=args.device)
        running_loss = 0.0
        running_acc = 0.0
        classifier.train()
        for batch_index, batch_dict in enumerate(batch_generator):
            # the training routine is these 5 steps:
            # --------------------------------------
            # step 1. zero the gradients
            optimizer.zero_grad()
            # step 2. compute the output
            y_pred = classifier(x_in=batch_dict['x_data'].float())
            # step 3. compute the loss
            loss = loss_func(y_pred, batch_dict['y_target'].float())
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)
            # step 4. use loss to produce gradients
            loss.backward()
            # step 5. use optimizer to take gradient step
            optimizer.step()
            # -----------------------------------------
            # compute the accuracy
            acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)
            # update bar
            train_bar.set_postfix(loss=running_loss, 
                                  acc=running_acc, 
                                  epoch=epoch_index)
            train_bar.update()
        train_state['train_loss'].append(running_loss)
        train_state['train_acc'].append(running_acc)
        # Iterate over val dataset
        # setup: batch generator, set loss and acc to 0; set eval mode on
        dataset.set_split('val')
        batch_generator = generate_batches(dataset, 
                                           batch_size=args.batch_size, 
                                           device=args.device)
        running_loss = 0.
        running_acc = 0.
        classifier.eval()
        for batch_index, batch_dict in enumerate(batch_generator):
            # compute the output
            y_pred = classifier(x_in=batch_dict['x_data'].float())
            # step 3. compute the loss
            loss = loss_func(y_pred, batch_dict['y_target'].float())
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)
            # compute the accuracy
            acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)
            val_bar.set_postfix(loss=running_loss, 
                                acc=running_acc, 
                                epoch=epoch_index)
            val_bar.update()
        train_state['val_loss'].append(running_loss)
        train_state['val_acc'].append(running_acc)
        train_state = update_train_state(args=args, model=classifier,
                                         train_state=train_state)
        scheduler.step(train_state['val_loss'][-1])
        train_bar.n = 0
        val_bar.n = 0
        epoch_bar.update()
        if train_state['stop_early']:
            break
        train_bar.n = 0
        val_bar.n = 0
        epoch_bar.update()
 except KeyboardInterrupt:
    print("Exiting loop")
 classifier.load_state_dict(torch.load(train_state['model_filename']))
 classifier = classifier.to(args.device)
 dataset.set_split('test')
 batch_generator = generate_batches(dataset, 
                                   batch_size=args.batch_size, 
                                   device=args.device)
 running_loss = 0.
 running_acc = 0.
 classifier.eval()
 for batch_index, batch_dict in enumerate(batch_generator):
    # compute the output
    y_pred = classifier(x_in=batch_dict['x_data'].float())
    # compute the loss
    loss = loss_func(y_pred, batch_dict['y_target'].float())
    loss_t = loss.item()
    running_loss += (loss_t - running_loss) / (batch_index + 1)
    # compute the accuracy
    acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
    running_acc += (acc_t - running_acc) / (batch_index + 1)
 train_state['test_loss'] = running_loss
 train_state['test_acc'] = running_acc
 print("Test loss: {:.3f}".format(train_state['test_loss']))
 print("Test Accuracy: {:.2f}".format(train_state['test_acc']))
 def preprocess_text(text):
    text = text.lower()
    text = re.sub(r"([.,!?])", r" \1 ", text)
    text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
    return text
 def predict_rating(review, classifier, vectorizer, decision_threshold=0.5):
    """Predict the rating of a review
    Args:
        review (str): the text of the review
        classifier (ReviewClassifier): the trained model
        vectorizer (ReviewVectorizer): the corresponding vectorizer
        decision_threshold (float): The numerical boundary which separates the rating classes
    """
    review = preprocess_text(review)
    vectorized_review = torch.tensor(vectorizer.vectorize(review))
    result = classifier(vectorized_review.view(1, -1))
    probability_value = F.sigmoid(result).item()
    index = 1
    if probability_value < decision_threshold:
        index = 0
    return vectorizer.rating_vocab.lookup_index(index)
 test_review = "this is a pretty awesome book"
 classifier = classifier.cpu()
 prediction = predict_rating(test_review, classifier, vectorizer, decision_threshold=0.5)
 print("{} -> {}".format(test_review, prediction))
 # Sort weights
 fc1_weights = classifier.fc1.weight.detach()[0]
 _, indices = torch.sort(fc1_weights, dim=0, descending=True)
 indices = indices.numpy().tolist()
 # Top 20 words
 print("Influential words in Positive Reviews:")
 print("--------------------------------------")
 for i in range(20):
    print(vectorizer.review_vocab.lookup_index(indices[i]))
 print("====\n\n\n")
 # Top 20 negative words
 print("Influential words in Negative Reviews:")
 print("--------------------------------------")
 indices.reverse()
 for i in range(20):
    print(vectorizer.review_vocab.lookup_index(indices[i]))
--- a/pages/students/2016/maros_harahus/README.md
+++ b/pages/students/2016/maros_harahus/README.md
@ -12,16 +12,42 @@ taxonomy:
 Zásobník úloh:
 - skúsiť prezentovať na lokálnej konferencii, (Data, Znalosti and WIKT) alebo fakultný zborník (krátka verzia diplomovky).
 - Využiť korpus Multext East pri trénovaní.  Vytvoriť mapovanie Multext Tagov na SNK Tagy.
 Virtuálne stretnutie 6.11.2020
 Stav:
 - Prečítané (podrobne) 2 články a urobené poznámky. Poznánky sú na GITe.
 - Dorobené ďalšie experimenty.
 Úlohy do ďalšieho stretnutia:
 - Pokračovať v otvorených úlohách.
 Virtuálne stretnutie 30.10.2020
 Stav:
 - Súbory sú na GIte
 - Vykonané experimenty, Výsledky experimentov sú v tabuľke
 - Návod na spustenie
 - Vyriešenie technických problémov. Je k dispozicíí Conda prostredie.
 Úlohy na ďalšie stretnutie:
 - Preštudovať literatúru na tému "pretrain" a "word embedding"
-    - [Healthcare NERModelsUsing Language Model Pretraining](http://ceur-ws.org/Vol-2551/paper-04.pdf)
+    - [Healthcare NER Models Using Language Model Pretraining](http://ceur-ws.org/Vol-2551/paper-04.pdf)
    - [Design and implementation of an open source Greek POS Tagger and Entity Recognizer using spaCy](https://ieeexplore.ieee.org/abstract/document/8909591)
    - https://arxiv.org/abs/1909.00505
    - https://arxiv.org/abs/1607.04606
    - LSTM, recurrent neural network, 
    - Urobte si poznámky z viacerých čnánkov, poznačte si zdroj a čo ste sa dozvedeli.
 - Vykonať viacero experimentov s pretrénovaním - rôzne modely, rôzne veľkosti adaptačných dát a zostaviť tabuľku
 - Opísať pretrénovanie, zhrnúť vplyv pretrénovania na trénovanie v krátkom článku cca 10 strán.
 - skúsiť prezentovať na lokálnej konferencii, (Data, Znalosti and WIKT) alebo fakultný zborník (krátka verzia diplomovky).
 - Využiť korpus Multext East pri trénovaní.  Vytvoriť mapovanie Multext Tagov na SNK Tagy.
 Virtuálne stretnutie 8.10.2020
--- a/pages/students/2016/tomas_kucharik/README.md
+++ b/pages/students/2016/tomas_kucharik/README.md
@ -21,6 +21,46 @@ Cieľom práce je príprava nástrojov a budovanie tzv. "Question Answering data
 ## Diplomový projekt 2
 Zásobník úloh:
 - Dá sa zistiť koľko času strávil anotátor pri vytváraní otázky? Ak sa to dá zistiť z DB schémy, tak by bolo dobré to zobraziť vo webovej aplikácii.
 Virtuálne stretnutie 27.10.2020
 Stav:
 - Dorobená webová aplikácia podľa pokynov z minulého stretnutia, kódy sú na gite
 Úlohy na ďalšie stretnutie:
 - Urobiť konfiguračný systém - načítať konfiguráciu zo súboru (python-configuration?). Meno konfiguračného súboru by sa malo dať zmeniť cez premennú prostredia (getenv).
 - Dorobiť autentifikáciu pre anotátorov pre zobrazovanie výsledkov, aby anotátor videl iba svoje výsledky. Je to potrebné? Zatiaľ dorobiť iba pomocou e-mailu.
 - Dorobiť heslo na webovú aplikáciu
 - Dorobiť zobrazovanie zlých a dobrých anotácií pre každého anotátora.
 - Preštudovať odbornú literatúru na tému "Crowdsourcing language resources". Vyberte niekoľko odborných publikácií (scholar, scopus), napíšte bibliografický odkaz a čo ste sa z publikácii dozvedeli o vytváraní jazykových zdrojov. Aké iné korpusy boli touto metódou vytvorené? 
 Virtuálne stretnutie 20.10.2020
 Stav:
 - Vylepšený skript pre prípravu dát , mierna zmena rozhrania (duplicitná práca kvôli nedostatku v komunikácii).
 Úohy do ďalšieho stretnutia:
 - Dorobiť webovú aplikáciu pre zisťoovanie množstva anotovaných dát.
 - Odladiť chyby súvisiace s novou anotačnou schémou.
 - Zobraziť množstvo anotovaných dát
 - Zobraziť množstvo platných anotovaných dát.
 - Zobbraziť množstvo validovaných dát.
 - Otázky sa v rámci jedného paragrafu nesmú opakovať. Každá otázka musí mať odpoveď. Každá otázka musí byť dlhšia ako 10 znakov alebo dlhšia ako 2 slová. Odpoveď musí mať aspoň jedno slovo. Otázka musí obsahovať slovenské slová. 
 - Výsledky posielajte čím skôr do projektového repozitára, adresár database_app.
 Stretnutie 25.9.2020
 Urobené:
--- a/pages/students/2017/martin_jancura/README.md
+++ b/pages/students/2017/martin_jancura/README.md
@ -6,10 +6,8 @@ taxonomy:
    tag: [demo,nlp]
    author: Daniel Hladek
 ---
 # Martin Jancura
 *Rok začiatku štúdia*:  2017
 ## Bakalársky projekt 2020
@ -31,9 +29,36 @@ Možné backendy:
 Zásobník úloh:
 - Pripraviť backend.
- Pripraviť frontend v Javascripte.
+- Pripraviť frontend v Javascripte - in progress.
 - Zapisať človekom urobený preklad do databázy.
 Virtuálne stretnutie 6.11.2020:
 Stav: 
 Práca na písomnej časti.
 Úlohy do ďalšieho stretnutia:
 - Pohľadať takú knižnicu, kde vieme využiť vlastný preklad. Skúste si nainštalovať OpenNMT.
 - Prejdite si tutoriál https://github.com/OpenNMT/OpenNMT-py#quickstart alebo podobný.
 - Navrhnite ako prepojiť frontend a backend.
 Virtuálne stretnutie 23.10.2020:
 Stav:
 - Urobený frontend pre komunikáciu s Microsof Translation API, využíva Axios a Vanilla Javascriupt
 Úlohy do ďalšieho stretnutia:
 - Pohľadať takú knižnicu, kde vieme využiť vlastný preklad. Skúste si nainštalovať OpenNMT.
 - Zistiť čo znamená  politika CORS.
 - Pokračujte v písaní práce, pridajte časť o strojovom preklade.. Prečítajte si články https://opennmt.net/OpenNMT/references/ a urobte si poznámky. Do poznámky dajte bibliografický odkaz a čo ste sa dozvedeli z článku.
 Virtuálne stretnutie 16.10:
 Stav:
--- a/pages/students/2018/martin_wencel/README.md
+++ b/pages/students/2018/martin_wencel/README.md
@ -31,7 +31,42 @@ Návrh na zadanie:
 1. Navrhnite možné zlepšenia Vami vytvorenej aplikácie.
 Zásobník úloh:
- Vytvorte si repozitár na GITe, nazvite ho bp2010. Do neho budete dávať kódy a dokumentáciu ktorú vytvoríte.
+
 - Pripravte Docker image Vašej aplikácie podľa https://pythonspeed.com/docker/
 Virtuálne stretnutie 30.10.:
 Stav:
 - Modifikovaná existujúca aplikácia "spacy-streamlit", zdrojové kóódy sú na GITe podľa pokynov z minulého stretnutia.
 - Obsahuje aj formulár, neobsahuje REST API
 Úlohy do ďalšieho stretnutia:
 - Pokračujte v písaní. Prečítajte si odborné články na tému "dependency parsing" a vypracujte poznámky čo ste sa dozvedeli. Poznačte si zdroj.
 - Pokkračujte v práci na demonštračnej webovej aplikácii.
 Virtuálne stretnutie 19.10.:
 Stav: 
 - Vypracované a odovzdané poznámky k bakalárskej práci, obsahujú výpisy z literatúry.
 - Vytvorený repozitár. https://git.kemt.fei.tuke.sk/mw223on/bp2020
 - Nainštalovaný a spustený slovenský Spacy model.
 - Nainštalované Spacy REST Api https://github.com/explosion/spacy-services
 - Vyskúšané demo Display so slovenským modelom
 Úlohy na ďalšie stretnutie:
 - Pripravte webovú aplikáciu ktorá bude prezentovať rozpoznávanie závislostí a pomenovaných entít v slovenskom jayzyku. Mala by sa skladať z frontentu a backendu.
 - zapíšte potrebné Python balíčky do súboru "requirements.txt"
 - Vytvorte skript na inštaláciu aplikácie pomocou pip.
 - Vytvorte skript pre spustenie backendu aj frontendu. Výsledky dajte do repozitára.
 - Vytvorte návrh na frontend (HTML + CSS).
 - Pozrite na zdrojové kódy Spacy a zistite, čo presne robí príkaz display.serve
 - Vysledky dajte do repozitára.
 Virtuálne stretnutie 9.10.
--- a/pages/students/2018/ondrej_megela/README.md
+++ b/pages/students/2018/ondrej_megela/README.md
@ -20,8 +20,21 @@ Návrh na zadanie:
 2. Vytvorte jazykový model metódou BERT alebo poodobnou metódou.
 3. Vyhodnnotte vytvorený jazykový model a navrhnite zlepšenia. 
-Zásobník úloh:
+
 Virtuálne stretnutie 30.10.2020
 Stav:
 - Vypracované poznámky k seq2seq
 - nainštalovaný Pytorch a fairseq
 - problémy s tutoriálom. Riešenie by mohlo byť použitie release verzie 0.9.0, pip install fairseq=0.9.0
 Do ďďalšieho stretnutia:
 - Vyriešte technické porblémy
 - prejdide si tutoriál https://fairseq.readthedocs.io/en/latest/getting_started.html#training-a-new-model
 - Prejsť si tutoriál https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md alebo podobný.
 - Preštudujte si články na tému BERT, urobte si poznámky čo ste sa dozvedeli spolu so zdrojom.
 Virtuálne stretnutie 16.10.2020
--- a/pages/students/2018/samuel_sirotnik/README.md
+++ b/pages/students/2018/samuel_sirotnik/README.md
@ -23,13 +23,50 @@ Pokusný klaster Raspberry Pi pre výuku klaudových technológií
 Ciel projektu je vytvoriť domáci lacný klaster pre výuku cloudových technológií.
 Zásobník úloh:
 - Aktivujte si technológiu WSL2 a Docker Desktop ak používate Windows.
 Virtuálne stretnutie 30.10.
 Stav:
 - vypracovaný písomný prehľad podľa pokynov
 - nainštalovaný RaspberryPI OS do Virtual\boxu
 - vypracovaný predbežný HW návrh
 - Nainšalované Docker Toolbox aj Ubuntu s Dockerom
 - Oboznámenie sa s Dockerom
 - Vedúci: vykoananý nákup HW - Dosky 5x RPi4 model B 8GB, SD Karty 128GB 11ks, the pi hut Cluster Case for raspberry pi 4ks, Zdroj 60W and 18W Quick Charger Epico 1ks. 220V kábel a zásuvka s vypínačom.
 Do budúceho stretnutia:
 - Dá sa kúpiť oficiálmy 5 portový switch?
 - Skompletizovať nákup a dohodntúť spôsob odovzdania. Podpísať preberací protokol.
 - Použite https://kind.sigs.k8s.io na simuláciu klastra.
 - Nainštalujte si https://microk8s.io/ , prečítajte tutoriály  https://ubuntu.com/tutorials/
 - Prejdite si https://kubernetes.io/docs/tutorials/hello-minikube/ alebo pododbný tutoriály
 Virtuálne stretnutie 16.10.
 Stav:
 - Prečítanie články
 - začatý tutorál Docker zo ZCT
 - vedúci vytovoril prístup na Jetson Xavier  AGX2 s ARM procesorom.
 - začatý nákup na Raspberry Pi a príslušenstvo.
 Úlohy do ďalšieho stretnutia
 - Vypracovať prehľad (min 4) existujúcich riešení Raspberry Pi cluster (na odovzdanie). Aký hardware a software použili?
    - napájanie, chladenie, sieťové prepojenie
 - Oboznámte sa s https://www.raspberrypi.org/downloads/raspberry-pi-os/
 - Nainštalujte si https://roboticsbackend.com/install-raspbian-desktop-on-a-virtual-machine-virtualbox/
 - Napíšte podrobný návrh hardware pre vytvorenie Raspberry Pi Cluster.
 Stretnutie 29.9.
 Dohodli sme sa na zadaní práce.
 Návrhy na zlepšenie (pre vedúceho):
 - Zistiť podmienky financovania (odhad 350EUR).
--- a/pages/topics/named-entity/navod/README.md
+++ b/pages/topics/named-entity/navod/README.md
@ -39,23 +39,35 @@ Učenie prebieha tak, že v texte ukážete ktoré slová patria názvom osôb,
 Vašou úlohou bude v texte vyznačiť vlastné podstatné mená.
-Vlastné podstatné meno sa v slovenskom jazyku spravidla začína veľkým písmeno, ale môže obsahovať aj ďalšie slová písané malým písmenom. 
+Vlastné podstatné meno sa v slovenskom jazyku spravidla začína veľkým písmenom, ale môže obsahovať aj ďalšie slová písané malým písmenom. 
 Ak vlastné podstatné meno v sebe obsahuje iný názov, napr. Nové Mesto nad Váhom, anotujte ho ako jeden celok.
 - PER: mená osôb
 - LOC: geografické názvy
 - ORG: názvy organizácii
 - MISC: iné názvy, napr. názvy produktov.
-Ak vlastné podstatné meno v sebe obsahuje iný názov, napr. Nové Mesto nad Váhom, anotujte ho ako jeden celok.
+V texte narazíte aj na slová, ktoré síce pomenúvajú geografickú oblasť, avšak nejedná sa o vlastné podstatné mená (napr. britská kolónia, londýnsky šerif...). Takéto slová nepovažujeme za pomenované entity a preto Vás prosíme, aby ste ich neoznačovali.
 V prípade, že v texte sa nenachádzajú žiadne anotácie, tento článok je platný, a teda zvoľte možnosť Accept.
 V prípade, že text sa skladá iba z jedného, resp. niekoľkých slov, ktoré sami o sebe nenesú žiaden význam, tento článok je neplatný, a teda zvoľte možnosť Reject.  
 ## Anotačné dávky
 Do formulára napíšte Váš e-mail aby bolo možné rozpoznať, kto vykonal anotáciu. 
 Počas anotácie môžete pre zjednodušenie práce využívať klávesové skratky: 
 - 1,2,3,4 - prepínanie medzi entitami
 - klávesa "a" - Accept
 - klávesa "x" - Reject
 - klávesa "space" - Ignore
 - klávesa "backspace" alebo "del" - Undo
 Po anotovaní nezabudnite svojú prácu uložiť (ikona vľavo hore, alebo "Ctrl + s").
 ### Pokusná anotačná dávka
 Dávka je zameraná na zber spätnej väzby od anotátorov na zlepšenie rozhrania a anotačného procesu.
 {% include "forms/form.html.twig" with { form: forms('ner1') } %}
--- a/pages/topics/question/navod/README.md
+++ b/pages/topics/question/navod/README.md
@ -36,11 +36,12 @@ Učenie prebieha tak, že vytvoríte príklad s otázkou a odpoveďou. Účasť
 ## Návod pre anotátorov
-Najprv sa Vám zobrazí krátky článok.  Vašou úlohou bude prečítať si časť článku, vymyslieť k nemu otázku a v texte vyznačiť odpoveď. Odpoveď na otázku sa musí nachádzať v texte článku. Na vyznačenie jednej otázky máte asi 50 sekúnd.
+Najprv sa Vám zobrazí krátky článok.  Vašou úlohou bude prečítať si časť článku, vymyslieť k nemu otázku a v texte vyznačiť odpoveď.  Otázka musí byť jednoznačná a odpoveď na otázku sa musí nachádzať v texte článku. Na vyznačenie jednej otázky máte asi 50 sekúnd.
-1. Prečítajte si článok. Ak článok nie je vyhovujúci ťuknite na červený krížik "reject" (Tab a potom 'x').
+1. Prečítajte si článok. Ak článok nie je vyhovujúci ťuknite na červený krížik "Reject" (Tab a potom 'x').
 2. Napíšte otázku. Ak neviete napísať otázku, ťuknite na "Ignore" (Tab a potom 'i'). 
-3. Vyznačte myšou odpoveď a ťuknite na zelenú fajku "Accept" (klávesa a) a pokračujte ďalšou otázkou k tomu istému článku alebo k novému článku. Ten istý text sa zobrazí maximálne 5 krát.
+3. Vyznačte myšou odpoveď a ťuknite na zelenú fajku "Accept" (klávesa a) a pokračujte ďalšou otázkou k tomu istému článku alebo k novému článku. 
 4. Ten istý článok sa Vám zobrazí 5 krát, vymyslite k nemu 5 rôznych otázok. 
 Ak je zobrazený text nevhodný, tak ho zamietnite. Nevhodný text:
@ -61,6 +62,12 @@ Ak je zobrazený text nevhodný, tak ho zamietnite. Nevhodný text:
 4. <span style="color:pink">Na čo slúži lyzozóm?</span>
 5. <span style="color:orange">Čo je to autofágia?<span>
 Príklad na nesprávu otázku:
 1. Čo je to Golgiho aparát? - odpoveď sa v článku nenachádza.
 2. Čo sa deje v mŕtvych bunkách? - otázka nie je jednoznačná, presná odpoveď sa v článku nenachádza.
 3. Čo je normálny fyziologický proces? - odpoveď sa v článku nenachádza.
 Do formulára napíšte Váš e-mail aby bolo možné rozpoznať, kto vykonal anotáciu. 
 ## Anotačné dávky
		`@ -1,3 +0,0 @@`

			`prodigy ner.correct wikiart sk_sk1 ./textfile.csv --label OSOBA,MIESTO,ORGANIZACIA,PRODUKT`
		`@ -0,0 +1,2 @@`
							`prodigy ner.manual wikiart sk_sk1 ./textfile.csv --label PER,LOC,ORG,MISC`
		`@ -0,0 +1 @@`
							`prodigy data-to-spacy ./train.json ./eval.json --lang sk --ner wikiart --eval-split 0.3`
		`@ -1 +0,0 @@`
			`prodigy db-out wikiart > ./annotations.jsonl`