merge
This commit is contained in:
commit
f5455a89b3
@ -13,11 +13,28 @@ Repozitár so [zdrojovými kódmi](https://git.kemt.fei.tuke.sk/dl874wn/dp2021)
|
||||
|
||||
## Diplomový projekt 2 2020
|
||||
|
||||
Virtuálne stretnutie 6.11.2020
|
||||
|
||||
Stav:
|
||||
|
||||
- Vypracovaná tabuľka s 5 experimentami.
|
||||
- vytvorený repozitár.
|
||||
|
||||
Na ďalšie stretnutie:
|
||||
|
||||
- nahrať kódy na repozitár.
|
||||
- závislosťi (názvy balíčkov) poznačte do súboru requirements.txt.
|
||||
- Prepracujte experiment tak aby akceptoval argumenty z príkazového riadka. (sys.argv)
|
||||
- K experimentom zapísať skript na spustenie. V skripte by mali byť parametre s ktorými ste spustili experiment.
|
||||
- dopracujte report.
|
||||
- do teorie urobte prehľad metód punctuation restoration a opis Vašej metódy.
|
||||
|
||||
|
||||
Virtuálne stretnutie 25.9.2020
|
||||
|
||||
Urobené:
|
||||
|
||||
- skript pre vyhodnotenie experimentov
|
||||
- skript pre vyhodnotenie experimentov.
|
||||
|
||||
|
||||
Úlohy do ďalšieho stretnutia:
|
||||
|
@ -21,8 +21,21 @@ Zásobník úloh:
|
||||
|
||||
- Použiť model na podporu anotácie
|
||||
- Do konca ZS vytvoriť report vo forme článku.
|
||||
- Vytvorte systém pre zistenie množstva a druhu anotovaných dát. Koľko článkov? Koľko entít jednotlivvých typov?
|
||||
- Spísať pravidlá pre validáciu. Aký výsledok anotácie je dobrý? Je potrebné anotované dáta skontrolovať?
|
||||
|
||||
Virtuálne stretnutie 30.10.2020:
|
||||
|
||||
Stav:
|
||||
|
||||
- Vylepšený návod
|
||||
- Vyskúšaný export dát a trénovanie modelu z databázy. Problém pri trénovaní Spacy - iné výsledky ako cez Progigy trénovanie
|
||||
- Práca na textovej čsati.
|
||||
|
||||
Úlohy do ďalšieho stretnutia:
|
||||
- Vytvorte si repozitár s názvom dp2021 a tam pridajte skripty a poznámky.
|
||||
- Pokračujte v písaní práce. Vykonajte prieskum literatúry "named entity corpora" aj poznámky.
|
||||
- Vytvorte systém pre zistenie množstva a druhu anotovaných dát. Koľko článkov? Koľko entít jednotlivvých typov? Výsledná tabuľka pôjde do práce.
|
||||
- Pripraviť sa na produkčné anotácie. Je schéma pripravená?
|
||||
|
||||
Virtuálne stretnutie 16.10.2020:
|
||||
|
||||
|
@ -1 +1,40 @@
|
||||
DP2021
|
||||
## Diplomový projekt 2 2020
|
||||
Stav:
|
||||
- aktualizácia anotačnej schémy (jedná sa o testovaciu schému s vlastnými dátami)
|
||||
- vykonaných niekoľko anotácii, trénovanie v Prodigy - nízka presnosť = malé množstvo anotovaných dát. Trénovanie v spacy zatiaľ nefunguje.
|
||||
- Štatistiky o množstve prijatých a odmietnutých anotácii získame z Prodigy: prodigy stats wikiart. Zatiaľ 156 anotácii (151 accept, 5 reject). Na získanie prehľadu o množstve anotácii jednotlivých entít potrebujeme vytvoriť skript.
|
||||
- Prehľad literatúry Named Entity Corpus
|
||||
- Budovanie korpusu pre NER – automatické vytvorenie už anotovaného korpusu z Wiki pomocou DBpedia – jedná sa o anglický korpus, ale možno spomenúť v porovnaní postupov
|
||||
- Building a Massive Corpus for Named Entity Recognition using Free Open Data Sources - Daniel Specht Menezes, Pedro Savarese, Ruy L. Milidiú
|
||||
- Porovnanie postupov pre anotáciu korpusu (z hľadiska presnosti aj času) - Manual, SemiManual
|
||||
- Comparison of Annotating Methods for Named Entity Corpora - Kanako Komiya, Masaya Suzuki
|
||||
- Čo je korpus, vývojový cyklus, analýza korpusu (Už využitá literatúra – cyklus MATTER)
|
||||
- Natural Language Annotation for Machine Learning – James Pustejovsky, Amber Stubbs
|
||||
|
||||
Aktualizácia 09.11.2020:
|
||||
- Vyriešený problém, kedy nefungovalo trénovanie v spacy
|
||||
- Vykonaná testovacia anotácia cca 500 viet. Výsledky trénovania pri 20 iteráciách: F-Score 47% (rovnaké výsledky pri trénovaní v Spacy aj Prodigy)
|
||||
- Štatistika o počte jednotlivých entít: skript count.py
|
||||
|
||||
|
||||
## Diplomový projekt 1 2020
|
||||
|
||||
- vytvorenie a spustenie docker kontajneru
|
||||
|
||||
|
||||
```
|
||||
./build-docker.sh
|
||||
docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash
|
||||
# (v mojom prípade:)
|
||||
winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
### Spustenie anotačnej schémy
|
||||
- `dataminer.csv` články stiahnuté z wiki
|
||||
- `cd ner`
|
||||
- `./01_text_to_sent.sh` spustenie skriptu *text_to_sent.py*, ktorý rozdelí články na jednotlivé vety
|
||||
- `./02_ner_correct.sh` spustenie anotačného procesu pre NER s návrhmi od modelu
|
||||
- `./03_ner_export.sh` exportovanie anotovaných dát vo formáte jsonl potrebnom pre spracovanie vo spacy
|
||||
|
@ -1,17 +1,16 @@
|
||||
# > docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash
|
||||
# > winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash
|
||||
|
||||
FROM python:3.8
|
||||
RUN mkdir /prodigy
|
||||
WORKDIR /prodigy
|
||||
COPY ./prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl /prodigy
|
||||
RUN mkdir /work
|
||||
COPY ./ner /work
|
||||
RUN pip install prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl
|
||||
RUN pip install https://files.kemt.fei.tuke.sk/models/spacy/sk_sk1-0.0.1.tar.gz
|
||||
RUN pip install nltk
|
||||
EXPOSE 8080
|
||||
ENV PRODIGY_HOME /work
|
||||
ENV PRODIGY_HOST 0.0.0.0
|
||||
WORKDIR /work
|
||||
|
||||
# > docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash
|
||||
# > winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation-master/annotation/work prodigy bash
|
||||
|
||||
FROM python:3.8
|
||||
RUN mkdir /prodigy
|
||||
WORKDIR /prodigy
|
||||
COPY ./prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl /prodigy
|
||||
RUN mkdir /work
|
||||
COPY ./ner /work/ner
|
||||
RUN pip install uvicorn==0.11.5 prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl
|
||||
RUN pip install https://files.kemt.fei.tuke.sk/models/spacy/sk_sk1-0.0.1.tar.gz
|
||||
RUN pip install nltk
|
||||
EXPOSE 8080
|
||||
ENV PRODIGY_HOME /work
|
||||
ENV PRODIGY_HOST 0.0.0.0
|
||||
WORKDIR /work
|
@ -1,13 +1,11 @@
|
||||
## Diplomový projekt 1 2020
|
||||
## Diplomový projekt 2 2020
|
||||
|
||||
- vytvorenie a spustenie docker kontajneru
|
||||
|
||||
|
||||
```
|
||||
./build-docker.sh
|
||||
docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash
|
||||
# (v mojom prípade:)
|
||||
winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash
|
||||
winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation-master/annotation/work prodigy bash
|
||||
```
|
||||
|
||||
|
||||
@ -17,5 +15,12 @@ winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/ann
|
||||
- `dataminer.csv` články stiahnuté z wiki
|
||||
- `cd ner`
|
||||
- `./01_text_to_sent.sh` spustenie skriptu *text_to_sent.py*, ktorý rozdelí články na jednotlivé vety
|
||||
- `./02_ner_correct.sh` spustenie anotačného procesu pre NER s návrhmi od modelu
|
||||
- `./03_ner_export.sh` exportovanie anotovaných dát vo formáte jsonl potrebnom pre spracovanie vo spacy
|
||||
- `./02_ner_manual.sh` spustenie manuálneho anotačného procesu pre NER
|
||||
- `./03_export.sh` exportovanie anotovaných dát vo formáte json potrebnom pre spracovanie vo spacy. Možnosť rozdelenia na trénovacie (70%) a testovacie dáta (30%) (--eval-split 0.3).
|
||||
|
||||
### Štatistika o anotovaných dátach
|
||||
- `prodigy stats wikiart` - informácie o počte prijatých a odmietnutých článkov
|
||||
- `python3 count.py` - informácie o počte jednotlivých entít
|
||||
|
||||
### Trénovanie modelu
|
||||
Založené na: https://git.kemt.fei.tuke.sk/dano/spacy-skmodel
|
||||
|
@ -0,0 +1,14 @@
|
||||
# load data
|
||||
filename = 'ner/annotations.jsonl'
|
||||
file = open(filename, 'rt', encoding='utf-8')
|
||||
text = file.read()
|
||||
|
||||
# count entity PER
|
||||
countPER = text.count('PER')
|
||||
countLOC = text.count('LOC')
|
||||
countORG = text.count('ORG')
|
||||
countMISC = text.count('MISC')
|
||||
print('Počet anotovaných entít typu PER:', countPER,'\n',
|
||||
'Počet anotovaných entít typu LOC:', countLOC,'\n',
|
||||
'Počet anotovaných entít typu ORG:', countORG,'\n',
|
||||
'Počet anotovaných entít typu MISC:', countMISC,'\n')
|
@ -1,3 +0,0 @@
|
||||
|
||||
prodigy ner.correct wikiart sk_sk1 ./textfile.csv --label OSOBA,MIESTO,ORGANIZACIA,PRODUKT
|
||||
|
@ -0,0 +1,2 @@
|
||||
prodigy ner.manual wikiart sk_sk1 ./textfile.csv --label PER,LOC,ORG,MISC
|
||||
|
@ -0,0 +1 @@
|
||||
prodigy data-to-spacy ./train.json ./eval.json --lang sk --ner wikiart --eval-split 0.3
|
@ -1 +0,0 @@
|
||||
prodigy db-out wikiart > ./annotations.jsonl
|
@ -0,0 +1,19 @@
|
||||
mkdir -p build
|
||||
mkdir -p build/input
|
||||
# Prepare Treebank
|
||||
mkdir -p build/input/slovak-treebank
|
||||
spacy convert ./sources/slovak-treebank/stb.conll ./build/input/slovak-treebank
|
||||
# UDAG used as evaluation
|
||||
mkdir -p build/input/ud-artificial-gapping
|
||||
spacy convert ./sources/ud-artificial-gapping/sk-ud-crawled-orphan.conllu ./build/input/ud-artificial-gapping
|
||||
# Prepare skner
|
||||
mkdir -p build/input/skner
|
||||
# Convert to IOB
|
||||
cat ./sources/skner/wikiann-sk.bio | python ./sources/bio-to-iob.py > build/input/skner/wikiann-sk.iob
|
||||
# Split to train test
|
||||
cat ./build/input/skner/wikiann-sk.iob | python ./sources/iob-to-traintest.py ./build/input/skner/wikiann-sk
|
||||
# Convert train and test
|
||||
mkdir -p build/input/skner-train
|
||||
spacy convert -n 15 --converter ner ./build/input/skner/wikiann-sk.train ./build/input/skner-train
|
||||
mkdir -p build/input/skner-test
|
||||
spacy convert -n 15 --converter ner ./build/input/skner/wikiann-sk.test ./build/input/skner-test
|
@ -0,0 +1,19 @@
|
||||
set -e
|
||||
OUTDIR=build/train/output
|
||||
TRAINDIR=build/train
|
||||
mkdir -p $TRAINDIR
|
||||
mkdir -p $OUTDIR
|
||||
mkdir -p dist
|
||||
# Delete old training results
|
||||
rm -rf $OUTDIR/*
|
||||
# Train dependency and POS
|
||||
spacy train sk $OUTDIR ./build/input/slovak-treebank ./build/input/ud-artificial-gapping --n-iter 20 -p tagger,parser
|
||||
rm -rf $TRAINDIR/posparser
|
||||
mv $OUTDIR/model-best $TRAINDIR/posparser
|
||||
# Train NER
|
||||
# python ./train.py -t ./train.json -o $TRAINDIR/nerposparser -n 10 -m $TRAINDIR/posparser/
|
||||
spacy train sk $TRAINDIR/nerposparser ./ner/train.json ./ner/eval.json --n-iter 20 -p ner
|
||||
# Package model
|
||||
spacy package $TRAINDIR/nerposparser dist --meta-path ./meta.json --force
|
||||
cd dist/sk_sk1-0.2.0
|
||||
python ./setup.py sdist --dist-dir ../
|
@ -31,11 +31,39 @@ Zásobník úloh:
|
||||
|
||||
- Urobiť verejné demo - nasadenie pomocou systému Docker
|
||||
- zlepšenie Web UI
|
||||
- vytvoriť REST api pre indexovanie dokumentu.
|
||||
- V indexe prideliť ohodnotenie každému dokumentu podľa viacerých metód, napr. PageRank
|
||||
- Využiť vyhodnotenie pri vyhľadávaní
|
||||
- **Použiť overovaciu databázu SCNC na vyhodnotenie každej metódy**
|
||||
- **Do konca zimného semestra vytvoriť "Mini Diplomovú prácu cca 8 strán s experimentami" vo forme článku**
|
||||
|
||||
|
||||
Virtuálne stretnutie 6.11:2020:
|
||||
|
||||
Stav:
|
||||
|
||||
- Riešenie problémov s cassandrou a javascriptom. Ako funguje funkcia then?
|
||||
|
||||
Úlohy na ďalšie stretnutie:
|
||||
|
||||
- vypracujte funkciu na indexovanie. Vstup je dokument (objekt s textom a metainformáciami). Fukcia zaindexuje dokument do ES.
|
||||
- Naštudujte si ako funguje funkcia then a čo je to callback.
|
||||
- Naštudujte si ako sa používa Promise.
|
||||
- Naštudujte si ako funguje async - await.
|
||||
- https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Asynchronous/
|
||||
|
||||
|
||||
|
||||
Virtuálne stretnutie 23.10:2020:
|
||||
|
||||
Stav:
|
||||
- Riešenie problémov s cassandrou. Ako vybrať dáta podľa primárneho kľúča.
|
||||
|
||||
Do ďďalšiehio stretnutia:
|
||||
|
||||
- pokračovať v otvorených úlohách.
|
||||
- urobte funkciu pre indexovanie jedného dokumentu.
|
||||
|
||||
Virtuálne stretnutie 16.10.
|
||||
|
||||
Stav:
|
||||
|
105
pages/students/2016/jan_holp/dp2021/zdrojove_subory/cassandra.js
Normal file
105
pages/students/2016/jan_holp/dp2021/zdrojove_subory/cassandra.js
Normal file
@ -0,0 +1,105 @@
|
||||
//Jan Holp, DP 2021
|
||||
|
||||
|
||||
//client1 = cassandra
|
||||
//client2 = elasticsearch
|
||||
//-----------------------------------------------------------------
|
||||
|
||||
//require the Elasticsearch librray
|
||||
const elasticsearch = require('elasticsearch');
|
||||
const client2 = new elasticsearch.Client({
|
||||
hosts: [ 'localhost:9200']
|
||||
});
|
||||
client2.ping({
|
||||
requestTimeout: 30000,
|
||||
}, function(error) {
|
||||
// at this point, eastic search is down, please check your Elasticsearch service
|
||||
if (error) {
|
||||
console.error('Elasticsearch cluster is down!');
|
||||
} else {
|
||||
console.log('Everything is ok');
|
||||
}
|
||||
});
|
||||
|
||||
//create new index skweb2
|
||||
client2.indices.create({
|
||||
index: 'skweb2'
|
||||
}, function(error, response, status) {
|
||||
if (error) {
|
||||
console.log(error);
|
||||
} else {
|
||||
console.log("created a new index", response);
|
||||
}
|
||||
});
|
||||
|
||||
const cassandra = require('cassandra-driver');
|
||||
const client1 = new cassandra.Client({ contactPoints: ['localhost:9042'], localDataCenter: 'datacenter1', keyspace: 'websucker' });
|
||||
const query = 'SELECT title FROM websucker.content WHERE body_size > 0 ALLOW FILTERING';
|
||||
client1.execute(query)
|
||||
.then(result => console.log(result)),function(error) {
|
||||
if(error){
|
||||
console.error('Something is wrong!');
|
||||
console.log(error);
|
||||
} else{
|
||||
console.log('Everything is ok');
|
||||
}
|
||||
};
|
||||
|
||||
/*
|
||||
async function indexData() {
|
||||
|
||||
var i = 0;
|
||||
const query = 'SELECT title FROM websucker.content WHERE body_size > 0 ALLOW FILTERING';
|
||||
client1.execute(query)
|
||||
.then((result) => {
|
||||
try {
|
||||
//for ( i=0; i<15;i++){
|
||||
console.log('%s', result.row[0].title)
|
||||
//}
|
||||
} catch (query) {
|
||||
if (query instanceof SyntaxError) {
|
||||
console.log( "Neplatne query" );
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
|
||||
});
|
||||
|
||||
|
||||
}
|
||||
|
||||
/*
|
||||
|
||||
//indexing method
|
||||
const bulkIndex = function bulkIndex(index, type, data) {
|
||||
let bulkBody = [];
|
||||
id = 1;
|
||||
const errorCount = 0;
|
||||
data.forEach(item => {
|
||||
bulkBody.push({
|
||||
index: {
|
||||
_index: index,
|
||||
_type: type,
|
||||
_id : id++,
|
||||
}
|
||||
});
|
||||
bulkBody.push(item);
|
||||
});
|
||||
console.log(bulkBody);
|
||||
client.bulk({body: bulkBody})
|
||||
.then(response => {
|
||||
|
||||
response.items.forEach(item => {
|
||||
if (item.index && item.index.error) {
|
||||
console.log(++errorCount, item.index.error);
|
||||
}
|
||||
});
|
||||
console.log(
|
||||
`Successfully indexed ${data.length - errorCount}
|
||||
out of ${data.length} items`
|
||||
);
|
||||
})
|
||||
.catch(console.err);
|
||||
};
|
||||
*/
|
@ -23,13 +23,26 @@ Zásobník úloh :
|
||||
- tesla
|
||||
- xavier
|
||||
- Trénovanie na dvoch kartách na jednom stroji
|
||||
- idoc
|
||||
- idoc DONE
|
||||
- titan
|
||||
- možno trénovanie na 4 kartách na jednom
|
||||
- quadra
|
||||
- *Trénovanie na dvoch kartách na dvoch strojoch pomocou NCCL (idoc, tesla)*
|
||||
- možno trénovanie na 2 kartách na dvoch strojoch (quadra plus idoc).
|
||||
|
||||
Virtuálne stretnutie 27.10.2020
|
||||
|
||||
Stav:
|
||||
|
||||
- Trénovanie na procesore, na 1 GPU, na 2 GPU na idoc
|
||||
- Príprava podkladov na trénovanie na dvoch strojoch pomocou Pytorch.
|
||||
- Vytvorený prístup na teslu a xavier.
|
||||
|
||||
Úlohy na ďďalšie stretnutie:
|
||||
- Štdúdium odbornej literatúry a vypracovanie poznámok.
|
||||
- Pokračovať v otvorených úlohách zo zásobníka
|
||||
- Vypracované skripty uložiť na GIT repozitár
|
||||
- vytvorte repozitár dp2021
|
||||
|
||||
Stretnutie 2.10.2020
|
||||
|
||||
|
@ -1 +1,4 @@
|
||||
## Všetky skripty, súbory a konfigurácie
|
||||
## Všetky skripty, súbory a konfigurácie
|
||||
|
||||
https://github.com/pytorch/examples/tree/master/imagenet
|
||||
- malo by fungovat pre DDP, nedostupny imagenet subor z oficialnej stranky
|
Binary file not shown.
@ -0,0 +1,76 @@
|
||||
import argparse
|
||||
import datetime
|
||||
import os
|
||||
import socket
|
||||
import sys
|
||||
|
||||
import numpy as np
|
||||
from torch.utils.tensorboard import SummaryWriter
|
||||
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.optim
|
||||
|
||||
from torch.optim import SGD, Adam
|
||||
from torch.utils.data import DataLoader
|
||||
|
||||
from util.util import enumerateWithEstimate
|
||||
from p2ch13.dsets import Luna2dSegmentationDataset, TrainingLuna2dSegmentationDataset, getCt
|
||||
from util.logconf import logging
|
||||
from util.util import xyz2irc
|
||||
from p2ch13.model_seg import UNetWrapper, SegmentationAugmentation
|
||||
from p2ch13.train_seg import LunaTrainingApp
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
# log.setLevel(logging.WARN)
|
||||
# log.setLevel(logging.INFO)
|
||||
log.setLevel(logging.DEBUG)
|
||||
|
||||
class BenchmarkLuna2dSegmentationDataset(TrainingLuna2dSegmentationDataset):
|
||||
def __len__(self):
|
||||
# return 500
|
||||
return 5000
|
||||
return 1000
|
||||
|
||||
class LunaBenchmarkApp(LunaTrainingApp):
|
||||
def initTrainDl(self):
|
||||
train_ds = BenchmarkLuna2dSegmentationDataset(
|
||||
val_stride=10,
|
||||
isValSet_bool=False,
|
||||
contextSlices_count=3,
|
||||
# augmentation_dict=self.augmentation_dict,
|
||||
)
|
||||
|
||||
batch_size = self.cli_args.batch_size
|
||||
if self.use_cuda:
|
||||
batch_size *= torch.cuda.device_count()
|
||||
|
||||
train_dl = DataLoader(
|
||||
train_ds,
|
||||
batch_size=batch_size,
|
||||
num_workers=self.cli_args.num_workers,
|
||||
pin_memory=self.use_cuda,
|
||||
)
|
||||
|
||||
return train_dl
|
||||
|
||||
def main(self):
|
||||
log.info("Starting {}, {}".format(type(self).__name__, self.cli_args))
|
||||
|
||||
train_dl = self.initTrainDl()
|
||||
|
||||
for epoch_ndx in range(1, 2):
|
||||
log.info("Epoch {} of {}, {}/{} batches of size {}*{}".format(
|
||||
epoch_ndx,
|
||||
self.cli_args.epochs,
|
||||
len(train_dl),
|
||||
len([]),
|
||||
self.cli_args.batch_size,
|
||||
(torch.cuda.device_count() if self.use_cuda else 1),
|
||||
))
|
||||
|
||||
self.doTraining(epoch_ndx, train_dl)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
LunaBenchmarkApp().main()
|
@ -0,0 +1,401 @@
|
||||
import copy
|
||||
import csv
|
||||
import functools
|
||||
import glob
|
||||
import math
|
||||
import os
|
||||
import random
|
||||
|
||||
from collections import namedtuple
|
||||
|
||||
import SimpleITK as sitk
|
||||
import numpy as np
|
||||
import scipy.ndimage.morphology as morph
|
||||
|
||||
import torch
|
||||
import torch.cuda
|
||||
import torch.nn.functional as F
|
||||
from torch.utils.data import Dataset
|
||||
|
||||
from util.disk import getCache
|
||||
from util.util import XyzTuple, xyz2irc
|
||||
from util.logconf import logging
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
# log.setLevel(logging.WARN)
|
||||
# log.setLevel(logging.INFO)
|
||||
log.setLevel(logging.DEBUG)
|
||||
|
||||
raw_cache = getCache('part2ch13_raw')
|
||||
|
||||
MaskTuple = namedtuple('MaskTuple', 'raw_dense_mask, dense_mask, body_mask, air_mask, raw_candidate_mask, candidate_mask, lung_mask, neg_mask, pos_mask')
|
||||
|
||||
CandidateInfoTuple = namedtuple('CandidateInfoTuple', 'isNodule_bool, hasAnnotation_bool, isMal_bool, diameter_mm, series_uid, center_xyz')
|
||||
|
||||
@functools.lru_cache(1)
|
||||
def getCandidateInfoList(requireOnDisk_bool=True):
|
||||
# We construct a set with all series_uids that are present on disk.
|
||||
# This will let us use the data, even if we haven't downloaded all of
|
||||
# the subsets yet.
|
||||
mhd_list = glob.glob('data-unversioned/subset*/*.mhd')
|
||||
presentOnDisk_set = {os.path.split(p)[-1][:-4] for p in mhd_list}
|
||||
|
||||
candidateInfo_list = []
|
||||
with open('data/annotations_with_malignancy.csv', "r") as f:
|
||||
for row in list(csv.reader(f))[1:]:
|
||||
series_uid = row[0]
|
||||
annotationCenter_xyz = tuple([float(x) for x in row[1:4]])
|
||||
annotationDiameter_mm = float(row[4])
|
||||
isMal_bool = {'False': False, 'True': True}[row[5]]
|
||||
|
||||
candidateInfo_list.append(
|
||||
CandidateInfoTuple(
|
||||
True,
|
||||
True,
|
||||
isMal_bool,
|
||||
annotationDiameter_mm,
|
||||
series_uid,
|
||||
annotationCenter_xyz,
|
||||
)
|
||||
)
|
||||
|
||||
with open('data/candidates.csv', "r") as f:
|
||||
for row in list(csv.reader(f))[1:]:
|
||||
series_uid = row[0]
|
||||
|
||||
if series_uid not in presentOnDisk_set and requireOnDisk_bool:
|
||||
continue
|
||||
|
||||
isNodule_bool = bool(int(row[4]))
|
||||
candidateCenter_xyz = tuple([float(x) for x in row[1:4]])
|
||||
|
||||
if not isNodule_bool:
|
||||
candidateInfo_list.append(
|
||||
CandidateInfoTuple(
|
||||
False,
|
||||
False,
|
||||
False,
|
||||
0.0,
|
||||
series_uid,
|
||||
candidateCenter_xyz,
|
||||
)
|
||||
)
|
||||
|
||||
candidateInfo_list.sort(reverse=True)
|
||||
return candidateInfo_list
|
||||
|
||||
@functools.lru_cache(1)
|
||||
def getCandidateInfoDict(requireOnDisk_bool=True):
|
||||
candidateInfo_list = getCandidateInfoList(requireOnDisk_bool)
|
||||
candidateInfo_dict = {}
|
||||
|
||||
for candidateInfo_tup in candidateInfo_list:
|
||||
candidateInfo_dict.setdefault(candidateInfo_tup.series_uid,
|
||||
[]).append(candidateInfo_tup)
|
||||
|
||||
return candidateInfo_dict
|
||||
|
||||
class Ct:
|
||||
def __init__(self, series_uid):
|
||||
mhd_path = glob.glob(
|
||||
'data-unversioned/subset*/{}.mhd'.format(series_uid)
|
||||
)[0]
|
||||
|
||||
ct_mhd = sitk.ReadImage(mhd_path)
|
||||
self.hu_a = np.array(sitk.GetArrayFromImage(ct_mhd), dtype=np.float32)
|
||||
|
||||
# CTs are natively expressed in https://en.wikipedia.org/wiki/Hounsfield_scale
|
||||
# HU are scaled oddly, with 0 g/cc (air, approximately) being -1000 and 1 g/cc (water) being 0.
|
||||
|
||||
self.series_uid = series_uid
|
||||
|
||||
self.origin_xyz = XyzTuple(*ct_mhd.GetOrigin())
|
||||
self.vxSize_xyz = XyzTuple(*ct_mhd.GetSpacing())
|
||||
self.direction_a = np.array(ct_mhd.GetDirection()).reshape(3, 3)
|
||||
|
||||
candidateInfo_list = getCandidateInfoDict()[self.series_uid]
|
||||
|
||||
self.positiveInfo_list = [
|
||||
candidate_tup
|
||||
for candidate_tup in candidateInfo_list
|
||||
if candidate_tup.isNodule_bool
|
||||
]
|
||||
self.positive_mask = self.buildAnnotationMask(self.positiveInfo_list)
|
||||
self.positive_indexes = (self.positive_mask.sum(axis=(1,2))
|
||||
.nonzero()[0].tolist())
|
||||
|
||||
def buildAnnotationMask(self, positiveInfo_list, threshold_hu = -700):
|
||||
boundingBox_a = np.zeros_like(self.hu_a, dtype=np.bool)
|
||||
|
||||
for candidateInfo_tup in positiveInfo_list:
|
||||
center_irc = xyz2irc(
|
||||
candidateInfo_tup.center_xyz,
|
||||
self.origin_xyz,
|
||||
self.vxSize_xyz,
|
||||
self.direction_a,
|
||||
)
|
||||
ci = int(center_irc.index)
|
||||
cr = int(center_irc.row)
|
||||
cc = int(center_irc.col)
|
||||
|
||||
index_radius = 2
|
||||
try:
|
||||
while self.hu_a[ci + index_radius, cr, cc] > threshold_hu and \
|
||||
self.hu_a[ci - index_radius, cr, cc] > threshold_hu:
|
||||
index_radius += 1
|
||||
except IndexError:
|
||||
index_radius -= 1
|
||||
|
||||
row_radius = 2
|
||||
try:
|
||||
while self.hu_a[ci, cr + row_radius, cc] > threshold_hu and \
|
||||
self.hu_a[ci, cr - row_radius, cc] > threshold_hu:
|
||||
row_radius += 1
|
||||
except IndexError:
|
||||
row_radius -= 1
|
||||
|
||||
col_radius = 2
|
||||
try:
|
||||
while self.hu_a[ci, cr, cc + col_radius] > threshold_hu and \
|
||||
self.hu_a[ci, cr, cc - col_radius] > threshold_hu:
|
||||
col_radius += 1
|
||||
except IndexError:
|
||||
col_radius -= 1
|
||||
|
||||
# assert index_radius > 0, repr([candidateInfo_tup.center_xyz, center_irc, self.hu_a[ci, cr, cc]])
|
||||
# assert row_radius > 0
|
||||
# assert col_radius > 0
|
||||
|
||||
boundingBox_a[
|
||||
ci - index_radius: ci + index_radius + 1,
|
||||
cr - row_radius: cr + row_radius + 1,
|
||||
cc - col_radius: cc + col_radius + 1] = True
|
||||
|
||||
mask_a = boundingBox_a & (self.hu_a > threshold_hu)
|
||||
|
||||
return mask_a
|
||||
|
||||
def getRawCandidate(self, center_xyz, width_irc):
|
||||
center_irc = xyz2irc(center_xyz, self.origin_xyz, self.vxSize_xyz,
|
||||
self.direction_a)
|
||||
|
||||
slice_list = []
|
||||
for axis, center_val in enumerate(center_irc):
|
||||
start_ndx = int(round(center_val - width_irc[axis]/2))
|
||||
end_ndx = int(start_ndx + width_irc[axis])
|
||||
|
||||
assert center_val >= 0 and center_val < self.hu_a.shape[axis], repr([self.series_uid, center_xyz, self.origin_xyz, self.vxSize_xyz, center_irc, axis])
|
||||
|
||||
if start_ndx < 0:
|
||||
# log.warning("Crop outside of CT array: {} {}, center:{} shape:{} width:{}".format(
|
||||
# self.series_uid, center_xyz, center_irc, self.hu_a.shape, width_irc))
|
||||
start_ndx = 0
|
||||
end_ndx = int(width_irc[axis])
|
||||
|
||||
if end_ndx > self.hu_a.shape[axis]:
|
||||
# log.warning("Crop outside of CT array: {} {}, center:{} shape:{} width:{}".format(
|
||||
# self.series_uid, center_xyz, center_irc, self.hu_a.shape, width_irc))
|
||||
end_ndx = self.hu_a.shape[axis]
|
||||
start_ndx = int(self.hu_a.shape[axis] - width_irc[axis])
|
||||
|
||||
slice_list.append(slice(start_ndx, end_ndx))
|
||||
|
||||
ct_chunk = self.hu_a[tuple(slice_list)]
|
||||
pos_chunk = self.positive_mask[tuple(slice_list)]
|
||||
|
||||
return ct_chunk, pos_chunk, center_irc
|
||||
|
||||
@functools.lru_cache(1, typed=True)
|
||||
def getCt(series_uid):
|
||||
return Ct(series_uid)
|
||||
|
||||
@raw_cache.memoize(typed=True)
|
||||
def getCtRawCandidate(series_uid, center_xyz, width_irc):
|
||||
ct = getCt(series_uid)
|
||||
ct_chunk, pos_chunk, center_irc = ct.getRawCandidate(center_xyz,
|
||||
width_irc)
|
||||
ct_chunk.clip(-1000, 1000, ct_chunk)
|
||||
return ct_chunk, pos_chunk, center_irc
|
||||
|
||||
@raw_cache.memoize(typed=True)
|
||||
def getCtSampleSize(series_uid):
|
||||
ct = Ct(series_uid)
|
||||
return int(ct.hu_a.shape[0]), ct.positive_indexes
|
||||
|
||||
|
||||
class Luna2dSegmentationDataset(Dataset):
|
||||
def __init__(self,
|
||||
val_stride=0,
|
||||
isValSet_bool=None,
|
||||
series_uid=None,
|
||||
contextSlices_count=3,
|
||||
fullCt_bool=False,
|
||||
):
|
||||
self.contextSlices_count = contextSlices_count
|
||||
self.fullCt_bool = fullCt_bool
|
||||
|
||||
if series_uid:
|
||||
self.series_list = [series_uid]
|
||||
else:
|
||||
self.series_list = sorted(getCandidateInfoDict().keys())
|
||||
|
||||
if isValSet_bool:
|
||||
assert val_stride > 0, val_stride
|
||||
self.series_list = self.series_list[::val_stride]
|
||||
assert self.series_list
|
||||
elif val_stride > 0:
|
||||
del self.series_list[::val_stride]
|
||||
assert self.series_list
|
||||
|
||||
self.sample_list = []
|
||||
for series_uid in self.series_list:
|
||||
index_count, positive_indexes = getCtSampleSize(series_uid)
|
||||
|
||||
if self.fullCt_bool:
|
||||
self.sample_list += [(series_uid, slice_ndx)
|
||||
for slice_ndx in range(index_count)]
|
||||
else:
|
||||
self.sample_list += [(series_uid, slice_ndx)
|
||||
for slice_ndx in positive_indexes]
|
||||
|
||||
self.candidateInfo_list = getCandidateInfoList()
|
||||
|
||||
series_set = set(self.series_list)
|
||||
self.candidateInfo_list = [cit for cit in self.candidateInfo_list
|
||||
if cit.series_uid in series_set]
|
||||
|
||||
self.pos_list = [nt for nt in self.candidateInfo_list
|
||||
if nt.isNodule_bool]
|
||||
|
||||
log.info("{!r}: {} {} series, {} slices, {} nodules".format(
|
||||
self,
|
||||
len(self.series_list),
|
||||
{None: 'general', True: 'validation', False: 'training'}[isValSet_bool],
|
||||
len(self.sample_list),
|
||||
len(self.pos_list),
|
||||
))
|
||||
|
||||
def __len__(self):
|
||||
return len(self.sample_list)
|
||||
|
||||
def __getitem__(self, ndx):
|
||||
series_uid, slice_ndx = self.sample_list[ndx % len(self.sample_list)]
|
||||
return self.getitem_fullSlice(series_uid, slice_ndx)
|
||||
|
||||
def getitem_fullSlice(self, series_uid, slice_ndx):
|
||||
ct = getCt(series_uid)
|
||||
ct_t = torch.zeros((self.contextSlices_count * 2 + 1, 512, 512))
|
||||
|
||||
start_ndx = slice_ndx - self.contextSlices_count
|
||||
end_ndx = slice_ndx + self.contextSlices_count + 1
|
||||
for i, context_ndx in enumerate(range(start_ndx, end_ndx)):
|
||||
context_ndx = max(context_ndx, 0)
|
||||
context_ndx = min(context_ndx, ct.hu_a.shape[0] - 1)
|
||||
ct_t[i] = torch.from_numpy(ct.hu_a[context_ndx].astype(np.float32))
|
||||
|
||||
# CTs are natively expressed in https://en.wikipedia.org/wiki/Hounsfield_scale
|
||||
# HU are scaled oddly, with 0 g/cc (air, approximately) being -1000 and 1 g/cc (water) being 0.
|
||||
# The lower bound gets rid of negative density stuff used to indicate out-of-FOV
|
||||
# The upper bound nukes any weird hotspots and clamps bone down
|
||||
ct_t.clamp_(-1000, 1000)
|
||||
|
||||
pos_t = torch.from_numpy(ct.positive_mask[slice_ndx]).unsqueeze(0)
|
||||
|
||||
return ct_t, pos_t, ct.series_uid, slice_ndx
|
||||
|
||||
|
||||
class TrainingLuna2dSegmentationDataset(Luna2dSegmentationDataset):
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
|
||||
self.ratio_int = 2
|
||||
|
||||
def __len__(self):
|
||||
return 300000
|
||||
|
||||
def shuffleSamples(self):
|
||||
random.shuffle(self.candidateInfo_list)
|
||||
random.shuffle(self.pos_list)
|
||||
|
||||
def __getitem__(self, ndx):
|
||||
candidateInfo_tup = self.pos_list[ndx % len(self.pos_list)]
|
||||
return self.getitem_trainingCrop(candidateInfo_tup)
|
||||
|
||||
def getitem_trainingCrop(self, candidateInfo_tup):
|
||||
ct_a, pos_a, center_irc = getCtRawCandidate(
|
||||
candidateInfo_tup.series_uid,
|
||||
candidateInfo_tup.center_xyz,
|
||||
(7, 96, 96),
|
||||
)
|
||||
pos_a = pos_a[3:4]
|
||||
|
||||
row_offset = random.randrange(0,32)
|
||||
col_offset = random.randrange(0,32)
|
||||
ct_t = torch.from_numpy(ct_a[:, row_offset:row_offset+64,
|
||||
col_offset:col_offset+64]).to(torch.float32)
|
||||
pos_t = torch.from_numpy(pos_a[:, row_offset:row_offset+64,
|
||||
col_offset:col_offset+64]).to(torch.long)
|
||||
|
||||
slice_ndx = center_irc.index
|
||||
|
||||
return ct_t, pos_t, candidateInfo_tup.series_uid, slice_ndx
|
||||
|
||||
class PrepcacheLunaDataset(Dataset):
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
|
||||
self.candidateInfo_list = getCandidateInfoList()
|
||||
self.pos_list = [nt for nt in self.candidateInfo_list if nt.isNodule_bool]
|
||||
|
||||
self.seen_set = set()
|
||||
self.candidateInfo_list.sort(key=lambda x: x.series_uid)
|
||||
|
||||
def __len__(self):
|
||||
return len(self.candidateInfo_list)
|
||||
|
||||
def __getitem__(self, ndx):
|
||||
# candidate_t, pos_t, series_uid, center_t = super().__getitem__(ndx)
|
||||
|
||||
candidateInfo_tup = self.candidateInfo_list[ndx]
|
||||
getCtRawCandidate(candidateInfo_tup.series_uid, candidateInfo_tup.center_xyz, (7, 96, 96))
|
||||
|
||||
series_uid = candidateInfo_tup.series_uid
|
||||
if series_uid not in self.seen_set:
|
||||
self.seen_set.add(series_uid)
|
||||
|
||||
getCtSampleSize(series_uid)
|
||||
# ct = getCt(series_uid)
|
||||
# for mask_ndx in ct.positive_indexes:
|
||||
# build2dLungMask(series_uid, mask_ndx)
|
||||
|
||||
return 0, 1 #candidate_t, pos_t, series_uid, center_t
|
||||
|
||||
|
||||
class TvTrainingLuna2dSegmentationDataset(torch.utils.data.Dataset):
|
||||
def __init__(self, isValSet_bool=False, val_stride=10, contextSlices_count=3):
|
||||
assert contextSlices_count == 3
|
||||
data = torch.load('./imgs_and_masks.pt')
|
||||
suids = list(set(data['suids']))
|
||||
trn_mask_suids = torch.arange(len(suids)) % val_stride < (val_stride - 1)
|
||||
trn_suids = {s for i, s in zip(trn_mask_suids, suids) if i}
|
||||
trn_mask = torch.tensor([(s in trn_suids) for s in data["suids"]])
|
||||
if not isValSet_bool:
|
||||
self.imgs = data["imgs"][trn_mask]
|
||||
self.masks = data["masks"][trn_mask]
|
||||
self.suids = [s for s, i in zip(data["suids"], trn_mask) if i]
|
||||
else:
|
||||
self.imgs = data["imgs"][~trn_mask]
|
||||
self.masks = data["masks"][~trn_mask]
|
||||
self.suids = [s for s, i in zip(data["suids"], trn_mask) if not i]
|
||||
# discard spurious hotspots and clamp bone
|
||||
self.imgs.clamp_(-1000, 1000)
|
||||
self.imgs /= 1000
|
||||
|
||||
|
||||
def __len__(self):
|
||||
return len(self.imgs)
|
||||
|
||||
def __getitem__(self, i):
|
||||
oh, ow = torch.randint(0, 32, (2,))
|
||||
sl = self.masks.size(1)//2
|
||||
return self.imgs[i, :, oh: oh + 64, ow: ow + 64], 1, self.masks[i, sl: sl+1, oh: oh + 64, ow: ow + 64].to(torch.float32), self.suids[i], 9999
|
@ -0,0 +1,224 @@
|
||||
import math
|
||||
import random
|
||||
from collections import namedtuple
|
||||
|
||||
import torch
|
||||
from torch import nn as nn
|
||||
import torch.nn.functional as F
|
||||
|
||||
from util.logconf import logging
|
||||
from util.unet import UNet
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
# log.setLevel(logging.WARN)
|
||||
# log.setLevel(logging.INFO)
|
||||
log.setLevel(logging.DEBUG)
|
||||
|
||||
class UNetWrapper(nn.Module):
|
||||
def __init__(self, **kwargs):
|
||||
super().__init__()
|
||||
|
||||
self.input_batchnorm = nn.BatchNorm2d(kwargs['in_channels'])
|
||||
self.unet = UNet(**kwargs)
|
||||
self.final = nn.Sigmoid()
|
||||
|
||||
self._init_weights()
|
||||
|
||||
def _init_weights(self):
|
||||
init_set = {
|
||||
nn.Conv2d,
|
||||
nn.Conv3d,
|
||||
nn.ConvTranspose2d,
|
||||
nn.ConvTranspose3d,
|
||||
nn.Linear,
|
||||
}
|
||||
for m in self.modules():
|
||||
if type(m) in init_set:
|
||||
nn.init.kaiming_normal_(
|
||||
m.weight.data, mode='fan_out', nonlinearity='relu', a=0
|
||||
)
|
||||
if m.bias is not None:
|
||||
fan_in, fan_out = \
|
||||
nn.init._calculate_fan_in_and_fan_out(m.weight.data)
|
||||
bound = 1 / math.sqrt(fan_out)
|
||||
nn.init.normal_(m.bias, -bound, bound)
|
||||
|
||||
# nn.init.constant_(self.unet.last.bias, -4)
|
||||
# nn.init.constant_(self.unet.last.bias, 4)
|
||||
|
||||
|
||||
def forward(self, input_batch):
|
||||
bn_output = self.input_batchnorm(input_batch)
|
||||
un_output = self.unet(bn_output)
|
||||
fn_output = self.final(un_output)
|
||||
return fn_output
|
||||
|
||||
class SegmentationAugmentation(nn.Module):
|
||||
def __init__(
|
||||
self, flip=None, offset=None, scale=None, rotate=None, noise=None
|
||||
):
|
||||
super().__init__()
|
||||
|
||||
self.flip = flip
|
||||
self.offset = offset
|
||||
self.scale = scale
|
||||
self.rotate = rotate
|
||||
self.noise = noise
|
||||
|
||||
def forward(self, input_g, label_g):
|
||||
transform_t = self._build2dTransformMatrix()
|
||||
transform_t = transform_t.expand(input_g.shape[0], -1, -1)
|
||||
transform_t = transform_t.to(input_g.device, torch.float32)
|
||||
affine_t = F.affine_grid(transform_t[:,:2],
|
||||
input_g.size(), align_corners=False)
|
||||
|
||||
augmented_input_g = F.grid_sample(input_g,
|
||||
affine_t, padding_mode='border',
|
||||
align_corners=False)
|
||||
augmented_label_g = F.grid_sample(label_g.to(torch.float32),
|
||||
affine_t, padding_mode='border',
|
||||
align_corners=False)
|
||||
|
||||
if self.noise:
|
||||
noise_t = torch.randn_like(augmented_input_g)
|
||||
noise_t *= self.noise
|
||||
|
||||
augmented_input_g += noise_t
|
||||
|
||||
return augmented_input_g, augmented_label_g > 0.5
|
||||
|
||||
def _build2dTransformMatrix(self):
|
||||
transform_t = torch.eye(3)
|
||||
|
||||
for i in range(2):
|
||||
if self.flip:
|
||||
if random.random() > 0.5:
|
||||
transform_t[i,i] *= -1
|
||||
|
||||
if self.offset:
|
||||
offset_float = self.offset
|
||||
random_float = (random.random() * 2 - 1)
|
||||
transform_t[2,i] = offset_float * random_float
|
||||
|
||||
if self.scale:
|
||||
scale_float = self.scale
|
||||
random_float = (random.random() * 2 - 1)
|
||||
transform_t[i,i] *= 1.0 + scale_float * random_float
|
||||
|
||||
if self.rotate:
|
||||
angle_rad = random.random() * math.pi * 2
|
||||
s = math.sin(angle_rad)
|
||||
c = math.cos(angle_rad)
|
||||
|
||||
rotation_t = torch.tensor([
|
||||
[c, -s, 0],
|
||||
[s, c, 0],
|
||||
[0, 0, 1]])
|
||||
|
||||
transform_t @= rotation_t
|
||||
|
||||
return transform_t
|
||||
|
||||
|
||||
# MaskTuple = namedtuple('MaskTuple', 'raw_dense_mask, dense_mask, body_mask, air_mask, raw_candidate_mask, candidate_mask, lung_mask, neg_mask, pos_mask')
|
||||
#
|
||||
# class SegmentationMask(nn.Module):
|
||||
# def __init__(self):
|
||||
# super().__init__()
|
||||
#
|
||||
# self.conv_list = nn.ModuleList([
|
||||
# self._make_circle_conv(radius) for radius in range(1, 8)
|
||||
# ])
|
||||
#
|
||||
# def _make_circle_conv(self, radius):
|
||||
# diameter = 1 + radius * 2
|
||||
#
|
||||
# a = torch.linspace(-1, 1, steps=diameter)**2
|
||||
# b = (a[None] + a[:, None])**0.5
|
||||
#
|
||||
# circle_weights = (b <= 1.0).to(torch.float32)
|
||||
#
|
||||
# conv = nn.Conv2d(1, 1, kernel_size=diameter, padding=radius, bias=False)
|
||||
# conv.weight.data.fill_(1)
|
||||
# conv.weight.data *= circle_weights / circle_weights.sum()
|
||||
#
|
||||
# return conv
|
||||
#
|
||||
#
|
||||
# def erode(self, input_mask, radius, threshold=1):
|
||||
# conv = self.conv_list[radius - 1]
|
||||
# input_float = input_mask.to(torch.float32)
|
||||
# result = conv(input_float)
|
||||
#
|
||||
# # log.debug(['erode in ', radius, threshold, input_float.min().item(), input_float.mean().item(), input_float.max().item()])
|
||||
# # log.debug(['erode out', radius, threshold, result.min().item(), result.mean().item(), result.max().item()])
|
||||
#
|
||||
# return result >= threshold
|
||||
#
|
||||
# def deposit(self, input_mask, radius, threshold=0):
|
||||
# conv = self.conv_list[radius - 1]
|
||||
# input_float = input_mask.to(torch.float32)
|
||||
# result = conv(input_float)
|
||||
#
|
||||
# # log.debug(['deposit in ', radius, threshold, input_float.min().item(), input_float.mean().item(), input_float.max().item()])
|
||||
# # log.debug(['deposit out', radius, threshold, result.min().item(), result.mean().item(), result.max().item()])
|
||||
#
|
||||
# return result > threshold
|
||||
#
|
||||
# def fill_cavity(self, input_mask):
|
||||
# cumsum = input_mask.cumsum(-1)
|
||||
# filled_mask = (cumsum > 0)
|
||||
# filled_mask &= (cumsum < cumsum[..., -1:])
|
||||
# cumsum = input_mask.cumsum(-2)
|
||||
# filled_mask &= (cumsum > 0)
|
||||
# filled_mask &= (cumsum < cumsum[..., -1:, :])
|
||||
#
|
||||
# return filled_mask
|
||||
#
|
||||
#
|
||||
# def forward(self, input_g, raw_pos_g):
|
||||
# gcc_g = input_g + 1
|
||||
#
|
||||
# with torch.no_grad():
|
||||
# # log.info(['gcc_g', gcc_g.min(), gcc_g.mean(), gcc_g.max()])
|
||||
#
|
||||
# raw_dense_mask = gcc_g > 0.7
|
||||
# dense_mask = self.deposit(raw_dense_mask, 2)
|
||||
# dense_mask = self.erode(dense_mask, 6)
|
||||
# dense_mask = self.deposit(dense_mask, 4)
|
||||
#
|
||||
# body_mask = self.fill_cavity(dense_mask)
|
||||
# air_mask = self.deposit(body_mask & ~dense_mask, 5)
|
||||
# air_mask = self.erode(air_mask, 6)
|
||||
#
|
||||
# lung_mask = self.deposit(air_mask, 5)
|
||||
#
|
||||
# raw_candidate_mask = gcc_g > 0.4
|
||||
# raw_candidate_mask &= air_mask
|
||||
# candidate_mask = self.erode(raw_candidate_mask, 1)
|
||||
# candidate_mask = self.deposit(candidate_mask, 1)
|
||||
#
|
||||
# pos_mask = self.deposit((raw_pos_g > 0.5) & lung_mask, 2)
|
||||
#
|
||||
# neg_mask = self.deposit(candidate_mask, 1)
|
||||
# neg_mask &= ~pos_mask
|
||||
# neg_mask &= lung_mask
|
||||
#
|
||||
# # label_g = (neg_mask | pos_mask).to(torch.float32)
|
||||
# label_g = (pos_mask).to(torch.float32)
|
||||
# neg_g = neg_mask.to(torch.float32)
|
||||
# pos_g = pos_mask.to(torch.float32)
|
||||
#
|
||||
# mask_dict = {
|
||||
# 'raw_dense_mask': raw_dense_mask,
|
||||
# 'dense_mask': dense_mask,
|
||||
# 'body_mask': body_mask,
|
||||
# 'air_mask': air_mask,
|
||||
# 'raw_candidate_mask': raw_candidate_mask,
|
||||
# 'candidate_mask': candidate_mask,
|
||||
# 'lung_mask': lung_mask,
|
||||
# 'neg_mask': neg_mask,
|
||||
# 'pos_mask': pos_mask,
|
||||
# }
|
||||
#
|
||||
# return label_g, neg_g, pos_g, lung_mask, mask_dict
|
@ -0,0 +1,69 @@
|
||||
import timing
|
||||
import argparse
|
||||
import sys
|
||||
|
||||
import numpy as np
|
||||
|
||||
import torch.nn as nn
|
||||
from torch.autograd import Variable
|
||||
from torch.optim import SGD
|
||||
from torch.utils.data import DataLoader
|
||||
|
||||
from util.util import enumerateWithEstimate
|
||||
from .dsets import PrepcacheLunaDataset, getCtSampleSize
|
||||
from util.logconf import logging
|
||||
# from .model import LunaModel
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
# log.setLevel(logging.WARN)
|
||||
log.setLevel(logging.INFO)
|
||||
# log.setLevel(logging.DEBUG)
|
||||
|
||||
|
||||
class LunaPrepCacheApp:
|
||||
@classmethod
|
||||
def __init__(self, sys_argv=None):
|
||||
if sys_argv is None:
|
||||
sys_argv = sys.argv[1:]
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--batch-size',
|
||||
help='Batch size to use for training',
|
||||
default=1024,
|
||||
type=int,
|
||||
)
|
||||
parser.add_argument('--num-workers',
|
||||
help='Number of worker processes for background data loading',
|
||||
default=8,
|
||||
type=int,
|
||||
)
|
||||
# parser.add_argument('--scaled',
|
||||
# help="Scale the CT chunks to square voxels.",
|
||||
# default=False,
|
||||
# action='store_true',
|
||||
# )
|
||||
|
||||
self.cli_args = parser.parse_args(sys_argv)
|
||||
|
||||
def main(self):
|
||||
log.info("Starting {}, {}".format(type(self).__name__, self.cli_args))
|
||||
|
||||
self.prep_dl = DataLoader(
|
||||
PrepcacheLunaDataset(
|
||||
# sortby_str='series_uid',
|
||||
),
|
||||
batch_size=self.cli_args.batch_size,
|
||||
num_workers=self.cli_args.num_workers,
|
||||
)
|
||||
|
||||
batch_iter = enumerateWithEstimate(
|
||||
self.prep_dl,
|
||||
"Stuffing cache",
|
||||
start_ndx=self.prep_dl.num_workers,
|
||||
)
|
||||
for batch_ndx, batch_tup in batch_iter:
|
||||
pass
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
LunaPrepCacheApp().main()
|
Binary file not shown.
@ -0,0 +1,331 @@
|
||||
import math
|
||||
import random
|
||||
import warnings
|
||||
|
||||
import numpy as np
|
||||
import scipy.ndimage
|
||||
|
||||
import torch
|
||||
from torch.autograd import Function
|
||||
from torch.autograd.function import once_differentiable
|
||||
import torch.backends.cudnn as cudnn
|
||||
|
||||
from util.logconf import logging
|
||||
log = logging.getLogger(__name__)
|
||||
# log.setLevel(logging.WARN)
|
||||
# log.setLevel(logging.INFO)
|
||||
log.setLevel(logging.DEBUG)
|
||||
|
||||
def cropToShape(image, new_shape, center_list=None, fill=0.0):
|
||||
# log.debug([image.shape, new_shape, center_list])
|
||||
# assert len(image.shape) == 3, repr(image.shape)
|
||||
|
||||
if center_list is None:
|
||||
center_list = [int(image.shape[i] / 2) for i in range(3)]
|
||||
|
||||
crop_list = []
|
||||
for i in range(0, 3):
|
||||
crop_int = center_list[i]
|
||||
if image.shape[i] > new_shape[i] and crop_int is not None:
|
||||
|
||||
# We can't just do crop_int +/- shape/2 since shape might be odd
|
||||
# and ints round down.
|
||||
start_int = crop_int - int(new_shape[i]/2)
|
||||
end_int = start_int + new_shape[i]
|
||||
crop_list.append(slice(max(0, start_int), end_int))
|
||||
else:
|
||||
crop_list.append(slice(0, image.shape[i]))
|
||||
|
||||
# log.debug([image.shape, crop_list])
|
||||
image = image[crop_list]
|
||||
|
||||
crop_list = []
|
||||
for i in range(0, 3):
|
||||
if image.shape[i] < new_shape[i]:
|
||||
crop_int = int((new_shape[i] - image.shape[i]) / 2)
|
||||
crop_list.append(slice(crop_int, crop_int + image.shape[i]))
|
||||
else:
|
||||
crop_list.append(slice(0, image.shape[i]))
|
||||
|
||||
# log.debug([image.shape, crop_list])
|
||||
new_image = np.zeros(new_shape, dtype=image.dtype)
|
||||
new_image[:] = fill
|
||||
new_image[crop_list] = image
|
||||
|
||||
return new_image
|
||||
|
||||
|
||||
def zoomToShape(image, new_shape, square=True):
|
||||
# assert image.shape[-1] in {1, 3, 4}, repr(image.shape)
|
||||
|
||||
if square and image.shape[0] != image.shape[1]:
|
||||
crop_int = min(image.shape[0], image.shape[1])
|
||||
new_shape = [crop_int, crop_int, image.shape[2]]
|
||||
image = cropToShape(image, new_shape)
|
||||
|
||||
zoom_shape = [new_shape[i] / image.shape[i] for i in range(3)]
|
||||
|
||||
with warnings.catch_warnings():
|
||||
warnings.simplefilter("ignore")
|
||||
image = scipy.ndimage.interpolation.zoom(
|
||||
image, zoom_shape,
|
||||
output=None, order=0, mode='nearest', cval=0.0, prefilter=True)
|
||||
|
||||
return image
|
||||
|
||||
def randomOffset(image_list, offset_rows=0.125, offset_cols=0.125):
|
||||
|
||||
center_list = [int(image_list[0].shape[i] / 2) for i in range(3)]
|
||||
center_list[0] += int(offset_rows * (random.random() - 0.5) * 2)
|
||||
center_list[1] += int(offset_cols * (random.random() - 0.5) * 2)
|
||||
center_list[2] = None
|
||||
|
||||
new_list = []
|
||||
for image in image_list:
|
||||
new_image = cropToShape(image, image.shape, center_list)
|
||||
new_list.append(new_image)
|
||||
|
||||
return new_list
|
||||
|
||||
|
||||
def randomZoom(image_list, scale=None, scale_min=0.8, scale_max=1.3):
|
||||
if scale is None:
|
||||
scale = scale_min + (scale_max - scale_min) * random.random()
|
||||
|
||||
new_list = []
|
||||
for image in image_list:
|
||||
# assert image.shape[-1] in {1, 3, 4}, repr(image.shape)
|
||||
|
||||
with warnings.catch_warnings():
|
||||
warnings.simplefilter("ignore")
|
||||
# log.info([image.shape])
|
||||
zimage = scipy.ndimage.interpolation.zoom(
|
||||
image, [scale, scale, 1.0],
|
||||
output=None, order=0, mode='nearest', cval=0.0, prefilter=True)
|
||||
image = cropToShape(zimage, image.shape)
|
||||
|
||||
new_list.append(image)
|
||||
|
||||
return new_list
|
||||
|
||||
|
||||
_randomFlip_transform_list = [
|
||||
# lambda a: np.rot90(a, axes=(0, 1)),
|
||||
# lambda a: np.flip(a, 0),
|
||||
lambda a: np.flip(a, 1),
|
||||
]
|
||||
|
||||
def randomFlip(image_list, transform_bits=None):
|
||||
if transform_bits is None:
|
||||
transform_bits = random.randrange(0, 2 ** len(_randomFlip_transform_list))
|
||||
|
||||
new_list = []
|
||||
for image in image_list:
|
||||
# assert image.shape[-1] in {1, 3, 4}, repr(image.shape)
|
||||
|
||||
for n in range(len(_randomFlip_transform_list)):
|
||||
if transform_bits & 2**n:
|
||||
# prhist(image, 'before')
|
||||
image = _randomFlip_transform_list[n](image)
|
||||
# prhist(image, 'after ')
|
||||
|
||||
new_list.append(image)
|
||||
|
||||
return new_list
|
||||
|
||||
|
||||
def randomSpin(image_list, angle=None, range_tup=None, axes=(0, 1)):
|
||||
if range_tup is None:
|
||||
range_tup = (0, 360)
|
||||
|
||||
if angle is None:
|
||||
angle = range_tup[0] + (range_tup[1] - range_tup[0]) * random.random()
|
||||
|
||||
new_list = []
|
||||
for image in image_list:
|
||||
# assert image.shape[-1] in {1, 3, 4}, repr(image.shape)
|
||||
|
||||
image = scipy.ndimage.interpolation.rotate(
|
||||
image, angle, axes=axes, reshape=False,
|
||||
output=None, order=0, mode='nearest', cval=0.0, prefilter=True)
|
||||
|
||||
new_list.append(image)
|
||||
|
||||
return new_list
|
||||
|
||||
|
||||
def randomNoise(image_list, noise_min=-0.1, noise_max=0.1):
|
||||
noise = np.zeros_like(image_list[0])
|
||||
noise += (noise_max - noise_min) * np.random.random_sample(image_list[0].shape) + noise_min
|
||||
noise *= 5
|
||||
noise = scipy.ndimage.filters.gaussian_filter(noise, 3)
|
||||
# noise += (noise_max - noise_min) * np.random.random_sample(image_hsv.shape) + noise_min
|
||||
|
||||
new_list = []
|
||||
for image_hsv in image_list:
|
||||
image_hsv = image_hsv + noise
|
||||
|
||||
new_list.append(image_hsv)
|
||||
|
||||
return new_list
|
||||
|
||||
|
||||
def randomHsvShift(image_list, h=None, s=None, v=None,
|
||||
h_min=-0.1, h_max=0.1,
|
||||
s_min=0.5, s_max=2.0,
|
||||
v_min=0.5, v_max=2.0):
|
||||
if h is None:
|
||||
h = h_min + (h_max - h_min) * random.random()
|
||||
if s is None:
|
||||
s = s_min + (s_max - s_min) * random.random()
|
||||
if v is None:
|
||||
v = v_min + (v_max - v_min) * random.random()
|
||||
|
||||
new_list = []
|
||||
for image_hsv in image_list:
|
||||
# assert image_hsv.shape[-1] == 3, repr(image_hsv.shape)
|
||||
|
||||
image_hsv[:,:,0::3] += h
|
||||
image_hsv[:,:,1::3] = image_hsv[:,:,1::3] ** s
|
||||
image_hsv[:,:,2::3] = image_hsv[:,:,2::3] ** v
|
||||
|
||||
new_list.append(image_hsv)
|
||||
|
||||
return clampHsv(new_list)
|
||||
|
||||
|
||||
def clampHsv(image_list):
|
||||
new_list = []
|
||||
for image_hsv in image_list:
|
||||
image_hsv = image_hsv.clone()
|
||||
|
||||
# Hue wraps around
|
||||
image_hsv[:,:,0][image_hsv[:,:,0] > 1] -= 1
|
||||
image_hsv[:,:,0][image_hsv[:,:,0] < 0] += 1
|
||||
|
||||
# Everything else clamps between 0 and 1
|
||||
image_hsv[image_hsv > 1] = 1
|
||||
image_hsv[image_hsv < 0] = 0
|
||||
|
||||
new_list.append(image_hsv)
|
||||
|
||||
return new_list
|
||||
|
||||
|
||||
# def torch_augment(input):
|
||||
# theta = random.random() * math.pi * 2
|
||||
# s = math.sin(theta)
|
||||
# c = math.cos(theta)
|
||||
# c1 = 1 - c
|
||||
# axis_vector = torch.rand(3, device='cpu', dtype=torch.float64)
|
||||
# axis_vector -= 0.5
|
||||
# axis_vector /= axis_vector.abs().sum()
|
||||
# l, m, n = axis_vector
|
||||
#
|
||||
# matrix = torch.tensor([
|
||||
# [l*l*c1 + c, m*l*c1 - n*s, n*l*c1 + m*s, 0],
|
||||
# [l*m*c1 + n*s, m*m*c1 + c, n*m*c1 - l*s, 0],
|
||||
# [l*n*c1 - m*s, m*n*c1 + l*s, n*n*c1 + c, 0],
|
||||
# [0, 0, 0, 1],
|
||||
# ], device=input.device, dtype=torch.float32)
|
||||
#
|
||||
# return th_affine3d(input, matrix)
|
||||
|
||||
|
||||
|
||||
|
||||
# following from https://github.com/ncullen93/torchsample/blob/master/torchsample/utils.py
|
||||
# MIT licensed
|
||||
|
||||
# def th_affine3d(input, matrix):
|
||||
# """
|
||||
# 3D Affine image transform on torch.Tensor
|
||||
# """
|
||||
# A = matrix[:3,:3]
|
||||
# b = matrix[:3,3]
|
||||
#
|
||||
# # make a meshgrid of normal coordinates
|
||||
# coords = th_iterproduct(input.size(-3), input.size(-2), input.size(-1), dtype=torch.float32)
|
||||
#
|
||||
# # shift the coordinates so center is the origin
|
||||
# coords[:,0] = coords[:,0] - (input.size(-3) / 2. - 0.5)
|
||||
# coords[:,1] = coords[:,1] - (input.size(-2) / 2. - 0.5)
|
||||
# coords[:,2] = coords[:,2] - (input.size(-1) / 2. - 0.5)
|
||||
#
|
||||
# # apply the coordinate transformation
|
||||
# new_coords = coords.mm(A.t().contiguous()) + b.expand_as(coords)
|
||||
#
|
||||
# # shift the coordinates back so origin is origin
|
||||
# new_coords[:,0] = new_coords[:,0] + (input.size(-3) / 2. - 0.5)
|
||||
# new_coords[:,1] = new_coords[:,1] + (input.size(-2) / 2. - 0.5)
|
||||
# new_coords[:,2] = new_coords[:,2] + (input.size(-1) / 2. - 0.5)
|
||||
#
|
||||
# # map new coordinates using bilinear interpolation
|
||||
# input_transformed = th_trilinear_interp3d(input, new_coords)
|
||||
#
|
||||
# return input_transformed
|
||||
#
|
||||
#
|
||||
# def th_trilinear_interp3d(input, coords):
|
||||
# """
|
||||
# trilinear interpolation of 3D torch.Tensor image
|
||||
# """
|
||||
# # take clamp then floor/ceil of x coords
|
||||
# x = torch.clamp(coords[:,0], 0, input.size(-3)-2)
|
||||
# x0 = x.floor()
|
||||
# x1 = x0 + 1
|
||||
# # take clamp then floor/ceil of y coords
|
||||
# y = torch.clamp(coords[:,1], 0, input.size(-2)-2)
|
||||
# y0 = y.floor()
|
||||
# y1 = y0 + 1
|
||||
# # take clamp then floor/ceil of z coords
|
||||
# z = torch.clamp(coords[:,2], 0, input.size(-1)-2)
|
||||
# z0 = z.floor()
|
||||
# z1 = z0 + 1
|
||||
#
|
||||
# stride = torch.tensor(input.stride()[-3:], dtype=torch.int64, device=input.device)
|
||||
# x0_ix = x0.mul(stride[0]).long()
|
||||
# x1_ix = x1.mul(stride[0]).long()
|
||||
# y0_ix = y0.mul(stride[1]).long()
|
||||
# y1_ix = y1.mul(stride[1]).long()
|
||||
# z0_ix = z0.mul(stride[2]).long()
|
||||
# z1_ix = z1.mul(stride[2]).long()
|
||||
#
|
||||
# # input_flat = th_flatten(input)
|
||||
# input_flat = x.contiguous().view(x[0], x[1], -1)
|
||||
#
|
||||
# vals_000 = input_flat[:, :, x0_ix+y0_ix+z0_ix]
|
||||
# vals_001 = input_flat[:, :, x0_ix+y0_ix+z1_ix]
|
||||
# vals_010 = input_flat[:, :, x0_ix+y1_ix+z0_ix]
|
||||
# vals_011 = input_flat[:, :, x0_ix+y1_ix+z1_ix]
|
||||
# vals_100 = input_flat[:, :, x1_ix+y0_ix+z0_ix]
|
||||
# vals_101 = input_flat[:, :, x1_ix+y0_ix+z1_ix]
|
||||
# vals_110 = input_flat[:, :, x1_ix+y1_ix+z0_ix]
|
||||
# vals_111 = input_flat[:, :, x1_ix+y1_ix+z1_ix]
|
||||
#
|
||||
# xd = x - x0
|
||||
# yd = y - y0
|
||||
# zd = z - z0
|
||||
# xm1 = 1 - xd
|
||||
# ym1 = 1 - yd
|
||||
# zm1 = 1 - zd
|
||||
#
|
||||
# x_mapped = (
|
||||
# vals_000.mul(xm1).mul(ym1).mul(zm1) +
|
||||
# vals_001.mul(xm1).mul(ym1).mul(zd) +
|
||||
# vals_010.mul(xm1).mul(yd).mul(zm1) +
|
||||
# vals_011.mul(xm1).mul(yd).mul(zd) +
|
||||
# vals_100.mul(xd).mul(ym1).mul(zm1) +
|
||||
# vals_101.mul(xd).mul(ym1).mul(zd) +
|
||||
# vals_110.mul(xd).mul(yd).mul(zm1) +
|
||||
# vals_111.mul(xd).mul(yd).mul(zd)
|
||||
# )
|
||||
#
|
||||
# return x_mapped.view_as(input)
|
||||
#
|
||||
# def th_iterproduct(*args, dtype=None):
|
||||
# return torch.from_numpy(np.indices(args).reshape((len(args),-1)).T)
|
||||
#
|
||||
# def th_flatten(x):
|
||||
# """Flatten tensor"""
|
||||
# return x.contiguous().view(x[0], x[1], -1)
|
@ -0,0 +1,136 @@
|
||||
import gzip
|
||||
|
||||
from diskcache import FanoutCache, Disk
|
||||
from diskcache.core import BytesType, MODE_BINARY, BytesIO
|
||||
|
||||
from util.logconf import logging
|
||||
log = logging.getLogger(__name__)
|
||||
# log.setLevel(logging.WARN)
|
||||
log.setLevel(logging.INFO)
|
||||
# log.setLevel(logging.DEBUG)
|
||||
|
||||
|
||||
class GzipDisk(Disk):
|
||||
def store(self, value, read, key=None):
|
||||
"""
|
||||
Override from base class diskcache.Disk.
|
||||
|
||||
Chunking is due to needing to work on pythons < 2.7.13:
|
||||
- Issue #27130: In the "zlib" module, fix handling of large buffers
|
||||
(typically 2 or 4 GiB). Previously, inputs were limited to 2 GiB, and
|
||||
compression and decompression operations did not properly handle results of
|
||||
2 or 4 GiB.
|
||||
|
||||
:param value: value to convert
|
||||
:param bool read: True when value is file-like object
|
||||
:return: (size, mode, filename, value) tuple for Cache table
|
||||
"""
|
||||
# pylint: disable=unidiomatic-typecheck
|
||||
if type(value) is BytesType:
|
||||
if read:
|
||||
value = value.read()
|
||||
read = False
|
||||
|
||||
str_io = BytesIO()
|
||||
gz_file = gzip.GzipFile(mode='wb', compresslevel=1, fileobj=str_io)
|
||||
|
||||
for offset in range(0, len(value), 2**30):
|
||||
gz_file.write(value[offset:offset+2**30])
|
||||
gz_file.close()
|
||||
|
||||
value = str_io.getvalue()
|
||||
|
||||
return super(GzipDisk, self).store(value, read)
|
||||
|
||||
|
||||
def fetch(self, mode, filename, value, read):
|
||||
"""
|
||||
Override from base class diskcache.Disk.
|
||||
|
||||
Chunking is due to needing to work on pythons < 2.7.13:
|
||||
- Issue #27130: In the "zlib" module, fix handling of large buffers
|
||||
(typically 2 or 4 GiB). Previously, inputs were limited to 2 GiB, and
|
||||
compression and decompression operations did not properly handle results of
|
||||
2 or 4 GiB.
|
||||
|
||||
:param int mode: value mode raw, binary, text, or pickle
|
||||
:param str filename: filename of corresponding value
|
||||
:param value: database value
|
||||
:param bool read: when True, return an open file handle
|
||||
:return: corresponding Python value
|
||||
"""
|
||||
value = super(GzipDisk, self).fetch(mode, filename, value, read)
|
||||
|
||||
if mode == MODE_BINARY:
|
||||
str_io = BytesIO(value)
|
||||
gz_file = gzip.GzipFile(mode='rb', fileobj=str_io)
|
||||
read_csio = BytesIO()
|
||||
|
||||
while True:
|
||||
uncompressed_data = gz_file.read(2**30)
|
||||
if uncompressed_data:
|
||||
read_csio.write(uncompressed_data)
|
||||
else:
|
||||
break
|
||||
|
||||
value = read_csio.getvalue()
|
||||
|
||||
return value
|
||||
|
||||
def getCache(scope_str):
|
||||
return FanoutCache('data-unversioned/cache/' + scope_str,
|
||||
disk=GzipDisk,
|
||||
shards=64,
|
||||
timeout=1,
|
||||
size_limit=3e11,
|
||||
# disk_min_file_size=2**20,
|
||||
)
|
||||
|
||||
# def disk_cache(base_path, memsize=2):
|
||||
# def disk_cache_decorator(f):
|
||||
# @functools.wraps(f)
|
||||
# def wrapper(*args, **kwargs):
|
||||
# args_str = repr(args) + repr(sorted(kwargs.items()))
|
||||
# file_str = hashlib.md5(args_str.encode('utf8')).hexdigest()
|
||||
#
|
||||
# cache_path = os.path.join(base_path, f.__name__, file_str + '.pkl.gz')
|
||||
#
|
||||
# if not os.path.exists(os.path.dirname(cache_path)):
|
||||
# os.makedirs(os.path.dirname(cache_path), exist_ok=True)
|
||||
#
|
||||
# if os.path.exists(cache_path):
|
||||
# return pickle_loadgz(cache_path)
|
||||
# else:
|
||||
# ret = f(*args, **kwargs)
|
||||
# pickle_dumpgz(cache_path, ret)
|
||||
# return ret
|
||||
#
|
||||
# return wrapper
|
||||
#
|
||||
# return disk_cache_decorator
|
||||
#
|
||||
#
|
||||
# def pickle_dumpgz(file_path, obj):
|
||||
# log.debug("Writing {}".format(file_path))
|
||||
# with open(file_path, 'wb') as file_obj:
|
||||
# with gzip.GzipFile(mode='wb', compresslevel=1, fileobj=file_obj) as gz_file:
|
||||
# pickle.dump(obj, gz_file, pickle.HIGHEST_PROTOCOL)
|
||||
#
|
||||
#
|
||||
# def pickle_loadgz(file_path):
|
||||
# log.debug("Reading {}".format(file_path))
|
||||
# with open(file_path, 'rb') as file_obj:
|
||||
# with gzip.GzipFile(mode='rb', fileobj=file_obj) as gz_file:
|
||||
# return pickle.load(gz_file)
|
||||
#
|
||||
#
|
||||
# def dtpath(dt=None):
|
||||
# if dt is None:
|
||||
# dt = datetime.datetime.now()
|
||||
#
|
||||
# return str(dt).rsplit('.', 1)[0].replace(' ', '--').replace(':', '.')
|
||||
#
|
||||
#
|
||||
# def safepath(s):
|
||||
# s = s.replace(' ', '_')
|
||||
# return re.sub('[^A-Za-z0-9_.-]', '', s)
|
@ -0,0 +1,19 @@
|
||||
import logging
|
||||
import logging.handlers
|
||||
|
||||
root_logger = logging.getLogger()
|
||||
root_logger.setLevel(logging.INFO)
|
||||
|
||||
# Some libraries attempt to add their own root logger handlers. This is
|
||||
# annoying and so we get rid of them.
|
||||
for handler in list(root_logger.handlers):
|
||||
root_logger.removeHandler(handler)
|
||||
|
||||
logfmt_str = "%(asctime)s %(levelname)-8s pid:%(process)d %(name)s:%(lineno)03d:%(funcName)s %(message)s"
|
||||
formatter = logging.Formatter(logfmt_str)
|
||||
|
||||
streamHandler = logging.StreamHandler()
|
||||
streamHandler.setFormatter(formatter)
|
||||
streamHandler.setLevel(logging.DEBUG)
|
||||
|
||||
root_logger.addHandler(streamHandler)
|
@ -0,0 +1,143 @@
|
||||
# From https://github.com/jvanvugt/pytorch-unet
|
||||
# https://raw.githubusercontent.com/jvanvugt/pytorch-unet/master/unet.py
|
||||
|
||||
# MIT License
|
||||
#
|
||||
# Copyright (c) 2018 Joris
|
||||
#
|
||||
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
# of this software and associated documentation files (the "Software"), to deal
|
||||
# in the Software without restriction, including without limitation the rights
|
||||
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
# copies of the Software, and to permit persons to whom the Software is
|
||||
# furnished to do so, subject to the following conditions:
|
||||
#
|
||||
# The above copyright notice and this permission notice shall be included in all
|
||||
# copies or substantial portions of the Software.
|
||||
#
|
||||
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
# SOFTWARE.
|
||||
|
||||
# Adapted from https://discuss.pytorch.org/t/unet-implementation/426
|
||||
|
||||
import torch
|
||||
from torch import nn
|
||||
import torch.nn.functional as F
|
||||
|
||||
|
||||
class UNet(nn.Module):
|
||||
def __init__(self, in_channels=1, n_classes=2, depth=5, wf=6, padding=False,
|
||||
batch_norm=False, up_mode='upconv'):
|
||||
"""
|
||||
Implementation of
|
||||
U-Net: Convolutional Networks for Biomedical Image Segmentation
|
||||
(Ronneberger et al., 2015)
|
||||
https://arxiv.org/abs/1505.04597
|
||||
|
||||
Using the default arguments will yield the exact version used
|
||||
in the original paper
|
||||
|
||||
Args:
|
||||
in_channels (int): number of input channels
|
||||
n_classes (int): number of output channels
|
||||
depth (int): depth of the network
|
||||
wf (int): number of filters in the first layer is 2**wf
|
||||
padding (bool): if True, apply padding such that the input shape
|
||||
is the same as the output.
|
||||
This may introduce artifacts
|
||||
batch_norm (bool): Use BatchNorm after layers with an
|
||||
activation function
|
||||
up_mode (str): one of 'upconv' or 'upsample'.
|
||||
'upconv' will use transposed convolutions for
|
||||
learned upsampling.
|
||||
'upsample' will use bilinear upsampling.
|
||||
"""
|
||||
super(UNet, self).__init__()
|
||||
assert up_mode in ('upconv', 'upsample')
|
||||
self.padding = padding
|
||||
self.depth = depth
|
||||
prev_channels = in_channels
|
||||
self.down_path = nn.ModuleList()
|
||||
for i in range(depth):
|
||||
self.down_path.append(UNetConvBlock(prev_channels, 2**(wf+i),
|
||||
padding, batch_norm))
|
||||
prev_channels = 2**(wf+i)
|
||||
|
||||
self.up_path = nn.ModuleList()
|
||||
for i in reversed(range(depth - 1)):
|
||||
self.up_path.append(UNetUpBlock(prev_channels, 2**(wf+i), up_mode,
|
||||
padding, batch_norm))
|
||||
prev_channels = 2**(wf+i)
|
||||
|
||||
self.last = nn.Conv2d(prev_channels, n_classes, kernel_size=1)
|
||||
|
||||
def forward(self, x):
|
||||
blocks = []
|
||||
for i, down in enumerate(self.down_path):
|
||||
x = down(x)
|
||||
if i != len(self.down_path)-1:
|
||||
blocks.append(x)
|
||||
x = F.avg_pool2d(x, 2)
|
||||
|
||||
for i, up in enumerate(self.up_path):
|
||||
x = up(x, blocks[-i-1])
|
||||
|
||||
return self.last(x)
|
||||
|
||||
|
||||
class UNetConvBlock(nn.Module):
|
||||
def __init__(self, in_size, out_size, padding, batch_norm):
|
||||
super(UNetConvBlock, self).__init__()
|
||||
block = []
|
||||
|
||||
block.append(nn.Conv2d(in_size, out_size, kernel_size=3,
|
||||
padding=int(padding)))
|
||||
block.append(nn.ReLU())
|
||||
# block.append(nn.LeakyReLU())
|
||||
if batch_norm:
|
||||
block.append(nn.BatchNorm2d(out_size))
|
||||
|
||||
block.append(nn.Conv2d(out_size, out_size, kernel_size=3,
|
||||
padding=int(padding)))
|
||||
block.append(nn.ReLU())
|
||||
# block.append(nn.LeakyReLU())
|
||||
if batch_norm:
|
||||
block.append(nn.BatchNorm2d(out_size))
|
||||
|
||||
self.block = nn.Sequential(*block)
|
||||
|
||||
def forward(self, x):
|
||||
out = self.block(x)
|
||||
return out
|
||||
|
||||
|
||||
class UNetUpBlock(nn.Module):
|
||||
def __init__(self, in_size, out_size, up_mode, padding, batch_norm):
|
||||
super(UNetUpBlock, self).__init__()
|
||||
if up_mode == 'upconv':
|
||||
self.up = nn.ConvTranspose2d(in_size, out_size, kernel_size=2,
|
||||
stride=2)
|
||||
elif up_mode == 'upsample':
|
||||
self.up = nn.Sequential(nn.Upsample(mode='bilinear', scale_factor=2),
|
||||
nn.Conv2d(in_size, out_size, kernel_size=1))
|
||||
|
||||
self.conv_block = UNetConvBlock(in_size, out_size, padding, batch_norm)
|
||||
|
||||
def center_crop(self, layer, target_size):
|
||||
_, _, layer_height, layer_width = layer.size()
|
||||
diff_y = (layer_height - target_size[0]) // 2
|
||||
diff_x = (layer_width - target_size[1]) // 2
|
||||
return layer[:, :, diff_y:(diff_y + target_size[0]), diff_x:(diff_x + target_size[1])]
|
||||
|
||||
def forward(self, x, bridge):
|
||||
up = self.up(x)
|
||||
crop1 = self.center_crop(bridge, up.shape[2:])
|
||||
out = torch.cat([up, crop1], 1)
|
||||
out = self.conv_block(out)
|
||||
|
||||
return out
|
105
pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist-dist.py
Normal file
105
pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist-dist.py
Normal file
@ -0,0 +1,105 @@
|
||||
import os
|
||||
from datetime import datetime
|
||||
import argparse
|
||||
import torch.multiprocessing as mp
|
||||
import torchvision
|
||||
import torchvision.transforms as transforms
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.distributed as dist
|
||||
from apex.parallel import DistributedDataParallel as DDP
|
||||
from apex import amp
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N',
|
||||
help='number of data loading workers (default: 4)')
|
||||
parser.add_argument('-g', '--gpus', default=1, type=int,
|
||||
help='number of gpus per node')
|
||||
parser.add_argument('-nr', '--nr', default=0, type=int,
|
||||
help='ranking within the nodes')
|
||||
parser.add_argument('--epochs', default=2, type=int, metavar='N',
|
||||
help='number of total epochs to run')
|
||||
args = parser.parse_args()
|
||||
args.world_size = args.gpus * args.nodes
|
||||
os.environ['MASTER_ADDR'] = '147.232.47.114'
|
||||
os.environ['MASTER_PORT'] = '8888'
|
||||
mp.spawn(train, nprocs=args.gpus, args=(args,))
|
||||
|
||||
|
||||
class ConvNet(nn.Module):
|
||||
def __init__(self, num_classes=10):
|
||||
super(ConvNet, self).__init__()
|
||||
self.layer1 = nn.Sequential(
|
||||
nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
|
||||
nn.BatchNorm2d(16),
|
||||
nn.ReLU(),
|
||||
nn.MaxPool2d(kernel_size=2, stride=2))
|
||||
self.layer2 = nn.Sequential(
|
||||
nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
|
||||
nn.BatchNorm2d(32),
|
||||
nn.ReLU(),
|
||||
nn.MaxPool2d(kernel_size=2, stride=2))
|
||||
self.fc = nn.Linear(7*7*32, num_classes)
|
||||
|
||||
def forward(self, x):
|
||||
out = self.layer1(x)
|
||||
out = self.layer2(out)
|
||||
out = out.reshape(out.size(0), -1)
|
||||
out = self.fc(out)
|
||||
return out
|
||||
|
||||
|
||||
def train(gpu, args):
|
||||
rank = args.nr * args.gpus + gpu
|
||||
dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)
|
||||
torch.manual_seed(0)
|
||||
model = ConvNet()
|
||||
torch.cuda.set_device(gpu)
|
||||
model.cuda(gpu)
|
||||
batch_size = 10
|
||||
# define loss function (criterion) and optimizer
|
||||
criterion = nn.CrossEntropyLoss().cuda(gpu)
|
||||
optimizer = torch.optim.SGD(model.parameters(), 1e-4)
|
||||
# Wrap the model
|
||||
model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
|
||||
# Data loading code
|
||||
train_dataset = torchvision.datasets.MNIST(root='./data',
|
||||
train=True,
|
||||
transform=transforms.ToTensor(),
|
||||
download=True)
|
||||
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,
|
||||
num_replicas=args.world_size,
|
||||
rank=rank)
|
||||
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
|
||||
batch_size=batch_size,
|
||||
shuffle=False,
|
||||
num_workers=0,
|
||||
pin_memory=True,
|
||||
sampler=train_sampler)
|
||||
|
||||
start = datetime.now()
|
||||
total_step = len(train_loader)
|
||||
for epoch in range(args.epochs):
|
||||
for i, (images, labels) in enumerate(train_loader):
|
||||
images = images.cuda(non_blocking=True)
|
||||
labels = labels.cuda(non_blocking=True)
|
||||
# Forward pass
|
||||
outputs = model(images)
|
||||
loss = criterion(outputs, labels)
|
||||
|
||||
# Backward and optimize
|
||||
optimizer.zero_grad()
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
if (i + 1) % 100 == 0 and gpu == 0:
|
||||
print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
|
||||
loss.item()))
|
||||
if gpu == 0:
|
||||
print("Training complete in: " + str(datetime.now() - start))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
torch.multiprocessing.set_start_method('spawn')
|
||||
main()
|
92
pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist.py
Normal file
92
pages/students/2016/lukas_pokryvka/dp2021/mnist/mnist.py
Normal file
@ -0,0 +1,92 @@
|
||||
import os
|
||||
from datetime import datetime
|
||||
import argparse
|
||||
import torch.multiprocessing as mp
|
||||
import torchvision
|
||||
import torchvision.transforms as transforms
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.distributed as dist
|
||||
from apex.parallel import DistributedDataParallel as DDP
|
||||
from apex import amp
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N',
|
||||
help='number of data loading workers (default: 4)')
|
||||
parser.add_argument('-g', '--gpus', default=1, type=int,
|
||||
help='number of gpus per node')
|
||||
parser.add_argument('-nr', '--nr', default=0, type =int,
|
||||
help='ranking within the nodes')
|
||||
parser.add_argument('--epochs', default=2, type=int, metavar='N',
|
||||
help='number of total epochs to run')
|
||||
args = parser.parse_args()
|
||||
train(0, args)
|
||||
|
||||
|
||||
class ConvNet(nn.Module):
|
||||
def __init__(self, num_classes=10):
|
||||
super(ConvNet, self).__init__()
|
||||
self.layer1 = nn.Sequential(
|
||||
nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
|
||||
nn.BatchNorm2d(16),
|
||||
nn.ReLU(),
|
||||
nn.MaxPool2d(kernel_size=2, stride=2))
|
||||
self.layer2 = nn.Sequential(
|
||||
nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
|
||||
nn.BatchNorm2d(32),
|
||||
nn.ReLU(),
|
||||
nn.MaxPool2d(kernel_size=2, stride=2))
|
||||
self.fc = nn.Linear(7*7*32, num_classes)
|
||||
|
||||
def forward(self, x):
|
||||
out = self.layer1(x)
|
||||
out = self.layer2(out)
|
||||
out = out.reshape(out.size(0), -1)
|
||||
out = self.fc(out)
|
||||
return out
|
||||
|
||||
|
||||
def train(gpu, args):
|
||||
model = ConvNet()
|
||||
torch.cuda.set_device(gpu)
|
||||
model.cuda(gpu)
|
||||
batch_size = 50
|
||||
# define loss function (criterion) and optimizer
|
||||
criterion = nn.CrossEntropyLoss().cuda(gpu)
|
||||
optimizer = torch.optim.SGD(model.parameters(), 1e-4)
|
||||
# Data loading code
|
||||
train_dataset = torchvision.datasets.MNIST(root='./data',
|
||||
train=True,
|
||||
transform=transforms.ToTensor(),
|
||||
download=True)
|
||||
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
|
||||
batch_size=batch_size,
|
||||
shuffle=True,
|
||||
num_workers=0,
|
||||
pin_memory=True)
|
||||
|
||||
start = datetime.now()
|
||||
total_step = len(train_loader)
|
||||
for epoch in range(args.epochs):
|
||||
for i, (images, labels) in enumerate(train_loader):
|
||||
images = images.cuda(non_blocking=True)
|
||||
labels = labels.cuda(non_blocking=True)
|
||||
# Forward pass
|
||||
outputs = model(images)
|
||||
loss = criterion(outputs, labels)
|
||||
|
||||
# Backward and optimize
|
||||
optimizer.zero_grad()
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
if (i + 1) % 100 == 0 and gpu == 0:
|
||||
print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
|
||||
loss.item()))
|
||||
if gpu == 0:
|
||||
print("Training complete in: " + str(datetime.now() - start))
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
748
pages/students/2016/lukas_pokryvka/dp2021/yelp/script.py
Normal file
748
pages/students/2016/lukas_pokryvka/dp2021/yelp/script.py
Normal file
@ -0,0 +1,748 @@
|
||||
from argparse import Namespace
|
||||
from collections import Counter
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import string
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.nn.functional as F
|
||||
import torch.optim as optim
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
from tqdm.notebook import tqdm
|
||||
|
||||
|
||||
class Vocabulary(object):
|
||||
"""Class to process text and extract vocabulary for mapping"""
|
||||
|
||||
def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
|
||||
"""
|
||||
Args:
|
||||
token_to_idx (dict): a pre-existing map of tokens to indices
|
||||
add_unk (bool): a flag that indicates whether to add the UNK token
|
||||
unk_token (str): the UNK token to add into the Vocabulary
|
||||
"""
|
||||
|
||||
if token_to_idx is None:
|
||||
token_to_idx = {}
|
||||
self._token_to_idx = token_to_idx
|
||||
|
||||
self._idx_to_token = {idx: token
|
||||
for token, idx in self._token_to_idx.items()}
|
||||
|
||||
self._add_unk = add_unk
|
||||
self._unk_token = unk_token
|
||||
|
||||
self.unk_index = -1
|
||||
if add_unk:
|
||||
self.unk_index = self.add_token(unk_token)
|
||||
|
||||
|
||||
def to_serializable(self):
|
||||
""" returns a dictionary that can be serialized """
|
||||
return {'token_to_idx': self._token_to_idx,
|
||||
'add_unk': self._add_unk,
|
||||
'unk_token': self._unk_token}
|
||||
|
||||
@classmethod
|
||||
def from_serializable(cls, contents):
|
||||
""" instantiates the Vocabulary from a serialized dictionary """
|
||||
return cls(**contents)
|
||||
|
||||
def add_token(self, token):
|
||||
"""Update mapping dicts based on the token.
|
||||
|
||||
Args:
|
||||
token (str): the item to add into the Vocabulary
|
||||
Returns:
|
||||
index (int): the integer corresponding to the token
|
||||
"""
|
||||
if token in self._token_to_idx:
|
||||
index = self._token_to_idx[token]
|
||||
else:
|
||||
index = len(self._token_to_idx)
|
||||
self._token_to_idx[token] = index
|
||||
self._idx_to_token[index] = token
|
||||
return index
|
||||
|
||||
def add_many(self, tokens):
|
||||
"""Add a list of tokens into the Vocabulary
|
||||
|
||||
Args:
|
||||
tokens (list): a list of string tokens
|
||||
Returns:
|
||||
indices (list): a list of indices corresponding to the tokens
|
||||
"""
|
||||
return [self.add_token(token) for token in tokens]
|
||||
|
||||
def lookup_token(self, token):
|
||||
"""Retrieve the index associated with the token
|
||||
or the UNK index if token isn't present.
|
||||
|
||||
Args:
|
||||
token (str): the token to look up
|
||||
Returns:
|
||||
index (int): the index corresponding to the token
|
||||
Notes:
|
||||
`unk_index` needs to be >=0 (having been added into the Vocabulary)
|
||||
for the UNK functionality
|
||||
"""
|
||||
if self.unk_index >= 0:
|
||||
return self._token_to_idx.get(token, self.unk_index)
|
||||
else:
|
||||
return self._token_to_idx[token]
|
||||
|
||||
def lookup_index(self, index):
|
||||
"""Return the token associated with the index
|
||||
|
||||
Args:
|
||||
index (int): the index to look up
|
||||
Returns:
|
||||
token (str): the token corresponding to the index
|
||||
Raises:
|
||||
KeyError: if the index is not in the Vocabulary
|
||||
"""
|
||||
if index not in self._idx_to_token:
|
||||
raise KeyError("the index (%d) is not in the Vocabulary" % index)
|
||||
return self._idx_to_token[index]
|
||||
|
||||
def __str__(self):
|
||||
return "<Vocabulary(size=%d)>" % len(self)
|
||||
|
||||
def __len__(self):
|
||||
return len(self._token_to_idx)
|
||||
|
||||
|
||||
|
||||
|
||||
class ReviewVectorizer(object):
|
||||
""" The Vectorizer which coordinates the Vocabularies and puts them to use"""
|
||||
def __init__(self, review_vocab, rating_vocab):
|
||||
"""
|
||||
Args:
|
||||
review_vocab (Vocabulary): maps words to integers
|
||||
rating_vocab (Vocabulary): maps class labels to integers
|
||||
"""
|
||||
self.review_vocab = review_vocab
|
||||
self.rating_vocab = rating_vocab
|
||||
|
||||
def vectorize(self, review):
|
||||
"""Create a collapsed one-hit vector for the review
|
||||
|
||||
Args:
|
||||
review (str): the review
|
||||
Returns:
|
||||
one_hot (np.ndarray): the collapsed one-hot encoding
|
||||
"""
|
||||
one_hot = np.zeros(len(self.review_vocab), dtype=np.float32)
|
||||
|
||||
for token in review.split(" "):
|
||||
if token not in string.punctuation:
|
||||
one_hot[self.review_vocab.lookup_token(token)] = 1
|
||||
|
||||
return one_hot
|
||||
|
||||
@classmethod
|
||||
def from_dataframe(cls, review_df, cutoff=25):
|
||||
"""Instantiate the vectorizer from the dataset dataframe
|
||||
|
||||
Args:
|
||||
review_df (pandas.DataFrame): the review dataset
|
||||
cutoff (int): the parameter for frequency-based filtering
|
||||
Returns:
|
||||
an instance of the ReviewVectorizer
|
||||
"""
|
||||
review_vocab = Vocabulary(add_unk=True)
|
||||
rating_vocab = Vocabulary(add_unk=False)
|
||||
|
||||
# Add ratings
|
||||
for rating in sorted(set(review_df.rating)):
|
||||
rating_vocab.add_token(rating)
|
||||
|
||||
# Add top words if count > provided count
|
||||
word_counts = Counter()
|
||||
for review in review_df.review:
|
||||
for word in review.split(" "):
|
||||
if word not in string.punctuation:
|
||||
word_counts[word] += 1
|
||||
|
||||
for word, count in word_counts.items():
|
||||
if count > cutoff:
|
||||
review_vocab.add_token(word)
|
||||
|
||||
return cls(review_vocab, rating_vocab)
|
||||
|
||||
@classmethod
|
||||
def from_serializable(cls, contents):
|
||||
"""Instantiate a ReviewVectorizer from a serializable dictionary
|
||||
|
||||
Args:
|
||||
contents (dict): the serializable dictionary
|
||||
Returns:
|
||||
an instance of the ReviewVectorizer class
|
||||
"""
|
||||
review_vocab = Vocabulary.from_serializable(contents['review_vocab'])
|
||||
rating_vocab = Vocabulary.from_serializable(contents['rating_vocab'])
|
||||
|
||||
return cls(review_vocab=review_vocab, rating_vocab=rating_vocab)
|
||||
|
||||
def to_serializable(self):
|
||||
"""Create the serializable dictionary for caching
|
||||
|
||||
Returns:
|
||||
contents (dict): the serializable dictionary
|
||||
"""
|
||||
return {'review_vocab': self.review_vocab.to_serializable(),
|
||||
'rating_vocab': self.rating_vocab.to_serializable()}
|
||||
|
||||
|
||||
|
||||
class ReviewDataset(Dataset):
|
||||
def __init__(self, review_df, vectorizer):
|
||||
"""
|
||||
Args:
|
||||
review_df (pandas.DataFrame): the dataset
|
||||
vectorizer (ReviewVectorizer): vectorizer instantiated from dataset
|
||||
"""
|
||||
self.review_df = review_df
|
||||
self._vectorizer = vectorizer
|
||||
|
||||
self.train_df = self.review_df[self.review_df.split=='train']
|
||||
self.train_size = len(self.train_df)
|
||||
|
||||
self.val_df = self.review_df[self.review_df.split=='val']
|
||||
self.validation_size = len(self.val_df)
|
||||
|
||||
self.test_df = self.review_df[self.review_df.split=='test']
|
||||
self.test_size = len(self.test_df)
|
||||
|
||||
self._lookup_dict = {'train': (self.train_df, self.train_size),
|
||||
'val': (self.val_df, self.validation_size),
|
||||
'test': (self.test_df, self.test_size)}
|
||||
|
||||
self.set_split('train')
|
||||
|
||||
@classmethod
|
||||
def load_dataset_and_make_vectorizer(cls, review_csv):
|
||||
"""Load dataset and make a new vectorizer from scratch
|
||||
|
||||
Args:
|
||||
review_csv (str): location of the dataset
|
||||
Returns:
|
||||
an instance of ReviewDataset
|
||||
"""
|
||||
review_df = pd.read_csv(review_csv)
|
||||
train_review_df = review_df[review_df.split=='train']
|
||||
return cls(review_df, ReviewVectorizer.from_dataframe(train_review_df))
|
||||
|
||||
@classmethod
|
||||
def load_dataset_and_load_vectorizer(cls, review_csv, vectorizer_filepath):
|
||||
"""Load dataset and the corresponding vectorizer.
|
||||
Used in the case in the vectorizer has been cached for re-use
|
||||
|
||||
Args:
|
||||
review_csv (str): location of the dataset
|
||||
vectorizer_filepath (str): location of the saved vectorizer
|
||||
Returns:
|
||||
an instance of ReviewDataset
|
||||
"""
|
||||
review_df = pd.read_csv(review_csv)
|
||||
vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
|
||||
return cls(review_df, vectorizer)
|
||||
|
||||
@staticmethod
|
||||
def load_vectorizer_only(vectorizer_filepath):
|
||||
"""a static method for loading the vectorizer from file
|
||||
|
||||
Args:
|
||||
vectorizer_filepath (str): the location of the serialized vectorizer
|
||||
Returns:
|
||||
an instance of ReviewVectorizer
|
||||
"""
|
||||
with open(vectorizer_filepath) as fp:
|
||||
return ReviewVectorizer.from_serializable(json.load(fp))
|
||||
|
||||
def save_vectorizer(self, vectorizer_filepath):
|
||||
"""saves the vectorizer to disk using json
|
||||
|
||||
Args:
|
||||
vectorizer_filepath (str): the location to save the vectorizer
|
||||
"""
|
||||
with open(vectorizer_filepath, "w") as fp:
|
||||
json.dump(self._vectorizer.to_serializable(), fp)
|
||||
|
||||
def get_vectorizer(self):
|
||||
""" returns the vectorizer """
|
||||
return self._vectorizer
|
||||
|
||||
def set_split(self, split="train"):
|
||||
""" selects the splits in the dataset using a column in the dataframe
|
||||
|
||||
Args:
|
||||
split (str): one of "train", "val", or "test"
|
||||
"""
|
||||
self._target_split = split
|
||||
self._target_df, self._target_size = self._lookup_dict[split]
|
||||
|
||||
def __len__(self):
|
||||
return self._target_size
|
||||
|
||||
def __getitem__(self, index):
|
||||
"""the primary entry point method for PyTorch datasets
|
||||
|
||||
Args:
|
||||
index (int): the index to the data point
|
||||
Returns:
|
||||
a dictionary holding the data point's features (x_data) and label (y_target)
|
||||
"""
|
||||
row = self._target_df.iloc[index]
|
||||
|
||||
review_vector = \
|
||||
self._vectorizer.vectorize(row.review)
|
||||
|
||||
rating_index = \
|
||||
self._vectorizer.rating_vocab.lookup_token(row.rating)
|
||||
|
||||
return {'x_data': review_vector,
|
||||
'y_target': rating_index}
|
||||
|
||||
def get_num_batches(self, batch_size):
|
||||
"""Given a batch size, return the number of batches in the dataset
|
||||
|
||||
Args:
|
||||
batch_size (int)
|
||||
Returns:
|
||||
number of batches in the dataset
|
||||
"""
|
||||
return len(self) // batch_size
|
||||
|
||||
def generate_batches(dataset, batch_size, shuffle=True,
|
||||
drop_last=True, device="cpu"):
|
||||
"""
|
||||
A generator function which wraps the PyTorch DataLoader. It will
|
||||
ensure each tensor is on the write device location.
|
||||
"""
|
||||
dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
|
||||
shuffle=shuffle, drop_last=drop_last)
|
||||
|
||||
for data_dict in dataloader:
|
||||
out_data_dict = {}
|
||||
for name, tensor in data_dict.items():
|
||||
out_data_dict[name] = data_dict[name].to(device)
|
||||
yield out_data_dict
|
||||
|
||||
|
||||
|
||||
class ReviewClassifier(nn.Module):
|
||||
""" a simple perceptron based classifier """
|
||||
def __init__(self, num_features):
|
||||
"""
|
||||
Args:
|
||||
num_features (int): the size of the input feature vector
|
||||
"""
|
||||
super(ReviewClassifier, self).__init__()
|
||||
self.fc1 = nn.Linear(in_features=num_features,
|
||||
out_features=1)
|
||||
|
||||
def forward(self, x_in, apply_sigmoid=False):
|
||||
"""The forward pass of the classifier
|
||||
|
||||
Args:
|
||||
x_in (torch.Tensor): an input data tensor.
|
||||
x_in.shape should be (batch, num_features)
|
||||
apply_sigmoid (bool): a flag for the sigmoid activation
|
||||
should be false if used with the Cross Entropy losses
|
||||
Returns:
|
||||
the resulting tensor. tensor.shape should be (batch,)
|
||||
"""
|
||||
y_out = self.fc1(x_in).squeeze()
|
||||
if apply_sigmoid:
|
||||
y_out = torch.sigmoid(y_out)
|
||||
return y_out
|
||||
|
||||
|
||||
|
||||
|
||||
def make_train_state(args):
|
||||
return {'stop_early': False,
|
||||
'early_stopping_step': 0,
|
||||
'early_stopping_best_val': 1e8,
|
||||
'learning_rate': args.learning_rate,
|
||||
'epoch_index': 0,
|
||||
'train_loss': [],
|
||||
'train_acc': [],
|
||||
'val_loss': [],
|
||||
'val_acc': [],
|
||||
'test_loss': -1,
|
||||
'test_acc': -1,
|
||||
'model_filename': args.model_state_file}
|
||||
|
||||
def update_train_state(args, model, train_state):
|
||||
"""Handle the training state updates.
|
||||
|
||||
Components:
|
||||
- Early Stopping: Prevent overfitting.
|
||||
- Model Checkpoint: Model is saved if the model is better
|
||||
|
||||
:param args: main arguments
|
||||
:param model: model to train
|
||||
:param train_state: a dictionary representing the training state values
|
||||
:returns:
|
||||
a new train_state
|
||||
"""
|
||||
|
||||
# Save one model at least
|
||||
if train_state['epoch_index'] == 0:
|
||||
torch.save(model.state_dict(), train_state['model_filename'])
|
||||
train_state['stop_early'] = False
|
||||
|
||||
# Save model if performance improved
|
||||
elif train_state['epoch_index'] >= 1:
|
||||
loss_tm1, loss_t = train_state['val_loss'][-2:]
|
||||
|
||||
# If loss worsened
|
||||
if loss_t >= train_state['early_stopping_best_val']:
|
||||
# Update step
|
||||
train_state['early_stopping_step'] += 1
|
||||
# Loss decreased
|
||||
else:
|
||||
# Save the best model
|
||||
if loss_t < train_state['early_stopping_best_val']:
|
||||
torch.save(model.state_dict(), train_state['model_filename'])
|
||||
|
||||
# Reset early stopping step
|
||||
train_state['early_stopping_step'] = 0
|
||||
|
||||
# Stop early ?
|
||||
train_state['stop_early'] = \
|
||||
train_state['early_stopping_step'] >= args.early_stopping_criteria
|
||||
|
||||
return train_state
|
||||
|
||||
def compute_accuracy(y_pred, y_target):
|
||||
y_target = y_target.cpu()
|
||||
y_pred_indices = (torch.sigmoid(y_pred)>0.5).cpu().long()#.max(dim=1)[1]
|
||||
n_correct = torch.eq(y_pred_indices, y_target).sum().item()
|
||||
return n_correct / len(y_pred_indices) * 100
|
||||
|
||||
|
||||
|
||||
|
||||
def set_seed_everywhere(seed, cuda):
|
||||
np.random.seed(seed)
|
||||
torch.manual_seed(seed)
|
||||
if cuda:
|
||||
torch.cuda.manual_seed_all(seed)
|
||||
|
||||
def handle_dirs(dirpath):
|
||||
if not os.path.exists(dirpath):
|
||||
os.makedirs(dirpath)
|
||||
|
||||
|
||||
|
||||
|
||||
args = Namespace(
|
||||
# Data and Path information
|
||||
frequency_cutoff=25,
|
||||
model_state_file='model.pth',
|
||||
review_csv='data/yelp/reviews_with_splits_lite.csv',
|
||||
# review_csv='data/yelp/reviews_with_splits_full.csv',
|
||||
save_dir='model_storage/ch3/yelp/',
|
||||
vectorizer_file='vectorizer.json',
|
||||
# No Model hyper parameters
|
||||
# Training hyper parameters
|
||||
batch_size=128,
|
||||
early_stopping_criteria=5,
|
||||
learning_rate=0.001,
|
||||
num_epochs=100,
|
||||
seed=1337,
|
||||
# Runtime options
|
||||
catch_keyboard_interrupt=True,
|
||||
cuda=True,
|
||||
expand_filepaths_to_save_dir=True,
|
||||
reload_from_files=False,
|
||||
)
|
||||
|
||||
if args.expand_filepaths_to_save_dir:
|
||||
args.vectorizer_file = os.path.join(args.save_dir,
|
||||
args.vectorizer_file)
|
||||
|
||||
args.model_state_file = os.path.join(args.save_dir,
|
||||
args.model_state_file)
|
||||
|
||||
print("Expanded filepaths: ")
|
||||
print("\t{}".format(args.vectorizer_file))
|
||||
print("\t{}".format(args.model_state_file))
|
||||
|
||||
# Check CUDA
|
||||
if not torch.cuda.is_available():
|
||||
args.cuda = False
|
||||
if torch.cuda.device_count() > 1:
|
||||
print("Pouzivam", torch.cuda.device_count(), "graficke karty!")
|
||||
|
||||
args.device = torch.device("cuda" if args.cuda else "cpu")
|
||||
|
||||
# Set seed for reproducibility
|
||||
set_seed_everywhere(args.seed, args.cuda)
|
||||
|
||||
# handle dirs
|
||||
handle_dirs(args.save_dir)
|
||||
|
||||
|
||||
|
||||
|
||||
if args.reload_from_files:
|
||||
# training from a checkpoint
|
||||
print("Loading dataset and vectorizer")
|
||||
dataset = ReviewDataset.load_dataset_and_load_vectorizer(args.review_csv,
|
||||
args.vectorizer_file)
|
||||
else:
|
||||
print("Loading dataset and creating vectorizer")
|
||||
# create dataset and vectorizer
|
||||
dataset = ReviewDataset.load_dataset_and_make_vectorizer(args.review_csv)
|
||||
dataset.save_vectorizer(args.vectorizer_file)
|
||||
vectorizer = dataset.get_vectorizer()
|
||||
|
||||
classifier = ReviewClassifier(num_features=len(vectorizer.review_vocab))
|
||||
|
||||
|
||||
|
||||
classifier = nn.DataParallel(classifier)
|
||||
classifier = classifier.to(args.device)
|
||||
|
||||
loss_func = nn.BCEWithLogitsLoss()
|
||||
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
|
||||
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
|
||||
mode='min', factor=0.5,
|
||||
patience=1)
|
||||
|
||||
train_state = make_train_state(args)
|
||||
|
||||
epoch_bar = tqdm(desc='training routine',
|
||||
total=args.num_epochs,
|
||||
position=0)
|
||||
|
||||
dataset.set_split('train')
|
||||
train_bar = tqdm(desc='split=train',
|
||||
total=dataset.get_num_batches(args.batch_size),
|
||||
position=1,
|
||||
leave=True)
|
||||
dataset.set_split('val')
|
||||
val_bar = tqdm(desc='split=val',
|
||||
total=dataset.get_num_batches(args.batch_size),
|
||||
position=1,
|
||||
leave=True)
|
||||
|
||||
try:
|
||||
for epoch_index in range(args.num_epochs):
|
||||
train_state['epoch_index'] = epoch_index
|
||||
|
||||
# Iterate over training dataset
|
||||
|
||||
# setup: batch generator, set loss and acc to 0, set train mode on
|
||||
dataset.set_split('train')
|
||||
batch_generator = generate_batches(dataset,
|
||||
batch_size=args.batch_size,
|
||||
device=args.device)
|
||||
running_loss = 0.0
|
||||
running_acc = 0.0
|
||||
classifier.train()
|
||||
|
||||
for batch_index, batch_dict in enumerate(batch_generator):
|
||||
# the training routine is these 5 steps:
|
||||
|
||||
# --------------------------------------
|
||||
# step 1. zero the gradients
|
||||
optimizer.zero_grad()
|
||||
|
||||
# step 2. compute the output
|
||||
y_pred = classifier(x_in=batch_dict['x_data'].float())
|
||||
|
||||
# step 3. compute the loss
|
||||
loss = loss_func(y_pred, batch_dict['y_target'].float())
|
||||
loss_t = loss.item()
|
||||
running_loss += (loss_t - running_loss) / (batch_index + 1)
|
||||
|
||||
# step 4. use loss to produce gradients
|
||||
loss.backward()
|
||||
|
||||
# step 5. use optimizer to take gradient step
|
||||
optimizer.step()
|
||||
# -----------------------------------------
|
||||
# compute the accuracy
|
||||
acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
|
||||
running_acc += (acc_t - running_acc) / (batch_index + 1)
|
||||
|
||||
# update bar
|
||||
train_bar.set_postfix(loss=running_loss,
|
||||
acc=running_acc,
|
||||
epoch=epoch_index)
|
||||
train_bar.update()
|
||||
|
||||
train_state['train_loss'].append(running_loss)
|
||||
train_state['train_acc'].append(running_acc)
|
||||
|
||||
# Iterate over val dataset
|
||||
|
||||
# setup: batch generator, set loss and acc to 0; set eval mode on
|
||||
dataset.set_split('val')
|
||||
batch_generator = generate_batches(dataset,
|
||||
batch_size=args.batch_size,
|
||||
device=args.device)
|
||||
running_loss = 0.
|
||||
running_acc = 0.
|
||||
classifier.eval()
|
||||
|
||||
for batch_index, batch_dict in enumerate(batch_generator):
|
||||
|
||||
# compute the output
|
||||
y_pred = classifier(x_in=batch_dict['x_data'].float())
|
||||
|
||||
# step 3. compute the loss
|
||||
loss = loss_func(y_pred, batch_dict['y_target'].float())
|
||||
loss_t = loss.item()
|
||||
running_loss += (loss_t - running_loss) / (batch_index + 1)
|
||||
|
||||
# compute the accuracy
|
||||
acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
|
||||
running_acc += (acc_t - running_acc) / (batch_index + 1)
|
||||
|
||||
val_bar.set_postfix(loss=running_loss,
|
||||
acc=running_acc,
|
||||
epoch=epoch_index)
|
||||
val_bar.update()
|
||||
|
||||
train_state['val_loss'].append(running_loss)
|
||||
train_state['val_acc'].append(running_acc)
|
||||
|
||||
train_state = update_train_state(args=args, model=classifier,
|
||||
train_state=train_state)
|
||||
|
||||
scheduler.step(train_state['val_loss'][-1])
|
||||
|
||||
train_bar.n = 0
|
||||
val_bar.n = 0
|
||||
epoch_bar.update()
|
||||
|
||||
if train_state['stop_early']:
|
||||
break
|
||||
|
||||
train_bar.n = 0
|
||||
val_bar.n = 0
|
||||
epoch_bar.update()
|
||||
except KeyboardInterrupt:
|
||||
print("Exiting loop")
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
classifier.load_state_dict(torch.load(train_state['model_filename']))
|
||||
classifier = classifier.to(args.device)
|
||||
|
||||
dataset.set_split('test')
|
||||
batch_generator = generate_batches(dataset,
|
||||
batch_size=args.batch_size,
|
||||
device=args.device)
|
||||
running_loss = 0.
|
||||
running_acc = 0.
|
||||
classifier.eval()
|
||||
|
||||
for batch_index, batch_dict in enumerate(batch_generator):
|
||||
# compute the output
|
||||
y_pred = classifier(x_in=batch_dict['x_data'].float())
|
||||
|
||||
# compute the loss
|
||||
loss = loss_func(y_pred, batch_dict['y_target'].float())
|
||||
loss_t = loss.item()
|
||||
running_loss += (loss_t - running_loss) / (batch_index + 1)
|
||||
|
||||
# compute the accuracy
|
||||
acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
|
||||
running_acc += (acc_t - running_acc) / (batch_index + 1)
|
||||
|
||||
train_state['test_loss'] = running_loss
|
||||
train_state['test_acc'] = running_acc
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
print("Test loss: {:.3f}".format(train_state['test_loss']))
|
||||
print("Test Accuracy: {:.2f}".format(train_state['test_acc']))
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
def preprocess_text(text):
|
||||
text = text.lower()
|
||||
text = re.sub(r"([.,!?])", r" \1 ", text)
|
||||
text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
|
||||
return text
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
def predict_rating(review, classifier, vectorizer, decision_threshold=0.5):
|
||||
"""Predict the rating of a review
|
||||
|
||||
Args:
|
||||
review (str): the text of the review
|
||||
classifier (ReviewClassifier): the trained model
|
||||
vectorizer (ReviewVectorizer): the corresponding vectorizer
|
||||
decision_threshold (float): The numerical boundary which separates the rating classes
|
||||
"""
|
||||
review = preprocess_text(review)
|
||||
|
||||
vectorized_review = torch.tensor(vectorizer.vectorize(review))
|
||||
result = classifier(vectorized_review.view(1, -1))
|
||||
|
||||
probability_value = F.sigmoid(result).item()
|
||||
index = 1
|
||||
if probability_value < decision_threshold:
|
||||
index = 0
|
||||
|
||||
return vectorizer.rating_vocab.lookup_index(index)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
test_review = "this is a pretty awesome book"
|
||||
|
||||
classifier = classifier.cpu()
|
||||
prediction = predict_rating(test_review, classifier, vectorizer, decision_threshold=0.5)
|
||||
print("{} -> {}".format(test_review, prediction))
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
# Sort weights
|
||||
fc1_weights = classifier.fc1.weight.detach()[0]
|
||||
_, indices = torch.sort(fc1_weights, dim=0, descending=True)
|
||||
indices = indices.numpy().tolist()
|
||||
|
||||
# Top 20 words
|
||||
print("Influential words in Positive Reviews:")
|
||||
print("--------------------------------------")
|
||||
for i in range(20):
|
||||
print(vectorizer.review_vocab.lookup_index(indices[i]))
|
||||
|
||||
print("====\n\n\n")
|
||||
|
||||
# Top 20 negative words
|
||||
print("Influential words in Negative Reviews:")
|
||||
print("--------------------------------------")
|
||||
indices.reverse()
|
||||
for i in range(20):
|
||||
print(vectorizer.review_vocab.lookup_index(indices[i]))
|
@ -12,16 +12,42 @@ taxonomy:
|
||||
|
||||
Zásobník úloh:
|
||||
|
||||
- skúsiť prezentovať na lokálnej konferencii, (Data, Znalosti and WIKT) alebo fakultný zborník (krátka verzia diplomovky).
|
||||
- Využiť korpus Multext East pri trénovaní. Vytvoriť mapovanie Multext Tagov na SNK Tagy.
|
||||
|
||||
|
||||
Virtuálne stretnutie 6.11.2020
|
||||
|
||||
Stav:
|
||||
|
||||
- Prečítané (podrobne) 2 články a urobené poznámky. Poznánky sú na GITe.
|
||||
- Dorobené ďalšie experimenty.
|
||||
|
||||
Úlohy do ďalšieho stretnutia:
|
||||
|
||||
- Pokračovať v otvorených úlohách.
|
||||
|
||||
|
||||
Virtuálne stretnutie 30.10.2020
|
||||
|
||||
Stav:
|
||||
|
||||
- Súbory sú na GIte
|
||||
- Vykonané experimenty, Výsledky experimentov sú v tabuľke
|
||||
- Návod na spustenie
|
||||
- Vyriešenie technických problémov. Je k dispozicíí Conda prostredie.
|
||||
|
||||
Úlohy na ďalšie stretnutie:
|
||||
|
||||
- Preštudovať literatúru na tému "pretrain" a "word embedding"
|
||||
- [Healthcare NERModelsUsing Language Model Pretraining](http://ceur-ws.org/Vol-2551/paper-04.pdf)
|
||||
- [Healthcare NER Models Using Language Model Pretraining](http://ceur-ws.org/Vol-2551/paper-04.pdf)
|
||||
- [Design and implementation of an open source Greek POS Tagger and Entity Recognizer using spaCy](https://ieeexplore.ieee.org/abstract/document/8909591)
|
||||
- https://arxiv.org/abs/1909.00505
|
||||
- https://arxiv.org/abs/1607.04606
|
||||
- LSTM, recurrent neural network,
|
||||
- Urobte si poznámky z viacerých čnánkov, poznačte si zdroj a čo ste sa dozvedeli.
|
||||
- Vykonať viacero experimentov s pretrénovaním - rôzne modely, rôzne veľkosti adaptačných dát a zostaviť tabuľku
|
||||
- Opísať pretrénovanie, zhrnúť vplyv pretrénovania na trénovanie v krátkom článku cca 10 strán.
|
||||
- skúsiť prezentovať na lokálnej konferencii, (Data, Znalosti and WIKT) alebo fakultný zborník (krátka verzia diplomovky).
|
||||
- Využiť korpus Multext East pri trénovaní. Vytvoriť mapovanie Multext Tagov na SNK Tagy.
|
||||
|
||||
|
||||
Virtuálne stretnutie 8.10.2020
|
||||
|
@ -21,6 +21,46 @@ Cieľom práce je príprava nástrojov a budovanie tzv. "Question Answering data
|
||||
|
||||
## Diplomový projekt 2
|
||||
|
||||
Zásobník úloh:
|
||||
|
||||
- Dá sa zistiť koľko času strávil anotátor pri vytváraní otázky? Ak sa to dá zistiť z DB schémy, tak by bolo dobré to zobraziť vo webovej aplikácii.
|
||||
|
||||
|
||||
Virtuálne stretnutie 27.10.2020
|
||||
|
||||
Stav:
|
||||
|
||||
- Dorobená webová aplikácia podľa pokynov z minulého stretnutia, kódy sú na gite
|
||||
|
||||
Úlohy na ďalšie stretnutie:
|
||||
|
||||
- Urobiť konfiguračný systém - načítať konfiguráciu zo súboru (python-configuration?). Meno konfiguračného súboru by sa malo dať zmeniť cez premennú prostredia (getenv).
|
||||
- Dorobiť autentifikáciu pre anotátorov pre zobrazovanie výsledkov, aby anotátor videl iba svoje výsledky. Je to potrebné? Zatiaľ dorobiť iba pomocou e-mailu.
|
||||
- Dorobiť heslo na webovú aplikáciu
|
||||
- Dorobiť zobrazovanie zlých a dobrých anotácií pre každého anotátora.
|
||||
- Preštudovať odbornú literatúru na tému "Crowdsourcing language resources". Vyberte niekoľko odborných publikácií (scholar, scopus), napíšte bibliografický odkaz a čo ste sa z publikácii dozvedeli o vytváraní jazykových zdrojov. Aké iné korpusy boli touto metódou vytvorené?
|
||||
|
||||
|
||||
|
||||
|
||||
Virtuálne stretnutie 20.10.2020
|
||||
|
||||
Stav:
|
||||
|
||||
- Vylepšený skript pre prípravu dát , mierna zmena rozhrania (duplicitná práca kvôli nedostatku v komunikácii).
|
||||
|
||||
Úohy do ďalšieho stretnutia:
|
||||
|
||||
- Dorobiť webovú aplikáciu pre zisťoovanie množstva anotovaných dát.
|
||||
- Odladiť chyby súvisiace s novou anotačnou schémou.
|
||||
- Zobraziť množstvo anotovaných dát
|
||||
- Zobraziť množstvo platných anotovaných dát.
|
||||
- Zobbraziť množstvo validovaných dát.
|
||||
- Otázky sa v rámci jedného paragrafu nesmú opakovať. Každá otázka musí mať odpoveď. Každá otázka musí byť dlhšia ako 10 znakov alebo dlhšia ako 2 slová. Odpoveď musí mať aspoň jedno slovo. Otázka musí obsahovať slovenské slová.
|
||||
- Výsledky posielajte čím skôr do projektového repozitára, adresár database_app.
|
||||
|
||||
|
||||
|
||||
Stretnutie 25.9.2020
|
||||
|
||||
Urobené:
|
||||
|
@ -6,10 +6,8 @@ taxonomy:
|
||||
tag: [demo,nlp]
|
||||
author: Daniel Hladek
|
||||
---
|
||||
|
||||
# Martin Jancura
|
||||
|
||||
|
||||
*Rok začiatku štúdia*: 2017
|
||||
|
||||
## Bakalársky projekt 2020
|
||||
@ -31,9 +29,36 @@ Možné backendy:
|
||||
Zásobník úloh:
|
||||
|
||||
- Pripraviť backend.
|
||||
- Pripraviť frontend v Javascripte.
|
||||
- Pripraviť frontend v Javascripte - in progress.
|
||||
- Zapisať človekom urobený preklad do databázy.
|
||||
|
||||
|
||||
Virtuálne stretnutie 6.11.2020:
|
||||
|
||||
Stav:
|
||||
|
||||
Práca na písomnej časti.
|
||||
|
||||
Úlohy do ďalšieho stretnutia:
|
||||
|
||||
- Pohľadať takú knižnicu, kde vieme využiť vlastný preklad. Skúste si nainštalovať OpenNMT.
|
||||
- Prejdite si tutoriál https://github.com/OpenNMT/OpenNMT-py#quickstart alebo podobný.
|
||||
- Navrhnite ako prepojiť frontend a backend.
|
||||
|
||||
|
||||
Virtuálne stretnutie 23.10.2020:
|
||||
|
||||
Stav:
|
||||
|
||||
- Urobený frontend pre komunikáciu s Microsof Translation API, využíva Axios a Vanilla Javascriupt
|
||||
|
||||
Úlohy do ďalšieho stretnutia:
|
||||
|
||||
- Pohľadať takú knižnicu, kde vieme využiť vlastný preklad. Skúste si nainštalovať OpenNMT.
|
||||
- Zistiť čo znamená politika CORS.
|
||||
- Pokračujte v písaní práce, pridajte časť o strojovom preklade.. Prečítajte si články https://opennmt.net/OpenNMT/references/ a urobte si poznámky. Do poznámky dajte bibliografický odkaz a čo ste sa dozvedeli z článku.
|
||||
|
||||
|
||||
Virtuálne stretnutie 16.10:
|
||||
|
||||
Stav:
|
||||
|
@ -31,7 +31,42 @@ Návrh na zadanie:
|
||||
1. Navrhnite možné zlepšenia Vami vytvorenej aplikácie.
|
||||
|
||||
Zásobník úloh:
|
||||
- Vytvorte si repozitár na GITe, nazvite ho bp2010. Do neho budete dávať kódy a dokumentáciu ktorú vytvoríte.
|
||||
|
||||
- Pripravte Docker image Vašej aplikácie podľa https://pythonspeed.com/docker/
|
||||
|
||||
|
||||
Virtuálne stretnutie 30.10.:
|
||||
|
||||
Stav:
|
||||
|
||||
- Modifikovaná existujúca aplikácia "spacy-streamlit", zdrojové kóódy sú na GITe podľa pokynov z minulého stretnutia.
|
||||
- Obsahuje aj formulár, neobsahuje REST API
|
||||
|
||||
Úlohy do ďalšieho stretnutia:
|
||||
|
||||
- Pokračujte v písaní. Prečítajte si odborné články na tému "dependency parsing" a vypracujte poznámky čo ste sa dozvedeli. Poznačte si zdroj.
|
||||
- Pokkračujte v práci na demonštračnej webovej aplikácii.
|
||||
|
||||
|
||||
Virtuálne stretnutie 19.10.:
|
||||
|
||||
Stav:
|
||||
|
||||
- Vypracované a odovzdané poznámky k bakalárskej práci, obsahujú výpisy z literatúry.
|
||||
- Vytvorený repozitár. https://git.kemt.fei.tuke.sk/mw223on/bp2020
|
||||
- Nainštalovaný a spustený slovenský Spacy model.
|
||||
- Nainštalované Spacy REST Api https://github.com/explosion/spacy-services
|
||||
- Vyskúšané demo Display so slovenským modelom
|
||||
|
||||
Úlohy na ďalšie stretnutie:
|
||||
|
||||
- Pripravte webovú aplikáciu ktorá bude prezentovať rozpoznávanie závislostí a pomenovaných entít v slovenskom jayzyku. Mala by sa skladať z frontentu a backendu.
|
||||
- zapíšte potrebné Python balíčky do súboru "requirements.txt"
|
||||
- Vytvorte skript na inštaláciu aplikácie pomocou pip.
|
||||
- Vytvorte skript pre spustenie backendu aj frontendu. Výsledky dajte do repozitára.
|
||||
- Vytvorte návrh na frontend (HTML + CSS).
|
||||
- Pozrite na zdrojové kódy Spacy a zistite, čo presne robí príkaz display.serve
|
||||
- Vysledky dajte do repozitára.
|
||||
|
||||
Virtuálne stretnutie 9.10.
|
||||
|
||||
|
@ -20,8 +20,21 @@ Návrh na zadanie:
|
||||
2. Vytvorte jazykový model metódou BERT alebo poodobnou metódou.
|
||||
3. Vyhodnnotte vytvorený jazykový model a navrhnite zlepšenia.
|
||||
|
||||
Zásobník úloh:
|
||||
|
||||
Virtuálne stretnutie 30.10.2020
|
||||
|
||||
Stav:
|
||||
- Vypracované poznámky k seq2seq
|
||||
- nainštalovaný Pytorch a fairseq
|
||||
- problémy s tutoriálom. Riešenie by mohlo byť použitie release verzie 0.9.0, pip install fairseq=0.9.0
|
||||
|
||||
Do ďďalšieho stretnutia:
|
||||
|
||||
- Vyriešte technické porblémy
|
||||
- prejdide si tutoriál https://fairseq.readthedocs.io/en/latest/getting_started.html#training-a-new-model
|
||||
- Prejsť si tutoriál https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md alebo podobný.
|
||||
- Preštudujte si články na tému BERT, urobte si poznámky čo ste sa dozvedeli spolu so zdrojom.
|
||||
|
||||
|
||||
Virtuálne stretnutie 16.10.2020
|
||||
|
||||
|
@ -23,13 +23,50 @@ Pokusný klaster Raspberry Pi pre výuku klaudových technológií
|
||||
Ciel projektu je vytvoriť domáci lacný klaster pre výuku cloudových technológií.
|
||||
|
||||
|
||||
Zásobník úloh:
|
||||
|
||||
- Aktivujte si technológiu WSL2 a Docker Desktop ak používate Windows.
|
||||
|
||||
Virtuálne stretnutie 30.10.
|
||||
|
||||
Stav:
|
||||
- vypracovaný písomný prehľad podľa pokynov
|
||||
- nainštalovaný RaspberryPI OS do Virtual\boxu
|
||||
- vypracovaný predbežný HW návrh
|
||||
- Nainšalované Docker Toolbox aj Ubuntu s Dockerom
|
||||
- Oboznámenie sa s Dockerom
|
||||
- Vedúci: vykoananý nákup HW - Dosky 5x RPi4 model B 8GB, SD Karty 128GB 11ks, the pi hut Cluster Case for raspberry pi 4ks, Zdroj 60W and 18W Quick Charger Epico 1ks. 220V kábel a zásuvka s vypínačom.
|
||||
|
||||
Do budúceho stretnutia:
|
||||
|
||||
- Dá sa kúpiť oficiálmy 5 portový switch?
|
||||
- Skompletizovať nákup a dohodntúť spôsob odovzdania. Podpísať preberací protokol.
|
||||
- Použite https://kind.sigs.k8s.io na simuláciu klastra.
|
||||
- Nainštalujte si https://microk8s.io/ , prečítajte tutoriály https://ubuntu.com/tutorials/
|
||||
- Prejdite si https://kubernetes.io/docs/tutorials/hello-minikube/ alebo pododbný tutoriály
|
||||
|
||||
|
||||
Virtuálne stretnutie 16.10.
|
||||
|
||||
|
||||
Stav:
|
||||
- Prečítanie články
|
||||
- začatý tutorál Docker zo ZCT
|
||||
- vedúci vytovoril prístup na Jetson Xavier AGX2 s ARM procesorom.
|
||||
- začatý nákup na Raspberry Pi a príslušenstvo.
|
||||
|
||||
Úlohy do ďalšieho stretnutia
|
||||
- Vypracovať prehľad (min 4) existujúcich riešení Raspberry Pi cluster (na odovzdanie). Aký hardware a software použili?
|
||||
- napájanie, chladenie, sieťové prepojenie
|
||||
- Oboznámte sa s https://www.raspberrypi.org/downloads/raspberry-pi-os/
|
||||
- Nainštalujte si https://roboticsbackend.com/install-raspbian-desktop-on-a-virtual-machine-virtualbox/
|
||||
- Napíšte podrobný návrh hardware pre vytvorenie Raspberry Pi Cluster.
|
||||
|
||||
Stretnutie 29.9.
|
||||
|
||||
|
||||
Dohodli sme sa na zadaní práce.
|
||||
|
||||
|
||||
Návrhy na zlepšenie (pre vedúceho):
|
||||
|
||||
- Zistiť podmienky financovania (odhad 350EUR).
|
||||
|
@ -39,23 +39,35 @@ Učenie prebieha tak, že v texte ukážete ktoré slová patria názvom osôb,
|
||||
|
||||
|
||||
Vašou úlohou bude v texte vyznačiť vlastné podstatné mená.
|
||||
Vlastné podstatné meno sa v slovenskom jazyku spravidla začína veľkým písmeno, ale môže obsahovať aj ďalšie slová písané malým písmenom.
|
||||
Vlastné podstatné meno sa v slovenskom jazyku spravidla začína veľkým písmenom, ale môže obsahovať aj ďalšie slová písané malým písmenom.
|
||||
Ak vlastné podstatné meno v sebe obsahuje iný názov, napr. Nové Mesto nad Váhom, anotujte ho ako jeden celok.
|
||||
|
||||
- PER: mená osôb
|
||||
- LOC: geografické názvy
|
||||
- ORG: názvy organizácii
|
||||
- MISC: iné názvy, napr. názvy produktov.
|
||||
|
||||
Ak vlastné podstatné meno v sebe obsahuje iný názov, napr. Nové Mesto nad Váhom, anotujte ho ako jeden celok.
|
||||
V texte narazíte aj na slová, ktoré síce pomenúvajú geografickú oblasť, avšak nejedná sa o vlastné podstatné mená (napr. britská kolónia, londýnsky šerif...). Takéto slová nepovažujeme za pomenované entity a preto Vás prosíme, aby ste ich neoznačovali.
|
||||
|
||||
V prípade, že v texte sa nenachádzajú žiadne anotácie, tento článok je platný, a teda zvoľte možnosť Accept.
|
||||
|
||||
V prípade, že text sa skladá iba z jedného, resp. niekoľkých slov, ktoré sami o sebe nenesú žiaden význam, tento článok je neplatný, a teda zvoľte možnosť Reject.
|
||||
|
||||
## Anotačné dávky
|
||||
|
||||
Do formulára napíšte Váš e-mail aby bolo možné rozpoznať, kto vykonal anotáciu.
|
||||
|
||||
Počas anotácie môžete pre zjednodušenie práce využívať klávesové skratky:
|
||||
- 1,2,3,4 - prepínanie medzi entitami
|
||||
- klávesa "a" - Accept
|
||||
- klávesa "x" - Reject
|
||||
- klávesa "space" - Ignore
|
||||
- klávesa "backspace" alebo "del" - Undo
|
||||
|
||||
Po anotovaní nezabudnite svojú prácu uložiť (ikona vľavo hore, alebo "Ctrl + s").
|
||||
|
||||
### Pokusná anotačná dávka
|
||||
|
||||
Dávka je zameraná na zber spätnej väzby od anotátorov na zlepšenie rozhrania a anotačného procesu.
|
||||
|
||||
{% include "forms/form.html.twig" with { form: forms('ner1') } %}
|
||||
|
||||
|
||||
|
@ -36,11 +36,12 @@ Učenie prebieha tak, že vytvoríte príklad s otázkou a odpoveďou. Účasť
|
||||
|
||||
## Návod pre anotátorov
|
||||
|
||||
Najprv sa Vám zobrazí krátky článok. Vašou úlohou bude prečítať si časť článku, vymyslieť k nemu otázku a v texte vyznačiť odpoveď. Odpoveď na otázku sa musí nachádzať v texte článku. Na vyznačenie jednej otázky máte asi 50 sekúnd.
|
||||
Najprv sa Vám zobrazí krátky článok. Vašou úlohou bude prečítať si časť článku, vymyslieť k nemu otázku a v texte vyznačiť odpoveď. Otázka musí byť jednoznačná a odpoveď na otázku sa musí nachádzať v texte článku. Na vyznačenie jednej otázky máte asi 50 sekúnd.
|
||||
|
||||
1. Prečítajte si článok. Ak článok nie je vyhovujúci ťuknite na červený krížik "reject" (Tab a potom 'x').
|
||||
1. Prečítajte si článok. Ak článok nie je vyhovujúci ťuknite na červený krížik "Reject" (Tab a potom 'x').
|
||||
2. Napíšte otázku. Ak neviete napísať otázku, ťuknite na "Ignore" (Tab a potom 'i').
|
||||
3. Vyznačte myšou odpoveď a ťuknite na zelenú fajku "Accept" (klávesa a) a pokračujte ďalšou otázkou k tomu istému článku alebo k novému článku. Ten istý text sa zobrazí maximálne 5 krát.
|
||||
3. Vyznačte myšou odpoveď a ťuknite na zelenú fajku "Accept" (klávesa a) a pokračujte ďalšou otázkou k tomu istému článku alebo k novému článku.
|
||||
4. Ten istý článok sa Vám zobrazí 5 krát, vymyslite k nemu 5 rôznych otázok.
|
||||
|
||||
Ak je zobrazený text nevhodný, tak ho zamietnite. Nevhodný text:
|
||||
|
||||
@ -61,6 +62,12 @@ Ak je zobrazený text nevhodný, tak ho zamietnite. Nevhodný text:
|
||||
4. <span style="color:pink">Na čo slúži lyzozóm?</span>
|
||||
5. <span style="color:orange">Čo je to autofágia?<span>
|
||||
|
||||
Príklad na nesprávu otázku:
|
||||
1. Čo je to Golgiho aparát? - odpoveď sa v článku nenachádza.
|
||||
2. Čo sa deje v mŕtvych bunkách? - otázka nie je jednoznačná, presná odpoveď sa v článku nenachádza.
|
||||
3. Čo je normálny fyziologický proces? - odpoveď sa v článku nenachádza.
|
||||
|
||||
|
||||
Do formulára napíšte Váš e-mail aby bolo možné rozpoznať, kto vykonal anotáciu.
|
||||
|
||||
## Anotačné dávky
|
||||
|
Loading…
Reference in New Issue
Block a user