This commit is contained in:
Daniel Hládek 2020-11-13 09:04:42 +01:00
commit f5455a89b3
40 changed files with 2860 additions and 48 deletions

View File

@ -13,11 +13,28 @@ Repozitár so [zdrojovými kódmi](https://git.kemt.fei.tuke.sk/dl874wn/dp2021)
## Diplomový projekt 2 2020 ## Diplomový projekt 2 2020
Virtuálne stretnutie 6.11.2020
Stav:
- Vypracovaná tabuľka s 5 experimentami.
- vytvorený repozitár.
Na ďalšie stretnutie:
- nahrať kódy na repozitár.
- závislosťi (názvy balíčkov) poznačte do súboru requirements.txt.
- Prepracujte experiment tak aby akceptoval argumenty z príkazového riadka. (sys.argv)
- K experimentom zapísať skript na spustenie. V skripte by mali byť parametre s ktorými ste spustili experiment.
- dopracujte report.
- do teorie urobte prehľad metód punctuation restoration a opis Vašej metódy.
Virtuálne stretnutie 25.9.2020 Virtuálne stretnutie 25.9.2020
Urobené: Urobené:
- skript pre vyhodnotenie experimentov - skript pre vyhodnotenie experimentov.
Úlohy do ďalšieho stretnutia: Úlohy do ďalšieho stretnutia:

View File

@ -21,8 +21,21 @@ Zásobník úloh:
- Použiť model na podporu anotácie - Použiť model na podporu anotácie
- Do konca ZS vytvoriť report vo forme článku. - Do konca ZS vytvoriť report vo forme článku.
- Vytvorte systém pre zistenie množstva a druhu anotovaných dát. Koľko článkov? Koľko entít jednotlivvých typov? - Spísať pravidlá pre validáciu. Aký výsledok anotácie je dobrý? Je potrebné anotované dáta skontrolovať?
Virtuálne stretnutie 30.10.2020:
Stav:
- Vylepšený návod
- Vyskúšaný export dát a trénovanie modelu z databázy. Problém pri trénovaní Spacy - iné výsledky ako cez Progigy trénovanie
- Práca na textovej čsati.
Úlohy do ďalšieho stretnutia:
- Vytvorte si repozitár s názvom dp2021 a tam pridajte skripty a poznámky.
- Pokračujte v písaní práce. Vykonajte prieskum literatúry "named entity corpora" aj poznámky.
- Vytvorte systém pre zistenie množstva a druhu anotovaných dát. Koľko článkov? Koľko entít jednotlivvých typov? Výsledná tabuľka pôjde do práce.
- Pripraviť sa na produkčné anotácie. Je schéma pripravená?
Virtuálne stretnutie 16.10.2020: Virtuálne stretnutie 16.10.2020:

View File

@ -1 +1,40 @@
DP2021 ## Diplomový projekt 2 2020
Stav:
- aktualizácia anotačnej schémy (jedná sa o testovaciu schému s vlastnými dátami)
- vykonaných niekoľko anotácii, trénovanie v Prodigy - nízka presnosť = malé množstvo anotovaných dát. Trénovanie v spacy zatiaľ nefunguje.
- Štatistiky o množstve prijatých a odmietnutých anotácii získame z Prodigy: prodigy stats wikiart. Zatiaľ 156 anotácii (151 accept, 5 reject). Na získanie prehľadu o množstve anotácii jednotlivých entít potrebujeme vytvoriť skript.
- Prehľad literatúry Named Entity Corpus
- Budovanie korpusu pre NER automatické vytvorenie už anotovaného korpusu z Wiki pomocou DBpedia jedná sa o anglický korpus, ale možno spomenúť v porovnaní postupov
- Building a Massive Corpus for Named Entity Recognition using Free Open Data Sources - Daniel Specht Menezes, Pedro Savarese, Ruy L. Milidiú
- Porovnanie postupov pre anotáciu korpusu (z hľadiska presnosti aj času) - Manual, SemiManual
- Comparison of Annotating Methods for Named Entity Corpora - Kanako Komiya, Masaya Suzuki
- Čo je korpus, vývojový cyklus, analýza korpusu (Už využitá literatúra cyklus MATTER)
- Natural Language Annotation for Machine Learning James Pustejovsky, Amber Stubbs
Aktualizácia 09.11.2020:
- Vyriešený problém, kedy nefungovalo trénovanie v spacy
- Vykonaná testovacia anotácia cca 500 viet. Výsledky trénovania pri 20 iteráciách: F-Score 47% (rovnaké výsledky pri trénovaní v Spacy aj Prodigy)
- Štatistika o počte jednotlivých entít: skript count.py
## Diplomový projekt 1 2020
- vytvorenie a spustenie docker kontajneru
```
./build-docker.sh
docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash
# (v mojom prípade:)
winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash
```
### Spustenie anotačnej schémy
- `dataminer.csv` články stiahnuté z wiki
- `cd ner`
- `./01_text_to_sent.sh` spustenie skriptu *text_to_sent.py*, ktorý rozdelí články na jednotlivé vety
- `./02_ner_correct.sh` spustenie anotačného procesu pre NER s návrhmi od modelu
- `./03_ner_export.sh` exportovanie anotovaných dát vo formáte jsonl potrebnom pre spracovanie vo spacy

View File

@ -1,17 +1,16 @@
# > docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash # > docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash
# > winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash # > winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation-master/annotation/work prodigy bash
FROM python:3.8 FROM python:3.8
RUN mkdir /prodigy RUN mkdir /prodigy
WORKDIR /prodigy WORKDIR /prodigy
COPY ./prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl /prodigy COPY ./prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl /prodigy
RUN mkdir /work RUN mkdir /work
COPY ./ner /work COPY ./ner /work/ner
RUN pip install prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl RUN pip install uvicorn==0.11.5 prodigy-1.9.6-cp36.cp37.cp38-cp36m.cp37m.cp38-linux_x86_64.whl
RUN pip install https://files.kemt.fei.tuke.sk/models/spacy/sk_sk1-0.0.1.tar.gz RUN pip install https://files.kemt.fei.tuke.sk/models/spacy/sk_sk1-0.0.1.tar.gz
RUN pip install nltk RUN pip install nltk
EXPOSE 8080 EXPOSE 8080
ENV PRODIGY_HOME /work ENV PRODIGY_HOME /work
ENV PRODIGY_HOST 0.0.0.0 ENV PRODIGY_HOST 0.0.0.0
WORKDIR /work WORKDIR /work

View File

@ -1,13 +1,11 @@
## Diplomový projekt 1 2020 ## Diplomový projekt 2 2020
- vytvorenie a spustenie docker kontajneru - vytvorenie a spustenie docker kontajneru
``` ```
./build-docker.sh ./build-docker.sh
docker run -it -p 8080:8080 -v ${PWD}:/work prodigy bash winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation-master/annotation/work prodigy bash
# (v mojom prípade:)
winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/annotation/work prodigy bash
``` ```
@ -17,5 +15,12 @@ winpty docker run --name prodigy -it -p 8080:8080 -v C://Users/jakub/Desktop/ann
- `dataminer.csv` články stiahnuté z wiki - `dataminer.csv` články stiahnuté z wiki
- `cd ner` - `cd ner`
- `./01_text_to_sent.sh` spustenie skriptu *text_to_sent.py*, ktorý rozdelí články na jednotlivé vety - `./01_text_to_sent.sh` spustenie skriptu *text_to_sent.py*, ktorý rozdelí články na jednotlivé vety
- `./02_ner_correct.sh` spustenie anotačného procesu pre NER s návrhmi od modelu - `./02_ner_manual.sh` spustenie manuálneho anotačného procesu pre NER
- `./03_ner_export.sh` exportovanie anotovaných dát vo formáte jsonl potrebnom pre spracovanie vo spacy - `./03_export.sh` exportovanie anotovaných dát vo formáte json potrebnom pre spracovanie vo spacy. Možnosť rozdelenia na trénovacie (70%) a testovacie dáta (30%) (--eval-split 0.3).
### Štatistika o anotovaných dátach
- `prodigy stats wikiart` - informácie o počte prijatých a odmietnutých článkov
- `python3 count.py` - informácie o počte jednotlivých entít
### Trénovanie modelu
Založené na: https://git.kemt.fei.tuke.sk/dano/spacy-skmodel

View File

@ -0,0 +1,14 @@
# load data
filename = 'ner/annotations.jsonl'
file = open(filename, 'rt', encoding='utf-8')
text = file.read()
# count entity PER
countPER = text.count('PER')
countLOC = text.count('LOC')
countORG = text.count('ORG')
countMISC = text.count('MISC')
print('Počet anotovaných entít typu PER:', countPER,'\n',
'Počet anotovaných entít typu LOC:', countLOC,'\n',
'Počet anotovaných entít typu ORG:', countORG,'\n',
'Počet anotovaných entít typu MISC:', countMISC,'\n')

View File

@ -1,3 +0,0 @@
prodigy ner.correct wikiart sk_sk1 ./textfile.csv --label OSOBA,MIESTO,ORGANIZACIA,PRODUKT

View File

@ -0,0 +1,2 @@
prodigy ner.manual wikiart sk_sk1 ./textfile.csv --label PER,LOC,ORG,MISC

View File

@ -0,0 +1 @@
prodigy data-to-spacy ./train.json ./eval.json --lang sk --ner wikiart --eval-split 0.3

View File

@ -1 +0,0 @@
prodigy db-out wikiart > ./annotations.jsonl

View File

@ -0,0 +1,19 @@
mkdir -p build
mkdir -p build/input
# Prepare Treebank
mkdir -p build/input/slovak-treebank
spacy convert ./sources/slovak-treebank/stb.conll ./build/input/slovak-treebank
# UDAG used as evaluation
mkdir -p build/input/ud-artificial-gapping
spacy convert ./sources/ud-artificial-gapping/sk-ud-crawled-orphan.conllu ./build/input/ud-artificial-gapping
# Prepare skner
mkdir -p build/input/skner
# Convert to IOB
cat ./sources/skner/wikiann-sk.bio | python ./sources/bio-to-iob.py > build/input/skner/wikiann-sk.iob
# Split to train test
cat ./build/input/skner/wikiann-sk.iob | python ./sources/iob-to-traintest.py ./build/input/skner/wikiann-sk
# Convert train and test
mkdir -p build/input/skner-train
spacy convert -n 15 --converter ner ./build/input/skner/wikiann-sk.train ./build/input/skner-train
mkdir -p build/input/skner-test
spacy convert -n 15 --converter ner ./build/input/skner/wikiann-sk.test ./build/input/skner-test

View File

@ -0,0 +1,19 @@
set -e
OUTDIR=build/train/output
TRAINDIR=build/train
mkdir -p $TRAINDIR
mkdir -p $OUTDIR
mkdir -p dist
# Delete old training results
rm -rf $OUTDIR/*
# Train dependency and POS
spacy train sk $OUTDIR ./build/input/slovak-treebank ./build/input/ud-artificial-gapping --n-iter 20 -p tagger,parser
rm -rf $TRAINDIR/posparser
mv $OUTDIR/model-best $TRAINDIR/posparser
# Train NER
# python ./train.py -t ./train.json -o $TRAINDIR/nerposparser -n 10 -m $TRAINDIR/posparser/
spacy train sk $TRAINDIR/nerposparser ./ner/train.json ./ner/eval.json --n-iter 20 -p ner
# Package model
spacy package $TRAINDIR/nerposparser dist --meta-path ./meta.json --force
cd dist/sk_sk1-0.2.0
python ./setup.py sdist --dist-dir ../

View File

@ -31,11 +31,39 @@ Zásobník úloh:
- Urobiť verejné demo - nasadenie pomocou systému Docker - Urobiť verejné demo - nasadenie pomocou systému Docker
- zlepšenie Web UI - zlepšenie Web UI
- vytvoriť REST api pre indexovanie dokumentu.
- V indexe prideliť ohodnotenie každému dokumentu podľa viacerých metód, napr. PageRank - V indexe prideliť ohodnotenie každému dokumentu podľa viacerých metód, napr. PageRank
- Využiť vyhodnotenie pri vyhľadávaní - Využiť vyhodnotenie pri vyhľadávaní
- **Použiť overovaciu databázu SCNC na vyhodnotenie každej metódy** - **Použiť overovaciu databázu SCNC na vyhodnotenie každej metódy**
- **Do konca zimného semestra vytvoriť "Mini Diplomovú prácu cca 8 strán s experimentami" vo forme článku** - **Do konca zimného semestra vytvoriť "Mini Diplomovú prácu cca 8 strán s experimentami" vo forme článku**
Virtuálne stretnutie 6.11:2020:
Stav:
- Riešenie problémov s cassandrou a javascriptom. Ako funguje funkcia then?
Úlohy na ďalšie stretnutie:
- vypracujte funkciu na indexovanie. Vstup je dokument (objekt s textom a metainformáciami). Fukcia zaindexuje dokument do ES.
- Naštudujte si ako funguje funkcia then a čo je to callback.
- Naštudujte si ako sa používa Promise.
- Naštudujte si ako funguje async - await.
- https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Asynchronous/
Virtuálne stretnutie 23.10:2020:
Stav:
- Riešenie problémov s cassandrou. Ako vybrať dáta podľa primárneho kľúča.
Do ďďalšiehio stretnutia:
- pokračovať v otvorených úlohách.
- urobte funkciu pre indexovanie jedného dokumentu.
Virtuálne stretnutie 16.10. Virtuálne stretnutie 16.10.
Stav: Stav:

View File

@ -0,0 +1,105 @@
//Jan Holp, DP 2021
//client1 = cassandra
//client2 = elasticsearch
//-----------------------------------------------------------------
//require the Elasticsearch librray
const elasticsearch = require('elasticsearch');
const client2 = new elasticsearch.Client({
hosts: [ 'localhost:9200']
});
client2.ping({
requestTimeout: 30000,
}, function(error) {
// at this point, eastic search is down, please check your Elasticsearch service
if (error) {
console.error('Elasticsearch cluster is down!');
} else {
console.log('Everything is ok');
}
});
//create new index skweb2
client2.indices.create({
index: 'skweb2'
}, function(error, response, status) {
if (error) {
console.log(error);
} else {
console.log("created a new index", response);
}
});
const cassandra = require('cassandra-driver');
const client1 = new cassandra.Client({ contactPoints: ['localhost:9042'], localDataCenter: 'datacenter1', keyspace: 'websucker' });
const query = 'SELECT title FROM websucker.content WHERE body_size > 0 ALLOW FILTERING';
client1.execute(query)
.then(result => console.log(result)),function(error) {
if(error){
console.error('Something is wrong!');
console.log(error);
} else{
console.log('Everything is ok');
}
};
/*
async function indexData() {
var i = 0;
const query = 'SELECT title FROM websucker.content WHERE body_size > 0 ALLOW FILTERING';
client1.execute(query)
.then((result) => {
try {
//for ( i=0; i<15;i++){
console.log('%s', result.row[0].title)
//}
} catch (query) {
if (query instanceof SyntaxError) {
console.log( "Neplatne query" );
}
}
});
}
/*
//indexing method
const bulkIndex = function bulkIndex(index, type, data) {
let bulkBody = [];
id = 1;
const errorCount = 0;
data.forEach(item => {
bulkBody.push({
index: {
_index: index,
_type: type,
_id : id++,
}
});
bulkBody.push(item);
});
console.log(bulkBody);
client.bulk({body: bulkBody})
.then(response => {
response.items.forEach(item => {
if (item.index && item.index.error) {
console.log(++errorCount, item.index.error);
}
});
console.log(
`Successfully indexed ${data.length - errorCount}
out of ${data.length} items`
);
})
.catch(console.err);
};
*/

View File

@ -23,13 +23,26 @@ Zásobník úloh :
- tesla - tesla
- xavier - xavier
- Trénovanie na dvoch kartách na jednom stroji - Trénovanie na dvoch kartách na jednom stroji
- idoc - idoc DONE
- titan - titan
- možno trénovanie na 4 kartách na jednom - možno trénovanie na 4 kartách na jednom
- quadra - quadra
- *Trénovanie na dvoch kartách na dvoch strojoch pomocou NCCL (idoc, tesla)* - *Trénovanie na dvoch kartách na dvoch strojoch pomocou NCCL (idoc, tesla)*
- možno trénovanie na 2 kartách na dvoch strojoch (quadra plus idoc). - možno trénovanie na 2 kartách na dvoch strojoch (quadra plus idoc).
Virtuálne stretnutie 27.10.2020
Stav:
- Trénovanie na procesore, na 1 GPU, na 2 GPU na idoc
- Príprava podkladov na trénovanie na dvoch strojoch pomocou Pytorch.
- Vytvorený prístup na teslu a xavier.
Úlohy na ďďalšie stretnutie:
- Štdúdium odbornej literatúry a vypracovanie poznámok.
- Pokračovať v otvorených úlohách zo zásobníka
- Vypracované skripty uložiť na GIT repozitár
- vytvorte repozitár dp2021
Stretnutie 2.10.2020 Stretnutie 2.10.2020

View File

@ -1 +1,4 @@
## Všetky skripty, súbory a konfigurácie ## Všetky skripty, súbory a konfigurácie
https://github.com/pytorch/examples/tree/master/imagenet
- malo by fungovat pre DDP, nedostupny imagenet subor z oficialnej stranky

View File

@ -0,0 +1,76 @@
import argparse
import datetime
import os
import socket
import sys
import numpy as np
from torch.utils.tensorboard import SummaryWriter
import torch
import torch.nn as nn
import torch.optim
from torch.optim import SGD, Adam
from torch.utils.data import DataLoader
from util.util import enumerateWithEstimate
from p2ch13.dsets import Luna2dSegmentationDataset, TrainingLuna2dSegmentationDataset, getCt
from util.logconf import logging
from util.util import xyz2irc
from p2ch13.model_seg import UNetWrapper, SegmentationAugmentation
from p2ch13.train_seg import LunaTrainingApp
log = logging.getLogger(__name__)
# log.setLevel(logging.WARN)
# log.setLevel(logging.INFO)
log.setLevel(logging.DEBUG)
class BenchmarkLuna2dSegmentationDataset(TrainingLuna2dSegmentationDataset):
def __len__(self):
# return 500
return 5000
return 1000
class LunaBenchmarkApp(LunaTrainingApp):
def initTrainDl(self):
train_ds = BenchmarkLuna2dSegmentationDataset(
val_stride=10,
isValSet_bool=False,
contextSlices_count=3,
# augmentation_dict=self.augmentation_dict,
)
batch_size = self.cli_args.batch_size
if self.use_cuda:
batch_size *= torch.cuda.device_count()
train_dl = DataLoader(
train_ds,
batch_size=batch_size,
num_workers=self.cli_args.num_workers,
pin_memory=self.use_cuda,
)
return train_dl
def main(self):
log.info("Starting {}, {}".format(type(self).__name__, self.cli_args))
train_dl = self.initTrainDl()
for epoch_ndx in range(1, 2):
log.info("Epoch {} of {}, {}/{} batches of size {}*{}".format(
epoch_ndx,
self.cli_args.epochs,
len(train_dl),
len([]),
self.cli_args.batch_size,
(torch.cuda.device_count() if self.use_cuda else 1),
))
self.doTraining(epoch_ndx, train_dl)
if __name__ == '__main__':
LunaBenchmarkApp().main()

View File

@ -0,0 +1,401 @@
import copy
import csv
import functools
import glob
import math
import os
import random
from collections import namedtuple
import SimpleITK as sitk
import numpy as np
import scipy.ndimage.morphology as morph
import torch
import torch.cuda
import torch.nn.functional as F
from torch.utils.data import Dataset
from util.disk import getCache
from util.util import XyzTuple, xyz2irc
from util.logconf import logging
log = logging.getLogger(__name__)
# log.setLevel(logging.WARN)
# log.setLevel(logging.INFO)
log.setLevel(logging.DEBUG)
raw_cache = getCache('part2ch13_raw')
MaskTuple = namedtuple('MaskTuple', 'raw_dense_mask, dense_mask, body_mask, air_mask, raw_candidate_mask, candidate_mask, lung_mask, neg_mask, pos_mask')
CandidateInfoTuple = namedtuple('CandidateInfoTuple', 'isNodule_bool, hasAnnotation_bool, isMal_bool, diameter_mm, series_uid, center_xyz')
@functools.lru_cache(1)
def getCandidateInfoList(requireOnDisk_bool=True):
# We construct a set with all series_uids that are present on disk.
# This will let us use the data, even if we haven't downloaded all of
# the subsets yet.
mhd_list = glob.glob('data-unversioned/subset*/*.mhd')
presentOnDisk_set = {os.path.split(p)[-1][:-4] for p in mhd_list}
candidateInfo_list = []
with open('data/annotations_with_malignancy.csv', "r") as f:
for row in list(csv.reader(f))[1:]:
series_uid = row[0]
annotationCenter_xyz = tuple([float(x) for x in row[1:4]])
annotationDiameter_mm = float(row[4])
isMal_bool = {'False': False, 'True': True}[row[5]]
candidateInfo_list.append(
CandidateInfoTuple(
True,
True,
isMal_bool,
annotationDiameter_mm,
series_uid,
annotationCenter_xyz,
)
)
with open('data/candidates.csv', "r") as f:
for row in list(csv.reader(f))[1:]:
series_uid = row[0]
if series_uid not in presentOnDisk_set and requireOnDisk_bool:
continue
isNodule_bool = bool(int(row[4]))
candidateCenter_xyz = tuple([float(x) for x in row[1:4]])
if not isNodule_bool:
candidateInfo_list.append(
CandidateInfoTuple(
False,
False,
False,
0.0,
series_uid,
candidateCenter_xyz,
)
)
candidateInfo_list.sort(reverse=True)
return candidateInfo_list
@functools.lru_cache(1)
def getCandidateInfoDict(requireOnDisk_bool=True):
candidateInfo_list = getCandidateInfoList(requireOnDisk_bool)
candidateInfo_dict = {}
for candidateInfo_tup in candidateInfo_list:
candidateInfo_dict.setdefault(candidateInfo_tup.series_uid,
[]).append(candidateInfo_tup)
return candidateInfo_dict
class Ct:
def __init__(self, series_uid):
mhd_path = glob.glob(
'data-unversioned/subset*/{}.mhd'.format(series_uid)
)[0]
ct_mhd = sitk.ReadImage(mhd_path)
self.hu_a = np.array(sitk.GetArrayFromImage(ct_mhd), dtype=np.float32)
# CTs are natively expressed in https://en.wikipedia.org/wiki/Hounsfield_scale
# HU are scaled oddly, with 0 g/cc (air, approximately) being -1000 and 1 g/cc (water) being 0.
self.series_uid = series_uid
self.origin_xyz = XyzTuple(*ct_mhd.GetOrigin())
self.vxSize_xyz = XyzTuple(*ct_mhd.GetSpacing())
self.direction_a = np.array(ct_mhd.GetDirection()).reshape(3, 3)
candidateInfo_list = getCandidateInfoDict()[self.series_uid]
self.positiveInfo_list = [
candidate_tup
for candidate_tup in candidateInfo_list
if candidate_tup.isNodule_bool
]
self.positive_mask = self.buildAnnotationMask(self.positiveInfo_list)
self.positive_indexes = (self.positive_mask.sum(axis=(1,2))
.nonzero()[0].tolist())
def buildAnnotationMask(self, positiveInfo_list, threshold_hu = -700):
boundingBox_a = np.zeros_like(self.hu_a, dtype=np.bool)
for candidateInfo_tup in positiveInfo_list:
center_irc = xyz2irc(
candidateInfo_tup.center_xyz,
self.origin_xyz,
self.vxSize_xyz,
self.direction_a,
)
ci = int(center_irc.index)
cr = int(center_irc.row)
cc = int(center_irc.col)
index_radius = 2
try:
while self.hu_a[ci + index_radius, cr, cc] > threshold_hu and \
self.hu_a[ci - index_radius, cr, cc] > threshold_hu:
index_radius += 1
except IndexError:
index_radius -= 1
row_radius = 2
try:
while self.hu_a[ci, cr + row_radius, cc] > threshold_hu and \
self.hu_a[ci, cr - row_radius, cc] > threshold_hu:
row_radius += 1
except IndexError:
row_radius -= 1
col_radius = 2
try:
while self.hu_a[ci, cr, cc + col_radius] > threshold_hu and \
self.hu_a[ci, cr, cc - col_radius] > threshold_hu:
col_radius += 1
except IndexError:
col_radius -= 1
# assert index_radius > 0, repr([candidateInfo_tup.center_xyz, center_irc, self.hu_a[ci, cr, cc]])
# assert row_radius > 0
# assert col_radius > 0
boundingBox_a[
ci - index_radius: ci + index_radius + 1,
cr - row_radius: cr + row_radius + 1,
cc - col_radius: cc + col_radius + 1] = True
mask_a = boundingBox_a & (self.hu_a > threshold_hu)
return mask_a
def getRawCandidate(self, center_xyz, width_irc):
center_irc = xyz2irc(center_xyz, self.origin_xyz, self.vxSize_xyz,
self.direction_a)
slice_list = []
for axis, center_val in enumerate(center_irc):
start_ndx = int(round(center_val - width_irc[axis]/2))
end_ndx = int(start_ndx + width_irc[axis])
assert center_val >= 0 and center_val < self.hu_a.shape[axis], repr([self.series_uid, center_xyz, self.origin_xyz, self.vxSize_xyz, center_irc, axis])
if start_ndx < 0:
# log.warning("Crop outside of CT array: {} {}, center:{} shape:{} width:{}".format(
# self.series_uid, center_xyz, center_irc, self.hu_a.shape, width_irc))
start_ndx = 0
end_ndx = int(width_irc[axis])
if end_ndx > self.hu_a.shape[axis]:
# log.warning("Crop outside of CT array: {} {}, center:{} shape:{} width:{}".format(
# self.series_uid, center_xyz, center_irc, self.hu_a.shape, width_irc))
end_ndx = self.hu_a.shape[axis]
start_ndx = int(self.hu_a.shape[axis] - width_irc[axis])
slice_list.append(slice(start_ndx, end_ndx))
ct_chunk = self.hu_a[tuple(slice_list)]
pos_chunk = self.positive_mask[tuple(slice_list)]
return ct_chunk, pos_chunk, center_irc
@functools.lru_cache(1, typed=True)
def getCt(series_uid):
return Ct(series_uid)
@raw_cache.memoize(typed=True)
def getCtRawCandidate(series_uid, center_xyz, width_irc):
ct = getCt(series_uid)
ct_chunk, pos_chunk, center_irc = ct.getRawCandidate(center_xyz,
width_irc)
ct_chunk.clip(-1000, 1000, ct_chunk)
return ct_chunk, pos_chunk, center_irc
@raw_cache.memoize(typed=True)
def getCtSampleSize(series_uid):
ct = Ct(series_uid)
return int(ct.hu_a.shape[0]), ct.positive_indexes
class Luna2dSegmentationDataset(Dataset):
def __init__(self,
val_stride=0,
isValSet_bool=None,
series_uid=None,
contextSlices_count=3,
fullCt_bool=False,
):
self.contextSlices_count = contextSlices_count
self.fullCt_bool = fullCt_bool
if series_uid:
self.series_list = [series_uid]
else:
self.series_list = sorted(getCandidateInfoDict().keys())
if isValSet_bool:
assert val_stride > 0, val_stride
self.series_list = self.series_list[::val_stride]
assert self.series_list
elif val_stride > 0:
del self.series_list[::val_stride]
assert self.series_list
self.sample_list = []
for series_uid in self.series_list:
index_count, positive_indexes = getCtSampleSize(series_uid)
if self.fullCt_bool:
self.sample_list += [(series_uid, slice_ndx)
for slice_ndx in range(index_count)]
else:
self.sample_list += [(series_uid, slice_ndx)
for slice_ndx in positive_indexes]
self.candidateInfo_list = getCandidateInfoList()
series_set = set(self.series_list)
self.candidateInfo_list = [cit for cit in self.candidateInfo_list
if cit.series_uid in series_set]
self.pos_list = [nt for nt in self.candidateInfo_list
if nt.isNodule_bool]
log.info("{!r}: {} {} series, {} slices, {} nodules".format(
self,
len(self.series_list),
{None: 'general', True: 'validation', False: 'training'}[isValSet_bool],
len(self.sample_list),
len(self.pos_list),
))
def __len__(self):
return len(self.sample_list)
def __getitem__(self, ndx):
series_uid, slice_ndx = self.sample_list[ndx % len(self.sample_list)]
return self.getitem_fullSlice(series_uid, slice_ndx)
def getitem_fullSlice(self, series_uid, slice_ndx):
ct = getCt(series_uid)
ct_t = torch.zeros((self.contextSlices_count * 2 + 1, 512, 512))
start_ndx = slice_ndx - self.contextSlices_count
end_ndx = slice_ndx + self.contextSlices_count + 1
for i, context_ndx in enumerate(range(start_ndx, end_ndx)):
context_ndx = max(context_ndx, 0)
context_ndx = min(context_ndx, ct.hu_a.shape[0] - 1)
ct_t[i] = torch.from_numpy(ct.hu_a[context_ndx].astype(np.float32))
# CTs are natively expressed in https://en.wikipedia.org/wiki/Hounsfield_scale
# HU are scaled oddly, with 0 g/cc (air, approximately) being -1000 and 1 g/cc (water) being 0.
# The lower bound gets rid of negative density stuff used to indicate out-of-FOV
# The upper bound nukes any weird hotspots and clamps bone down
ct_t.clamp_(-1000, 1000)
pos_t = torch.from_numpy(ct.positive_mask[slice_ndx]).unsqueeze(0)
return ct_t, pos_t, ct.series_uid, slice_ndx
class TrainingLuna2dSegmentationDataset(Luna2dSegmentationDataset):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.ratio_int = 2
def __len__(self):
return 300000
def shuffleSamples(self):
random.shuffle(self.candidateInfo_list)
random.shuffle(self.pos_list)
def __getitem__(self, ndx):
candidateInfo_tup = self.pos_list[ndx % len(self.pos_list)]
return self.getitem_trainingCrop(candidateInfo_tup)
def getitem_trainingCrop(self, candidateInfo_tup):
ct_a, pos_a, center_irc = getCtRawCandidate(
candidateInfo_tup.series_uid,
candidateInfo_tup.center_xyz,
(7, 96, 96),
)
pos_a = pos_a[3:4]
row_offset = random.randrange(0,32)
col_offset = random.randrange(0,32)
ct_t = torch.from_numpy(ct_a[:, row_offset:row_offset+64,
col_offset:col_offset+64]).to(torch.float32)
pos_t = torch.from_numpy(pos_a[:, row_offset:row_offset+64,
col_offset:col_offset+64]).to(torch.long)
slice_ndx = center_irc.index
return ct_t, pos_t, candidateInfo_tup.series_uid, slice_ndx
class PrepcacheLunaDataset(Dataset):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.candidateInfo_list = getCandidateInfoList()
self.pos_list = [nt for nt in self.candidateInfo_list if nt.isNodule_bool]
self.seen_set = set()
self.candidateInfo_list.sort(key=lambda x: x.series_uid)
def __len__(self):
return len(self.candidateInfo_list)
def __getitem__(self, ndx):
# candidate_t, pos_t, series_uid, center_t = super().__getitem__(ndx)
candidateInfo_tup = self.candidateInfo_list[ndx]
getCtRawCandidate(candidateInfo_tup.series_uid, candidateInfo_tup.center_xyz, (7, 96, 96))
series_uid = candidateInfo_tup.series_uid
if series_uid not in self.seen_set:
self.seen_set.add(series_uid)
getCtSampleSize(series_uid)
# ct = getCt(series_uid)
# for mask_ndx in ct.positive_indexes:
# build2dLungMask(series_uid, mask_ndx)
return 0, 1 #candidate_t, pos_t, series_uid, center_t
class TvTrainingLuna2dSegmentationDataset(torch.utils.data.Dataset):
def __init__(self, isValSet_bool=False, val_stride=10, contextSlices_count=3):
assert contextSlices_count == 3
data = torch.load('./imgs_and_masks.pt')
suids = list(set(data['suids']))
trn_mask_suids = torch.arange(len(suids)) % val_stride < (val_stride - 1)
trn_suids = {s for i, s in zip(trn_mask_suids, suids) if i}
trn_mask = torch.tensor([(s in trn_suids) for s in data["suids"]])
if not isValSet_bool:
self.imgs = data["imgs"][trn_mask]
self.masks = data["masks"][trn_mask]
self.suids = [s for s, i in zip(data["suids"], trn_mask) if i]
else:
self.imgs = data["imgs"][~trn_mask]
self.masks = data["masks"][~trn_mask]
self.suids = [s for s, i in zip(data["suids"], trn_mask) if not i]
# discard spurious hotspots and clamp bone
self.imgs.clamp_(-1000, 1000)
self.imgs /= 1000
def __len__(self):
return len(self.imgs)
def __getitem__(self, i):
oh, ow = torch.randint(0, 32, (2,))
sl = self.masks.size(1)//2
return self.imgs[i, :, oh: oh + 64, ow: ow + 64], 1, self.masks[i, sl: sl+1, oh: oh + 64, ow: ow + 64].to(torch.float32), self.suids[i], 9999

View File

@ -0,0 +1,224 @@
import math
import random
from collections import namedtuple
import torch
from torch import nn as nn
import torch.nn.functional as F
from util.logconf import logging
from util.unet import UNet
log = logging.getLogger(__name__)
# log.setLevel(logging.WARN)
# log.setLevel(logging.INFO)
log.setLevel(logging.DEBUG)
class UNetWrapper(nn.Module):
def __init__(self, **kwargs):
super().__init__()
self.input_batchnorm = nn.BatchNorm2d(kwargs['in_channels'])
self.unet = UNet(**kwargs)
self.final = nn.Sigmoid()
self._init_weights()
def _init_weights(self):
init_set = {
nn.Conv2d,
nn.Conv3d,
nn.ConvTranspose2d,
nn.ConvTranspose3d,
nn.Linear,
}
for m in self.modules():
if type(m) in init_set:
nn.init.kaiming_normal_(
m.weight.data, mode='fan_out', nonlinearity='relu', a=0
)
if m.bias is not None:
fan_in, fan_out = \
nn.init._calculate_fan_in_and_fan_out(m.weight.data)
bound = 1 / math.sqrt(fan_out)
nn.init.normal_(m.bias, -bound, bound)
# nn.init.constant_(self.unet.last.bias, -4)
# nn.init.constant_(self.unet.last.bias, 4)
def forward(self, input_batch):
bn_output = self.input_batchnorm(input_batch)
un_output = self.unet(bn_output)
fn_output = self.final(un_output)
return fn_output
class SegmentationAugmentation(nn.Module):
def __init__(
self, flip=None, offset=None, scale=None, rotate=None, noise=None
):
super().__init__()
self.flip = flip
self.offset = offset
self.scale = scale
self.rotate = rotate
self.noise = noise
def forward(self, input_g, label_g):
transform_t = self._build2dTransformMatrix()
transform_t = transform_t.expand(input_g.shape[0], -1, -1)
transform_t = transform_t.to(input_g.device, torch.float32)
affine_t = F.affine_grid(transform_t[:,:2],
input_g.size(), align_corners=False)
augmented_input_g = F.grid_sample(input_g,
affine_t, padding_mode='border',
align_corners=False)
augmented_label_g = F.grid_sample(label_g.to(torch.float32),
affine_t, padding_mode='border',
align_corners=False)
if self.noise:
noise_t = torch.randn_like(augmented_input_g)
noise_t *= self.noise
augmented_input_g += noise_t
return augmented_input_g, augmented_label_g > 0.5
def _build2dTransformMatrix(self):
transform_t = torch.eye(3)
for i in range(2):
if self.flip:
if random.random() > 0.5:
transform_t[i,i] *= -1
if self.offset:
offset_float = self.offset
random_float = (random.random() * 2 - 1)
transform_t[2,i] = offset_float * random_float
if self.scale:
scale_float = self.scale
random_float = (random.random() * 2 - 1)
transform_t[i,i] *= 1.0 + scale_float * random_float
if self.rotate:
angle_rad = random.random() * math.pi * 2
s = math.sin(angle_rad)
c = math.cos(angle_rad)
rotation_t = torch.tensor([
[c, -s, 0],
[s, c, 0],
[0, 0, 1]])
transform_t @= rotation_t
return transform_t
# MaskTuple = namedtuple('MaskTuple', 'raw_dense_mask, dense_mask, body_mask, air_mask, raw_candidate_mask, candidate_mask, lung_mask, neg_mask, pos_mask')
#
# class SegmentationMask(nn.Module):
# def __init__(self):
# super().__init__()
#
# self.conv_list = nn.ModuleList([
# self._make_circle_conv(radius) for radius in range(1, 8)
# ])
#
# def _make_circle_conv(self, radius):
# diameter = 1 + radius * 2
#
# a = torch.linspace(-1, 1, steps=diameter)**2
# b = (a[None] + a[:, None])**0.5
#
# circle_weights = (b <= 1.0).to(torch.float32)
#
# conv = nn.Conv2d(1, 1, kernel_size=diameter, padding=radius, bias=False)
# conv.weight.data.fill_(1)
# conv.weight.data *= circle_weights / circle_weights.sum()
#
# return conv
#
#
# def erode(self, input_mask, radius, threshold=1):
# conv = self.conv_list[radius - 1]
# input_float = input_mask.to(torch.float32)
# result = conv(input_float)
#
# # log.debug(['erode in ', radius, threshold, input_float.min().item(), input_float.mean().item(), input_float.max().item()])
# # log.debug(['erode out', radius, threshold, result.min().item(), result.mean().item(), result.max().item()])
#
# return result >= threshold
#
# def deposit(self, input_mask, radius, threshold=0):
# conv = self.conv_list[radius - 1]
# input_float = input_mask.to(torch.float32)
# result = conv(input_float)
#
# # log.debug(['deposit in ', radius, threshold, input_float.min().item(), input_float.mean().item(), input_float.max().item()])
# # log.debug(['deposit out', radius, threshold, result.min().item(), result.mean().item(), result.max().item()])
#
# return result > threshold
#
# def fill_cavity(self, input_mask):
# cumsum = input_mask.cumsum(-1)
# filled_mask = (cumsum > 0)
# filled_mask &= (cumsum < cumsum[..., -1:])
# cumsum = input_mask.cumsum(-2)
# filled_mask &= (cumsum > 0)
# filled_mask &= (cumsum < cumsum[..., -1:, :])
#
# return filled_mask
#
#
# def forward(self, input_g, raw_pos_g):
# gcc_g = input_g + 1
#
# with torch.no_grad():
# # log.info(['gcc_g', gcc_g.min(), gcc_g.mean(), gcc_g.max()])
#
# raw_dense_mask = gcc_g > 0.7
# dense_mask = self.deposit(raw_dense_mask, 2)
# dense_mask = self.erode(dense_mask, 6)
# dense_mask = self.deposit(dense_mask, 4)
#
# body_mask = self.fill_cavity(dense_mask)
# air_mask = self.deposit(body_mask & ~dense_mask, 5)
# air_mask = self.erode(air_mask, 6)
#
# lung_mask = self.deposit(air_mask, 5)
#
# raw_candidate_mask = gcc_g > 0.4
# raw_candidate_mask &= air_mask
# candidate_mask = self.erode(raw_candidate_mask, 1)
# candidate_mask = self.deposit(candidate_mask, 1)
#
# pos_mask = self.deposit((raw_pos_g > 0.5) & lung_mask, 2)
#
# neg_mask = self.deposit(candidate_mask, 1)
# neg_mask &= ~pos_mask
# neg_mask &= lung_mask
#
# # label_g = (neg_mask | pos_mask).to(torch.float32)
# label_g = (pos_mask).to(torch.float32)
# neg_g = neg_mask.to(torch.float32)
# pos_g = pos_mask.to(torch.float32)
#
# mask_dict = {
# 'raw_dense_mask': raw_dense_mask,
# 'dense_mask': dense_mask,
# 'body_mask': body_mask,
# 'air_mask': air_mask,
# 'raw_candidate_mask': raw_candidate_mask,
# 'candidate_mask': candidate_mask,
# 'lung_mask': lung_mask,
# 'neg_mask': neg_mask,
# 'pos_mask': pos_mask,
# }
#
# return label_g, neg_g, pos_g, lung_mask, mask_dict

View File

@ -0,0 +1,69 @@
import timing
import argparse
import sys
import numpy as np
import torch.nn as nn
from torch.autograd import Variable
from torch.optim import SGD
from torch.utils.data import DataLoader
from util.util import enumerateWithEstimate
from .dsets import PrepcacheLunaDataset, getCtSampleSize
from util.logconf import logging
# from .model import LunaModel
log = logging.getLogger(__name__)
# log.setLevel(logging.WARN)
log.setLevel(logging.INFO)
# log.setLevel(logging.DEBUG)
class LunaPrepCacheApp:
@classmethod
def __init__(self, sys_argv=None):
if sys_argv is None:
sys_argv = sys.argv[1:]
parser = argparse.ArgumentParser()
parser.add_argument('--batch-size',
help='Batch size to use for training',
default=1024,
type=int,
)
parser.add_argument('--num-workers',
help='Number of worker processes for background data loading',
default=8,
type=int,
)
# parser.add_argument('--scaled',
# help="Scale the CT chunks to square voxels.",
# default=False,
# action='store_true',
# )
self.cli_args = parser.parse_args(sys_argv)
def main(self):
log.info("Starting {}, {}".format(type(self).__name__, self.cli_args))
self.prep_dl = DataLoader(
PrepcacheLunaDataset(
# sortby_str='series_uid',
),
batch_size=self.cli_args.batch_size,
num_workers=self.cli_args.num_workers,
)
batch_iter = enumerateWithEstimate(
self.prep_dl,
"Stuffing cache",
start_ndx=self.prep_dl.num_workers,
)
for batch_ndx, batch_tup in batch_iter:
pass
if __name__ == '__main__':
LunaPrepCacheApp().main()

View File

@ -0,0 +1,331 @@
import math
import random
import warnings
import numpy as np
import scipy.ndimage
import torch
from torch.autograd import Function
from torch.autograd.function import once_differentiable
import torch.backends.cudnn as cudnn
from util.logconf import logging
log = logging.getLogger(__name__)
# log.setLevel(logging.WARN)
# log.setLevel(logging.INFO)
log.setLevel(logging.DEBUG)
def cropToShape(image, new_shape, center_list=None, fill=0.0):
# log.debug([image.shape, new_shape, center_list])
# assert len(image.shape) == 3, repr(image.shape)
if center_list is None:
center_list = [int(image.shape[i] / 2) for i in range(3)]
crop_list = []
for i in range(0, 3):
crop_int = center_list[i]
if image.shape[i] > new_shape[i] and crop_int is not None:
# We can't just do crop_int +/- shape/2 since shape might be odd
# and ints round down.
start_int = crop_int - int(new_shape[i]/2)
end_int = start_int + new_shape[i]
crop_list.append(slice(max(0, start_int), end_int))
else:
crop_list.append(slice(0, image.shape[i]))
# log.debug([image.shape, crop_list])
image = image[crop_list]
crop_list = []
for i in range(0, 3):
if image.shape[i] < new_shape[i]:
crop_int = int((new_shape[i] - image.shape[i]) / 2)
crop_list.append(slice(crop_int, crop_int + image.shape[i]))
else:
crop_list.append(slice(0, image.shape[i]))
# log.debug([image.shape, crop_list])
new_image = np.zeros(new_shape, dtype=image.dtype)
new_image[:] = fill
new_image[crop_list] = image
return new_image
def zoomToShape(image, new_shape, square=True):
# assert image.shape[-1] in {1, 3, 4}, repr(image.shape)
if square and image.shape[0] != image.shape[1]:
crop_int = min(image.shape[0], image.shape[1])
new_shape = [crop_int, crop_int, image.shape[2]]
image = cropToShape(image, new_shape)
zoom_shape = [new_shape[i] / image.shape[i] for i in range(3)]
with warnings.catch_warnings():
warnings.simplefilter("ignore")
image = scipy.ndimage.interpolation.zoom(
image, zoom_shape,
output=None, order=0, mode='nearest', cval=0.0, prefilter=True)
return image
def randomOffset(image_list, offset_rows=0.125, offset_cols=0.125):
center_list = [int(image_list[0].shape[i] / 2) for i in range(3)]
center_list[0] += int(offset_rows * (random.random() - 0.5) * 2)
center_list[1] += int(offset_cols * (random.random() - 0.5) * 2)
center_list[2] = None
new_list = []
for image in image_list:
new_image = cropToShape(image, image.shape, center_list)
new_list.append(new_image)
return new_list
def randomZoom(image_list, scale=None, scale_min=0.8, scale_max=1.3):
if scale is None:
scale = scale_min + (scale_max - scale_min) * random.random()
new_list = []
for image in image_list:
# assert image.shape[-1] in {1, 3, 4}, repr(image.shape)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
# log.info([image.shape])
zimage = scipy.ndimage.interpolation.zoom(
image, [scale, scale, 1.0],
output=None, order=0, mode='nearest', cval=0.0, prefilter=True)
image = cropToShape(zimage, image.shape)
new_list.append(image)
return new_list
_randomFlip_transform_list = [
# lambda a: np.rot90(a, axes=(0, 1)),
# lambda a: np.flip(a, 0),
lambda a: np.flip(a, 1),
]
def randomFlip(image_list, transform_bits=None):
if transform_bits is None:
transform_bits = random.randrange(0, 2 ** len(_randomFlip_transform_list))
new_list = []
for image in image_list:
# assert image.shape[-1] in {1, 3, 4}, repr(image.shape)
for n in range(len(_randomFlip_transform_list)):
if transform_bits & 2**n:
# prhist(image, 'before')
image = _randomFlip_transform_list[n](image)
# prhist(image, 'after ')
new_list.append(image)
return new_list
def randomSpin(image_list, angle=None, range_tup=None, axes=(0, 1)):
if range_tup is None:
range_tup = (0, 360)
if angle is None:
angle = range_tup[0] + (range_tup[1] - range_tup[0]) * random.random()
new_list = []
for image in image_list:
# assert image.shape[-1] in {1, 3, 4}, repr(image.shape)
image = scipy.ndimage.interpolation.rotate(
image, angle, axes=axes, reshape=False,
output=None, order=0, mode='nearest', cval=0.0, prefilter=True)
new_list.append(image)
return new_list
def randomNoise(image_list, noise_min=-0.1, noise_max=0.1):
noise = np.zeros_like(image_list[0])
noise += (noise_max - noise_min) * np.random.random_sample(image_list[0].shape) + noise_min
noise *= 5
noise = scipy.ndimage.filters.gaussian_filter(noise, 3)
# noise += (noise_max - noise_min) * np.random.random_sample(image_hsv.shape) + noise_min
new_list = []
for image_hsv in image_list:
image_hsv = image_hsv + noise
new_list.append(image_hsv)
return new_list
def randomHsvShift(image_list, h=None, s=None, v=None,
h_min=-0.1, h_max=0.1,
s_min=0.5, s_max=2.0,
v_min=0.5, v_max=2.0):
if h is None:
h = h_min + (h_max - h_min) * random.random()
if s is None:
s = s_min + (s_max - s_min) * random.random()
if v is None:
v = v_min + (v_max - v_min) * random.random()
new_list = []
for image_hsv in image_list:
# assert image_hsv.shape[-1] == 3, repr(image_hsv.shape)
image_hsv[:,:,0::3] += h
image_hsv[:,:,1::3] = image_hsv[:,:,1::3] ** s
image_hsv[:,:,2::3] = image_hsv[:,:,2::3] ** v
new_list.append(image_hsv)
return clampHsv(new_list)
def clampHsv(image_list):
new_list = []
for image_hsv in image_list:
image_hsv = image_hsv.clone()
# Hue wraps around
image_hsv[:,:,0][image_hsv[:,:,0] > 1] -= 1
image_hsv[:,:,0][image_hsv[:,:,0] < 0] += 1
# Everything else clamps between 0 and 1
image_hsv[image_hsv > 1] = 1
image_hsv[image_hsv < 0] = 0
new_list.append(image_hsv)
return new_list
# def torch_augment(input):
# theta = random.random() * math.pi * 2
# s = math.sin(theta)
# c = math.cos(theta)
# c1 = 1 - c
# axis_vector = torch.rand(3, device='cpu', dtype=torch.float64)
# axis_vector -= 0.5
# axis_vector /= axis_vector.abs().sum()
# l, m, n = axis_vector
#
# matrix = torch.tensor([
# [l*l*c1 + c, m*l*c1 - n*s, n*l*c1 + m*s, 0],
# [l*m*c1 + n*s, m*m*c1 + c, n*m*c1 - l*s, 0],
# [l*n*c1 - m*s, m*n*c1 + l*s, n*n*c1 + c, 0],
# [0, 0, 0, 1],
# ], device=input.device, dtype=torch.float32)
#
# return th_affine3d(input, matrix)
# following from https://github.com/ncullen93/torchsample/blob/master/torchsample/utils.py
# MIT licensed
# def th_affine3d(input, matrix):
# """
# 3D Affine image transform on torch.Tensor
# """
# A = matrix[:3,:3]
# b = matrix[:3,3]
#
# # make a meshgrid of normal coordinates
# coords = th_iterproduct(input.size(-3), input.size(-2), input.size(-1), dtype=torch.float32)
#
# # shift the coordinates so center is the origin
# coords[:,0] = coords[:,0] - (input.size(-3) / 2. - 0.5)
# coords[:,1] = coords[:,1] - (input.size(-2) / 2. - 0.5)
# coords[:,2] = coords[:,2] - (input.size(-1) / 2. - 0.5)
#
# # apply the coordinate transformation
# new_coords = coords.mm(A.t().contiguous()) + b.expand_as(coords)
#
# # shift the coordinates back so origin is origin
# new_coords[:,0] = new_coords[:,0] + (input.size(-3) / 2. - 0.5)
# new_coords[:,1] = new_coords[:,1] + (input.size(-2) / 2. - 0.5)
# new_coords[:,2] = new_coords[:,2] + (input.size(-1) / 2. - 0.5)
#
# # map new coordinates using bilinear interpolation
# input_transformed = th_trilinear_interp3d(input, new_coords)
#
# return input_transformed
#
#
# def th_trilinear_interp3d(input, coords):
# """
# trilinear interpolation of 3D torch.Tensor image
# """
# # take clamp then floor/ceil of x coords
# x = torch.clamp(coords[:,0], 0, input.size(-3)-2)
# x0 = x.floor()
# x1 = x0 + 1
# # take clamp then floor/ceil of y coords
# y = torch.clamp(coords[:,1], 0, input.size(-2)-2)
# y0 = y.floor()
# y1 = y0 + 1
# # take clamp then floor/ceil of z coords
# z = torch.clamp(coords[:,2], 0, input.size(-1)-2)
# z0 = z.floor()
# z1 = z0 + 1
#
# stride = torch.tensor(input.stride()[-3:], dtype=torch.int64, device=input.device)
# x0_ix = x0.mul(stride[0]).long()
# x1_ix = x1.mul(stride[0]).long()
# y0_ix = y0.mul(stride[1]).long()
# y1_ix = y1.mul(stride[1]).long()
# z0_ix = z0.mul(stride[2]).long()
# z1_ix = z1.mul(stride[2]).long()
#
# # input_flat = th_flatten(input)
# input_flat = x.contiguous().view(x[0], x[1], -1)
#
# vals_000 = input_flat[:, :, x0_ix+y0_ix+z0_ix]
# vals_001 = input_flat[:, :, x0_ix+y0_ix+z1_ix]
# vals_010 = input_flat[:, :, x0_ix+y1_ix+z0_ix]
# vals_011 = input_flat[:, :, x0_ix+y1_ix+z1_ix]
# vals_100 = input_flat[:, :, x1_ix+y0_ix+z0_ix]
# vals_101 = input_flat[:, :, x1_ix+y0_ix+z1_ix]
# vals_110 = input_flat[:, :, x1_ix+y1_ix+z0_ix]
# vals_111 = input_flat[:, :, x1_ix+y1_ix+z1_ix]
#
# xd = x - x0
# yd = y - y0
# zd = z - z0
# xm1 = 1 - xd
# ym1 = 1 - yd
# zm1 = 1 - zd
#
# x_mapped = (
# vals_000.mul(xm1).mul(ym1).mul(zm1) +
# vals_001.mul(xm1).mul(ym1).mul(zd) +
# vals_010.mul(xm1).mul(yd).mul(zm1) +
# vals_011.mul(xm1).mul(yd).mul(zd) +
# vals_100.mul(xd).mul(ym1).mul(zm1) +
# vals_101.mul(xd).mul(ym1).mul(zd) +
# vals_110.mul(xd).mul(yd).mul(zm1) +
# vals_111.mul(xd).mul(yd).mul(zd)
# )
#
# return x_mapped.view_as(input)
#
# def th_iterproduct(*args, dtype=None):
# return torch.from_numpy(np.indices(args).reshape((len(args),-1)).T)
#
# def th_flatten(x):
# """Flatten tensor"""
# return x.contiguous().view(x[0], x[1], -1)

View File

@ -0,0 +1,136 @@
import gzip
from diskcache import FanoutCache, Disk
from diskcache.core import BytesType, MODE_BINARY, BytesIO
from util.logconf import logging
log = logging.getLogger(__name__)
# log.setLevel(logging.WARN)
log.setLevel(logging.INFO)
# log.setLevel(logging.DEBUG)
class GzipDisk(Disk):
def store(self, value, read, key=None):
"""
Override from base class diskcache.Disk.
Chunking is due to needing to work on pythons < 2.7.13:
- Issue #27130: In the "zlib" module, fix handling of large buffers
(typically 2 or 4 GiB). Previously, inputs were limited to 2 GiB, and
compression and decompression operations did not properly handle results of
2 or 4 GiB.
:param value: value to convert
:param bool read: True when value is file-like object
:return: (size, mode, filename, value) tuple for Cache table
"""
# pylint: disable=unidiomatic-typecheck
if type(value) is BytesType:
if read:
value = value.read()
read = False
str_io = BytesIO()
gz_file = gzip.GzipFile(mode='wb', compresslevel=1, fileobj=str_io)
for offset in range(0, len(value), 2**30):
gz_file.write(value[offset:offset+2**30])
gz_file.close()
value = str_io.getvalue()
return super(GzipDisk, self).store(value, read)
def fetch(self, mode, filename, value, read):
"""
Override from base class diskcache.Disk.
Chunking is due to needing to work on pythons < 2.7.13:
- Issue #27130: In the "zlib" module, fix handling of large buffers
(typically 2 or 4 GiB). Previously, inputs were limited to 2 GiB, and
compression and decompression operations did not properly handle results of
2 or 4 GiB.
:param int mode: value mode raw, binary, text, or pickle
:param str filename: filename of corresponding value
:param value: database value
:param bool read: when True, return an open file handle
:return: corresponding Python value
"""
value = super(GzipDisk, self).fetch(mode, filename, value, read)
if mode == MODE_BINARY:
str_io = BytesIO(value)
gz_file = gzip.GzipFile(mode='rb', fileobj=str_io)
read_csio = BytesIO()
while True:
uncompressed_data = gz_file.read(2**30)
if uncompressed_data:
read_csio.write(uncompressed_data)
else:
break
value = read_csio.getvalue()
return value
def getCache(scope_str):
return FanoutCache('data-unversioned/cache/' + scope_str,
disk=GzipDisk,
shards=64,
timeout=1,
size_limit=3e11,
# disk_min_file_size=2**20,
)
# def disk_cache(base_path, memsize=2):
# def disk_cache_decorator(f):
# @functools.wraps(f)
# def wrapper(*args, **kwargs):
# args_str = repr(args) + repr(sorted(kwargs.items()))
# file_str = hashlib.md5(args_str.encode('utf8')).hexdigest()
#
# cache_path = os.path.join(base_path, f.__name__, file_str + '.pkl.gz')
#
# if not os.path.exists(os.path.dirname(cache_path)):
# os.makedirs(os.path.dirname(cache_path), exist_ok=True)
#
# if os.path.exists(cache_path):
# return pickle_loadgz(cache_path)
# else:
# ret = f(*args, **kwargs)
# pickle_dumpgz(cache_path, ret)
# return ret
#
# return wrapper
#
# return disk_cache_decorator
#
#
# def pickle_dumpgz(file_path, obj):
# log.debug("Writing {}".format(file_path))
# with open(file_path, 'wb') as file_obj:
# with gzip.GzipFile(mode='wb', compresslevel=1, fileobj=file_obj) as gz_file:
# pickle.dump(obj, gz_file, pickle.HIGHEST_PROTOCOL)
#
#
# def pickle_loadgz(file_path):
# log.debug("Reading {}".format(file_path))
# with open(file_path, 'rb') as file_obj:
# with gzip.GzipFile(mode='rb', fileobj=file_obj) as gz_file:
# return pickle.load(gz_file)
#
#
# def dtpath(dt=None):
# if dt is None:
# dt = datetime.datetime.now()
#
# return str(dt).rsplit('.', 1)[0].replace(' ', '--').replace(':', '.')
#
#
# def safepath(s):
# s = s.replace(' ', '_')
# return re.sub('[^A-Za-z0-9_.-]', '', s)

View File

@ -0,0 +1,19 @@
import logging
import logging.handlers
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
# Some libraries attempt to add their own root logger handlers. This is
# annoying and so we get rid of them.
for handler in list(root_logger.handlers):
root_logger.removeHandler(handler)
logfmt_str = "%(asctime)s %(levelname)-8s pid:%(process)d %(name)s:%(lineno)03d:%(funcName)s %(message)s"
formatter = logging.Formatter(logfmt_str)
streamHandler = logging.StreamHandler()
streamHandler.setFormatter(formatter)
streamHandler.setLevel(logging.DEBUG)
root_logger.addHandler(streamHandler)

View File

@ -0,0 +1,143 @@
# From https://github.com/jvanvugt/pytorch-unet
# https://raw.githubusercontent.com/jvanvugt/pytorch-unet/master/unet.py
# MIT License
#
# Copyright (c) 2018 Joris
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
# SOFTWARE.
# Adapted from https://discuss.pytorch.org/t/unet-implementation/426
import torch
from torch import nn
import torch.nn.functional as F
class UNet(nn.Module):
def __init__(self, in_channels=1, n_classes=2, depth=5, wf=6, padding=False,
batch_norm=False, up_mode='upconv'):
"""
Implementation of
U-Net: Convolutional Networks for Biomedical Image Segmentation
(Ronneberger et al., 2015)
https://arxiv.org/abs/1505.04597
Using the default arguments will yield the exact version used
in the original paper
Args:
in_channels (int): number of input channels
n_classes (int): number of output channels
depth (int): depth of the network
wf (int): number of filters in the first layer is 2**wf
padding (bool): if True, apply padding such that the input shape
is the same as the output.
This may introduce artifacts
batch_norm (bool): Use BatchNorm after layers with an
activation function
up_mode (str): one of 'upconv' or 'upsample'.
'upconv' will use transposed convolutions for
learned upsampling.
'upsample' will use bilinear upsampling.
"""
super(UNet, self).__init__()
assert up_mode in ('upconv', 'upsample')
self.padding = padding
self.depth = depth
prev_channels = in_channels
self.down_path = nn.ModuleList()
for i in range(depth):
self.down_path.append(UNetConvBlock(prev_channels, 2**(wf+i),
padding, batch_norm))
prev_channels = 2**(wf+i)
self.up_path = nn.ModuleList()
for i in reversed(range(depth - 1)):
self.up_path.append(UNetUpBlock(prev_channels, 2**(wf+i), up_mode,
padding, batch_norm))
prev_channels = 2**(wf+i)
self.last = nn.Conv2d(prev_channels, n_classes, kernel_size=1)
def forward(self, x):
blocks = []
for i, down in enumerate(self.down_path):
x = down(x)
if i != len(self.down_path)-1:
blocks.append(x)
x = F.avg_pool2d(x, 2)
for i, up in enumerate(self.up_path):
x = up(x, blocks[-i-1])
return self.last(x)
class UNetConvBlock(nn.Module):
def __init__(self, in_size, out_size, padding, batch_norm):
super(UNetConvBlock, self).__init__()
block = []
block.append(nn.Conv2d(in_size, out_size, kernel_size=3,
padding=int(padding)))
block.append(nn.ReLU())
# block.append(nn.LeakyReLU())
if batch_norm:
block.append(nn.BatchNorm2d(out_size))
block.append(nn.Conv2d(out_size, out_size, kernel_size=3,
padding=int(padding)))
block.append(nn.ReLU())
# block.append(nn.LeakyReLU())
if batch_norm:
block.append(nn.BatchNorm2d(out_size))
self.block = nn.Sequential(*block)
def forward(self, x):
out = self.block(x)
return out
class UNetUpBlock(nn.Module):
def __init__(self, in_size, out_size, up_mode, padding, batch_norm):
super(UNetUpBlock, self).__init__()
if up_mode == 'upconv':
self.up = nn.ConvTranspose2d(in_size, out_size, kernel_size=2,
stride=2)
elif up_mode == 'upsample':
self.up = nn.Sequential(nn.Upsample(mode='bilinear', scale_factor=2),
nn.Conv2d(in_size, out_size, kernel_size=1))
self.conv_block = UNetConvBlock(in_size, out_size, padding, batch_norm)
def center_crop(self, layer, target_size):
_, _, layer_height, layer_width = layer.size()
diff_y = (layer_height - target_size[0]) // 2
diff_x = (layer_width - target_size[1]) // 2
return layer[:, :, diff_y:(diff_y + target_size[0]), diff_x:(diff_x + target_size[1])]
def forward(self, x, bridge):
up = self.up(x)
crop1 = self.center_crop(bridge, up.shape[2:])
out = torch.cat([up, crop1], 1)
out = self.conv_block(out)
return out

View File

@ -0,0 +1,105 @@
import os
from datetime import datetime
import argparse
import torch.multiprocessing as mp
import torchvision
import torchvision.transforms as transforms
import torch
import torch.nn as nn
import torch.distributed as dist
from apex.parallel import DistributedDataParallel as DDP
from apex import amp
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N',
help='number of data loading workers (default: 4)')
parser.add_argument('-g', '--gpus', default=1, type=int,
help='number of gpus per node')
parser.add_argument('-nr', '--nr', default=0, type=int,
help='ranking within the nodes')
parser.add_argument('--epochs', default=2, type=int, metavar='N',
help='number of total epochs to run')
args = parser.parse_args()
args.world_size = args.gpus * args.nodes
os.environ['MASTER_ADDR'] = '147.232.47.114'
os.environ['MASTER_PORT'] = '8888'
mp.spawn(train, nprocs=args.gpus, args=(args,))
class ConvNet(nn.Module):
def __init__(self, num_classes=10):
super(ConvNet, self).__init__()
self.layer1 = nn.Sequential(
nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
nn.BatchNorm2d(16),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.layer2 = nn.Sequential(
nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.fc = nn.Linear(7*7*32, num_classes)
def forward(self, x):
out = self.layer1(x)
out = self.layer2(out)
out = out.reshape(out.size(0), -1)
out = self.fc(out)
return out
def train(gpu, args):
rank = args.nr * args.gpus + gpu
dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)
torch.manual_seed(0)
model = ConvNet()
torch.cuda.set_device(gpu)
model.cuda(gpu)
batch_size = 10
# define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda(gpu)
optimizer = torch.optim.SGD(model.parameters(), 1e-4)
# Wrap the model
model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])
# Data loading code
train_dataset = torchvision.datasets.MNIST(root='./data',
train=True,
transform=transforms.ToTensor(),
download=True)
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,
num_replicas=args.world_size,
rank=rank)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=0,
pin_memory=True,
sampler=train_sampler)
start = datetime.now()
total_step = len(train_loader)
for epoch in range(args.epochs):
for i, (images, labels) in enumerate(train_loader):
images = images.cuda(non_blocking=True)
labels = labels.cuda(non_blocking=True)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i + 1) % 100 == 0 and gpu == 0:
print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
loss.item()))
if gpu == 0:
print("Training complete in: " + str(datetime.now() - start))
if __name__ == '__main__':
torch.multiprocessing.set_start_method('spawn')
main()

View File

@ -0,0 +1,92 @@
import os
from datetime import datetime
import argparse
import torch.multiprocessing as mp
import torchvision
import torchvision.transforms as transforms
import torch
import torch.nn as nn
import torch.distributed as dist
from apex.parallel import DistributedDataParallel as DDP
from apex import amp
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N',
help='number of data loading workers (default: 4)')
parser.add_argument('-g', '--gpus', default=1, type=int,
help='number of gpus per node')
parser.add_argument('-nr', '--nr', default=0, type =int,
help='ranking within the nodes')
parser.add_argument('--epochs', default=2, type=int, metavar='N',
help='number of total epochs to run')
args = parser.parse_args()
train(0, args)
class ConvNet(nn.Module):
def __init__(self, num_classes=10):
super(ConvNet, self).__init__()
self.layer1 = nn.Sequential(
nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
nn.BatchNorm2d(16),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.layer2 = nn.Sequential(
nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2))
self.fc = nn.Linear(7*7*32, num_classes)
def forward(self, x):
out = self.layer1(x)
out = self.layer2(out)
out = out.reshape(out.size(0), -1)
out = self.fc(out)
return out
def train(gpu, args):
model = ConvNet()
torch.cuda.set_device(gpu)
model.cuda(gpu)
batch_size = 50
# define loss function (criterion) and optimizer
criterion = nn.CrossEntropyLoss().cuda(gpu)
optimizer = torch.optim.SGD(model.parameters(), 1e-4)
# Data loading code
train_dataset = torchvision.datasets.MNIST(root='./data',
train=True,
transform=transforms.ToTensor(),
download=True)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=True,
num_workers=0,
pin_memory=True)
start = datetime.now()
total_step = len(train_loader)
for epoch in range(args.epochs):
for i, (images, labels) in enumerate(train_loader):
images = images.cuda(non_blocking=True)
labels = labels.cuda(non_blocking=True)
# Forward pass
outputs = model(images)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (i + 1) % 100 == 0 and gpu == 0:
print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,
loss.item()))
if gpu == 0:
print("Training complete in: " + str(datetime.now() - start))
if __name__ == '__main__':
main()

View File

@ -0,0 +1,748 @@
from argparse import Namespace
from collections import Counter
import json
import os
import re
import string
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tqdm.notebook import tqdm
class Vocabulary(object):
"""Class to process text and extract vocabulary for mapping"""
def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
"""
Args:
token_to_idx (dict): a pre-existing map of tokens to indices
add_unk (bool): a flag that indicates whether to add the UNK token
unk_token (str): the UNK token to add into the Vocabulary
"""
if token_to_idx is None:
token_to_idx = {}
self._token_to_idx = token_to_idx
self._idx_to_token = {idx: token
for token, idx in self._token_to_idx.items()}
self._add_unk = add_unk
self._unk_token = unk_token
self.unk_index = -1
if add_unk:
self.unk_index = self.add_token(unk_token)
def to_serializable(self):
""" returns a dictionary that can be serialized """
return {'token_to_idx': self._token_to_idx,
'add_unk': self._add_unk,
'unk_token': self._unk_token}
@classmethod
def from_serializable(cls, contents):
""" instantiates the Vocabulary from a serialized dictionary """
return cls(**contents)
def add_token(self, token):
"""Update mapping dicts based on the token.
Args:
token (str): the item to add into the Vocabulary
Returns:
index (int): the integer corresponding to the token
"""
if token in self._token_to_idx:
index = self._token_to_idx[token]
else:
index = len(self._token_to_idx)
self._token_to_idx[token] = index
self._idx_to_token[index] = token
return index
def add_many(self, tokens):
"""Add a list of tokens into the Vocabulary
Args:
tokens (list): a list of string tokens
Returns:
indices (list): a list of indices corresponding to the tokens
"""
return [self.add_token(token) for token in tokens]
def lookup_token(self, token):
"""Retrieve the index associated with the token
or the UNK index if token isn't present.
Args:
token (str): the token to look up
Returns:
index (int): the index corresponding to the token
Notes:
`unk_index` needs to be >=0 (having been added into the Vocabulary)
for the UNK functionality
"""
if self.unk_index >= 0:
return self._token_to_idx.get(token, self.unk_index)
else:
return self._token_to_idx[token]
def lookup_index(self, index):
"""Return the token associated with the index
Args:
index (int): the index to look up
Returns:
token (str): the token corresponding to the index
Raises:
KeyError: if the index is not in the Vocabulary
"""
if index not in self._idx_to_token:
raise KeyError("the index (%d) is not in the Vocabulary" % index)
return self._idx_to_token[index]
def __str__(self):
return "<Vocabulary(size=%d)>" % len(self)
def __len__(self):
return len(self._token_to_idx)
class ReviewVectorizer(object):
""" The Vectorizer which coordinates the Vocabularies and puts them to use"""
def __init__(self, review_vocab, rating_vocab):
"""
Args:
review_vocab (Vocabulary): maps words to integers
rating_vocab (Vocabulary): maps class labels to integers
"""
self.review_vocab = review_vocab
self.rating_vocab = rating_vocab
def vectorize(self, review):
"""Create a collapsed one-hit vector for the review
Args:
review (str): the review
Returns:
one_hot (np.ndarray): the collapsed one-hot encoding
"""
one_hot = np.zeros(len(self.review_vocab), dtype=np.float32)
for token in review.split(" "):
if token not in string.punctuation:
one_hot[self.review_vocab.lookup_token(token)] = 1
return one_hot
@classmethod
def from_dataframe(cls, review_df, cutoff=25):
"""Instantiate the vectorizer from the dataset dataframe
Args:
review_df (pandas.DataFrame): the review dataset
cutoff (int): the parameter for frequency-based filtering
Returns:
an instance of the ReviewVectorizer
"""
review_vocab = Vocabulary(add_unk=True)
rating_vocab = Vocabulary(add_unk=False)
# Add ratings
for rating in sorted(set(review_df.rating)):
rating_vocab.add_token(rating)
# Add top words if count > provided count
word_counts = Counter()
for review in review_df.review:
for word in review.split(" "):
if word not in string.punctuation:
word_counts[word] += 1
for word, count in word_counts.items():
if count > cutoff:
review_vocab.add_token(word)
return cls(review_vocab, rating_vocab)
@classmethod
def from_serializable(cls, contents):
"""Instantiate a ReviewVectorizer from a serializable dictionary
Args:
contents (dict): the serializable dictionary
Returns:
an instance of the ReviewVectorizer class
"""
review_vocab = Vocabulary.from_serializable(contents['review_vocab'])
rating_vocab = Vocabulary.from_serializable(contents['rating_vocab'])
return cls(review_vocab=review_vocab, rating_vocab=rating_vocab)
def to_serializable(self):
"""Create the serializable dictionary for caching
Returns:
contents (dict): the serializable dictionary
"""
return {'review_vocab': self.review_vocab.to_serializable(),
'rating_vocab': self.rating_vocab.to_serializable()}
class ReviewDataset(Dataset):
def __init__(self, review_df, vectorizer):
"""
Args:
review_df (pandas.DataFrame): the dataset
vectorizer (ReviewVectorizer): vectorizer instantiated from dataset
"""
self.review_df = review_df
self._vectorizer = vectorizer
self.train_df = self.review_df[self.review_df.split=='train']
self.train_size = len(self.train_df)
self.val_df = self.review_df[self.review_df.split=='val']
self.validation_size = len(self.val_df)
self.test_df = self.review_df[self.review_df.split=='test']
self.test_size = len(self.test_df)
self._lookup_dict = {'train': (self.train_df, self.train_size),
'val': (self.val_df, self.validation_size),
'test': (self.test_df, self.test_size)}
self.set_split('train')
@classmethod
def load_dataset_and_make_vectorizer(cls, review_csv):
"""Load dataset and make a new vectorizer from scratch
Args:
review_csv (str): location of the dataset
Returns:
an instance of ReviewDataset
"""
review_df = pd.read_csv(review_csv)
train_review_df = review_df[review_df.split=='train']
return cls(review_df, ReviewVectorizer.from_dataframe(train_review_df))
@classmethod
def load_dataset_and_load_vectorizer(cls, review_csv, vectorizer_filepath):
"""Load dataset and the corresponding vectorizer.
Used in the case in the vectorizer has been cached for re-use
Args:
review_csv (str): location of the dataset
vectorizer_filepath (str): location of the saved vectorizer
Returns:
an instance of ReviewDataset
"""
review_df = pd.read_csv(review_csv)
vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
return cls(review_df, vectorizer)
@staticmethod
def load_vectorizer_only(vectorizer_filepath):
"""a static method for loading the vectorizer from file
Args:
vectorizer_filepath (str): the location of the serialized vectorizer
Returns:
an instance of ReviewVectorizer
"""
with open(vectorizer_filepath) as fp:
return ReviewVectorizer.from_serializable(json.load(fp))
def save_vectorizer(self, vectorizer_filepath):
"""saves the vectorizer to disk using json
Args:
vectorizer_filepath (str): the location to save the vectorizer
"""
with open(vectorizer_filepath, "w") as fp:
json.dump(self._vectorizer.to_serializable(), fp)
def get_vectorizer(self):
""" returns the vectorizer """
return self._vectorizer
def set_split(self, split="train"):
""" selects the splits in the dataset using a column in the dataframe
Args:
split (str): one of "train", "val", or "test"
"""
self._target_split = split
self._target_df, self._target_size = self._lookup_dict[split]
def __len__(self):
return self._target_size
def __getitem__(self, index):
"""the primary entry point method for PyTorch datasets
Args:
index (int): the index to the data point
Returns:
a dictionary holding the data point's features (x_data) and label (y_target)
"""
row = self._target_df.iloc[index]
review_vector = \
self._vectorizer.vectorize(row.review)
rating_index = \
self._vectorizer.rating_vocab.lookup_token(row.rating)
return {'x_data': review_vector,
'y_target': rating_index}
def get_num_batches(self, batch_size):
"""Given a batch size, return the number of batches in the dataset
Args:
batch_size (int)
Returns:
number of batches in the dataset
"""
return len(self) // batch_size
def generate_batches(dataset, batch_size, shuffle=True,
drop_last=True, device="cpu"):
"""
A generator function which wraps the PyTorch DataLoader. It will
ensure each tensor is on the write device location.
"""
dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
shuffle=shuffle, drop_last=drop_last)
for data_dict in dataloader:
out_data_dict = {}
for name, tensor in data_dict.items():
out_data_dict[name] = data_dict[name].to(device)
yield out_data_dict
class ReviewClassifier(nn.Module):
""" a simple perceptron based classifier """
def __init__(self, num_features):
"""
Args:
num_features (int): the size of the input feature vector
"""
super(ReviewClassifier, self).__init__()
self.fc1 = nn.Linear(in_features=num_features,
out_features=1)
def forward(self, x_in, apply_sigmoid=False):
"""The forward pass of the classifier
Args:
x_in (torch.Tensor): an input data tensor.
x_in.shape should be (batch, num_features)
apply_sigmoid (bool): a flag for the sigmoid activation
should be false if used with the Cross Entropy losses
Returns:
the resulting tensor. tensor.shape should be (batch,)
"""
y_out = self.fc1(x_in).squeeze()
if apply_sigmoid:
y_out = torch.sigmoid(y_out)
return y_out
def make_train_state(args):
return {'stop_early': False,
'early_stopping_step': 0,
'early_stopping_best_val': 1e8,
'learning_rate': args.learning_rate,
'epoch_index': 0,
'train_loss': [],
'train_acc': [],
'val_loss': [],
'val_acc': [],
'test_loss': -1,
'test_acc': -1,
'model_filename': args.model_state_file}
def update_train_state(args, model, train_state):
"""Handle the training state updates.
Components:
- Early Stopping: Prevent overfitting.
- Model Checkpoint: Model is saved if the model is better
:param args: main arguments
:param model: model to train
:param train_state: a dictionary representing the training state values
:returns:
a new train_state
"""
# Save one model at least
if train_state['epoch_index'] == 0:
torch.save(model.state_dict(), train_state['model_filename'])
train_state['stop_early'] = False
# Save model if performance improved
elif train_state['epoch_index'] >= 1:
loss_tm1, loss_t = train_state['val_loss'][-2:]
# If loss worsened
if loss_t >= train_state['early_stopping_best_val']:
# Update step
train_state['early_stopping_step'] += 1
# Loss decreased
else:
# Save the best model
if loss_t < train_state['early_stopping_best_val']:
torch.save(model.state_dict(), train_state['model_filename'])
# Reset early stopping step
train_state['early_stopping_step'] = 0
# Stop early ?
train_state['stop_early'] = \
train_state['early_stopping_step'] >= args.early_stopping_criteria
return train_state
def compute_accuracy(y_pred, y_target):
y_target = y_target.cpu()
y_pred_indices = (torch.sigmoid(y_pred)>0.5).cpu().long()#.max(dim=1)[1]
n_correct = torch.eq(y_pred_indices, y_target).sum().item()
return n_correct / len(y_pred_indices) * 100
def set_seed_everywhere(seed, cuda):
np.random.seed(seed)
torch.manual_seed(seed)
if cuda:
torch.cuda.manual_seed_all(seed)
def handle_dirs(dirpath):
if not os.path.exists(dirpath):
os.makedirs(dirpath)
args = Namespace(
# Data and Path information
frequency_cutoff=25,
model_state_file='model.pth',
review_csv='data/yelp/reviews_with_splits_lite.csv',
# review_csv='data/yelp/reviews_with_splits_full.csv',
save_dir='model_storage/ch3/yelp/',
vectorizer_file='vectorizer.json',
# No Model hyper parameters
# Training hyper parameters
batch_size=128,
early_stopping_criteria=5,
learning_rate=0.001,
num_epochs=100,
seed=1337,
# Runtime options
catch_keyboard_interrupt=True,
cuda=True,
expand_filepaths_to_save_dir=True,
reload_from_files=False,
)
if args.expand_filepaths_to_save_dir:
args.vectorizer_file = os.path.join(args.save_dir,
args.vectorizer_file)
args.model_state_file = os.path.join(args.save_dir,
args.model_state_file)
print("Expanded filepaths: ")
print("\t{}".format(args.vectorizer_file))
print("\t{}".format(args.model_state_file))
# Check CUDA
if not torch.cuda.is_available():
args.cuda = False
if torch.cuda.device_count() > 1:
print("Pouzivam", torch.cuda.device_count(), "graficke karty!")
args.device = torch.device("cuda" if args.cuda else "cpu")
# Set seed for reproducibility
set_seed_everywhere(args.seed, args.cuda)
# handle dirs
handle_dirs(args.save_dir)
if args.reload_from_files:
# training from a checkpoint
print("Loading dataset and vectorizer")
dataset = ReviewDataset.load_dataset_and_load_vectorizer(args.review_csv,
args.vectorizer_file)
else:
print("Loading dataset and creating vectorizer")
# create dataset and vectorizer
dataset = ReviewDataset.load_dataset_and_make_vectorizer(args.review_csv)
dataset.save_vectorizer(args.vectorizer_file)
vectorizer = dataset.get_vectorizer()
classifier = ReviewClassifier(num_features=len(vectorizer.review_vocab))
classifier = nn.DataParallel(classifier)
classifier = classifier.to(args.device)
loss_func = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
mode='min', factor=0.5,
patience=1)
train_state = make_train_state(args)
epoch_bar = tqdm(desc='training routine',
total=args.num_epochs,
position=0)
dataset.set_split('train')
train_bar = tqdm(desc='split=train',
total=dataset.get_num_batches(args.batch_size),
position=1,
leave=True)
dataset.set_split('val')
val_bar = tqdm(desc='split=val',
total=dataset.get_num_batches(args.batch_size),
position=1,
leave=True)
try:
for epoch_index in range(args.num_epochs):
train_state['epoch_index'] = epoch_index
# Iterate over training dataset
# setup: batch generator, set loss and acc to 0, set train mode on
dataset.set_split('train')
batch_generator = generate_batches(dataset,
batch_size=args.batch_size,
device=args.device)
running_loss = 0.0
running_acc = 0.0
classifier.train()
for batch_index, batch_dict in enumerate(batch_generator):
# the training routine is these 5 steps:
# --------------------------------------
# step 1. zero the gradients
optimizer.zero_grad()
# step 2. compute the output
y_pred = classifier(x_in=batch_dict['x_data'].float())
# step 3. compute the loss
loss = loss_func(y_pred, batch_dict['y_target'].float())
loss_t = loss.item()
running_loss += (loss_t - running_loss) / (batch_index + 1)
# step 4. use loss to produce gradients
loss.backward()
# step 5. use optimizer to take gradient step
optimizer.step()
# -----------------------------------------
# compute the accuracy
acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
running_acc += (acc_t - running_acc) / (batch_index + 1)
# update bar
train_bar.set_postfix(loss=running_loss,
acc=running_acc,
epoch=epoch_index)
train_bar.update()
train_state['train_loss'].append(running_loss)
train_state['train_acc'].append(running_acc)
# Iterate over val dataset
# setup: batch generator, set loss and acc to 0; set eval mode on
dataset.set_split('val')
batch_generator = generate_batches(dataset,
batch_size=args.batch_size,
device=args.device)
running_loss = 0.
running_acc = 0.
classifier.eval()
for batch_index, batch_dict in enumerate(batch_generator):
# compute the output
y_pred = classifier(x_in=batch_dict['x_data'].float())
# step 3. compute the loss
loss = loss_func(y_pred, batch_dict['y_target'].float())
loss_t = loss.item()
running_loss += (loss_t - running_loss) / (batch_index + 1)
# compute the accuracy
acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
running_acc += (acc_t - running_acc) / (batch_index + 1)
val_bar.set_postfix(loss=running_loss,
acc=running_acc,
epoch=epoch_index)
val_bar.update()
train_state['val_loss'].append(running_loss)
train_state['val_acc'].append(running_acc)
train_state = update_train_state(args=args, model=classifier,
train_state=train_state)
scheduler.step(train_state['val_loss'][-1])
train_bar.n = 0
val_bar.n = 0
epoch_bar.update()
if train_state['stop_early']:
break
train_bar.n = 0
val_bar.n = 0
epoch_bar.update()
except KeyboardInterrupt:
print("Exiting loop")
classifier.load_state_dict(torch.load(train_state['model_filename']))
classifier = classifier.to(args.device)
dataset.set_split('test')
batch_generator = generate_batches(dataset,
batch_size=args.batch_size,
device=args.device)
running_loss = 0.
running_acc = 0.
classifier.eval()
for batch_index, batch_dict in enumerate(batch_generator):
# compute the output
y_pred = classifier(x_in=batch_dict['x_data'].float())
# compute the loss
loss = loss_func(y_pred, batch_dict['y_target'].float())
loss_t = loss.item()
running_loss += (loss_t - running_loss) / (batch_index + 1)
# compute the accuracy
acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
running_acc += (acc_t - running_acc) / (batch_index + 1)
train_state['test_loss'] = running_loss
train_state['test_acc'] = running_acc
print("Test loss: {:.3f}".format(train_state['test_loss']))
print("Test Accuracy: {:.2f}".format(train_state['test_acc']))
def preprocess_text(text):
text = text.lower()
text = re.sub(r"([.,!?])", r" \1 ", text)
text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
return text
def predict_rating(review, classifier, vectorizer, decision_threshold=0.5):
"""Predict the rating of a review
Args:
review (str): the text of the review
classifier (ReviewClassifier): the trained model
vectorizer (ReviewVectorizer): the corresponding vectorizer
decision_threshold (float): The numerical boundary which separates the rating classes
"""
review = preprocess_text(review)
vectorized_review = torch.tensor(vectorizer.vectorize(review))
result = classifier(vectorized_review.view(1, -1))
probability_value = F.sigmoid(result).item()
index = 1
if probability_value < decision_threshold:
index = 0
return vectorizer.rating_vocab.lookup_index(index)
test_review = "this is a pretty awesome book"
classifier = classifier.cpu()
prediction = predict_rating(test_review, classifier, vectorizer, decision_threshold=0.5)
print("{} -> {}".format(test_review, prediction))
# Sort weights
fc1_weights = classifier.fc1.weight.detach()[0]
_, indices = torch.sort(fc1_weights, dim=0, descending=True)
indices = indices.numpy().tolist()
# Top 20 words
print("Influential words in Positive Reviews:")
print("--------------------------------------")
for i in range(20):
print(vectorizer.review_vocab.lookup_index(indices[i]))
print("====\n\n\n")
# Top 20 negative words
print("Influential words in Negative Reviews:")
print("--------------------------------------")
indices.reverse()
for i in range(20):
print(vectorizer.review_vocab.lookup_index(indices[i]))

View File

@ -12,16 +12,42 @@ taxonomy:
Zásobník úloh: Zásobník úloh:
- skúsiť prezentovať na lokálnej konferencii, (Data, Znalosti and WIKT) alebo fakultný zborník (krátka verzia diplomovky).
- Využiť korpus Multext East pri trénovaní. Vytvoriť mapovanie Multext Tagov na SNK Tagy.
Virtuálne stretnutie 6.11.2020
Stav:
- Prečítané (podrobne) 2 články a urobené poznámky. Poznánky sú na GITe.
- Dorobené ďalšie experimenty.
Úlohy do ďalšieho stretnutia:
- Pokračovať v otvorených úlohách.
Virtuálne stretnutie 30.10.2020
Stav:
- Súbory sú na GIte
- Vykonané experimenty, Výsledky experimentov sú v tabuľke
- Návod na spustenie
- Vyriešenie technických problémov. Je k dispozicíí Conda prostredie.
Úlohy na ďalšie stretnutie:
- Preštudovať literatúru na tému "pretrain" a "word embedding" - Preštudovať literatúru na tému "pretrain" a "word embedding"
- [Healthcare NER Models Using Language Model Pretraining](http://ceur-ws.org/Vol-2551/paper-04.pdf) - [Healthcare NER Models Using Language Model Pretraining](http://ceur-ws.org/Vol-2551/paper-04.pdf)
- [Design and implementation of an open source Greek POS Tagger and Entity Recognizer using spaCy](https://ieeexplore.ieee.org/abstract/document/8909591) - [Design and implementation of an open source Greek POS Tagger and Entity Recognizer using spaCy](https://ieeexplore.ieee.org/abstract/document/8909591)
- https://arxiv.org/abs/1909.00505 - https://arxiv.org/abs/1909.00505
- https://arxiv.org/abs/1607.04606 - https://arxiv.org/abs/1607.04606
- LSTM, recurrent neural network, - LSTM, recurrent neural network,
- Urobte si poznámky z viacerých čnánkov, poznačte si zdroj a čo ste sa dozvedeli.
- Vykonať viacero experimentov s pretrénovaním - rôzne modely, rôzne veľkosti adaptačných dát a zostaviť tabuľku - Vykonať viacero experimentov s pretrénovaním - rôzne modely, rôzne veľkosti adaptačných dát a zostaviť tabuľku
- Opísať pretrénovanie, zhrnúť vplyv pretrénovania na trénovanie v krátkom článku cca 10 strán. - Opísať pretrénovanie, zhrnúť vplyv pretrénovania na trénovanie v krátkom článku cca 10 strán.
- skúsiť prezentovať na lokálnej konferencii, (Data, Znalosti and WIKT) alebo fakultný zborník (krátka verzia diplomovky).
- Využiť korpus Multext East pri trénovaní. Vytvoriť mapovanie Multext Tagov na SNK Tagy.
Virtuálne stretnutie 8.10.2020 Virtuálne stretnutie 8.10.2020

View File

@ -21,6 +21,46 @@ Cieľom práce je príprava nástrojov a budovanie tzv. "Question Answering data
## Diplomový projekt 2 ## Diplomový projekt 2
Zásobník úloh:
- Dá sa zistiť koľko času strávil anotátor pri vytváraní otázky? Ak sa to dá zistiť z DB schémy, tak by bolo dobré to zobraziť vo webovej aplikácii.
Virtuálne stretnutie 27.10.2020
Stav:
- Dorobená webová aplikácia podľa pokynov z minulého stretnutia, kódy sú na gite
Úlohy na ďalšie stretnutie:
- Urobiť konfiguračný systém - načítať konfiguráciu zo súboru (python-configuration?). Meno konfiguračného súboru by sa malo dať zmeniť cez premennú prostredia (getenv).
- Dorobiť autentifikáciu pre anotátorov pre zobrazovanie výsledkov, aby anotátor videl iba svoje výsledky. Je to potrebné? Zatiaľ dorobiť iba pomocou e-mailu.
- Dorobiť heslo na webovú aplikáciu
- Dorobiť zobrazovanie zlých a dobrých anotácií pre každého anotátora.
- Preštudovať odbornú literatúru na tému "Crowdsourcing language resources". Vyberte niekoľko odborných publikácií (scholar, scopus), napíšte bibliografický odkaz a čo ste sa z publikácii dozvedeli o vytváraní jazykových zdrojov. Aké iné korpusy boli touto metódou vytvorené?
Virtuálne stretnutie 20.10.2020
Stav:
- Vylepšený skript pre prípravu dát , mierna zmena rozhrania (duplicitná práca kvôli nedostatku v komunikácii).
Úohy do ďalšieho stretnutia:
- Dorobiť webovú aplikáciu pre zisťoovanie množstva anotovaných dát.
- Odladiť chyby súvisiace s novou anotačnou schémou.
- Zobraziť množstvo anotovaných dát
- Zobraziť množstvo platných anotovaných dát.
- Zobbraziť množstvo validovaných dát.
- Otázky sa v rámci jedného paragrafu nesmú opakovať. Každá otázka musí mať odpoveď. Každá otázka musí byť dlhšia ako 10 znakov alebo dlhšia ako 2 slová. Odpoveď musí mať aspoň jedno slovo. Otázka musí obsahovať slovenské slová.
- Výsledky posielajte čím skôr do projektového repozitára, adresár database_app.
Stretnutie 25.9.2020 Stretnutie 25.9.2020
Urobené: Urobené:

View File

@ -6,10 +6,8 @@ taxonomy:
tag: [demo,nlp] tag: [demo,nlp]
author: Daniel Hladek author: Daniel Hladek
--- ---
# Martin Jancura # Martin Jancura
*Rok začiatku štúdia*: 2017 *Rok začiatku štúdia*: 2017
## Bakalársky projekt 2020 ## Bakalársky projekt 2020
@ -31,9 +29,36 @@ Možné backendy:
Zásobník úloh: Zásobník úloh:
- Pripraviť backend. - Pripraviť backend.
- Pripraviť frontend v Javascripte. - Pripraviť frontend v Javascripte - in progress.
- Zapisať človekom urobený preklad do databázy. - Zapisať človekom urobený preklad do databázy.
Virtuálne stretnutie 6.11.2020:
Stav:
Práca na písomnej časti.
Úlohy do ďalšieho stretnutia:
- Pohľadať takú knižnicu, kde vieme využiť vlastný preklad. Skúste si nainštalovať OpenNMT.
- Prejdite si tutoriál https://github.com/OpenNMT/OpenNMT-py#quickstart alebo podobný.
- Navrhnite ako prepojiť frontend a backend.
Virtuálne stretnutie 23.10.2020:
Stav:
- Urobený frontend pre komunikáciu s Microsof Translation API, využíva Axios a Vanilla Javascriupt
Úlohy do ďalšieho stretnutia:
- Pohľadať takú knižnicu, kde vieme využiť vlastný preklad. Skúste si nainštalovať OpenNMT.
- Zistiť čo znamená politika CORS.
- Pokračujte v písaní práce, pridajte časť o strojovom preklade.. Prečítajte si články https://opennmt.net/OpenNMT/references/ a urobte si poznámky. Do poznámky dajte bibliografický odkaz a čo ste sa dozvedeli z článku.
Virtuálne stretnutie 16.10: Virtuálne stretnutie 16.10:
Stav: Stav:

View File

@ -31,7 +31,42 @@ Návrh na zadanie:
1. Navrhnite možné zlepšenia Vami vytvorenej aplikácie. 1. Navrhnite možné zlepšenia Vami vytvorenej aplikácie.
Zásobník úloh: Zásobník úloh:
- Vytvorte si repozitár na GITe, nazvite ho bp2010. Do neho budete dávať kódy a dokumentáciu ktorú vytvoríte.
- Pripravte Docker image Vašej aplikácie podľa https://pythonspeed.com/docker/
Virtuálne stretnutie 30.10.:
Stav:
- Modifikovaná existujúca aplikácia "spacy-streamlit", zdrojové kóódy sú na GITe podľa pokynov z minulého stretnutia.
- Obsahuje aj formulár, neobsahuje REST API
Úlohy do ďalšieho stretnutia:
- Pokračujte v písaní. Prečítajte si odborné články na tému "dependency parsing" a vypracujte poznámky čo ste sa dozvedeli. Poznačte si zdroj.
- Pokkračujte v práci na demonštračnej webovej aplikácii.
Virtuálne stretnutie 19.10.:
Stav:
- Vypracované a odovzdané poznámky k bakalárskej práci, obsahujú výpisy z literatúry.
- Vytvorený repozitár. https://git.kemt.fei.tuke.sk/mw223on/bp2020
- Nainštalovaný a spustený slovenský Spacy model.
- Nainštalované Spacy REST Api https://github.com/explosion/spacy-services
- Vyskúšané demo Display so slovenským modelom
Úlohy na ďalšie stretnutie:
- Pripravte webovú aplikáciu ktorá bude prezentovať rozpoznávanie závislostí a pomenovaných entít v slovenskom jayzyku. Mala by sa skladať z frontentu a backendu.
- zapíšte potrebné Python balíčky do súboru "requirements.txt"
- Vytvorte skript na inštaláciu aplikácie pomocou pip.
- Vytvorte skript pre spustenie backendu aj frontendu. Výsledky dajte do repozitára.
- Vytvorte návrh na frontend (HTML + CSS).
- Pozrite na zdrojové kódy Spacy a zistite, čo presne robí príkaz display.serve
- Vysledky dajte do repozitára.
Virtuálne stretnutie 9.10. Virtuálne stretnutie 9.10.

View File

@ -20,8 +20,21 @@ Návrh na zadanie:
2. Vytvorte jazykový model metódou BERT alebo poodobnou metódou. 2. Vytvorte jazykový model metódou BERT alebo poodobnou metódou.
3. Vyhodnnotte vytvorený jazykový model a navrhnite zlepšenia. 3. Vyhodnnotte vytvorený jazykový model a navrhnite zlepšenia.
Zásobník úloh:
Virtuálne stretnutie 30.10.2020
Stav:
- Vypracované poznámky k seq2seq
- nainštalovaný Pytorch a fairseq
- problémy s tutoriálom. Riešenie by mohlo byť použitie release verzie 0.9.0, pip install fairseq=0.9.0
Do ďďalšieho stretnutia:
- Vyriešte technické porblémy
- prejdide si tutoriál https://fairseq.readthedocs.io/en/latest/getting_started.html#training-a-new-model
- Prejsť si tutoriál https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md alebo podobný. - Prejsť si tutoriál https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.md alebo podobný.
- Preštudujte si články na tému BERT, urobte si poznámky čo ste sa dozvedeli spolu so zdrojom.
Virtuálne stretnutie 16.10.2020 Virtuálne stretnutie 16.10.2020

View File

@ -23,13 +23,50 @@ Pokusný klaster Raspberry Pi pre výuku klaudových technológií
Ciel projektu je vytvoriť domáci lacný klaster pre výuku cloudových technológií. Ciel projektu je vytvoriť domáci lacný klaster pre výuku cloudových technológií.
Zásobník úloh:
- Aktivujte si technológiu WSL2 a Docker Desktop ak používate Windows.
Virtuálne stretnutie 30.10.
Stav:
- vypracovaný písomný prehľad podľa pokynov
- nainštalovaný RaspberryPI OS do Virtual\boxu
- vypracovaný predbežný HW návrh
- Nainšalované Docker Toolbox aj Ubuntu s Dockerom
- Oboznámenie sa s Dockerom
- Vedúci: vykoananý nákup HW - Dosky 5x RPi4 model B 8GB, SD Karty 128GB 11ks, the pi hut Cluster Case for raspberry pi 4ks, Zdroj 60W and 18W Quick Charger Epico 1ks. 220V kábel a zásuvka s vypínačom.
Do budúceho stretnutia:
- Dá sa kúpiť oficiálmy 5 portový switch?
- Skompletizovať nákup a dohodntúť spôsob odovzdania. Podpísať preberací protokol.
- Použite https://kind.sigs.k8s.io na simuláciu klastra.
- Nainštalujte si https://microk8s.io/ , prečítajte tutoriály https://ubuntu.com/tutorials/
- Prejdite si https://kubernetes.io/docs/tutorials/hello-minikube/ alebo pododbný tutoriály
Virtuálne stretnutie 16.10.
Stav:
- Prečítanie články
- začatý tutorál Docker zo ZCT
- vedúci vytovoril prístup na Jetson Xavier AGX2 s ARM procesorom.
- začatý nákup na Raspberry Pi a príslušenstvo.
Úlohy do ďalšieho stretnutia
- Vypracovať prehľad (min 4) existujúcich riešení Raspberry Pi cluster (na odovzdanie). Aký hardware a software použili?
- napájanie, chladenie, sieťové prepojenie
- Oboznámte sa s https://www.raspberrypi.org/downloads/raspberry-pi-os/
- Nainštalujte si https://roboticsbackend.com/install-raspbian-desktop-on-a-virtual-machine-virtualbox/
- Napíšte podrobný návrh hardware pre vytvorenie Raspberry Pi Cluster.
Stretnutie 29.9. Stretnutie 29.9.
Dohodli sme sa na zadaní práce. Dohodli sme sa na zadaní práce.
Návrhy na zlepšenie (pre vedúceho): Návrhy na zlepšenie (pre vedúceho):
- Zistiť podmienky financovania (odhad 350EUR). - Zistiť podmienky financovania (odhad 350EUR).

View File

@ -39,23 +39,35 @@ Učenie prebieha tak, že v texte ukážete ktoré slová patria názvom osôb,
Vašou úlohou bude v texte vyznačiť vlastné podstatné mená. Vašou úlohou bude v texte vyznačiť vlastné podstatné mená.
Vlastné podstatné meno sa v slovenskom jazyku spravidla začína veľkým písmeno, ale môže obsahovať aj ďalšie slová písané malým písmenom. Vlastné podstatné meno sa v slovenskom jazyku spravidla začína veľkým písmenom, ale môže obsahovať aj ďalšie slová písané malým písmenom.
Ak vlastné podstatné meno v sebe obsahuje iný názov, napr. Nové Mesto nad Váhom, anotujte ho ako jeden celok.
- PER: mená osôb - PER: mená osôb
- LOC: geografické názvy - LOC: geografické názvy
- ORG: názvy organizácii - ORG: názvy organizácii
- MISC: iné názvy, napr. názvy produktov. - MISC: iné názvy, napr. názvy produktov.
Ak vlastné podstatné meno v sebe obsahuje iný názov, napr. Nové Mesto nad Váhom, anotujte ho ako jeden celok. V texte narazíte aj na slová, ktoré síce pomenúvajú geografickú oblasť, avšak nejedná sa o vlastné podstatné mená (napr. britská kolónia, londýnsky šerif...). Takéto slová nepovažujeme za pomenované entity a preto Vás prosíme, aby ste ich neoznačovali.
V prípade, že v texte sa nenachádzajú žiadne anotácie, tento článok je platný, a teda zvoľte možnosť Accept.
V prípade, že text sa skladá iba z jedného, resp. niekoľkých slov, ktoré sami o sebe nenesú žiaden význam, tento článok je neplatný, a teda zvoľte možnosť Reject.
## Anotačné dávky ## Anotačné dávky
Do formulára napíšte Váš e-mail aby bolo možné rozpoznať, kto vykonal anotáciu. Do formulára napíšte Váš e-mail aby bolo možné rozpoznať, kto vykonal anotáciu.
Počas anotácie môžete pre zjednodušenie práce využívať klávesové skratky:
- 1,2,3,4 - prepínanie medzi entitami
- klávesa "a" - Accept
- klávesa "x" - Reject
- klávesa "space" - Ignore
- klávesa "backspace" alebo "del" - Undo
Po anotovaní nezabudnite svojú prácu uložiť (ikona vľavo hore, alebo "Ctrl + s").
### Pokusná anotačná dávka ### Pokusná anotačná dávka
Dávka je zameraná na zber spätnej väzby od anotátorov na zlepšenie rozhrania a anotačného procesu. Dávka je zameraná na zber spätnej väzby od anotátorov na zlepšenie rozhrania a anotačného procesu.
{% include "forms/form.html.twig" with { form: forms('ner1') } %} {% include "forms/form.html.twig" with { form: forms('ner1') } %}

View File

@ -36,11 +36,12 @@ Učenie prebieha tak, že vytvoríte príklad s otázkou a odpoveďou. Účasť
## Návod pre anotátorov ## Návod pre anotátorov
Najprv sa Vám zobrazí krátky článok. Vašou úlohou bude prečítať si časť článku, vymyslieť k nemu otázku a v texte vyznačiť odpoveď. Odpoveď na otázku sa musí nachádzať v texte článku. Na vyznačenie jednej otázky máte asi 50 sekúnd. Najprv sa Vám zobrazí krátky článok. Vašou úlohou bude prečítať si časť článku, vymyslieť k nemu otázku a v texte vyznačiť odpoveď. Otázka musí byť jednoznačná a odpoveď na otázku sa musí nachádzať v texte článku. Na vyznačenie jednej otázky máte asi 50 sekúnd.
1. Prečítajte si článok. Ak článok nie je vyhovujúci ťuknite na červený krížik "reject" (Tab a potom 'x'). 1. Prečítajte si článok. Ak článok nie je vyhovujúci ťuknite na červený krížik "Reject" (Tab a potom 'x').
2. Napíšte otázku. Ak neviete napísať otázku, ťuknite na "Ignore" (Tab a potom 'i'). 2. Napíšte otázku. Ak neviete napísať otázku, ťuknite na "Ignore" (Tab a potom 'i').
3. Vyznačte myšou odpoveď a ťuknite na zelenú fajku "Accept" (klávesa a) a pokračujte ďalšou otázkou k tomu istému článku alebo k novému článku. Ten istý text sa zobrazí maximálne 5 krát. 3. Vyznačte myšou odpoveď a ťuknite na zelenú fajku "Accept" (klávesa a) a pokračujte ďalšou otázkou k tomu istému článku alebo k novému článku.
4. Ten istý článok sa Vám zobrazí 5 krát, vymyslite k nemu 5 rôznych otázok.
Ak je zobrazený text nevhodný, tak ho zamietnite. Nevhodný text: Ak je zobrazený text nevhodný, tak ho zamietnite. Nevhodný text:
@ -61,6 +62,12 @@ Ak je zobrazený text nevhodný, tak ho zamietnite. Nevhodný text:
4. <span style="color:pink">Na čo slúži lyzozóm?</span> 4. <span style="color:pink">Na čo slúži lyzozóm?</span>
5. <span style="color:orange">Čo je to autofágia?<span> 5. <span style="color:orange">Čo je to autofágia?<span>
Príklad na nesprávu otázku:
1. Čo je to Golgiho aparát? - odpoveď sa v článku nenachádza.
2. Čo sa deje v mŕtvych bunkách? - otázka nie je jednoznačná, presná odpoveď sa v článku nenachádza.
3. Čo je normálny fyziologický proces? - odpoveď sa v článku nenachádza.
Do formulára napíšte Váš e-mail aby bolo možné rozpoznať, kto vykonal anotáciu. Do formulára napíšte Váš e-mail aby bolo možné rozpoznať, kto vykonal anotáciu.
## Anotačné dávky ## Anotačné dávky