2020-03-02 13:49:10 +00:00
|
|
|
# Slovenské jazykové zdroje
|
|
|
|
|
|
|
|
### POS
|
|
|
|
|
2020-03-12 08:55:20 +00:00
|
|
|
[Multext East](http://nl.ijs.si/ME/) Anotovaný román George Orwell 1984 v 15 európskych jazykoch
|
2020-03-02 13:49:10 +00:00
|
|
|
|
2020-03-11 07:29:35 +00:00
|
|
|
|
|
|
|
### NER
|
|
|
|
|
|
|
|
- Learning multilingual named entity recognition from Wikipedia- WIKI Ner?
|
|
|
|
- Cross-lingual Name Tagging and Linking for 282 Languages - NER anotácia aj slovenskej Wikipédie podľa anglickej
|
|
|
|
- https://drive.google.com/drive/folders/1bkK6ly_awxe9IgAKL16VVvCtjcYcDSw8
|
|
|
|
- https://elisa-ie.github.io/wikiann/
|
|
|
|
|
2020-03-02 13:49:10 +00:00
|
|
|
### Parsing-POS
|
|
|
|
|
|
|
|
[Slovak Dependency Treebank](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1822)
|
|
|
|
|
|
|
|
https://github.com/UniversalDependencies/UD_Slovak-SNK
|
|
|
|
|
|
|
|
[Artificial Treebank with Ellipsis](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2616)
|
|
|
|
|
|
|
|
### Wordnet
|
|
|
|
|
|
|
|
[Slovak Word Net](https://korpus.sk/WordNet.html)
|
|
|
|
|
2020-04-14 15:17:37 +00:00
|
|
|
### Parallel Corpus
|
2020-03-02 13:49:10 +00:00
|
|
|
|
|
|
|
Europarlament
|
|
|
|
|
|
|
|
[Czech-Slovak Parallel Corpus](https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0006-AADF-0)
|
|
|
|
|
|
|
|
[English-Slovak Parallel Corpus](https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0006-AAE0-A)
|
|
|
|
|
|
|
|
[Multext East](http://nl.ijs.si/ME/)
|
|
|
|
|
2020-03-12 08:55:20 +00:00
|
|
|
### Sentiment
|
2020-03-02 13:49:10 +00:00
|
|
|
|
|
|
|
[Twitter sentiment for 15 European languages](https://www.clarin.si/repository/xmlui/handle/11356/1054)
|
|
|
|
|
|
|
|
### Web
|
|
|
|
|
2020-03-19 08:42:26 +00:00
|
|
|
- [Aranea](http://ucts.uniba.sk/aranea_about/)
|
|
|
|
- [SkTenTen](https://www.sketchengine.eu/sktenten-slovak-corpus/) automaticky POS anotovaný, prístup cez web rozhranie
|
2020-04-14 15:17:37 +00:00
|
|
|
- [CommonCrawl](https://commoncrawl.org/2020/03/february-2020-crawl-archive-now-available/) Obsahuje aj slovenské dáta?
|
2020-05-04 15:20:36 +00:00
|
|
|
- [Oscar](https://traces1.inria.fr/oscar/) klasifikácia a deduplikácia dát z COmmonCrawl, aj pre slovenčinu (4.5 GB dedub, 665M slov dedup.)
|
2020-03-02 13:49:10 +00:00
|
|
|
|
2020-03-12 08:55:20 +00:00
|
|
|
### Wikipedia
|
|
|
|
|
|
|
|
[Wikipedia vo formáte JSON Elasticsearch Bulk](https://dumps.wikimedia.org/other/cirrussearch/current/)
|
2020-03-02 13:49:10 +00:00
|
|
|
|
2020-04-14 15:14:32 +00:00
|
|
|
### Word Embedding
|
|
|
|
|
2020-04-16 06:26:53 +00:00
|
|
|
- [FastText Word Embedding from Common Crawl](https://fasttext.cc/docs/en/crawl-vectors.html)
|
|
|
|
- [FastText Word Embedding from Wikipedia](https://fasttext.cc/docs/en/pretrained-vectors.html)
|
|
|
|
|
2020-04-14 15:14:32 +00:00
|
|
|
|
2020-03-02 13:49:10 +00:00
|
|
|
### Databázy zdrojov
|
|
|
|
|
|
|
|
https://www.clarin.eu/portal
|
|
|
|
|
|
|
|
https://www.clarin.eu/resource-families/manually-annotated-corpora
|
|
|
|
|
|
|
|
http://www.meta-share.org/
|
|
|
|
|
|
|
|
https://korpus.sk/res.html
|
|
|
|
|
2020-04-16 06:26:53 +00:00
|
|
|
Slovak Stemming https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Slovak_Stemmer_Analysis
|
|
|
|
|
|
|
|
### Tools
|
|
|
|
|
|
|
|
- [Spacy](https://spacy.io/), tokenizer, stopwords, custom model
|
|
|
|
- [Slovak Lexer](https://github.com/hladek/slovak-lexer) / tokenizer
|
|
|
|
- [Slovak Elasticsearch](https://github.com/essential-data/elasticsearch-sk) - stopwords, stemmer
|
|
|
|
- [Slovak Hunspell](https://github.com/essential-data/hunspell-sk) - stemmer, spelling
|