forked from KEMT/zpwiki
2.8 KiB
2.8 KiB
title | published | taxonomy | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Slovenské jazykové zdroje | true |
|
Slovenské jazykové zdroje
POS
Multext East Anotovaný román George Orwell 1984 v 15 európskych jazykoch
NER
- Learning multilingual named entity recognition from Wikipedia- WIKI Ner?
- Cross-lingual Name Tagging and Linking for 282 Languages - NER anotácia aj slovenskej Wikipédie podľa anglickej
Parsing-POS
https://github.com/UniversalDependencies/UD_Slovak-SNK
Artificial Treebank with Ellipsis
Wordnet
Parallel Corpus
Europarlament
English-Slovak Parallel Corpus
Sentiment
Twitter sentiment for 15 European languages
Web
- Aranea
- SkTenTen automaticky POS anotovaný, prístup cez web rozhranie
- CommonCrawl Obsahuje aj slovenské dáta?
- Oscar klasifikácia a deduplikácia dát z COmmonCrawl, aj pre slovenčinu (4.5 GB dedub, 665M slov dedup.)
Wikipedia
Wikipedia vo formáte JSON Elasticsearch Bulk
Word Embedding
Databázy zdrojov
https://github.com/slovak-nlp/resources
https://www.clarin.eu/resource-families/manually-annotated-corpora
Slovak Stemming https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Slovak_Stemmer_Analysis
Tools
- Spacy, tokenizer, stopwords, custom model
- Slovak Lexer / tokenizer
- Slovak Elasticsearch - stopwords, stemmer
- Slovak Hunspell - stemmer, spelling