86 lines
		
	
	
		
			2.8 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			86 lines
		
	
	
		
			2.8 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | |
| title: Slovenské jazykové zdroje
 | |
| published: true
 | |
| taxonomy:
 | |
|     category: [info]
 | |
|     tag: [annotation,ner,pos,question-answer,nlp]
 | |
|     author: Daniel Hladek
 | |
| ---
 | |
| # Slovenské jazykové zdroje
 | |
| 
 | |
| ### POS
 | |
| 
 | |
| [Multext East](http://nl.ijs.si/ME/)  Anotovaný román George Orwell 1984 v 15 európskych jazykoch
 | |
| 
 | |
| 
 | |
| ### NER
 | |
| 
 | |
| - Learning multilingual named entity recognition from Wikipedia- WIKI Ner?
 | |
| - Cross-lingual Name Tagging and Linking for 282 Languages - NER anotácia aj slovenskej Wikipédie podľa anglickej
 | |
|     -  https://drive.google.com/drive/folders/1bkK6ly_awxe9IgAKL16VVvCtjcYcDSw8
 | |
|     -  https://elisa-ie.github.io/wikiann/
 | |
| 
 | |
| ### Parsing-POS
 | |
| 
 | |
| [Slovak Dependency Treebank](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1822)
 | |
| 
 | |
| https://github.com/UniversalDependencies/UD_Slovak-SNK
 | |
| 
 | |
| [Artificial Treebank with Ellipsis](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2616)
 | |
| 
 | |
| ### Wordnet
 | |
| 
 | |
| [Slovak Word Net](https://korpus.sk/WordNet.html)
 | |
| 
 | |
| ### Parallel Corpus
 | |
| 
 | |
| Europarlament
 | |
| 
 | |
| [Czech-Slovak Parallel Corpus](https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0006-AADF-0)
 | |
| 
 | |
| [English-Slovak Parallel Corpus](https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0006-AAE0-A)
 | |
| 
 | |
| [Multext East](http://nl.ijs.si/ME/)
 | |
| 
 | |
| ### Sentiment
 | |
| 
 | |
| [Twitter sentiment for 15 European languages](https://www.clarin.si/repository/xmlui/handle/11356/1054)
 | |
| 
 | |
| ### Web
 | |
| 
 | |
| - [Aranea](http://ucts.uniba.sk/aranea_about/)
 | |
| - [SkTenTen](https://www.sketchengine.eu/sktenten-slovak-corpus/) automaticky POS anotovaný, prístup cez web rozhranie
 | |
| - [CommonCrawl](https://commoncrawl.org/2020/03/february-2020-crawl-archive-now-available/) Obsahuje aj slovenské dáta?
 | |
| - [Oscar](https://traces1.inria.fr/oscar/) klasifikácia a deduplikácia dát z COmmonCrawl, aj pre slovenčinu (4.5 GB dedub, 665M slov dedup.)
 | |
| 
 | |
| ### Wikipedia
 | |
| 
 | |
| [Wikipedia vo formáte JSON Elasticsearch Bulk](https://dumps.wikimedia.org/other/cirrussearch/current/)
 | |
| 
 | |
| ### Word Embedding
 | |
| 
 | |
| - [FastText Word Embedding from Common Crawl](https://fasttext.cc/docs/en/crawl-vectors.html)
 | |
| - [FastText Word Embedding from Wikipedia](https://fasttext.cc/docs/en/pretrained-vectors.html)
 | |
| 
 | |
| 
 | |
| ### Databázy zdrojov
 | |
| 
 | |
| https://github.com/slovak-nlp/resources
 | |
| 
 | |
| https://www.clarin.eu/portal
 | |
| 
 | |
| https://www.clarin.eu/resource-families/manually-annotated-corpora
 | |
| 
 | |
| http://www.meta-share.org/
 | |
| 
 | |
| https://korpus.sk/res.html
 | |
| 
 | |
| Slovak Stemming https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Slovak_Stemmer_Analysis
 | |
| 
 | |
| ### Tools
 | |
| 
 | |
| - [Spacy](https://spacy.io/), tokenizer, stopwords, custom model
 | |
| - [Slovak Lexer](https://github.com/hladek/slovak-lexer) / tokenizer
 | |
| - [Slovak Elasticsearch](https://github.com/essential-data/elasticsearch-sk) - stopwords, stemmer
 | |
| - [Slovak Hunspell](https://github.com/essential-data/hunspell-sk) - stemmer, spelling
 |