pridane subory s rozpracovanym prieskumom prelozenych squad datasetov a volne dostupnych API na preklad textu, taktiez prilozeny skript na parsovanie squad datasetu a na scitanie znakov
This commit is contained in:
commit
9a9f30f5f6
42
Porovnanie sluzieb.md
Normal file
42
Porovnanie sluzieb.md
Normal file
@ -0,0 +1,42 @@
|
|||||||
|
# Porovnanie služieb jazykového prekladu
|
||||||
|
|
||||||
|
## Google Translation API
|
||||||
|
|
||||||
|
[Pricing](https://cloud.google.com/translate/pricing)
|
||||||
|
|
||||||
|
[docs](https://cloud.google.com/translate/docs/basic/translating-text)
|
||||||
|
|
||||||
|
| Usage | Price |
|
||||||
|
|---|---|
|
||||||
|
| 1 to 500,000 characters per month | Free |
|
||||||
|
| 500,000 to 1 billion characters per month | $20~ per million characters |
|
||||||
|
|
||||||
|
|
||||||
|
## Amazon AWS
|
||||||
|
|
||||||
|
[Pricing](https://aws.amazon.com/translate/pricing/)
|
||||||
|
|
||||||
|
| Usage | Price |
|
||||||
|
|---|---|
|
||||||
|
| 1 to 2,000,000 characters per month | Free |
|
||||||
|
| above 2,000,000 per month | $15~ per million characters |
|
||||||
|
|
||||||
|
|
||||||
|
## Microsoft Azure translation API
|
||||||
|
|
||||||
|
[Pricing](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/translator)
|
||||||
|
|
||||||
|
| Usage | Price |
|
||||||
|
|---|---|
|
||||||
|
| 2,000,000 characters per month | Free |
|
||||||
|
| above 2,000,000 characters per month | $10~ per million characters |
|
||||||
|
|
||||||
|
## IBM Watson translation
|
||||||
|
|
||||||
|
[Pricing](https://www.ibm.com/cloud/watson-language-translator/pricing)
|
||||||
|
|
||||||
|
Multiple plans with various free tiers, but no exact values for cost after exceeding free tier.
|
||||||
|
|
||||||
|
## Systran
|
||||||
|
|
||||||
|
[maybe??](https://translate.systran.net)
|
45
Prieskum jazykovych verzii.md
Normal file
45
Prieskum jazykovych verzii.md
Normal file
@ -0,0 +1,45 @@
|
|||||||
|
# Prieskum rôznych jazykových verzií datasetu SQuAD a spôsobov ich vytvorenia
|
||||||
|
|
||||||
|
## Spanish SQuAD
|
||||||
|
[Hugging face](https://huggingface.co/datasets/squad_es)
|
||||||
|
|
||||||
|
[White paper](https://www.researchgate.net/publication/337904607_Automatic_Spanish_Translation_of_the_SQuAD_Dataset_for_Multilingual_Question_Answering/fulltext/5df1bb65299bf10bc3545e97/Automatic-Spanish-Translation-of-the-SQuAD-Dataset-for-Multilingual-Question-Answering.pdf)
|
||||||
|
|
||||||
|
Preložili SQuAD verzie 1.1 a vyvinuli metódu TAR, Translate Align Retrieve.
|
||||||
|
|
||||||
|
### Translate Aling Retrieve Method
|
||||||
|
[GitHub repo](https://github.com/ccasimiro88/TranslateAlignRetrieve)
|
||||||
|
|
||||||
|
TAR je metóda, ktorá preloží kontext, otázky, odpovede datasetu SQuAD do iného jazyka. Skladá sa z troch častí:
|
||||||
|
|
||||||
|
1. Natrénovaný NMT model zo zdrojového jazyka do cielového jazyka
|
||||||
|
2. Model na zarovnávanie text slov
|
||||||
|
3. Postup na preloženie kontextu, otázky, odpovedí do cieľového jazyka použitím predošlých komponentov
|
||||||
|
|
||||||
|
Pre zarovnanie kontextu a jeho prekladu použili model [eflomal](https://ufal.mff.cuni.cz/pbml/106/art-ostling-tiedemann.pdf), ktorého implementácia je dostupna na [gitlabe](https://github.com/robertostling/efmaral)
|
||||||
|
|
||||||
|
## Swedish SQuAD
|
||||||
|
[Hugging face](https://huggingface.co/datasets/susumu2357/squad_v2_sv)
|
||||||
|
|
||||||
|
[GitHub repo](https://github.com/susumu2357/SQuAD_v2_sv)
|
||||||
|
|
||||||
|
Na preloženie SQuADu použili Google Translation API, kde okolo odpovede v kontexte pridali špeciálne znaky "[0]" aby vedeli nájst odpoveď v kontexte ktorého slová môžu byť posunuté.
|
||||||
|
|
||||||
|
Avšak tento postup nie je perfektný, niektoré kontext-odpoveď páry neboli preložené perfektne. Výsledný preložený dataset je z toho dôvodu iba 90% veľkosti originálneho SQuADu.
|
||||||
|
|
||||||
|
Datasetom bol dotrénovaný model [Swedish BERT](https://github.com/Kungbib/swedish-bert-models), ktorého výsledky porovnávali s výsledkami originálneho Swedish BERT modelu a Multilingual XLM-RoBERTa. ([výsledky](https://github.com/susumu2357/SQuAD_v2_sv#evaluation-on-squad_v2_sv-dev))
|
||||||
|
|
||||||
|
Pre realistickejšie porovnanie modelov bol interne vytvorený menší QA dataset s 91 pármi otázok-odpovedí kde výsledky dotrénovaného modelu sú ovela lepšie. ([výsledky](https://github.com/susumu2357/SQuAD_v2_sv#evaluation-on-nobel-prize-dataset))
|
||||||
|
|
||||||
|
## French SQuAD
|
||||||
|
[Hugging face](https://huggingface.co/datasets/qwant/squad_fr)
|
||||||
|
|
||||||
|
[White paper](https://hal.archives-ouvertes.fr/hal-03336060/file/RANLP_2021_transformers_usability.pdf)
|
||||||
|
|
||||||
|
## Italian SQuAD
|
||||||
|
[Hugging face](https://huggingface.co/datasets/squad_it)
|
||||||
|
|
||||||
|
Článok nie je volne dostupný
|
||||||
|
|
||||||
|
|
||||||
|
|
1
squad-v2-dev.json
Normal file
1
squad-v2-dev.json
Normal file
File diff suppressed because one or more lines are too long
45
squad_char_counter.py
Normal file
45
squad_char_counter.py
Normal file
@ -0,0 +1,45 @@
|
|||||||
|
import json
|
||||||
|
|
||||||
|
squad = None
|
||||||
|
|
||||||
|
with open("squad-v2-dev.json", "r", encoding="utf-8") as f:
|
||||||
|
squad = json.load(f)
|
||||||
|
|
||||||
|
num_articles = len(squad['data'])
|
||||||
|
print(f"total articles: {num_articles}")
|
||||||
|
|
||||||
|
context_chars = 0
|
||||||
|
question_chars = 0
|
||||||
|
answer_chars = 0
|
||||||
|
|
||||||
|
total_paragraphs = 0
|
||||||
|
total_qas = 0
|
||||||
|
total_answers = 0
|
||||||
|
for article in squad['data']:
|
||||||
|
total_paragraphs += len(article['paragraphs'])
|
||||||
|
|
||||||
|
for paragraph in article['paragraphs']:
|
||||||
|
context_chars += len(paragraph['context'])
|
||||||
|
|
||||||
|
total_qas += len(paragraph['qas'])
|
||||||
|
|
||||||
|
for qas in paragraph['qas']:
|
||||||
|
question_chars += len(qas['question'])
|
||||||
|
|
||||||
|
total_answers += len(qas['answers'])
|
||||||
|
|
||||||
|
for answer in qas['answers']:
|
||||||
|
answer_chars += len(answer['text'])
|
||||||
|
|
||||||
|
print(f"total paragraphs: {total_paragraphs}")
|
||||||
|
print(f"total qas: {total_qas}")
|
||||||
|
print(f"total answers: {total_answers}")
|
||||||
|
|
||||||
|
print(f"chars in contexts: {context_chars}")
|
||||||
|
print(f"chars in questions: {question_chars}")
|
||||||
|
print(f"chars in answers: {answer_chars}")
|
||||||
|
|
||||||
|
total_chars = context_chars + question_chars + answer_chars
|
||||||
|
|
||||||
|
print(f"total chars: {total_chars}")
|
||||||
|
|
Loading…
Reference in New Issue
Block a user