pridane subory s rozpracovanym prieskumom prelozenych squad datasetov a volne dostupnych API na preklad textu, taktiez prilozeny skript na parsovanie squad datasetu a na scitanie znakov

This commit is contained in:
Tomas Kucharik 2021-10-22 13:46:19 +02:00
commit 9a9f30f5f6
4 changed files with 133 additions and 0 deletions

42
Porovnanie sluzieb.md Normal file
View File

@ -0,0 +1,42 @@
# Porovnanie služieb jazykového prekladu
## Google Translation API
[Pricing](https://cloud.google.com/translate/pricing)
[docs](https://cloud.google.com/translate/docs/basic/translating-text)
| Usage | Price |
|---|---|
| 1 to 500,000 characters per month | Free |
| 500,000 to 1 billion characters per month | $20~ per million characters |
## Amazon AWS
[Pricing](https://aws.amazon.com/translate/pricing/)
| Usage | Price |
|---|---|
| 1 to 2,000,000 characters per month | Free |
| above 2,000,000 per month | $15~ per million characters |
## Microsoft Azure translation API
[Pricing](https://azure.microsoft.com/en-us/pricing/details/cognitive-services/translator)
| Usage | Price |
|---|---|
| 2,000,000 characters per month | Free |
| above 2,000,000 characters per month | $10~ per million characters |
## IBM Watson translation
[Pricing](https://www.ibm.com/cloud/watson-language-translator/pricing)
Multiple plans with various free tiers, but no exact values for cost after exceeding free tier.
## Systran
[maybe??](https://translate.systran.net)

View File

@ -0,0 +1,45 @@
# Prieskum rôznych jazykových verzií datasetu SQuAD a spôsobov ich vytvorenia
## Spanish SQuAD
[Hugging face](https://huggingface.co/datasets/squad_es)
[White paper](https://www.researchgate.net/publication/337904607_Automatic_Spanish_Translation_of_the_SQuAD_Dataset_for_Multilingual_Question_Answering/fulltext/5df1bb65299bf10bc3545e97/Automatic-Spanish-Translation-of-the-SQuAD-Dataset-for-Multilingual-Question-Answering.pdf)
Preložili SQuAD verzie 1.1 a vyvinuli metódu TAR, Translate Align Retrieve.
### Translate Aling Retrieve Method
[GitHub repo](https://github.com/ccasimiro88/TranslateAlignRetrieve)
TAR je metóda, ktorá preloží kontext, otázky, odpovede datasetu SQuAD do iného jazyka. Skladá sa z troch častí:
1. Natrénovaný NMT model zo zdrojového jazyka do cielového jazyka
2. Model na zarovnávanie text slov
3. Postup na preloženie kontextu, otázky, odpovedí do cieľového jazyka použitím predošlých komponentov
Pre zarovnanie kontextu a jeho prekladu použili model [eflomal](https://ufal.mff.cuni.cz/pbml/106/art-ostling-tiedemann.pdf), ktorého implementácia je dostupna na [gitlabe](https://github.com/robertostling/efmaral)
## Swedish SQuAD
[Hugging face](https://huggingface.co/datasets/susumu2357/squad_v2_sv)
[GitHub repo](https://github.com/susumu2357/SQuAD_v2_sv)
Na preloženie SQuADu použili Google Translation API, kde okolo odpovede v kontexte pridali špeciálne znaky "[0]" aby vedeli nájst odpoveď v kontexte ktorého slová môžu byť posunuté.
Avšak tento postup nie je perfektný, niektoré kontext-odpoveď páry neboli preložené perfektne. Výsledný preložený dataset je z toho dôvodu iba 90% veľkosti originálneho SQuADu.
Datasetom bol dotrénovaný model [Swedish BERT](https://github.com/Kungbib/swedish-bert-models), ktorého výsledky porovnávali s výsledkami originálneho Swedish BERT modelu a Multilingual XLM-RoBERTa. ([výsledky](https://github.com/susumu2357/SQuAD_v2_sv#evaluation-on-squad_v2_sv-dev))
Pre realistickejšie porovnanie modelov bol interne vytvorený menší QA dataset s 91 pármi otázok-odpovedí kde výsledky dotrénovaného modelu sú ovela lepšie. ([výsledky](https://github.com/susumu2357/SQuAD_v2_sv#evaluation-on-nobel-prize-dataset))
## French SQuAD
[Hugging face](https://huggingface.co/datasets/qwant/squad_fr)
[White paper](https://hal.archives-ouvertes.fr/hal-03336060/file/RANLP_2021_transformers_usability.pdf)
## Italian SQuAD
[Hugging face](https://huggingface.co/datasets/squad_it)
Článok nie je volne dostupný

1
squad-v2-dev.json Normal file

File diff suppressed because one or more lines are too long

45
squad_char_counter.py Normal file
View File

@ -0,0 +1,45 @@
import json
squad = None
with open("squad-v2-dev.json", "r", encoding="utf-8") as f:
squad = json.load(f)
num_articles = len(squad['data'])
print(f"total articles: {num_articles}")
context_chars = 0
question_chars = 0
answer_chars = 0
total_paragraphs = 0
total_qas = 0
total_answers = 0
for article in squad['data']:
total_paragraphs += len(article['paragraphs'])
for paragraph in article['paragraphs']:
context_chars += len(paragraph['context'])
total_qas += len(paragraph['qas'])
for qas in paragraph['qas']:
question_chars += len(qas['question'])
total_answers += len(qas['answers'])
for answer in qas['answers']:
answer_chars += len(answer['text'])
print(f"total paragraphs: {total_paragraphs}")
print(f"total qas: {total_qas}")
print(f"total answers: {total_answers}")
print(f"chars in contexts: {context_chars}")
print(f"chars in questions: {question_chars}")
print(f"chars in answers: {answer_chars}")
total_chars = context_chars + question_chars + answer_chars
print(f"total chars: {total_chars}")