zpwiki

History

Daniel Hladek c065ab523d zz		2025-07-17 10:36:38 +02:00
..
README.md	zz	2025-07-17 10:36:38 +02:00

title

published

taxonomy

Youssef Ressaissi

true

category

tag

author

iaeste

summarization

nlp

Daniel Hladek

IAESTE Intern Summer 2025, 1.7. - 31.8.2025

Goal: Evaluate and improve language models for summarization in Slovak medical or legal domain.

Tasks:

and prepare working environment: HF transformers, datasets, lm-evaluation-harness, HF trl
Read several recent papers about summarization using LLM and write a report.
Get familiar how to perform and evaluate document summarization using language models in Slovak.

Pick summarization datasets and models. Evaluate several models for evaluation using ROUGE and BLEU metrics.
https://github.com/slovak-nlp/resources
Describe the experiments. Summarize results in a table. Describe the results.

Use more data. Prepare a domain-oriented dataset and finetune a model. Maybe generate artificial data to imporve summarization.
Run new expriments and write down the results.

Meeting 17.7.2025:

State:

Studying of the task, metrics (ROUGE,BLEU)
Loaded a model. preprocessed a dataset, evaluated a model
loaded more models, used SlovakSum, generated summarization with four model and comapre them with ROUGE and BLEU (TUKE-KEMT/slovak-t5-base, google/mt5-small, google/mt5-base, facebook/mbart-large-50)
the comparisin is without fine tuning (zero shot), for far, the best is MBART-large
working on legal dataset "dennlinger/eur-lex-sum",
notebooks are on the kemt git

Tasks:

Future tasks:

Try prompting LLM and evaluation of the results. We need to pick LLM with SLovak Support
Finetune an LLM to summarize
Use medical data (after they are ready).
Prepare a detailed report (to be converted into a paper).