--- title: Youssef Ressaissi published: true taxonomy: category: [iaeste] tag: [summarization,nlp] author: Daniel Hladek --- IAESTE Intern Summer 2025, 1.7. - 31.8.2025 Goal: Evaluate and improve language models for summarization in Slovak medical or legal domain. Tasks: 1. Get familiar with basic tools - and prepare working environment: HF transformers, datasets, lm-evaluation-harness, HF trl - Read several recent papers about summarization using LLM and write a report. - Get familiar how to perform and evaluate document summarization using language models in Slovak. 2. Make a comparison experiment - Pick summarization datasets and models. Evaluate several models for evaluation using ROUGE and BLEU metrics. - https://github.com/slovak-nlp/resources - Describe the experiments. Summarize results in a table. Describe the results. 3. Improve performance of a languge model. - Use more data. Prepare a domain-oriented dataset and finetune a model. Maybe generate artificial data to imporve summarization. - Run new expriments and write down the results. 4. Report and disseminate - Prepare a final report with analysis, experiments and conclusions. - Publish the fine-tuned models in HF HUB. Publish the paper from the project. Meeting 17.7.2025: State: - Studying of the task, metrics (ROUGE,BLEU) - Loaded a model. preprocessed a dataset, evaluated a model - loaded more models, used SlovakSum, generated summarization with four model and comapre them with ROUGE and BLEU (TUKE-KEMT/slovak-t5-base, google/mt5-small, google/mt5-base, facebook/mbart-large-50) - the comparisin is without fine tuning (zero shot), for far, the best is MBART-large - working on legal dataset "dennlinger/eur-lex-sum", - notebooks are on the kemt git Tasks: - Prepare "mango.kemt.fei.tuke.sk" workflow - Finetune an existing models and evaluate it. Use News and Legal datasets - Try mbart-large, flan-t5-large, slovak-t5-base, google/t5-v1_1-large - Describe the experimental setup, prepare tables with results. Future tasks: - Try prompting LLM and evaluation of the results. We need to pick LLM with SLovak Support - Finetune an LLM to summarize - Use medical data (after they are ready). - Prepare a detailed report (to be converted into a paper).