--- title: Oliver Pejic published: true taxonomy: category: [iaeste] tag: [hatespeech,nlp] author: Daniel Hladek --- Oliver Pejic IAESTE Intern Summer 2024, 12 weeks in August, September and October. Goal: - Help with the [Hate Speech Project](/topics/hatespeech) - Help with evaluation of sentence transformer models using toolkit [MTEB](https://github.com/embeddings-benchmark/mteb) Final Tasks: - Prepare an MTEB evaluation task for [Slovak HATE speech](https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak). - Prepare an MTEB evaluation task for [Slovak question answering](https://huggingface.co/datasets/TUKE-KEMT/retrieval-skquad). - [Machine translate](https://huggingface.co/google/madlad400-3b-mt) an SBERT evaluation set for multiple slavic languages. - Write a short scientific paper with results. Meeting 3.10.: State: - Prepared a pull request for Retrieval SK Quad. - Prepared a pull request for Hate Speech Slovak. Tasks: - Make the pull request compatible with the MTEB Contribution guidelines. Discuss it when it is done. - Submit pull requests to MTEB project. - Machine Translate a database (HotpotQA, DB Pedia, FEVER) . Pick a database that is short, because translation might be slow. Non priority tasks: - Prepare databse and subnit it to HuggingFace Hub. - Prepare a MTEB PR for the databse. Meeting 3.9: State: Studied MTEB framework and transformers. Tasks: - Prepare and try MTEB evaluation tasks for the database. For evaluation you can try me5-base model. - Make a fork of MTEB and do necessary modification, including the documentation references for the task. - Prepare 2 GITHUB pull requests for the databases, preliminary BEIR script given. Future tasks: - Prepare a machine translation system to create another slovak/multilingual evaluation task from English task. Preparation (7.8.2024): - Get familiar with [SentenceTransformer](https://sbert.net/) framework, study fundamental papers and write down notes. - Get familiar with [MTEB](https://github.com/embeddings-benchmark/mteb) evaluation framework. - Prepare a working environment on Google Colab or on school server or Anaconda. - Get familiar with [existing finetuning scripts](https://git.kemt.fei.tuke.sk/dano/slovakretrieval).