67 lines
2.2 KiB
Markdown
67 lines
2.2 KiB
Markdown
---
|
|
title: Oliver Pejic
|
|
published: true
|
|
taxonomy:
|
|
category: [iaeste]
|
|
tag: [hatespeech,nlp]
|
|
author: Daniel Hladek
|
|
---
|
|
|
|
Oliver Pejic
|
|
|
|
IAESTE Intern Summer 2024, 12 weeks in August, September and October.
|
|
|
|
Goal:
|
|
|
|
- Help with the [Hate Speech Project](/topics/hatespeech)
|
|
- Help with evaluation of sentence transformer models using toolkit [MTEB](https://github.com/embeddings-benchmark/mteb)
|
|
|
|
Final Tasks:
|
|
|
|
- Prepare an MTEB evaluation task for [Slovak HATE speech](https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak).
|
|
- Prepare an MTEB evaluation task for [Slovak question answering](https://huggingface.co/datasets/TUKE-KEMT/retrieval-skquad).
|
|
- [Machine translate](https://huggingface.co/google/madlad400-3b-mt) an SBERT evaluation set for multiple slavic languages.
|
|
- Write a short scientific paper with results.
|
|
|
|
Meeting 3.10.:
|
|
|
|
State:
|
|
|
|
- Prepared a pull request for Retrieval SK Quad.
|
|
- Prepared a pull request for Hate Speech Slovak.
|
|
|
|
Tasks:
|
|
|
|
- Make the pull request compatible with the MTEB Contribution guidelines. Discuss it when it is done.
|
|
- Submit pull requests to MTEB project.
|
|
- Machine Translate a database (HotpotQA, DB Pedia, FEVER) . Pick a database that is short, because translation might be slow.
|
|
|
|
Non priority tasks:
|
|
|
|
- Prepare databse and subnit it to HuggingFace Hub.
|
|
- Prepare a MTEB PR for the databse.
|
|
|
|
Meeting 3.9:
|
|
|
|
State: Studied MTEB framework and transformers.
|
|
|
|
Tasks:
|
|
|
|
- Prepare and try MTEB evaluation tasks for the database. For evaluation you can try me5-base model.
|
|
- Make a fork of MTEB and do necessary modification, including the documentation references for the task.
|
|
- Prepare 2 GITHUB pull requests for the databases, preliminary BEIR script given.
|
|
|
|
Future tasks:
|
|
|
|
- Prepare a machine translation system to create another slovak/multilingual evaluation task from English task.
|
|
|
|
Preparation (7.8.2024):
|
|
|
|
- Get familiar with [SentenceTransformer](https://sbert.net/) framework, study fundamental papers and write down notes.
|
|
- Get familiar with [MTEB](https://github.com/embeddings-benchmark/mteb) evaluation framework.
|
|
- Prepare a working environment on Google Colab or on school server or Anaconda.
|
|
- Get familiar with [existing finetuning scripts](https://git.kemt.fei.tuke.sk/dano/slovakretrieval).
|
|
|
|
|
|
|