67 lines
		
	
	
		
			2.2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			67 lines
		
	
	
		
			2.2 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| ---
 | |
| title: Oliver Pejic
 | |
| published: true
 | |
| taxonomy:
 | |
|     category: [iaeste]
 | |
|     tag: [hatespeech,nlp]
 | |
|     author: Daniel Hladek
 | |
| ---
 | |
| 
 | |
| Oliver Pejic
 | |
| 
 | |
| IAESTE Intern Summer 2024, 12 weeks in August, September and October.
 | |
| 
 | |
| Goal:
 | |
|  
 | |
| - Help with the [Hate Speech Project](/topics/hatespeech)
 | |
| - Help with evaluation of sentence transformer models using toolkit [MTEB](https://github.com/embeddings-benchmark/mteb) 
 | |
| 
 | |
| Final Tasks:
 | |
| 
 | |
| - Prepare an MTEB evaluation task for [Slovak HATE speech](https://huggingface.co/datasets/TUKE-KEMT/hate_speech_slovak).
 | |
| - Prepare an MTEB evaluation task for [Slovak question answering](https://huggingface.co/datasets/TUKE-KEMT/retrieval-skquad).
 | |
| - [Machine translate](https://huggingface.co/google/madlad400-3b-mt) an SBERT evaluation set for multiple slavic languages.
 | |
| - Write a short scientific paper with results.
 | |
| 
 | |
| Meeting 3.10.:
 | |
| 
 | |
| State:
 | |
| 
 | |
| - Prepared a pull request for Retrieval SK Quad.
 | |
| - Prepared a pull request for Hate Speech Slovak.
 | |
| 
 | |
| Tasks:
 | |
| 
 | |
| - Make the pull request compatible with the MTEB Contribution guidelines. Discuss it when it is done.
 | |
| - Submit pull requests to MTEB project.
 | |
| - Machine Translate a database (HotpotQA, DB Pedia, FEVER) . Pick a database that is short, because translation might be slow.
 | |
| 
 | |
| Non priority tasks:
 | |
| 
 | |
| - Prepare databse  and subnit it to HuggingFace Hub.
 | |
| - Prepare a MTEB PR for the databse.
 | |
| 
 | |
| Meeting 3.9:
 | |
| 
 | |
| State: Studied MTEB framework and transformers.
 | |
| 
 | |
| Tasks:
 | |
| 
 | |
| - Prepare and try MTEB evaluation tasks for the database. For evaluation you can try me5-base model. 
 | |
| - Make a fork of MTEB and do necessary modification, including the documentation references for the task.
 | |
| - Prepare 2 GITHUB pull requests for the databases, preliminary BEIR script given.
 | |
| 
 | |
| Future tasks:
 | |
| 
 | |
| - Prepare a machine translation system to create another slovak/multilingual evaluation task from English task. 
 | |
| 
 | |
| Preparation (7.8.2024):
 | |
| 
 | |
| - Get familiar with [SentenceTransformer](https://sbert.net/) framework, study fundamental papers and write down notes.
 | |
| - Get familiar with [MTEB](https://github.com/embeddings-benchmark/mteb) evaluation framework.
 | |
| - Prepare a working  environment on Google Colab or on school server or Anaconda.
 | |
| - Get familiar with [existing finetuning scripts](https://git.kemt.fei.tuke.sk/dano/slovakretrieval).
 | |
| 
 | |
| 
 | |
| 
 |