zpwiki/README.md at master

KEMT/zpwiki

Dnaiel Hladek 946bc7f9f1 zz

2024-08-06 01:05:54 +02:00

title

published

taxonomy

Sevval Bulburu

true

category

tag

author

iaeste

hatespeech

nlp

Daniel Hladek

Sevval Bulburu

IAESTE Intern Summer 2023, two months

Goal: Help with the Hate Speech Project

Meeting 12.10.2023

State:

Proposed and tried extra layers above BERT model to make a classifier in seriees of experiments. There is a single sigmoid neuron on the output.
Manually adjusted the slovak HS dataset. Slovak dataset is not balanced. Tried some methods for "balancing" the dataset. By google translate - augmentation. Other samples are "generated" by translation into random langauge and translating back. This creates "paraphrases" of the original samples. It helps.
Tried SMOTE upsampling, it did not work.
Is date and user name an important feature?
tried some grid search for hyperparameter of the neural network - learning rate, dropout, epoch size, batch size. Batch size 64 no good.
Tried multilingal models for Slovak dataset (cnerg), result are not good.

Tasks:

Please send me your work report. Please upload your scripts and notebooks on git and send me a link. git is git.kemt.fei.tuke.sk or github. Prepare some short comment about scripts.You can also upload the slovak datasetthere is some work done on it.

Ideas for a paper:

Possible names:

- "Data set balancing for Multilingual Hate Speech Detection"
- "BERT embeddings for HS Detection in Low Resource Languages" (Turkish and Slovak).

(Possible) Tasks for paper:

Create a shared draft document, e.g. on Overleaf or Google Doc
Write a good introduction and literature overview (Zuzka).
Try 2 or 3 class Softmax Layer for neural network.
Change the dataset for 3 class classification.
Prepare classifier for Slovak, English, Turkish and for multiple BERT models. Try to use multilingual BERT model for baseline embeddings.
Measure the effect of balancing the dataset by generation of additional examples.
Summarize experiments in tables.

Meeting 5.9.2023

State:

Tasks:

Meeting 22.8.2023

State:

Familiar with Python, Anaconda, Tensorflow, AI projects
created account at idoc.fei.tuke.sk and installed anaconda.
Continue with previous open tasks.
Read a website and pick a dataset from https://hatespeechdata.com/
Evaluate (calculate p r f1) existing multilingual model. E.G. https://huggingface.co/Andrazp/multilingual-hate-speech-robacofi with any data
Get familiar with Django.

Notes:

ssh bulbur@idoc.fei.tuke.sk
nvidia-smi command to check status of GPU.
Use WinSCP to Copy Files. Use anaconda virtual env to create and activate a new python virtual environment. Use Visual Studio Code Remote to delvelop on your computer and run on remote computer (idoc.fei.tuke.sk). Use the same credentials for idoc server.
Use WSL2 to have local linux just to play.

Tasks:

Get familiar with the task of Hate speech detection. Find out how can we use Transformer neural networks to detect and categorize hate speech in internet comments created by random people.
Get familiar with the basic tools: Huggingface Transformers, Learn how to use https://huggingface.co/Andrazp/multilingual-hate-speech-robacofi in Python script. Learn something about Transformer neural networks.
get familiar with Prodi.gy annotation tool.
[-] Set up web-based annotation environment for students (open, cooperation with Vladimir Ferko ).

Ideas for annotation tools:

Future tasks (to be decided):

Translate existing English dataset into Slovak. Use OPUS English Slovak Marian NMT model. Train Slovak munolingual model.
Prepare existing Slovak Twitter dataaset, train evaluate a model.