forked from KEMT/zpwiki
266 lines
10 KiB
Markdown
266 lines
10 KiB
Markdown
---
|
|
title: Manohar Gowdru Shridhara
|
|
published: true
|
|
taxonomy:
|
|
category: [phd2024]
|
|
tag: [lm,nlp,hatespeech]
|
|
author: Daniel Hladek
|
|
---
|
|
# Manohar Gowdru Shridhara
|
|
|
|
Beginning of the study: 2021
|
|
|
|
repository: https://git.kemt.fei.tuke.sk/mg240ia
|
|
|
|
## Disertation Thesis
|
|
|
|
in 2023/24
|
|
|
|
Hate Speech Detection
|
|
|
|
Goals:
|
|
|
|
- Write a dissertaion thesis
|
|
- Publish 2 A-class journal papers
|
|
|
|
## Minimal Thesis
|
|
|
|
(preliminary dissertaion and exam in 2022/23)
|
|
|
|
Goals:
|
|
|
|
- Provide state-of-the-art overview.
|
|
- Formulate dissertation theses (describe scientific contribution of the thesis).
|
|
- Prepare to reach the scientific contribution.
|
|
- Publish 4 conference papers.
|
|
|
|
## First year of PhD study
|
|
|
|
Goals:
|
|
|
|
- Provide state-of-the-art overview.
|
|
- Read and make notes from at least 100 scientific papers or books.
|
|
- Publish at least 2 conference papers.
|
|
- Prepare for minimal thesis.
|
|
|
|
Resources:
|
|
|
|
- [Hate Speech Project Page](/topics/hatespeech)
|
|
- https://hatespeechdata.com/
|
|
- [Hate speech detection: Challenges and solutions](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6701757/)
|
|
- [HateBase](https://hatebase.org/)
|
|
- [Resources and benchmark corpora for hate speech detection: a systematic review]
|
|
(https://link.springer.com/article/10.1007/s10579-020-09502-8)
|
|
|
|
## Meeting 13.6.
|
|
|
|
- Implemented a Mayfly and Horse Heard Algorithms in Python and Matlab for HS datasets.
|
|
- Written a draft of a paper.
|
|
- Performed experiments on HS with Word2Vec, FastText, OneHot.
|
|
|
|
Tasks:
|
|
|
|
- Implement open tasks from the previous meetings !!!!!!!!
|
|
- Share Scripts with GIT and Drafts with Online Word or Docs !!!
|
|
- try https://huggingface.co/cardiffnlp/twitter-roberta-base-hate, try to repeat the training and evaluation
|
|
|
|
|
|
|
|
## Meeting 24.5.
|
|
|
|
- shared colab notebook, with on-going implementation of mayfly algorithm for preprocessing in sentiment recognition in a twitter dataset.
|
|
|
|
Tasks:
|
|
|
|
- Implement open tasks from the previous meetings !!!
|
|
- [ ] Focus on making a baseline experiment for sentiment classification using classcal methods, such as Transformers.
|
|
- [x] Consider using pre-trained embeddings. FastText, word2vec, sentence-transformers, Labse, Laser,
|
|
|
|
Supplemental tasks:
|
|
|
|
- [x] Fininsh the mayfly implementation
|
|
|
|
|
|
## Meeting 20.5.
|
|
|
|
- learned about Firefly / mayfly optimization algorithm.
|
|
- read ten papers,
|
|
- wrote 1 page abstract about possible system, based od DBN.
|
|
|
|
|
|
## Meeting 25.4.
|
|
|
|
- Learned aboud deep learning lifecycle / evaluation, BERT, RoBERTa, GPT
|
|
- Tried HF transformers, Spacy, NLTK, word embeddings, sentence transformers.
|
|
- Set up a repo with notes: https://git.kemt.fei.tuke.sk/mg240ia
|
|
|
|
Tasks:
|
|
|
|
- [ ] Publish experiments into the repository.
|
|
- [ ] Prepare a paper for publication in faculty proceedings http://eei.fei.tuke.sk/#!/
|
|
- [ ] Send me draft in advance.
|
|
|
|
Suplemental tasks:
|
|
|
|
- [x] For presentation of the results, learn about https://wandb.ai/. This can dispplay results (learning curve, etc.)
|
|
- [ ] For preparing a web aplication with demo, learn about streamlit.
|
|
|
|
## Meeting 12.4.
|
|
|
|
- Created repositories, empty so far.
|
|
- Tried to replicate the results from "Emotion and sentiment analysis of tweets using BERT" paper and "Fine-Tuning BERT Based Approach for Multi-Class Sentiment Analysis on Twitter Emotion Data".
|
|
- The experiments are based on BERT (which kind?), Tweet Emotion Intensity.
|
|
- Prepared colab notebook with experiments.
|
|
|
|
Tasks:
|
|
|
|
- [ ] Finish experiments, upload source codes into git, provide a description of the experiments.
|
|
- [ ] Try to improve the results - try different kind of BERT - roberta, electra, xl-net. Can "generative models" be used? (gpt, bart, t5). Can "sentence transformers be used" - labse, laser.
|
|
- [x] Learn about "Sentence Transformers".
|
|
- [ ] Summarize the results in the table, publish the table on git.
|
|
- [-] Use Markdown for formatting. There is "Typora".
|
|
- [-] Continue to improve the SCYR paper.
|
|
- If you have some conference in mind, tell me.
|
|
|
|
## Meeting 25.3.22
|
|
|
|
- Learned about Transformers, BERT, LSTM and RNN.
|
|
- Tried HuggingFace transformers library
|
|
- Started Google Colab - executing sentiment analysis, hf transformers pipeline functions.
|
|
- prepared datasets: twitter-roberta Datasets. Experiments a re riunnig, no results yet.
|
|
- prepared a short note about nlp and neural networks.
|
|
- still working on the SCYR paper
|
|
|
|
Tasks:
|
|
|
|
- [-] finish experiments about sentiment and present results.
|
|
- [-] create a repository on git.kemt.fei.tuke.sk and upload your experiments, results and notes. Use you student creadentials.
|
|
- [-] continue working on "SCYR" review paper, consider publishing it elswhere (the firs version got rejected).
|
|
- [-] prepare an outline for another paper with sentiment classification.
|
|
|
|
## Meeting 10.3.22
|
|
|
|
- Improvement of the report.
|
|
- Installed Transformers and Anaconda
|
|
|
|
Tasks:
|
|
|
|
- Try [this model](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment) with your own text.
|
|
- Learn how Transformers Neural Network Works. Learn how Roberta Model training works. Learn how BERT model finetuning works. Write a short memo about your findings and papers read on this topic.
|
|
- Pick a dataset:
|
|
- https://huggingface.co/datasets/sentiment140 (english)
|
|
- https://www.clarin.si/repository/xmlui/handle/11356/1054 (multilingua)
|
|
- https://huggingface.co/datasets/tamilmixsentiment (english tamil code switch)
|
|
- Grab baseline BERT type model and try to finetune it for sentiment classification.
|
|
- For finetuning and evaluation you can use this scrip https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification
|
|
- For finetuning you will need to install CUDA and Pytorch. It can work on CPU or NOT.
|
|
- If you need GPU, use the school server idoc.fei.tuke.sk or google Colab.
|
|
- Continue working on the paper.
|
|
- Remind me about the SCYR conference payment.
|
|
|
|
|
|
|
|
## Meeting 21.2.22
|
|
|
|
- Written a report about HS detection (in progress)
|
|
|
|
Tasks:
|
|
|
|
- Repair the report (rewrite copied parts, make the paragrapsh be logically ordered, teoreticaly - formaly define the HS detection, analyze te datasets in detail - how do they work. what metric do they use).
|
|
- Install Hugging Face Transformers and come through a tutorial
|
|
|
|
|
|
|
|
## Meeting 31.1.22
|
|
|
|
- Read some blogs about transformers
|
|
- Installed and tied transformers
|
|
- Worked on the review paper
|
|
- Picked the Twitter Dataset on keggle
|
|
- still selecting a method
|
|
|
|
Open tasks:
|
|
|
|
- Continue to work on the paper and share the paper with us.
|
|
- Prepare som ideas for the common discussion about the project.
|
|
- [ ] Try to prepare an experiment with the selected dataset.
|
|
- [ ] You can use the school CUDA infrastructre (idoc.fei.tuke.sk).
|
|
- [ ] Set up a repository for experiments, use the school git server git.kemt.fei.tuke.sk.
|
|
- [x] Get ready to post a paper on the school PhD conference SCYR, deadline is in the middle of February http://scyr.kpi.fei.tuke.sk/.
|
|
|
|
|
|
### Meeting 10.1.22
|
|
|
|
- Set up a git account https://github.com/ManoGS with script to prepare "twitter" dataset and "english" dataset for HS detection.
|
|
- confgured laptop with (Anaconda) / PyCharm, pytorch, cuda gone throug some basic python tutorials.
|
|
- Read some blogs how to use kaggle (dataset database).
|
|
- tutorials on huggingface transformers - understanding sentiment analysis.
|
|
|
|
Open tasks:
|
|
|
|
- [x] Continue to work on the review - with datasets and methods (specified below).
|
|
- [x] Read and make notes about transformers, neural language models and finentuning.
|
|
- [ ] Pick feasible dataset and method to start with.
|
|
- [ ] You can use the school CUDA infrastructre (idoc.fei.tuke.sk).
|
|
- [ ] Set up a repository for experiments, use the school git server git.kemt.fei.tuke.sk.
|
|
- [ ] Get ready to post a paper on the school PhD conference SCYR, deadline is in the middle of February http://scyr.kpi.fei.tuke.sk/.
|
|
|
|
#### Meeting 16.12.21
|
|
|
|
- A report was provided (through Teams).
|
|
- Installed Anaconda and started s Transformers tutorial
|
|
- Started Dive into python book
|
|
|
|
Task:
|
|
|
|
- Report: Create a detailed list of available datasets for HS.
|
|
- Report: Create a detailed description of the state of the art approaches for HS detection.
|
|
- Practical: Continue with open tasks below. (pick datasetm, perform classification,evaluate the experiment.)
|
|
|
|
|
|
#### Meeting 10.12.21
|
|
|
|
No report (just draft) was provided so far.
|
|
|
|
1. Read papers from below and make notes what you have learned fro the papers. For each note make a bibliographic citation. Write down authors of the paper, name paper of the paper, year, publisher and other important information.
|
|
When you find out something, make a reference with a number to that paper.
|
|
You can use a bibliografic manager software. Mendeley, Endnote, Jabref.
|
|
2. From the papers find out answers to the questions below.
|
|
3. Pick a hatespeech dataset.
|
|
4. Pick an approach and Python library for HS classification.
|
|
5. Create a [GIT](https://git.kemt.fei.tuke.sk) repository and share your experiment files. Do not commit data files, just links how to download the files.
|
|
6. Perform and evaluate experiments.
|
|
|
|
#### Meeting 10.11.21
|
|
|
|
#### First tasks
|
|
|
|
Prepare a report where you will explain:
|
|
|
|
- what is hate speech detection,
|
|
- where and why you can use hate-speech detection,
|
|
- what are state-of-the-art methods for hate speech detection,
|
|
- how can you evaluate a hate-speech detection system,
|
|
- what datasets for hate-speech detection are available,
|
|
|
|
The report should properly cite scientific bibliographical sources.
|
|
Use a bibliography manager software, such as Mendeley.
|
|
|
|
Create a [VPN connection](https://uvt.tuke.sk/wps/portal/uv/sluzby/vzdialeny-pristup-vpn) to the university network to have access to the scientific databses. Use scientific indexes to discover literature:
|
|
|
|
- [Scopus](https://www.scopus.com/) (available from TUKE VPN)
|
|
- [Scholar](httyps://scholar.google.com)
|
|
|
|
Your review can start with:
|
|
|
|
- [Hate speech detection: Challenges and solutions](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6701757/)
|
|
- [HateBase](https://hatebase.org/)
|
|
- [Resources and benchmark corpora for hate speech detection: a systematic review](https://link.springer.com/article/10.1007/s10579-020-09502-8)
|
|
|
|
Get to know the Python programming language
|
|
|
|
- Read [Dive into Python](https://diveintopython3.net/)
|
|
- Install [Anaconda](https://www.anaconda.com/)
|
|
- Try [HuggingFace Transformers library]( https://huggingface.co/transformers/quicktour.html)
|
|
|