zpwiki/pages/students/2021/manohar_gowdru_shridharu/README.md

321 lines
13 KiB
Markdown

---
title: Manohar Gowdru Shridhara
published: true
taxonomy:
category: [phd2024]
tag: [lm,nlp,hatespeech]
author: Daniel Hladek
---
# Manohar Gowdru Shridhara
Beginning of the study: 2021
repository: https://git.kemt.fei.tuke.sk/mg240ia
## Disertation Thesis
in 2023/24
Hate Speech Detection
Goals:
- Publish and defend a minimal thesis
- Write a dissertaion thesis
- Publish 2 A-class journal papers
## Second year of PhD study
Goals:
- Publish and defend a minimal thesis. Minimal thesis should contain PhD thesis statements - scientific contributions.
- Provide state-of-the-art overview.
- Formulate dissertation theses (describe scientific contribution of the thesis).
- Prepare to reach the scientific contribution.
- Publish Q2/Q3 paper
- Publish 1 school conference paper.
- Publish 1 regular conference paper.
- Prepare a demo for hate speech detection.
Meeting 6.9.2022
Status:
- Managed to move to Kosice.
- "A systematic review of Hate Sppech" is in progress (cca 50 pages + 100 references).
- "Horseheard" paper is in progress.
Tasks:
- Gather feedback for "Systematic review",make new revisions according to the feedback, select a journal and publish.
- Pick dataset, prepare several methods of HS and compare results.
- Work on web demo of HS detection.
- Continue working on "horseheard paper".
- Read provided books.
## First year of PhD study
Goals:
- [x] Provide state-of-the-art overview.
- [x] Read and make notes from at least 100 scientific papers or books.
- [ ] Publish at least 2 conference papers.
- [x] Prepare for minimal thesis.
Resources:
- [Hate Speech Project Page](/topics/hatespeech)
- https://hatespeechdata.com/
- [Hate speech detection: Challenges and solutions](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6701757/)
- [HateBase](https://hatebase.org/)
- [Resources and benchmark corpora for hate speech detection: a systematic review]
(https://link.springer.com/article/10.1007/s10579-020-09502-8)
14.7:
Status:
- Worked on an horseheard implementation.
- Picked a feasible dataset and method to start with: kannada dataset, tagging sentiment for movie reviews.
- Worked on a paper.
- Studied several papers,
- started to work on a streamlit demo
Open tasks:
- Focus on making a baseline experiment for sentiment classification using classical methods, such as Transformers. PLEASE DO NOT AVOID !!!!
- Prepare a survey paper for school journal or a conference. Use and correct the draft form the beginning. PLEASE DO NOT AVOID !!!! The goal is to identify the most current trends in methods for HS detection. Write in your own words what did you learn from the literature. Write what will be you contribution. Contribution is something new that we have to prove that is new and better.
- Try to prepare an experiment with the selected dataset. https://git.kemt.fei.tuke.sk/mg240ia/Hate_Speech_IMAYFLY_and_HORSEHERD
- For preparing a web application with demo, learn about streamlit. In progress: https://git.kemt.fei.tuke.sk/mg240ia/Hate-Speech-Detector-Streamlit
Read Papers :
- https://aclanthology.org/2020.peoples-1.6.pdf
- https://aclanthology.org/2022.ltedi-1.14/
- https://arxiv.org/abs/2108.03867
- https://arxiv.org/pdf/2112.15417v4.pdf
- https://arxiv.org/ftp/arxiv/papers/2202/2202.04725.pdf
- https://github.com/manikandan-ravikiran/DOSA/blob/main/EACL_Final_Paper.pdf
- https://aclanthology.org/2020.icon-main.13.pdf
- http://ceur-ws.org/Vol-3159/T6-4.pdf
- https://www.researchgate.net/publication/353819476_Hope_Speech_detection_in_under-resourced_Kannada_language
- https://www.researchgate.net/publication/346964457_Creation_of_Corpus_and_analysis_in_Code-Mixed_Kannada-English_Twitter_data_for_Emotion_Prediction
- https://www.semanticscholar.org/paper/Detecting-stance-in-kannada-social-media-code-mixed-SrinidhiSkanda-Kumar/f651d67211809f2036ac81c27e55d02bd061ed64
- https://www.academia.edu/81920734/Findings_of_the_Sentiment_Analysis_of_Dravidian_Languages_in_Code_Mixed_Text
- https://competitions.codalab.org/competitions/30642#learn_the_details
- https://paperswithcode.com/paper/creation-of-corpus-and-analysis-in-code-mixed
- https://paperswithcode.com/paper/hope-speech-detection-in-under-resourced#code
## Meeting 13.6.
- Implemented a Mayfly and Horse Heard Algorithms in Python and Matlab for HS datasets.
- Written a draft of a paper.
- Performed experiments on HS with Word2Vec, FastText, OneHot.
Tasks:
- Implement open tasks from the previous meetings !!!!!!!!
- Share Scripts with GIT and Drafts with Online Word or Docs !!!
- try https://huggingface.co/cardiffnlp/twitter-roberta-base-hate, try to repeat the training and evaluation
## Meeting 24.5.
- shared colab notebook, with on-going implementation of mayfly algorithm for preprocessing in sentiment recognition in a twitter dataset.
Tasks:
- Implement open tasks from the previous meetings !!!
- [ ] Focus on making a baseline experiment for sentiment classification using classcal methods, such as Transformers.
- [x] Consider using pre-trained embeddings. FastText, word2vec, sentence-transformers, Labse, Laser,
Supplemental tasks:
- [x] Fininsh the mayfly implementation
## Meeting 20.5.
- learned about Firefly / mayfly optimization algorithm.
- read ten papers,
- wrote 1 page abstract about possible system, based od DBN.
## Meeting 25.4.
- Learned aboud deep learning lifecycle / evaluation, BERT, RoBERTa, GPT
- Tried HF transformers, Spacy, NLTK, word embeddings, sentence transformers.
- Set up a repo with notes: https://git.kemt.fei.tuke.sk/mg240ia
Tasks:
- [ ] Publish experiments into the repository.
- [ ] Prepare a paper for publication in faculty proceedings http://eei.fei.tuke.sk/#!/
- [ ] Send me draft in advance.
Suplemental tasks:
- [x] For presentation of the results, learn about https://wandb.ai/. This can dispplay results (learning curve, etc.)
- [ ] For preparing a web aplication with demo, learn about streamlit.
## Meeting 12.4.
- Created repositories, empty so far.
- Tried to replicate the results from "Emotion and sentiment analysis of tweets using BERT" paper and "Fine-Tuning BERT Based Approach for Multi-Class Sentiment Analysis on Twitter Emotion Data".
- The experiments are based on BERT (which kind?), Tweet Emotion Intensity.
- Prepared colab notebook with experiments.
Tasks:
- [ ] Finish experiments, upload source codes into git, provide a description of the experiments.
- [ ] Try to improve the results - try different kind of BERT - roberta, electra, xl-net. Can "generative models" be used? (gpt, bart, t5). Can "sentence transformers be used" - labse, laser.
- [x] Learn about "Sentence Transformers".
- [ ] Summarize the results in the table, publish the table on git.
- [-] Use Markdown for formatting. There is "Typora".
- [-] Continue to improve the SCYR paper.
- If you have some conference in mind, tell me.
## Meeting 25.3.22
- Learned about Transformers, BERT, LSTM and RNN.
- Tried HuggingFace transformers library
- Started Google Colab - executing sentiment analysis, hf transformers pipeline functions.
- prepared datasets: twitter-roberta Datasets. Experiments a re riunnig, no results yet.
- prepared a short note about nlp and neural networks.
- still working on the SCYR paper
Tasks:
- [-] finish experiments about sentiment and present results.
- [-] create a repository on git.kemt.fei.tuke.sk and upload your experiments, results and notes. Use you student creadentials.
- [-] continue working on "SCYR" review paper, consider publishing it elswhere (the firs version got rejected).
- [-] prepare an outline for another paper with sentiment classification.
## Meeting 10.3.22
- Improvement of the report.
- Installed Transformers and Anaconda
Tasks:
- Try [this model](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment) with your own text.
- Learn how Transformers Neural Network Works. Learn how Roberta Model training works. Learn how BERT model finetuning works. Write a short memo about your findings and papers read on this topic.
- Pick a dataset:
- https://huggingface.co/datasets/sentiment140 (english)
- https://www.clarin.si/repository/xmlui/handle/11356/1054 (multilingua)
- https://huggingface.co/datasets/tamilmixsentiment (english tamil code switch)
- Grab baseline BERT type model and try to finetune it for sentiment classification.
- For finetuning and evaluation you can use this scrip https://github.com/huggingface/transformers/tree/master/examples/pytorch/text-classification
- For finetuning you will need to install CUDA and Pytorch. It can work on CPU or NOT.
- If you need GPU, use the school server idoc.fei.tuke.sk or google Colab.
- Continue working on the paper.
- Remind me about the SCYR conference payment.
## Meeting 21.2.22
- Written a report about HS detection (in progress)
Tasks:
- Repair the report (rewrite copied parts, make the paragrapsh be logically ordered, teoreticaly - formaly define the HS detection, analyze te datasets in detail - how do they work. what metric do they use).
- Install Hugging Face Transformers and come through a tutorial
## Meeting 31.1.22
- Read some blogs about transformers
- Installed and tied transformers
- Worked on the review paper
- Picked the Twitter Dataset on keggle
- still selecting a method
Open tasks:
- Continue to work on the paper and share the paper with us.
- Prepare som ideas for the common discussion about the project.
- [ ] Try to prepare an experiment with the selected dataset.
- [ ] You can use the school CUDA infrastructre (idoc.fei.tuke.sk).
- [ ] Set up a repository for experiments, use the school git server git.kemt.fei.tuke.sk.
- [x] Get ready to post a paper on the school PhD conference SCYR, deadline is in the middle of February http://scyr.kpi.fei.tuke.sk/.
### Meeting 10.1.22
- Set up a git account https://github.com/ManoGS with script to prepare "twitter" dataset and "english" dataset for HS detection.
- confgured laptop with (Anaconda) / PyCharm, pytorch, cuda gone throug some basic python tutorials.
- Read some blogs how to use kaggle (dataset database).
- tutorials on huggingface transformers - understanding sentiment analysis.
Open tasks:
- [x] Continue to work on the review - with datasets and methods (specified below).
- [x] Read and make notes about transformers, neural language models and finentuning.
- [ ] Pick feasible dataset and method to start with.
- [ ] You can use the school CUDA infrastructre (idoc.fei.tuke.sk).
- [ ] Set up a repository for experiments, use the school git server git.kemt.fei.tuke.sk.
- [ ] Get ready to post a paper on the school PhD conference SCYR, deadline is in the middle of February http://scyr.kpi.fei.tuke.sk/.
#### Meeting 16.12.21
- A report was provided (through Teams).
- Installed Anaconda and started s Transformers tutorial
- Started Dive into python book
Task:
- Report: Create a detailed list of available datasets for HS.
- Report: Create a detailed description of the state of the art approaches for HS detection.
- Practical: Continue with open tasks below. (pick datasetm, perform classification,evaluate the experiment.)
#### Meeting 10.12.21
No report (just draft) was provided so far.
1. Read papers from below and make notes what you have learned fro the papers. For each note make a bibliographic citation. Write down authors of the paper, name paper of the paper, year, publisher and other important information.
When you find out something, make a reference with a number to that paper.
You can use a bibliografic manager software. Mendeley, Endnote, Jabref.
2. From the papers find out answers to the questions below.
3. Pick a hatespeech dataset.
4. Pick an approach and Python library for HS classification.
5. Create a [GIT](https://git.kemt.fei.tuke.sk) repository and share your experiment files. Do not commit data files, just links how to download the files.
6. Perform and evaluate experiments.
#### Meeting 10.11.21
#### First tasks
Prepare a report where you will explain:
- what is hate speech detection,
- where and why you can use hate-speech detection,
- what are state-of-the-art methods for hate speech detection,
- how can you evaluate a hate-speech detection system,
- what datasets for hate-speech detection are available,
The report should properly cite scientific bibliographical sources.
Use a bibliography manager software, such as Mendeley.
Create a [VPN connection](https://uvt.tuke.sk/wps/portal/uv/sluzby/vzdialeny-pristup-vpn) to the university network to have access to the scientific databses. Use scientific indexes to discover literature:
- [Scopus](https://www.scopus.com/) (available from TUKE VPN)
- [Scholar](httyps://scholar.google.com)
Your review can start with:
- [Hate speech detection: Challenges and solutions](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6701757/)
- [HateBase](https://hatebase.org/)
- [Resources and benchmark corpora for hate speech detection: a systematic review](https://link.springer.com/article/10.1007/s10579-020-09502-8)
Get to know the Python programming language
- Read [Dive into Python](https://diveintopython3.net/)
- Install [Anaconda](https://www.anaconda.com/)
- Try [HuggingFace Transformers library]( https://huggingface.co/transformers/quicktour.html)