BP2024/preprocessing/dataProcessing.ipynb

1059 lines
238 KiB
Plaintext
Raw Normal View History

2024-04-09 13:39:11 +00:00
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"id": "3e42b2f9-aef4-4727-bd1b-126ac70975dd",
"metadata": {},
"source": [
"# Data processing\n",
"This notebook implements dataset changes and modifications such as\n",
"- Data cleaning (null and empty string values)\n",
"- GDPR protection\n",
"- merge data\n",
"- data analysis\n",
"\n",
"It basically prepares dataset for anotation using prodi.gy"
]
},
{
"cell_type": "code",
"execution_count": 72,
"id": "556ce1b6-fd72-4a57-a3a1-1cfc2d2432f5",
"metadata": {},
"outputs": [],
"source": [
"# imports\n",
"\n",
"import json\n",
"import os\n",
"import matplotlib.pyplot as plt\n",
"from print_dict import pd"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b73bf010-f549-4ed7-aab5-460acf12c36c",
"metadata": {},
"source": [
"#### variable declaration"
]
},
{
"cell_type": "code",
"execution_count": 73,
"id": "9ca195f9-b638-4063-af10-406a8f6efad2",
"metadata": {},
"outputs": [],
"source": [
"basePath = 'Facebook/outputs/parts'\n",
"nameDict= {}\n",
"finalData= []"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "00011939-93c3-41d5-ae22-ce9f7ee77705",
"metadata": {},
"source": [
"## Opening parts of the dataset"
]
},
{
"cell_type": "code",
"execution_count": 74,
"id": "bb84d441-699d-4948-96cc-0b20f2dee412",
"metadata": {},
"outputs": [],
"source": [
"def openJson(filename):\n",
" with open(os.path.join(os.getcwd(), basePath, filename)) as json_file:\n",
" return json.load(json_file)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "130372a3-2397-4721-9537-2a503ccfd9ce",
"metadata": {},
"source": [
"### Politicians"
]
},
{
"cell_type": "code",
"execution_count": 75,
"id": "52d3b818-c0c2-4879-9f99-f785c6f5cadc",
"metadata": {},
"outputs": [],
"source": [
"robertFico= openJson(\"robert_fico_data.json\")\n",
"igorMatovic= openJson(\"igor_matovic_data.json\")\n",
"erikKalinak= openJson(\"erik_kaliňák_data.json\")\n",
"zuzanaCaputova= openJson(\"zuzana_čaputová_data.json\")\n",
"marianKotleba= openJson(\"marian_kotleba_data.json\")\n",
"\n",
"politicianNames = ['Robert Fico', 'Igor Matovic', 'Erik Kalinak', 'Zuzana Caputova', 'Marian Kotleba']"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f15e1f1e-0d76-4e90-ba5e-0754a45d449f",
"metadata": {},
"source": [
"### Memes"
]
},
{
"cell_type": "code",
"execution_count": 76,
"id": "1e2e0dbb-a17a-407b-b8a2-1476a9d6cad1",
"metadata": {},
"outputs": [],
"source": [
"zomri= openJson(\"zomri_data.json\")\n",
"emefka= openJson(\"emefka_data.json\")\n",
"okAleIdesPrvy= openJson(\"ok,ale_ideš_prvý_:d_data.json\")\n",
"\n",
"memeNames= ['Zomri', 'Emefka', 'OkAleIdesPrvy']"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b56544ac-3c0d-438b-8632-296703bf6ef6",
"metadata": {},
"source": [
"### Media"
]
},
{
"cell_type": "code",
"execution_count": 77,
"id": "6a8dfa3c-ab65-4d0e-9868-e5400eda7906",
"metadata": {},
"outputs": [],
"source": [
"eva= openJson(\"eva_-_hriešne_dobrá_data.json\")\n",
"aktuality= openJson(\"aktuality_data.json\")\n",
"dennikN= openJson(\"denník_n_data.json\")\n",
"tvJOJ= openJson(\"televízia_joj_data.json\")\n",
"\n",
"mediaNames= ['Eva', 'Aktuality', 'DennikN', 'tvJOJ']"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "23fc715c-d2f0-4e48-be9f-e0adea43e987",
"metadata": {},
"source": [
"### Famous people"
]
},
{
"cell_type": "code",
"execution_count": 78,
"id": "8ff4a0fc-d87b-4484-80e6-0651d772a5c3",
"metadata": {},
"outputs": [],
"source": [
"peterMarcin= openJson(\"peter_marcin_data.json\")\n",
"sajfa= openJson(\"sajfa_data.json\")\n",
"janKolenik= openJson(\"ján_koleník_data.json\")\n",
"\n",
"famousNames = ['Peter Marcin', 'Sajfa', 'Jan Kolenik']"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c21fc6d6-2f8c-4cd1-b0ad-746cffbe008b",
"metadata": {},
"source": [
"### Sports"
]
},
{
"cell_type": "code",
"execution_count": 79,
"id": "3df20d49-08d3-49b2-a512-c3a55eb1a38b",
"metadata": {},
"outputs": [],
"source": [
"sport24= openJson(\"šport24_data.json\")\n",
"dominikaCibulkova= openJson(\"dominika_cibulkova_data.json\")\n",
"hetrik= openJson(\"hetrik_data.json\")\n",
"RTVSsport= openJson(\"šport_v_rtvs_data.json\")\n",
"\n",
"sportsNames= ['Sport24', 'Dominika Cibulkova', 'Hetrik', 'RTVSsport']"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "d929dd8d-43ab-465e-8b2a-62732aa56217",
"metadata": {},
"source": [
"## Dataset peek"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c209e264-4c40-4f1c-b1de-c1b71969d717",
"metadata": {},
"source": [
"Sample of the dataset might look like this:\n",
"\n",
" [\n",
" {\n",
" 'publisher': Post author,\n",
" 'title': Text that autor published as a post,\n",
" 'post_reactions': Number of interactions with a post,\n",
" 'comments': [\n",
" {\n",
" 'publisher': Comment author,\n",
" 'text': Comment's content,\n",
" 'replies': [\n",
" {\n",
" 'publisher': Reply author,\n",
" 'text': Reply's content,\n",
" 'replies': [\n",
" {\n",
" 'publisher': Second reply author,\n",
" 'text': Second reply's content\n",
" }\n",
" ]\n",
" }\n",
" ]\n",
" }\n",
" ]\n",
" }\n",
" ]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "ea82f129-37f3-4e38-8729-54c2b7526557",
"metadata": {},
"source": [
"## Data recordings retrieval"
]
},
{
"cell_type": "code",
"execution_count": 80,
"id": "4eb2d369-df75-47bc-83f2-5759793a5edd",
"metadata": {},
"outputs": [],
"source": [
"def getNumOfRecords(dictObj):\n",
" count= 0\n",
" for post in dictObj:\n",
" count+= 1\n",
" for comment in post['comments']:\n",
" count+= 1\n",
" for reply in comment['replies']:\n",
" count+= 1\n",
" for sec_reply in reply['replies']:\n",
" count+= 1\n",
" return count"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9cd23b0e-8933-4d0d-9242-bab8a89e1fe1",
"metadata": {},
"source": [
"## Junk data cleaning"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "fcaf221b-1128-41a8-a0f4-135fd6b6df8f",
"metadata": {},
"source": [
"This part of the code iterates through whole dataset and delete NULL or \"\" values."
]
},
{
"cell_type": "code",
"execution_count": 81,
"id": "364f6cdb-64d7-40ea-bcc9-b2b7a2ac8dd5",
"metadata": {},
"outputs": [],
"source": [
"def cleanData(dictObj):\n",
" i = 0\n",
" for post_idx, post in enumerate(dictObj):\n",
" for comment_idx, comment in enumerate(post['comments']):\n",
" try:\n",
" if comment['text'] == '' or comment['text'] is None:\n",
" dictObj[post_idx]['comments'].remove(comment)\n",
" i += 1\n",
" except:\n",
" pass\n",
" for reply_idx, reply in enumerate(comment['replies']):\n",
" try:\n",
" if reply['text'] == '' or reply['text'] is None:\n",
" dictObj[post_idx]['comments'][comment_idx]['replies'].remove(reply)\n",
" i += 1\n",
" except:\n",
" pass\n",
" for sec_reply_idx, sec_reply in enumerate(reply['replies']):\n",
" try:\n",
" if sec_reply['text'] == '' or sec_reply['text'] is None:\n",
" dictObj[post_idx]['comments'][comment_idx]['replies'][reply_idx]['replies'].remove(sec_reply)\n",
" i += 1\n",
" except:\n",
" pass\n",
" return dictObj, i"
]
},
{
"cell_type": "code",
"execution_count": 82,
"id": "327576e2-55b4-439a-af6c-0364a4a7f3ec",
"metadata": {},
"outputs": [],
"source": [
"def cleanEntireData(dictObj):\n",
" recordsDeleted= 0\n",
" temp, i= cleanData(dictObj)\n",
" recordsDeleted += i\n",
" while i:\n",
" temp, i= cleanData(dictObj)\n",
" recordsDeleted += i\n",
" return temp"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "5fc31f16-6b15-48e0-bd0f-2761bf793d26",
"metadata": {},
"source": [
"## Data preparation for visualisation"
]
},
{
"cell_type": "code",
"execution_count": 83,
"id": "242e2673-216b-42de-9c9d-137aa1b975f3",
"metadata": {},
"outputs": [],
"source": [
"categories= politicianNames + memeNames + mediaNames + sportsNames + famousNames\n",
"values= [\n",
" getNumOfRecords(cleanEntireData(robertFico)), # politics topppic\n",
" getNumOfRecords(cleanEntireData(igorMatovic)),\n",
" getNumOfRecords(cleanEntireData(erikKalinak)),\n",
" getNumOfRecords(cleanEntireData(zuzanaCaputova)),\n",
" getNumOfRecords(cleanEntireData(marianKotleba)),\n",
"\n",
" getNumOfRecords(cleanEntireData(zomri)), # meme toppic\n",
" getNumOfRecords(cleanEntireData(emefka)),\n",
" getNumOfRecords(cleanEntireData(okAleIdesPrvy)),\n",
"\n",
" getNumOfRecords(cleanEntireData(eva)), # media toppic\n",
" getNumOfRecords(cleanEntireData(aktuality)),\n",
" getNumOfRecords(cleanEntireData(dennikN)),\n",
" getNumOfRecords(cleanEntireData(tvJOJ)),\n",
"\n",
" getNumOfRecords(cleanEntireData(sport24)), # sports toppic\n",
" getNumOfRecords(cleanEntireData(dominikaCibulkova)),\n",
" getNumOfRecords(cleanEntireData(hetrik)),\n",
" getNumOfRecords(cleanEntireData(RTVSsport)),\n",
"\n",
" getNumOfRecords(cleanEntireData(peterMarcin)), # famous people toppic\n",
" getNumOfRecords(cleanEntireData(sajfa)),\n",
" getNumOfRecords(cleanEntireData(janKolenik))\n",
"]"
]
},
{
"cell_type": "code",
"execution_count": 84,
"id": "c014ec40-db05-4ce4-802e-fe5c84c3f123",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA00AAAHrCAYAAAAT2Xb2AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzddVxV9//A8RchYl+VUkGFGTNAZztbMbED7JjOdrPRzW43nbNmbwZgYuAMsEVEsbsDsTAQwaDk/v7wd8+X6710qXs/Hw8f6vl8zjmf87nnxvt8yiA0NFSNEEIIIYQQQgi9DDO7AEIIIYQQQgjxOZOgSQghhBBCCCESIEGTEEIIIYQQQiRAgiYhhBBCCCGESIAETUIIIYQQQgiRAAmahBBCCCGEECIBEjQJIYQQQgghRAIkaBJCCCGEEEKIBEjQJIQQQgghhBAJkKBJiHRkb2+PSqViwIABmV2UdDNz5kxUKhUqlUpvupOTEyqVCicnp4wtWAb4mq9NCCHSkq+vr/Jd4evrm9nFESLZJGgS4v8NGzZM+UD38fFJ1r737t1T9m3VqlU6lVAIIYQQQmQGCZqE+H+dOnVS/r1hw4Zk7bt+/Xq9xxFCJJ3mwcPMmTMzuyhCCCGEFuPMLoAQn4sqVapQrFgxbt++zZ49ewgLCyN37txJ2nfTpk0A5MyZkxYtWijbL126lC5l/ZLs2rUrs4uQbr7maxNCCCHE/0hLkxBxdOzYEYD379+zffv2JO3j7+/P/fv3AWjZsiU5cuRIp9IJIYQQQojMIEGTEHG4uLhgYGAAJL2LXtx80jVPCCGEEOLrI0GTEHHY2NhQq1Yt4GML0oMHDxLMHxkZqbRI2djYULNmTa30+GbPCwwMVMZvJOWPvb291v6JzVin7zzu7u5685w6dYpp06bh5OREiRIlMDc3x8bGhqpVqzJ8+HCuX7+e4DkSk9AMc8mpA33X+vTpU1auXEn37t2pUKECBQsWxMLCglKlStGpUye2bt1KbGxsksr56NEjpk6dSoMGDbCzs8PMzIyiRYvSsGFDJk2axJUrV5J1bQChoaG4ubnRt29fqlatSqFChTA3N6dEiRK0bduW1atXExUVlaTyxefTGaliY2NZu3YtTZs2xc7OjgIFClCtWjWmT59OeHh4oseLjo5m9erVtGrViuLFi2Nubk6xYsVo2bIlf//9N9HR0Ukq19mzZxk2bBhVq1alcOHCmJubU7JkSdq0acOCBQsIDg5W8mreJxqzZ8/Wee3jvofc3d2V7YGBgQmWI7XjpAYMGKD1Hnz9+jWzZs2ievXqFCpUiMKFC9OwYUNWr17Nhw8f4j1OVFQUe/bsYdSoUdSrV48iRYpgZmaGra0tDRo0YObMmbx8+TLR8hw7doy+fftSvnx5ChQogJWVFWXLlqVu3bqMGjWKPXv2oFar490/Jff5p59jFy5cYPDgwZQrVw4rKytUKhWhoaFK/vv377Nw4UJcXFywt7fHyspKKWevXr3Yv39/gtf46esbGRnJokWLqFu3LoULF8ba2po6deqwcOHCJL1/3r17x8KFC2nSpAl2dnZYWFhQsmRJXFxc2Lx5s976evDgAXnz5kWlUvHrr78meo6TJ08qZV66dKlW2tu3b9m6dStDhgyhZs2aFC5cGDMzM7755huaNWvGwoULefPmTYLH//Q+Pn/+PD/++CNly5ZVrqdHjx6cP39e7/5du3ZFpVJRpEgRIiIiEr2e6tWro1Kp+P777+PNc/jwYfr370+FChWwtrbGwsKCMmXK4OLiwsqVK7Xuifjs2LGD1q1bU6xYMSwtLSlfvjyjR4/W+nxIrrjfj0n58+lnQ2xsLEeOHGHcuHE0btxYeZ8ULlyYmjVrMm7cOIKCguI9v+YzI6l/Pv1u/vT9dvv2bYYNG4aDgwOWlpZ88803ODs7c+TIkQTrIbXfP/p+P3h5edGmTRuKFStGwYIFqVGjBsuWLdP6XlCr1WzevBknJyeKFStGgQIFqF27Nn///XeCn00aN27cYMSIEVSqVIlChQpRsGBBKlasyLBhw7h27ZrefeJ+Z+j7DPvUzz//jEqlwtzcnJCQkETza8iYJiE+0alTJ44ePYparWbjxo2MGjUq3rx79uzh9evXgHYr1ZfC3d2dQYMG6WyPjo7mxo0b3LhxgzVr1jB79mz69OmTCSWM34cPHyhdurTeoOjJkyc8efKEPXv2sG7dOtatW0fOnDnjPdayZcuYMGECkZGRWttDQ0M5deoUp06dwtPTM9lj1GrVqqX3y/XZs2ccPHiQgwcP8vfff7N582YsLS2TdWx9oqOjcXFxYd++fVrbr1+/zvXr19mwYQNeXl7Y2trq3f/hw4c4Oztz9epVre0vXrzg6NGjHD16lOXLl7N582ZsbGz0HiMyMpJhw4bh4eGhkxYcHExwcDCHDh3i2rVrLFmyJIVXmjlu3bpF+/btdQI1zT1y+PBh/vnnH72fAz///LPWhDEar1694syZM5w5c4YVK1bg4eFBtWrV9J5/3LhxLFq0SGf7w4cPefjwIefPn2fFihU8ffoUU1NTnXxpcZ+vXr2aUaNGxRs8379/n/Lly+tN05Rz27ZtODs789dff2FsnPDPkNDQUHr27Mm5c+e0tl+4cIELFy6wadMmtm/fTv78+fXuf+XKFVxcXHj48KHW9uDgYLy9vfH29ubvv/9m/fr1WoF74cKFqVatGv7+/mzdupWpU6diaBj/c+bNmzcDYGxsTLt27bTSnJ2d8fPz09nn5cuXHD9+nOPHj7Ny5Uo2b95MiRIlEqwPgJUrVzJmzBhiYmK0rmfHjh3s3r2bv//+W2tsLUCPHj34999/ef36NV5eXjg7O8d7/NOnTys/Trt166aTHhoaSt++ffXOMvvo0SMePXqEt7c3z58/Z+zYsXrPERsbS79+/di4caPW9vv377N8+XK8vLzYtWsX33zzTfwVkU5mz57N7NmzdbaHhYVx+fJlLl++zN9//82yZct06jmt/fvvv/Tr14+3b98q2yIjI/Hx8cHHx4fffvuNvn376t03rb9/RowYwapVq7S2XblyBVdXV44dO8bq1auJiYmhb9++7NixQyvfxYsXGT58OBcuXGD+/PnxnmPhwoVMmjRJ5wHUnTt3uHPnDmvXrmX8+PEMHTpUK93FxUUJfjdv3kyZMmXiPUdUVJRSvgYNGpAvX75Er11DgiYhPtGyZUtGjRrFmzdvEg2aUjprXsGCBTl+/HiCeY4cOaJ84cT3AzW1Pnz4gEqlolmzZnz//fd88803ZM+enadPn3LhwgWWLVvGy5cvGTVqFMWLF6dOnTppev7E6uDatWv07duXDx8+YG1trZWmeWJVu3ZtGjZsSOnSpcmfPz9v3rzh/v37rF27loCAAA4dOsTIkSN1nv5qzJ8/n4kTJwKQK1cuevXqRZ06dTAzM+PNmzdcvnwZb29v7ty5k+zri42NpVKlSjRu3BgHBwcsLCyIiooiMDCQTZs2sX//fi5evMgPP/yQJpNKTJs2jbNnz1K7dm169+5NkSJFePLkCW5ubuzatYugoCDatGmDn5+fzti7t2/f0qpVK+U6GzZsSI8ePbC2tubRo0esXbsWb29vrl+/TosWLfD19SVXrlxax1Cr1XTv3h1vb2/g4w/PH3/8kQoVKpAzZ05evHjBmTNndL5Qt23bRlRUlPJUu3fv3vTu3Vsrj76Wxoz0/v17OnbsyIsXLxg2bBj16tUjd+7c3Lhxg99//53bt2+zfft26tevT/fu3XX2//DhA0WLFqV58+ZUrFgRa2trjI2NefDgAUeOHMHNzY2QkBC6du2Kv78/5ubmWvt7e3srAVPp0qXp1asXJUuWRKVSER4ezs2bNzl69KhS959Ki/v83LlzbNq0iQIFCjB48GAqVqyIWq0mICAAExMT4OM9b2JiQv369alXrx7ffvut0hJ1+/ZtVq5cybVr19i0aRNFixbll19+SbDehw0bxrlz52jVqhWdO3fG0tKSwMBAli9fjp+fH5cuXaJz587s2bNHJ6h58uQJLVq0UJ4kd+jQAWdnZ8zNzbl79y7Lly/nxIk
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Zoradenie hodnôt a zarovnanie poľa\n",
"zoradene = sorted(zip(values, categories))\n",
"\n",
"# Rozdelenie zoradených hodnôt na nové polia\n",
"sortedValues, sortedCategories = zip(*zoradene)\n",
"\n",
"\n",
"plt.barh(sortedCategories, sortedValues, color= '#6b5b95')\n",
"plt.style.use('fivethirtyeight')\n",
"# Přidání popisků\n",
"plt.xlabel('Pocet zaznamov')\n",
"plt.ylabel('Facebookove profily')\n",
"\n",
"plt.title('Vizualizacia poctu nascrapovanych zaznamov')\n",
"\n",
"# # Zobrazení grafu\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 85,
"id": "a5db69e8-1056-40ee-abe9-419e4382d508",
"metadata": {},
"outputs": [],
"source": [
"politiciansTotalRecords= sum(values[:5])\n",
"memeTotalRedcords= sum(values[5:8])\n",
"mediaTotalRecords= sum(values[8:12])\n",
"sportsTotalRecords= sum(values[12:16])\n",
"famousTotalRecords= sum(values[16:20])\n",
"totalRecords= sum(values)"
]
},
{
"cell_type": "code",
"execution_count": 86,
"id": "abc5189f-a90f-4b9d-9471-f6febcb02e12",
"metadata": {},
"outputs": [],
"source": [
"def showPie(title, records):\n",
" plt.pie(records,\n",
"\n",
" labels=[\n",
" 'Politician',\n",
" 'Meme',\n",
" 'Media',\n",
" 'Sports',\n",
" 'Famous people'\n",
" ],\n",
" wedgeprops={\n",
" 'edgecolor': 'black' \n",
" },\n",
" colors=[\n",
" \"#ffb400\", \n",
" \"#d2980d\", \n",
" \"#a57c1b\", \n",
" \"#786028\",\n",
" \"#724545\"\n",
" ],\n",
" autopct='%1.1f%%'\n",
" )\n",
" plt.title(title)\n",
" plt.tight_layout()\n",
" plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 87,
"id": "dc171573-3bd5-49cb-a22b-9c221a0dca52",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAswAAAHKCAYAAAAXYB2aAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAACwL0lEQVR4nOzdd3QUZRfA4d9uet80SEJJCL33jtKU3ntVivKBCIoNFRUVFTuKBUGwUEQpUqVLkaogRSB0SKEESO9ld+f7I2RJyGZTSJgk3OecHHHqnd2Z2TvvvEUTExOjIIQQQgghhDBLq3YAQgghhBBClGSSMAshhBBCCGGBJMxCCCGEEEJYIAmzEEIIIYQQFkjCLIQQQgghhAWSMAshhBBCCGGBJMxCCCGEEEJYIAmzEEIIIYQQFkjCLIQQQgghhAWSMAshhBCi1Klfvz46nY5JkyYV63727t2LTqdDp9Oxd+/eYt2XuD/Lli0zfVchISE55k+aNAmdTkf9+vULvG3rgq6wd+9eevfubXaevb09np6e1KtXj549ezJkyBDs7e0LHJQQQgghhBAlRZGWMKekpHDt2jW2bt3K1KlTeeSRR7hw4UJR7kKUMT179kSn09GzZ0+1QxFCCCFKpNmzZ5tKToU67ithHj9+PAcOHDD9/fnnn3zxxRfUrFkTgAsXLjBo0CCSk5OLJFghhBBCCCHMGTlyJDExMcTExODv759j/rx584iJieHkyZMF3vZ9JcxeXl7UqVPH9Ne0aVPGjBnD7t27adq0KQAhISEsWbLkfnYjhBBCCCGEaoql0Z+DgwNvvvmm6f937NhRHLsRQgghhBCi2BVbLxnNmjUz/TssLCzH/KSkJL766iu6detGYGAg5cqVo2bNmgwdOpSVK1eiKEqu2763lePNmzd54403aNq0Kb6+vtSuXZsxY8Zw5syZbOuFhITwyiuv0LRpU3x8fKhevTpPP/00V65cydcxHTx4kMmTJ9O4cWP8/PyoUKECLVq04OWXX7a4jXtbbaalpfHdd9/RpUsXqlatiru7O6+++mq+YsiUub3Zs2cDsGfPHkaOHEnt2rUpX7489erVY+rUqfk+th07djB27Fjq1q1L+fLl8ff3p3379rz//vtERkbmaxvXrl1j1qxZdO7cmcDAQLy8vAgICODxxx/n7bff5vTp06ZlM7/D/fv3A7B//37TMWX+ZW3FGhISYpq+bNkyi3HcT73orK2h8/Nnbh9BQUF88sknDBgwgDp16lCuXDkqVKhAkyZNmDhxIocPH851/1nPlfz83ds6/N5rIzY2lg8//JDWrVtToUIFKleuzOOPP85PP/2EwWDINY60tDQ2b97Myy+/TMeOHfH398fLy4sqVarQuXNnZs+ened5cW8L9uPHj/P0009Tt25dfHx8aNy4Ma+//nqO7fz99988+eST1KtXj/Lly9OoUSNmzpxJfHy8xf0BREdHM3v2bDp06IC/vz/ly5enbt26PPnkk2zbts3sOlnPvfnz5+e5j5UrV5qW37RpU57LZ2Wutf26devo168f1apVMx3vK6+8ws2bNy1u637Os0w3b95k1qxZdOjQgcqVK+Pl5UW1atVo1aoVo0aN4scffyQiIiLX9dPT01m6dCnDhw833Tt8fX1p1qwZEyZMYN26daSnp2db5966mHFxcXz66ae0b9+egIAAdDod3377rWn5xMREfv/9d6ZMmUK7du1McVatWpUePXrw1VdfkZCQYPE4S8L9snfv3uh0OqpXr27x2svUrFkzdDodjz76aI55hw8f5r333qNnz57UqFEDb29vKlWqRMuWLXnhhRc4e/asxW3f731i06ZNps901apVeR7L3LlzTcsfP37c7DIF+f3IzcWLF5k2bRoNGjSgfPnyVK1alSFDhrBnz548171f8fHx9OnTB51Oh7u7O9999122+YX9zjJ/Ez766CPTNHO/BeZ6hDAajaxcuZIRI0aY7hH+/v507tyZTz/9lNjY2DyPKywsjBdeeMH0mdaqVYsRI0bw119/AfmvW33u3DlefPFFmjVrRoUKFfDz86Np06ZMmzYtR652r3uv37179zJu3Djq1atHuXLlqFy5co7PK7fP5H56ySAmJkYpyN+GDRsUQAGU6dOn57rczZs3TcvVqFEj27z9+/crFStWNM0399e6dWslODjY7LaHDx+uAEqlSpWUvXv3KuXLlze7DScnJ2Xz5s1KTEyMsm7dOsXV1dXscjqdTjl48KDFYxk2bJjFeG1sbJSvvvrK7PrffPONabndu3crDRs2zLH+xIkTC/Q9ZP0OXn/9dUWj0ZiNy8HBQfnll18sHlv//v0tHpurq6uydu1ai/F89NFHip2dncXtVKpUKcd3mN/lT5w4YZr+zTffWIylbdu2CqC0bdu2QJ/pved3fv7u3Ud+1582bVqe50p+/oYPH57rtXH48GHF398/13X79eunREdHW7zGLP15eHgoW7ZsyfWzrFSpkinG7777TrG1tTW7nWrVqinnzp1TYmJilFmzZuV6Ljds2FC5evVqrvtbv369otPpLMbcp08fJTw8PNt60dHRplibNWuW5znSpUsX0/Hfvn270OfXunXrlKFDh+Yaq4+Pj/Lvv//e13ma23kWExOjbNmyRXFzc8tzG59//rnZ9Q8ePKhUrVo1z/U3bNiQbb3p06eb5h09elQJCAjIsc4HH3yQ43q29Ofv76/8888/Jfp++dVXX5mWWbVqlcXzZOfOnaZl33vvvQLfI6ysrJRPP/001+3f730iMjJS8fX1VQClQ4cOeZ73NWvWVAClfv36ZucX9PfD3D1m6dKlipOTU67rf/zxxwW6VnO73u49n2NiYpSLFy8qjRo1UgDF2tpamT9/fpF9Z/n9TThx4kS29U6fPm2KKbe/8uXLKzt37sz1uNetW6c4OzubXVej0Shvvvlmtus5t+3MmjVLsbKysnjsb7/9dr6u35dffjnH9evq6mr287r3M7n33C/oeVDgbuXyK+vToI+Pj+nfN27coHfv3kRFRQEwePBghgwZgre3N5cvX2bBggUcOnSIgwcPMmTIEDZv3oyVlZXZfSQnJzNq1CjS0tJ46623aNu2LVZWVuzYsYPPP/+cxMRE/ve//7F27VpGjRqFq6srr7/+Os2aNUOv17N+/XpTBfApU6bkWnVkzJgxbN68GYBOnToxaNAgAgICsLe358SJE8ybN4/z588zdepUvL296datW66fy+TJkwkKCmLIkCEMGDAAHx8fbty4ka8SB3O2b9/O0aNHCQwM5Pnnn6d+/fokJCSwefNmvv/+e5KTkxkzZgx//vmn2SeqyZMns2bNGgBq1arFs88+S926dYmLi+OPP/7ghx9+IC4ujiFDhrB9+3YaNmyYYxtffvklM2fOBMDFxYWxY8fSvn17vLy8SEhI4NSpU2zdupVLly6Z1nnzzTeZMmUKkydP5tixYzRu3Jhvvvkm23ZtbW0L9ZncjyZNmnDgwAGLy/z22298+eWXAFSqVCnbPIPBgJOTE126dOHRRx+levXquLi4EBERwZkzZ5g/fz5hYWHMmTOHqlWrMmrUqGzr9+zZk8aNG1vc/xdffMGKFSvM7j9TcnIyw4YNIyIigmnTptGxY0dcXV05d+4cn3zyCRcvXmTt2rV06tSJJ554Isf6BoOBgIAAevXqRdOmTalYsSLW1taEhoayZ88eli5dSlRUFKNGjeLgwYN4e3vnGu+pU6dYvXo1gYGBpvMrPj6epUuXsmLFCi5evMibb75Jr169ePPNN2nevDkTJkygevXqREZGMn/+fLZt28aJEyf49NNPefvtt83uY/DgwaSmpmJlZcXYsWPp3bs3rq6uBAUF8c033xAUFMT69evRarX89NNPpnU1Gg2DBw/m888/58iRI1y5coUqVaqYPZaIiAh27twJQP/+/bGxscn1uPPywQcf8Pfff9O1a1dGjBiBv78/0dHR/PLLL6xcuZLw8HCeffZZ070nq/s9z9LS0hg3bhyxsbE4OzszZswY2rdvj7e3N3q9nrCwMI4cOcIff/xhNvZLly7RtWtX4uLiAOj
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"scrapedToppicRatio= [\n",
" politiciansTotalRecords,\n",
" memeTotalRedcords,\n",
" mediaTotalRecords,\n",
" sportsTotalRecords,\n",
" famousTotalRecords\n",
" ]\n",
"\n",
"showPie('Pomer poctu zaznamov nascrapovanych kategorii', scrapedToppicRatio)"
]
},
{
"cell_type": "code",
"execution_count": 88,
"id": "823b6688-cd25-4af4-9e80-2d9834bb9147",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Total number of scraped records is 378677.'"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"f\"Total number of scraped records is {totalRecords}.\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "0d5dbe9c-c26d-4928-8d6e-9c77c43c7224",
"metadata": {},
"source": [
"## Dataset merge"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "68d9d3f9-4716-4711-9568-d0db1b04a58b",
"metadata": {},
"source": [
"### meme toppic"
]
},
{
"cell_type": "code",
"execution_count": 89,
"id": "26cbe373-0808-4332-8492-bbdb597a7584",
"metadata": {},
"outputs": [],
"source": [
"memesFinal = []\n",
"\n",
"memesFinal= zomri.copy()\n",
"memesFinal.extend(emefka)\n",
"\n",
"finalMemeRecords= getNumOfRecords(memesFinal)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "7d8f11e6-b146-4744-848d-aa594a7e65ee",
"metadata": {},
"source": [
"### media toppic"
]
},
{
"cell_type": "code",
"execution_count": 90,
"id": "1f9eb3c1-a035-4d89-8745-81e93ee92bdb",
"metadata": {},
"outputs": [],
"source": [
"mediaFinal = []\n",
"\n",
"mediaFinal= aktuality.copy()\n",
"mediaFinal.extend(dennikN)\n",
"mediaFinal.extend(eva[1:])\n",
"mediaFinal.extend(tvJOJ)\n",
"\n",
"finalMediaRecords= getNumOfRecords(mediaFinal)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "5c15265d-0c9a-4c9c-a88b-7a2ca54f1201",
"metadata": {},
"source": [
"### sports toppic"
]
},
{
"cell_type": "code",
"execution_count": 91,
"id": "369d3221-156f-4a0b-81c2-61b652d3a6fb",
"metadata": {},
"outputs": [],
"source": [
"sportsFinal= []\n",
"\n",
"sportsFinal= dominikaCibulkova.copy()\n",
"sportsFinal.extend(RTVSsport)\n",
"sportsFinal.extend(sport24)\n",
"sportsFinal.extend(hetrik)\n",
"\n",
"finalSportsRecords= getNumOfRecords(sportsFinal)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "e1db48f6-3f7c-4869-b7df-349777ba0e67",
"metadata": {},
"source": [
"### famous people toppic"
]
},
{
"cell_type": "code",
"execution_count": 92,
"id": "f6e0d3c7-9f64-4c0b-964d-69dd1af9a518",
"metadata": {},
"outputs": [],
"source": [
"famousFinal= []\n",
"\n",
"famousFinal= peterMarcin.copy()\n",
"famousFinal.extend(janKolenik)\n",
"famousFinal.extend(sajfa)\n",
"\n",
"finalFamousRecords= getNumOfRecords(famousFinal)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "338d94c5-de92-4998-a63d-caf8a5956f98",
"metadata": {},
"source": [
"### politics toppic"
]
},
{
"cell_type": "code",
"execution_count": 93,
"id": "f885b47a-b5a0-4792-8730-ca213da5262a",
"metadata": {},
"outputs": [],
"source": [
"politicsFinal= []\n",
"\n",
"politicsFinal= marianKotleba.copy()\n",
"politicsFinal.extend(zuzanaCaputova)\n",
"politicsFinal.extend(erikKalinak[:25])\n",
"politicsFinal.extend(igorMatovic)\n",
"politicsFinal.extend(robertFico[:40])\n",
"\n",
"finalPoliticsRecords= getNumOfRecords(politicsFinal)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f5cf9d40-1828-407c-9cb3-5110b8d1fc83",
"metadata": {},
"source": [
"### Merge data"
]
},
{
"cell_type": "code",
"execution_count": 94,
"id": "24659f5b-834d-4c71-9197-6a4f6356c032",
"metadata": {},
"outputs": [],
"source": [
"finalData = famousFinal.copy()\n",
"finalData.extend(politicsFinal)\n",
"finalData.extend(mediaFinal)\n",
"finalData.extend(memesFinal)\n",
"finalData.extend(sportsFinal)"
]
},
{
"cell_type": "code",
"execution_count": 95,
"id": "56b3b771-cf66-4617-b9be-06c42fb5aa41",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1736"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(finalData)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "de6674b2-cd3b-4457-a727-d752337a83a8",
"metadata": {},
"source": [
"## Dataset toppic ratio visualisation"
]
},
{
"cell_type": "code",
"execution_count": 96,
"id": "7ca4acec-ae51-4f6f-9187-344759fd7e9f",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAw4AAAHKCAYAAACqpBzGAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8pXeV/AAAACXBIWXMAAA9hAAAPYQGoP6dpAAC4R0lEQVR4nOzdd3gUVRfA4d/upjdCGul0QgfpgjTpBFCQIk2aFEVQPjuCoGDBBgpSRaUJ0pt0EKQXpddQAgRIIaT3Ld8fMWtCegiZTXLe5+EBdmZnzraZOXPvPVcVGRlpQAghhBBCCCFyoFY6ACGEEEIIIYTpk8RBCCGEEEIIkStJHIQQQgghhBC5ksRBCCGEEEIIkStJHIQQQgghhBC5ksRBCCGEEEIIkStJHIQQQgghhBC5ksRBCCGEEEIIkStJHIQQQgghhBC5ksRBiCJ2+/ZtHB0dcXR0ZMWKFUqHUyRee+01HB0dqVOnjtKhZMvf3x9HR0f8/f2L1bbF06XT6ViwYAHt27fH19eXsmXLZvgsi9PvuTj8DtOkvadffPGF0qEUCwcPHjS+ZwcPHsy0/IsvvjAuf9rq1KmDo6Mjr7322lPflyh6Zvl9wsGDB+nevXuWy6ysrHB2dqZ27dr4+/vTt29frKysnjhIIYQQQgkjRoxg48aNSochhBAmoVBbHBITE7l37x47d+5k/PjxtGzZkoCAgMLchfiXZPSioHK7MyWESHX8+HFj0tChQwc2bNjA4cOHOXLkCD/++KOywQlRSpW2c5iptRTmu8UhvREjRjBixAjj/xMSEjh//jzz5s3j6tWrBAQE0Lt3b44dO4a1tfUTByuEKJ7mzZvHvHnzlA4jR3/88YfSIQgTs3//fgA0Gg0//fQTZcqUybRO+fLliYyMLNrASgF5T/OnZcuWOb5nH374IR9++GHRBSRKrCdqcXBxcaFmzZrGPw0bNmTo0KHs37+fhg0bAqn9P5ctW1YowQohhBBF5cGDBwC4ubllmTQIIURp81QGR1tbWzN58mTj//fs2fM0diOEEEI8NUlJSQCYmT1R47wQQpQYT62qUqNGjYz/vnv3bqbl8fHxzJ49m86dO1OpUiXc3Nzw8/OjX79+rFmzBoPBkO22H+/vFRISwqRJk2jYsCEeHh7UqFGDoUOHcvny5QzPu337Nu+99x4NGzbE3d2dqlWrMnLkSG7dupWn13T06FHGjh3LM888g6enJ15eXjRp0oR33303x22sWLHC2B/v9u3bJCcnM3/+fDp27EjlypUpW7YsH3zwQZ5iSKvOkvaerly50rjttD/ZVW55+PAhX3zxBe3ataNixYq4urpSvXp1BgwYwNatW3Pc7+MVLv766y8GDBhA9erVcXd3p0mTJnz11VfExcVleN6uXbvo06ePcb2mTZvy3XffkZycnKfX+7gZM2YYY7l48WKu67/55ps4Ojri6urKo0ePAGjVqhWOjo40btw41+cnJSVRsWJFHB0defnllzMtP3v2LOPHj6dx48Z4enri5uZGjRo1eO655xg3bhwbNmwwXnzkR0pKCiNHjjS+1kmTJjFv3jzj/0+cOJHrNt555x0cHR1xcXEhNDTUWP0lfXGD7t27Z/r+pK8Mk9e+lbn1Oc1pO/mpSpPX6kT37t1j2rRptGvXjkqVKuHi4kKFChXo0KEDU6dOzfK7k9dtR0RE8O2339KlSxeqVq2Ki4sLPj4+tGrVivfee4/jx4/n+HxIvZOddsxyd3enfPnydO/evVAG4er1elauXGn83bm6uuLj40P9+vXp0qULn332GadPn87yuZcuXeLrr7+mV69e1KxZEzc3N7y8vGjQoAFjxozh5MmTBY7rzp07xopEH330Ua7rHz9+3Pi9mD9/fpbrXL16lbfffptGjRrh5eWFp6cnDRs2ZMKECZmO//mRtt+VK1cCqeewx38nt2/fBnL//j5ezSYpKYk5c+bQpk0bfH198fLyomXLlnz//fckJiZmG5Ner+fAgQNMmjSJTp06Gb/Xvr6+PPfcc0yaNCnLc21BXnfacf7MmTOMHDmS2rVrG8/RQ4YM4cyZM3na3u3bt5k8eTLPPfccvr6+lCtXjtq1azNixAgOHTqUr1jyKj4+Hm9vbxwdHRk8eHCu69+7dw8nJyccHR2ZOHFiluskJyfz888/06tXL/z8/HB1daVSpUp07dqV+fPn5/i5PX5cuXnzJu+9957xO+vo6Mi5c+cyfI/y8ufxY2lRVlXavXs3ffr0oXLlynh4eNCwYUMmTpzI/fv38/T8wMBAZs+eTb9+/ahTpw7u7u64u7tTu3Zthg0blu3N5oKcw6BwjmshISFMmzbN+Lt1cXGhSpUqNGvWjEGDBvHLL7/w8OHDbJ9fkO9Q2meW03Ho8c8zr7+bJxk38dRuo5ibmxv/rdPpMiy7ePEi/fr1IygoKMPjISEh7Ny5k507d/Lzzz8bL4pzcv78eXr37k1ISIjxsYSEBDZu3Mju3btZu3Ytzz77LAcOHGDw4MFER0cb10tMTGTNmjXs3r2b7du3U6NGjSz3kZSUxJtvvsmqVasyLbt27RrXrl3j119/5bvvvsv1QBUREcErr7zC2bNnc1yvsG3YsIHx48cTExOT4fHg4GC2bdvGtm3b6NKlCz/99BO2trY5bmvmzJl8+umnGZK7a9eu8fnnn7N3717Wr1+PjY0NH3zwAQsWLMjw3KtXr/Lpp59y5MgRfv/9dzQaTb5eR79+/Yw/iDVr1lCrVq1s101OTmbTpk0AtGvXDicnJwCGDBnC22+/TUBAAMeOHaNZs2bZbuOPP/4gIiICINNnO3/+fCZOnIher8/w+IMHD3jw4AEXLlxg2bJlnDhxgmrVquX5NcbHxzNkyBB2794NwJQpU5gwYQKRkZF88sknJCYmsnz5cpo0aZLtNhITE1m7di0AnTp1ws3NzXihU9ItWLCAjz/+OFPCFhkZycmTJzl58iTr1q3j/Pnz+d72pk2bGDduXIbjCEBMTAznzp3j3LlzLFy4MMe+xsePH2fgwIEZTjKJiYkcPHiQgwcP8sYbbzB9+vR8xwYQGxvLyy+/nOmiLCUlhZiYGAIDAzl69Ch//vlnppNzdhXzkpOTuXnzJjdv3mTVqlVMmDCBKVOm5Ds2X19fmjVrxtGjR1m/fj3Tpk1Drc7+3tWaNWuA1Lv9L730Uqbls2fPZurUqZnOLzdu3ODGjRssXbqUyZMn89Zbb+U71qclNDSU3r17c+7cuQyPnz9/nvPnz7Njxw42bNiQZTXCGTNmMGPGjEyPR0dHc+HCBS5cuMDPP//MggULsq18mB8//fQTH3zwAVqt1vhYSEgImzZtYtu2bfz888857mfevHlMmTIl002ioKAggoKCWLduHcOGDeObb77J93kgJzY2Nvj7+/P777+za9cuoqKicuxmtm7dOuMxvG/fvpmWX7p0iYEDB2a6Ofjo0SOOHDnCkSNHWLx4Mb///juVKlXKMbbt27czcuRIYmNjC/DKTMPEiROZO3duhsdu3LjB3LlzWb16tfF3m53AwEDq16+f5bK078aGDRvo27cvc+fOfeLWvsI4rh07dox+/foRFRWV4fGHDx/y8OFDrly5wtatWzEYDAwfPjzT8wvzO2QKnlrikP6Onru7u/HfDx48oHv37sa7v3369KFv3764urpy8+ZNFi5cyLFjxzh69Ch9+/Zl+/bt2R5UEhISGDRoEMnJyXz88ce0aNECjUbDnj17+O6774iLi2P06NFs3LiRQYMG4eDgwMSJE2nUqBFarZbNmzczb948IiMjGTduXLZZ7tChQ9m+fTsAzz//PL1796ZChQpYWVlx9uxZ5s2bx7Vr1xg/fjyurq507tw52/dl7NixXLp0ib59+9KrVy/c3d158OBBppNfdn788Ufi4+N56aWXePDgAV27dmXSpEkZ1rGxscnw/02bNjF8+HAMBgPe3t6MGjWK6tWr4+bmxoMHD1i7di3r1q1j+/btjB07ll9//TXb/e/Zs4e///6bJk2aMGrUKKpUqUJ4eDjz589n9+7dHD9+nJkzZ+Lo6MiCBQvo0KEDgwcPxtf
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"finalDatasetToppicRatio = [\n",
" finalPoliticsRecords,\n",
" finalMemeRecords,\n",
" finalMediaRecords,\n",
" finalSportsRecords,\n",
" finalFamousRecords\n",
"]\n",
"showPie(\"Pomer tem vyskytujucich sa vo finalnej verzii datasetu\", finalDatasetToppicRatio)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "7142c3e9-f14c-4514-b660-429a5f163abb",
"metadata": {},
"source": [
"## GDPR protection"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "4c47e04a-1aef-4201-9f97-71d0851d01e6",
"metadata": {},
"source": [
"Down there is an implementation of anonymizing data, rename all names from format \"Jozko Mrkvicka\" to \"user1\""
]
},
{
"cell_type": "code",
"execution_count": 97,
"id": "b88bbe75-9984-4389-9e72-fbe1d27401fd",
"metadata": {},
"outputs": [],
"source": [
"def protectGDPR(dictObj):\n",
" usr_idx = 1\n",
" for post_idx, post in enumerate(dictObj):\n",
" if post['publisher'] not in nameDict:\n",
" nameDict[post['publisher']] = f'<user{usr_idx}>'\n",
" usr_idx+= 1\n",
" if post['publisher'] in nameDict:\n",
" dictObj[post_idx]['publisher'] = nameDict[post['publisher']]\n",
" for comment_idx, comment in enumerate(post['comments']):\n",
" if comment['publisher'] not in nameDict:\n",
" nameDict[comment['publisher']] = f'<user{usr_idx}>'\n",
" usr_idx+= 1\n",
" if comment['publisher'] in nameDict:\n",
" dictObj[post_idx]['comments'][comment_idx]['publisher'] = nameDict[comment['publisher']]\n",
" \n",
" for item, value in nameDict.items():\n",
" comment['text'] = comment['text'].replace(item, value)\n",
" \n",
" for reply_idx, reply in enumerate(comment['replies']):\n",
" if reply['publisher'] not in nameDict:\n",
" nameDict[reply['publisher']] = f'<user{usr_idx}>'\n",
" usr_idx+= 1\n",
" if reply['publisher'] in nameDict:\n",
" dictObj[post_idx]['comments'][comment_idx]['replies'][reply_idx]['publisher'] = nameDict[reply['publisher']]\n",
" \n",
" for item, value in nameDict.items():\n",
" reply['text'] = reply['text'].replace(item, value)\n",
" \n",
" for sec_reply_idx, sec_reply in enumerate(reply['replies']):\n",
" if sec_reply['publisher'] not in nameDict:\n",
" nameDict[sec_reply['publisher']] = f'<user{usr_idx}>'\n",
" usr_idx+= 1\n",
" if sec_reply['publisher'] in nameDict:\n",
" dictObj[post_idx]['comments'][comment_idx]['replies'][reply_idx]['replies'][sec_reply_idx]['publisher'] = nameDict[sec_reply['publisher']]\n",
" for item, value in nameDict.items():\n",
" sec_reply['text'] = sec_reply['text'].replace(item, value)\n",
" \n",
" return dictObj"
]
},
{
"cell_type": "code",
"execution_count": 98,
"id": "5cfaf396-98c0-4755-8181-dc54d775b8b4",
"metadata": {},
"outputs": [],
"source": [
"finalData= protectGDPR(finalData)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "409020dd-ec65-4219-a90b-eeae1c717c83",
"metadata": {},
"source": [
"## Delete blank mentions"
]
},
{
"cell_type": "code",
"execution_count": 100,
"id": "ea84ff84-dcc5-4135-9e2c-4222c8a5d7df",
"metadata": {},
"outputs": [],
"source": [
"def deleteMentions(dictObj):\n",
" for post_idx, post in enumerate(dictObj):\n",
" for comment_idx, comment in enumerate(post['comments']):\n",
" try:\n",
" if len(comment['text'].split()) == 1 and 'user' in comment['text']:\n",
" dictObj[post_idx]['comments'].remove(comment)\n",
" except:\n",
" pass\n",
" for reply_idx, reply in enumerate(comment['replies']):\n",
" try:\n",
" if len(reply['text'].split()) == 1 and 'user' in reply['text']:\n",
" dictObj[post_idx]['comments'][comment_idx]['replies'].remove(reply)\n",
" except:\n",
" pass\n",
" for sec_reply_idx, sec_reply in enumerate(reply['replies']):\n",
" try:\n",
" if len(sec_reply['text'].split()) == 1 and 'user' in sec_reply['text']:\n",
" dictObj[post_idx]['comments'][comment_idx]['replies'][reply_idx]['replies'].remove(sec_reply)\n",
" except:\n",
" pass\n",
" return dictObj"
]
},
{
"cell_type": "code",
"execution_count": 101,
"id": "32472e54-6011-4ed7-bcbc-0bdba604797d",
"metadata": {},
"outputs": [],
"source": [
"finalDataset= deleteMentions(finalData)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "63e895ab-a12e-4142-add9-704cfbb5df29",
"metadata": {},
"source": [
"## Final dataset version peek"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "acb06643-0b78-4ce8-b2e4-18d19dcde4f8",
"metadata": {},
"source": [
"Final structure of the dataset might look like this:\n",
"\n",
" [\n",
" {\n",
" 'post_author': Post author,\n",
" 'title': Text that autor published as a post,\n",
" 'post_reactions': Number of interactions with a post,\n",
" 'sentiment': 1,\n",
" 'hateful': 0,\n",
" 'comments': [\n",
" {\n",
" 'comment_author': Comment author,\n",
" 'text': Comment's content,\n",
" 'sentiment': 1,\n",
" 'hateful': 0,\n",
" 'replies': [\n",
" {\n",
" 'reply_author': Reply author,\n",
" 'text': Reply's content,\n",
" 'sentiment': 1,\n",
" 'hateful': 0,\n",
" 'replies': [\n",
" {\n",
" 'second_reply_author': Second reply author,\n",
" 'text': Second reply's content,\n",
" 'sentiment': 1,\n",
" 'hateful': 0\n",
" }\n",
" ]\n",
" }\n",
" ]\n",
" }\n",
" ]\n",
" }\n",
" ]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "23805d2c",
"metadata": {},
"source": [
"## Rename Label Fields"
]
},
{
"cell_type": "code",
"execution_count": 118,
"id": "e613fbd3",
"metadata": {},
"outputs": [],
"source": [
"renamed_dataset = []\n",
"id = 1\n",
"for idx, post in enumerate(finalDataset):\n",
" renamed_dataset.append({\n",
" 'id': id,\n",
" 'post_author': post['publisher'],\n",
" 'title': post['title'],\n",
" 'post_reactions': post['post_reactions'],\n",
" 'comments': []\n",
" })\n",
" id += 1\n",
" for cmnt_idx, comment in enumerate(post['comments']):\n",
" try:\n",
" renamed_dataset[idx]['comments'].append({\n",
" 'id': id,\n",
" 'comment_author': comment['publisher'],\n",
" 'text': comment['text'],\n",
" 'replies': []\n",
" })\n",
" id += 1\n",
" except KeyError as err:\n",
" pass\n",
"\n",
" try:\n",
" if len(comment['replies']):\n",
" for repl_idx, reply in enumerate(comment['replies']):\n",
" renamed_dataset[idx]['comments'][cmnt_idx]['replies'].append({\n",
" 'id': id,\n",
" 'reply_author': reply['publisher'],\n",
" 'text': reply['text'],\n",
" 'replies': []\n",
" })\n",
" id += 1\n",
"\n",
" if len(reply['replies']):\n",
"\n",
" for sec_reply in reply['replies']:\n",
" renamed_dataset[idx]['comments'][cmnt_idx]['replies'][repl_idx]['replies'].append({\n",
" 'id': id,\n",
" 'second_reply_author': sec_reply['publisher'],\n",
" 'text': sec_reply['text']\n",
" })\n",
" id += 1\n",
" \n",
" except KeyError:\n",
" pass"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "4b95afa4",
"metadata": {},
"source": [
"### Delete empty arrays"
]
},
{
"cell_type": "code",
"execution_count": 119,
"id": "35e114a6",
"metadata": {},
"outputs": [],
"source": [
"for idx, post in enumerate(renamed_dataset):\n",
" for cmnt_idx, comment in enumerate(post['comments']):\n",
" try:\n",
" if not len(comment['replies']):\n",
" del renamed_dataset[idx]['comments'][cmnt_idx]['replies']\n",
" except KeyError as err:\n",
" pass\n",
"\n",
" try:\n",
" if len(comment['replies']):\n",
" for repl_idx, reply in enumerate(comment['replies']):\n",
" if not len(reply['replies']):\n",
" del renamed_dataset[idx]['comments'][cmnt_idx]['replies'][repl_idx]['replies']\n",
" except KeyError:\n",
" pass"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "40f640e4",
"metadata": {},
"source": [
"### Create JSON file"
]
},
{
"cell_type": "code",
"execution_count": 120,
"id": "c9054d6d",
"metadata": {},
"outputs": [],
"source": [
"with open('Facebook/outputs/GDPR_v5.json', 'w', encoding= \"utf-8\") as file:\n",
" json.dump(renamed_dataset, file, indent=4, separators=(',',': '))"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d32a26df",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}