{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Data chunking for effectiveness\n", "\n", "In our data, facebook user called Robert Fico has a lot of samples.\n", "For efficiency, this notebook chunks those data in 4 parts." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import json" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### JSONL file loading and creation" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def load_jsonl(file_path):\n", " with open(file_path, 'r', encoding='utf-8') as file:\n", " return [json.loads(line) for line in file]" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def create_jsonl(filename, new_dataset):\n", " with open(f'{filename}l', 'w') as jsonl_file:\n", " for item in new_dataset:\n", " jsonl_file.write(json.dumps(item) + '\\n')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "fico = load_jsonl('jsonl_data/robert_fico_data.jsonl')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Split data into 4 parts equal parts" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "135155" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "num_samples = len(fico)\n", "chunk_size = int(num_samples / 4)\n", "\n", "num_samples" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chunk_size * 4 == num_samples # we have lost one sample, because our dataset has odd number of samples" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Actual chunking algorithm" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "chunk_arr = []\n", "for chunks in range(0, 4):\n", " chunk_arr.append(\n", " fico[chunk_size * chunks: chunk_size * (chunks + 1)]\n", " )" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### Write chunked data to disk in a for loop" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "for index, data in enumerate(chunk_arr):\n", " create_jsonl(f'jsonl_data/fico_chunk_{index}.json', data)" ] } ], "metadata": { "kernelspec": { "display_name": "sentiment", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }