959a391334
Some checks failed
publish docs / publish-docs (push) Has been cancelled
release-please / release-please (push) Has been cancelled
tests / setup (push) Has been cancelled
tests / ${{ matrix.quality-command }} (black) (push) Has been cancelled
tests / ${{ matrix.quality-command }} (mypy) (push) Has been cancelled
tests / ${{ matrix.quality-command }} (ruff) (push) Has been cancelled
tests / test (push) Has been cancelled
tests / all_checks_passed (push) Has been cancelled
Mark stale issues and pull requests / stale (push) Has been cancelled
138 lines
5.0 KiB
Plaintext
138 lines
5.0 KiB
Plaintext
# Ingesting & Managing Documents
|
|
|
|
The ingestion of documents can be done in different ways:
|
|
|
|
* Using the `/ingest` API
|
|
* Using the Gradio UI
|
|
* Using the Bulk Local Ingestion functionality (check next section)
|
|
|
|
## Bulk Local Ingestion
|
|
|
|
You will need to activate `data.local_ingestion.enabled` in your setting file to use this feature. Additionally,
|
|
it is probably a good idea to set `data.local_ingestion.allow_ingest_from` to specify which folders are allowed to be ingested.
|
|
|
|
<Callout intent = "warning">
|
|
Be careful enabling this feature in a production environment, as it can be a security risk, as it allows users to
|
|
ingest any local file with permissions.
|
|
</Callout>
|
|
|
|
When you are running PrivateGPT in a fully local setup, you can ingest a complete folder for convenience (containing
|
|
pdf, text files, etc.)
|
|
and optionally watch changes on it with the command:
|
|
|
|
```bash
|
|
make ingest /path/to/folder -- --watch
|
|
```
|
|
|
|
To log the processed and failed files to an additional file, use:
|
|
|
|
```bash
|
|
make ingest /path/to/folder -- --watch --log-file /path/to/log/file.log
|
|
```
|
|
|
|
**Note for Windows Users:** Depending on your Windows version and whether you are using PowerShell to execute
|
|
PrivateGPT API calls, you may need to include the parameter name before passing the folder path for consumption:
|
|
|
|
```bash
|
|
make ingest arg=/path/to/folder -- --watch --log-file /path/to/log/file.log
|
|
```
|
|
|
|
After ingestion is complete, you should be able to chat with your documents
|
|
by navigating to http://localhost:8001 and using the option `Query documents`,
|
|
or using the completions / chat API.
|
|
|
|
## Ingestion troubleshooting
|
|
|
|
### Running out of memory
|
|
|
|
To do not run out of memory, you should ingest your documents without the LLM loaded in your (video) memory.
|
|
To do so, you should change your configuration to set `llm.mode: mock`.
|
|
|
|
You can also use the existing `PGPT_PROFILES=mock` that will set the following configuration for you:
|
|
|
|
```yaml
|
|
llm:
|
|
mode: mock
|
|
embedding:
|
|
mode: local
|
|
```
|
|
|
|
This configuration allows you to use hardware acceleration for creating embeddings while avoiding loading the full LLM into (video) memory.
|
|
|
|
Once your documents are ingested, you can set the `llm.mode` value back to `local` (or your previous custom value).
|
|
|
|
### Ingestion speed
|
|
|
|
The ingestion speed depends on the number of documents you are ingesting, and the size of each document.
|
|
To speed up the ingestion, you can change the ingestion mode in configuration.
|
|
|
|
The following ingestion mode exist:
|
|
* `simple`: historic behavior, ingest one document at a time, sequentially
|
|
* `batch`: read, parse, and embed multiple documents using batches (batch read, and then batch parse, and then batch embed)
|
|
* `parallel`: read, parse, and embed multiple documents in parallel. This is the fastest ingestion mode for local setup.
|
|
* `pipeline`: Alternative to parallel.
|
|
To change the ingestion mode, you can use the `embedding.ingest_mode` configuration value. The default value is `simple`.
|
|
|
|
To configure the number of workers used for parallel or batched ingestion, you can use
|
|
the `embedding.count_workers` configuration value. If you set this value too high, you might run out of
|
|
memory, so be mindful when setting this value. The default value is `2`.
|
|
For `batch` mode, you can easily set this value to your number of threads available on your CPU without
|
|
running out of memory. For `parallel` mode, you should be more careful, and set this value to a lower value.
|
|
|
|
The configuration below should be enough for users who want to stress more their hardware:
|
|
```yaml
|
|
embedding:
|
|
ingest_mode: parallel
|
|
count_workers: 4
|
|
```
|
|
|
|
If your hardware is powerful enough, and that you are loading heavy documents, you can increase the number of workers.
|
|
It is recommended to do your own tests to find the optimal value for your hardware.
|
|
|
|
If you have a `bash` shell, you can use this set of command to do your own benchmark:
|
|
|
|
```bash
|
|
# Wipe your local data, to put yourself in a clean state
|
|
# This will delete all your ingested documents
|
|
make wipe
|
|
|
|
time PGPT_PROFILES=mock python ./scripts/ingest_folder.py ~/my-dir/to-ingest/
|
|
```
|
|
|
|
## Supported file formats
|
|
|
|
PrivateGPT by default supports all the file formats that contains clear text (for example, `.txt` files, `.html`, etc.).
|
|
However, these text based file formats as only considered as text files, and are not pre-processed in any other way.
|
|
|
|
It also supports the following file formats:
|
|
* `.hwp`
|
|
* `.pdf`
|
|
* `.docx`
|
|
* `.pptx`
|
|
* `.ppt`
|
|
* `.pptm`
|
|
* `.jpg`
|
|
* `.png`
|
|
* `.jpeg`
|
|
* `.mp3`
|
|
* `.mp4`
|
|
* `.csv`
|
|
* `.epub`
|
|
* `.md`
|
|
* `.mbox`
|
|
* `.ipynb`
|
|
* `.json`
|
|
|
|
<Callout intent = "info">
|
|
While `PrivateGPT` supports these file formats, it **might** require additional
|
|
dependencies to be installed in your python's virtual environment.
|
|
For example, if you try to ingest `.epub` files, `PrivateGPT` might fail to do it, and will instead display an
|
|
explanatory error asking you to download the necessary dependencies to install this file format.
|
|
</Callout>
|
|
|
|
<Callout intent = "info">
|
|
**Other file formats might work**, but they will be considered as plain text
|
|
files (in other words, they will be ingested as `.txt` files).
|
|
</Callout>
|
|
|