235 lines
8.7 KiB
Plaintext
235 lines
8.7 KiB
Plaintext
|
## Running the Server
|
||
|
|
||
|
PrivateGPT supports running with different LLMs & setups.
|
||
|
|
||
|
### Local models
|
||
|
|
||
|
Both the LLM and the Embeddings model will run locally.
|
||
|
|
||
|
Make sure you have followed the *Local LLM requirements* section before moving on.
|
||
|
|
||
|
This command will start PrivateGPT using the `settings.yaml` (default profile) together with the `settings-local.yaml`
|
||
|
configuration files. By default, it will enable both the API and the Gradio UI. Run:
|
||
|
|
||
|
```bash
|
||
|
PGPT_PROFILES=local make run
|
||
|
```
|
||
|
|
||
|
or
|
||
|
|
||
|
```bash
|
||
|
PGPT_PROFILES=local poetry run python -m private_gpt
|
||
|
```
|
||
|
|
||
|
When the server is started it will print a log *Application startup complete*.
|
||
|
Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API
|
||
|
using Swagger UI.
|
||
|
|
||
|
#### Customizing low level parameters
|
||
|
|
||
|
Currently, not all the parameters of `llama.cpp` and `llama-cpp-python` are available at PrivateGPT's `settings.yaml` file.
|
||
|
In case you need to customize parameters such as the number of layers loaded into the GPU, you might change
|
||
|
these at the `llm_component.py` file under the `private_gpt/components/llm/llm_component.py`.
|
||
|
|
||
|
##### Available LLM config options
|
||
|
|
||
|
The `llm` section of the settings allows for the following configurations:
|
||
|
|
||
|
- `mode`: how to run your llm
|
||
|
- `max_new_tokens`: this lets you configure the number of new tokens the LLM will generate and add to the context window (by default Llama.cpp uses `256`)
|
||
|
|
||
|
Example:
|
||
|
|
||
|
```yaml
|
||
|
llm:
|
||
|
mode: local
|
||
|
max_new_tokens: 256
|
||
|
```
|
||
|
|
||
|
If you are getting an out of memory error, you might also try a smaller model or stick to the proposed
|
||
|
recommended models, instead of custom tuning the parameters.
|
||
|
|
||
|
### Using OpenAI
|
||
|
|
||
|
If you cannot run a local model (because you don't have a GPU, for example) or for testing purposes, you may
|
||
|
decide to run PrivateGPT using OpenAI as the LLM and Embeddings model.
|
||
|
|
||
|
In order to do so, create a profile `settings-openai.yaml` with the following contents:
|
||
|
|
||
|
```yaml
|
||
|
llm:
|
||
|
mode: openai
|
||
|
|
||
|
openai:
|
||
|
api_base: <openai-api-base-url> # Defaults to https://api.openai.com/v1
|
||
|
api_key: <your_openai_api_key> # You could skip this configuration and use the OPENAI_API_KEY env var instead
|
||
|
model: <openai_model_to_use> # Optional model to use. Default is "gpt-3.5-turbo"
|
||
|
# Note: Open AI Models are listed here: https://platform.openai.com/docs/models
|
||
|
```
|
||
|
|
||
|
And run PrivateGPT loading that profile you just created:
|
||
|
|
||
|
`PGPT_PROFILES=openai make run`
|
||
|
|
||
|
or
|
||
|
|
||
|
`PGPT_PROFILES=openai poetry run python -m private_gpt`
|
||
|
|
||
|
When the server is started it will print a log *Application startup complete*.
|
||
|
Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.
|
||
|
You'll notice the speed and quality of response is higher, given you are using OpenAI's servers for the heavy
|
||
|
computations.
|
||
|
|
||
|
### Using OpenAI compatible API
|
||
|
|
||
|
Many tools, including [LocalAI](https://localai.io/) and [vLLM](https://docs.vllm.ai/en/latest/),
|
||
|
support serving local models with an OpenAI compatible API. Even when overriding the `api_base`,
|
||
|
using the `openai` mode doesn't allow you to use custom models. Instead, you should use the `openailike` mode:
|
||
|
|
||
|
```yaml
|
||
|
llm:
|
||
|
mode: openailike
|
||
|
```
|
||
|
|
||
|
This mode uses the same settings as the `openai` mode.
|
||
|
|
||
|
As an example, you can follow the [vLLM quickstart guide](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server)
|
||
|
to run an OpenAI compatible server. Then, you can run PrivateGPT using the `settings-vllm.yaml` profile:
|
||
|
|
||
|
`PGPT_PROFILES=vllm make run`
|
||
|
|
||
|
### Using Azure OpenAI
|
||
|
|
||
|
If you cannot run a local model (because you don't have a GPU, for example) or for testing purposes, you may
|
||
|
decide to run PrivateGPT using Azure OpenAI as the LLM and Embeddings model.
|
||
|
|
||
|
In order to do so, create a profile `settings-azopenai.yaml` with the following contents:
|
||
|
|
||
|
```yaml
|
||
|
llm:
|
||
|
mode: azopenai
|
||
|
|
||
|
embedding:
|
||
|
mode: azopenai
|
||
|
|
||
|
azopenai:
|
||
|
api_key: <your_azopenai_api_key> # You could skip this configuration and use the AZ_OPENAI_API_KEY env var instead
|
||
|
azure_endpoint: <your_azopenai_endpoint> # You could skip this configuration and use the AZ_OPENAI_ENDPOINT env var instead
|
||
|
api_version: <api_version> # The API version to use. Default is "2023_05_15"
|
||
|
embedding_deployment_name: <your_embedding_deployment_name> # You could skip this configuration and use the AZ_OPENAI_EMBEDDING_DEPLOYMENT_NAME env var instead
|
||
|
embedding_model: <openai_embeddings_to_use> # Optional model to use. Default is "text-embedding-ada-002"
|
||
|
llm_deployment_name: <your_model_deployment_name> # You could skip this configuration and use the AZ_OPENAI_LLM_DEPLOYMENT_NAME env var instead
|
||
|
llm_model: <openai_model_to_use> # Optional model to use. Default is "gpt-35-turbo"
|
||
|
```
|
||
|
|
||
|
And run PrivateGPT loading that profile you just created:
|
||
|
|
||
|
`PGPT_PROFILES=azopenai make run`
|
||
|
|
||
|
or
|
||
|
|
||
|
`PGPT_PROFILES=azopenai poetry run python -m private_gpt`
|
||
|
|
||
|
When the server is started it will print a log *Application startup complete*.
|
||
|
Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.
|
||
|
You'll notice the speed and quality of response is higher, given you are using Azure OpenAI's servers for the heavy
|
||
|
computations.
|
||
|
|
||
|
### Using AWS Sagemaker
|
||
|
|
||
|
For a fully private & performant setup, you can choose to have both your LLM and Embeddings model deployed using Sagemaker.
|
||
|
|
||
|
Note: how to deploy models on Sagemaker is out of the scope of this documentation.
|
||
|
|
||
|
In order to do so, create a profile `settings-sagemaker.yaml` with the following contents (remember to
|
||
|
update the values of the llm_endpoint_name and embedding_endpoint_name to yours):
|
||
|
|
||
|
```yaml
|
||
|
llm:
|
||
|
mode: sagemaker
|
||
|
|
||
|
sagemaker:
|
||
|
llm_endpoint_name: huggingface-pytorch-tgi-inference-2023-09-25-19-53-32-140
|
||
|
embedding_endpoint_name: huggingface-pytorch-inference-2023-11-03-07-41-36-479
|
||
|
```
|
||
|
|
||
|
And run PrivateGPT loading that profile you just created:
|
||
|
|
||
|
`PGPT_PROFILES=sagemaker make run`
|
||
|
|
||
|
or
|
||
|
|
||
|
`PGPT_PROFILES=sagemaker poetry run python -m private_gpt`
|
||
|
|
||
|
When the server is started it will print a log *Application startup complete*.
|
||
|
Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.
|
||
|
|
||
|
### Using Ollama
|
||
|
|
||
|
Another option for a fully private setup is using [Ollama](https://ollama.ai/).
|
||
|
|
||
|
Note: how to deploy Ollama and pull models onto it is out of the scope of this documentation.
|
||
|
|
||
|
In order to do so, create a profile `settings-ollama.yaml` with the following contents:
|
||
|
|
||
|
```yaml
|
||
|
llm:
|
||
|
mode: ollama
|
||
|
|
||
|
ollama:
|
||
|
model: <ollama_model_to_use> # Required Model to use.
|
||
|
# Note: Ollama Models are listed here: https://ollama.ai/library
|
||
|
# Be sure to pull the model to your Ollama server
|
||
|
api_base: <ollama-api-base-url> # Defaults to http://localhost:11434
|
||
|
```
|
||
|
|
||
|
And run PrivateGPT loading that profile you just created:
|
||
|
|
||
|
`PGPT_PROFILES=ollama make run`
|
||
|
|
||
|
or
|
||
|
|
||
|
`PGPT_PROFILES=ollama poetry run python -m private_gpt`
|
||
|
|
||
|
When the server is started it will print a log *Application startup complete*.
|
||
|
Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.
|
||
|
|
||
|
### Using IPEX-LLM
|
||
|
|
||
|
For a fully private setup on Intel GPUs (such as a local PC with an iGPU, or discrete GPUs like Arc, Flex, and Max), you can use [IPEX-LLM](https://github.com/intel-analytics/ipex-llm).
|
||
|
|
||
|
To deploy Ollama and pull models using IPEX-LLM, please refer to [this guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html). Then, follow the same steps outlined in the [Using Ollama](#using-ollama) section to create a `settings-ollama.yaml` profile and run the private-GPT server.
|
||
|
|
||
|
### Using Gemini
|
||
|
|
||
|
If you cannot run a local model (because you don't have a GPU, for example) or for testing purposes, you may
|
||
|
decide to run PrivateGPT using Gemini as the LLM and Embeddings model. In addition, you will benefit from
|
||
|
multimodal inputs, such as text and images, in a very large contextual window.
|
||
|
|
||
|
In order to do so, create a profile `settings-gemini.yaml` with the following contents:
|
||
|
|
||
|
```yaml
|
||
|
llm:
|
||
|
mode: gemini
|
||
|
|
||
|
embedding:
|
||
|
mode: gemini
|
||
|
|
||
|
gemini:
|
||
|
api_key: <your_gemini_api_key> # You could skip this configuration and use the GEMINI_API_KEY env var instead
|
||
|
model: <gemini_model_to_use> # Optional model to use. Default is models/gemini-pro"
|
||
|
embedding_model: <gemini_embeddings_to_use> # Optional model to use. Default is "models/embedding-001"
|
||
|
```
|
||
|
|
||
|
And run PrivateGPT loading that profile you just created:
|
||
|
|
||
|
`PGPT_PROFILES=gemini make run`
|
||
|
|
||
|
or
|
||
|
|
||
|
`PGPT_PROFILES=gemini poetry run python -m private_gpt`
|
||
|
|
||
|
When the server is started it will print a log *Application startup complete*.
|
||
|
Navigate to http://localhost:8001 to use the Gradio UI or to http://localhost:8001/docs (API section) to try the API.
|
||
|
|