DiplomovaPraca/RUN_INSTRUCTION.md
2026-05-14 12:26:11 +02:00

215 lines
5.3 KiB
Markdown

# Run Instructions - LightRAG ADC Knowledge Graph
This project prepares ADC pharmaceutical leaflet data for a knowledge graph and
LightRAG-based question answering about drug interactions, contraindications,
warnings, indications, dosage, and side effects.
## Current Data
The current ADC scrape is stored in:
```powershell
data_adc_databaza/adc_scrape_2026_05_04/
```
Main files:
- `adc_product_links.json` - 35k+ ADC product detail URLs.
- `adc_products_structured.json` - main structured dataset for the next pipeline stage.
- `adc_products_structured.failed.json` - products that failed during scraping.
- `adc_products_structured_10.json` - small parser test sample.
Use `adc_products_structured.json` as the main source for new graph and
LightRAG ingestion work.
## Requirements
- Python 3.10+
- ADC scraper dependencies:
```powershell
pip install -r scripts/adc_scraper/requirements.txt
python -m playwright install chromium
```
- Local embedding dependencies:
```powershell
pip install sentence-transformers fastapi uvicorn
```
- LightRAG package from `lightrag/`
- OpenWebUI-compatible LLM API access configured in `lightrag/.env`
## Scraping Pipeline
Collect ADC product detail links:
```powershell
python scripts/adc_scraper/scrape_adc_product_links.py `
--out data_adc_databaza/adc_scrape_2026_05_04/adc_product_links.json `
--browser
```
Scrape product detail pages and PIL pages into structured JSON:
```powershell
python scripts/adc_scraper/scrape_adc_product_data.py --browser
```
The default output is:
```powershell
data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json
```
For a small test run:
```powershell
python scripts/adc_scraper/scrape_adc_product_data.py `
--browser `
--limit 10 `
--out data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json
```
## Start Servers
Start the local embedding server and LightRAG server:
```powershell
cd "c:\Users\Oleh\Desktop\Diplomova praca"
python start_servers.py
```
Keep this terminal open. Stop with `Ctrl+C`.
Health checks:
```text
http://localhost:8010/health - embedding server
http://localhost:9621/health - LightRAG server
```
## Old Ingestion Pipeline
The folder `checkpoint_02_ingest/` contains an older ingestion pipeline that
loads data from:
```powershell
data_adc_databaza/cleaned_general_info_additional.json
```
It is kept as a reference because it already contains working LightRAG upload
logic and progress tracking:
```powershell
python checkpoint_02_ingest/load_leaflets.py --count 50
python checkpoint_02_ingest/load_leaflets.py --status
```
Do not treat this as the final ingestion path for the new dataset. The next
step is to create a new ingestion script that reads:
```powershell
data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json
```
and sends each record's `lightrag_text` to LightRAG.
## Query LightRAG
After documents are ingested and LightRAG has finished processing them:
```powershell
python -c "
import urllib.request, json
payload = json.dumps({'query': 'Ake su kontraindikacie Abirateronu?', 'mode': 'hybrid'}).encode()
req = urllib.request.Request('http://localhost:9621/query', data=payload, headers={'Content-Type': 'application/json'})
r = urllib.request.urlopen(req, timeout=120)
print(json.loads(r.read())['response'])
"
```
Available query modes:
- `hybrid` - recommended combined retrieval mode.
- `local` - entity-centered retrieval.
- `global` - broader graph-level retrieval.
- `naive` - vector-only retrieval.
Avoid querying while the document pipeline is still busy. Entity extraction can
take several minutes per batch depending on the LLM API and concurrency limits.
## Reset LightRAG Storage
Stop the servers first, then clear generated graph/vector data:
```powershell
Remove-Item -LiteralPath "c:\Users\Oleh\Desktop\Diplomova praca\lightrag\rag_storage\*" -Force
python checkpoint_02_ingest/load_leaflets.py --reset
```
Use this only when you intentionally want to rebuild the graph.
## Recommended Next Steps
1. Update `validate_adc_json.py` for the new `adc_products_structured.json` schema.
2. Build an explicit knowledge graph from `graph_hints` and PIL subsections.
3. Create a new LightRAG ingestion script for the new dataset.
4. Retry failed scrape URLs from `adc_products_structured.failed.json`.
5. Prepare a small RAGAS evaluation set for contraindication and interaction questions.
## Project Layout
```text
Diplomova praca/
start_servers.py
embedding_server.py
scripts/adc_scraper/
scrape_adc_product_links.py
scrape_adc_product_data.py
validate_adc_json.py
data_adc_databaza/
adc_scrape_2026_05_04/
adc_product_links.json
adc_products_structured.json
adc_products_structured.failed.json
adc_products_structured_10.json
checkpoint_02_ingest/
load_leaflets.py
batch_ingest.py
progress.json
lightrag/
.env
rag_storage/
```
## Troubleshooting
If the embedding server does not start:
```powershell
pip install sentence-transformers fastapi uvicorn
```
If LightRAG has encoding issues:
```powershell
$env:PYTHONUTF8 = "1"
python -m lightrag.api.lightrag_server
```
If LLM extraction times out, reduce concurrency in `lightrag/.env`:
```text
MAX_ASYNC=3
MAX_PARALLEL_INSERT=1
```
If the graph looks empty after ingestion, wait for background processing and
check:
```powershell
python checkpoint_02_ingest/load_leaflets.py --status
```