DiplomovaPraca/RUN_INSTRUCTION.md

# Run Instructions - LightRAG ADC Knowledge Graph

This project prepares ADC pharmaceutical leaflet data for a knowledge graph and
LightRAG-based question answering about drug interactions, contraindications,
warnings, indications, dosage, and side effects.

## Current Data

The current ADC scrape is stored in:

```powershell
data_adc_databaza/adc_scrape_2026_05_04/
```

Main files:

- `adc_product_links.json` - 35k+ ADC product detail URLs.
- `adc_products_structured.json` - main structured dataset for the next pipeline stage.
- `adc_products_structured.failed.json` - products that failed during scraping.
- `adc_products_structured_10.json` - small parser test sample.

Use `adc_products_structured.json` as the main source for new graph and
LightRAG ingestion work.

## Requirements

- Python 3.10+
- ADC scraper dependencies:

```powershell
pip install -r scripts/adc_scraper/requirements.txt
python -m playwright install chromium
```

- Local embedding dependencies:

```powershell
pip install sentence-transformers fastapi uvicorn
```

- LightRAG package from `lightrag/`
- OpenWebUI-compatible LLM API access configured in `lightrag/.env`

## Scraping Pipeline

Collect ADC product detail links:

```powershell
python scripts/adc_scraper/scrape_adc_product_links.py `
  --out data_adc_databaza/adc_scrape_2026_05_04/adc_product_links.json `
  --browser
```

Scrape product detail pages and PIL pages into structured JSON:

```powershell
python scripts/adc_scraper/scrape_adc_product_data.py --browser
```

The default output is:

```powershell
data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json
```

For a small test run:

```powershell
python scripts/adc_scraper/scrape_adc_product_data.py `
  --browser `
  --limit 10 `
  --out data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json
```

## Start Servers

Start the local embedding server and LightRAG server:

```powershell
cd "c:\Users\Oleh\Desktop\Diplomova praca"
python start_servers.py
```

Keep this terminal open. Stop with `Ctrl+C`.

Health checks:

```text
http://localhost:8010/health   - embedding server
http://localhost:9621/health   - LightRAG server
```

## Old Ingestion Pipeline

The folder `checkpoint_02_ingest/` contains an older ingestion pipeline that
loads data from:

```powershell
data_adc_databaza/cleaned_general_info_additional.json
```

It is kept as a reference because it already contains working LightRAG upload
logic and progress tracking:

```powershell
python checkpoint_02_ingest/load_leaflets.py --count 50
python checkpoint_02_ingest/load_leaflets.py --status
```

Do not treat this as the final ingestion path for the new dataset. The next
step is to create a new ingestion script that reads:

```powershell
data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json
```

and sends each record's `lightrag_text` to LightRAG.

## Query LightRAG

After documents are ingested and LightRAG has finished processing them:

```powershell
python -c "
import urllib.request, json
payload = json.dumps({'query': 'Ake su kontraindikacie Abirateronu?', 'mode': 'hybrid'}).encode()
req = urllib.request.Request('http://localhost:9621/query', data=payload, headers={'Content-Type': 'application/json'})
r = urllib.request.urlopen(req, timeout=120)
print(json.loads(r.read())['response'])
"
```

Available query modes:

- `hybrid` - recommended combined retrieval mode.
- `local` - entity-centered retrieval.
- `global` - broader graph-level retrieval.
- `naive` - vector-only retrieval.

Avoid querying while the document pipeline is still busy. Entity extraction can
take several minutes per batch depending on the LLM API and concurrency limits.

## Reset LightRAG Storage

Stop the servers first, then clear generated graph/vector data:

```powershell
Remove-Item -LiteralPath "c:\Users\Oleh\Desktop\Diplomova praca\lightrag\rag_storage\*" -Force
python checkpoint_02_ingest/load_leaflets.py --reset
```

Use this only when you intentionally want to rebuild the graph.

## Recommended Next Steps

1. Update `validate_adc_json.py` for the new `adc_products_structured.json` schema.
2. Build an explicit knowledge graph from `graph_hints` and PIL subsections.
3. Create a new LightRAG ingestion script for the new dataset.
4. Retry failed scrape URLs from `adc_products_structured.failed.json`.
5. Prepare a small RAGAS evaluation set for contraindication and interaction questions.

## Project Layout

```text
Diplomova praca/
  start_servers.py
  embedding_server.py
  scripts/adc_scraper/
    scrape_adc_product_links.py
    scrape_adc_product_data.py
    validate_adc_json.py
  data_adc_databaza/
    adc_scrape_2026_05_04/
      adc_product_links.json
      adc_products_structured.json
      adc_products_structured.failed.json
      adc_products_structured_10.json
  checkpoint_02_ingest/
    load_leaflets.py
    batch_ingest.py
    progress.json
  lightrag/
    .env
    rag_storage/
```

## Troubleshooting

If the embedding server does not start:

```powershell
pip install sentence-transformers fastapi uvicorn
```

If LightRAG has encoding issues:

```powershell
$env:PYTHONUTF8 = "1"
python -m lightrag.api.lightrag_server
```

If LLM extraction times out, reduce concurrency in `lightrag/.env`:

```text
MAX_ASYNC=3
MAX_PARALLEL_INSERT=1
```

If the graph looks empty after ingestion, wait for background processing and
check:

```powershell
python checkpoint_02_ingest/load_leaflets.py --status
```