DiplomovaPraca/RUN_INSTRUCTION.md
2026-05-14 12:44:43 +02:00

302 lines
7.4 KiB
Markdown

# Run Instructions - LightRAG ADC Knowledge Graph
This project prepares ADC pharmaceutical leaflet data for a knowledge graph and
LightRAG-based question answering about drug interactions, contraindications,
warnings, indications, dosage, and side effects.
## Current Data
The current ADC scrape is stored in:
```powershell
data_adc_databaza/adc_scrape_2026_05_04/
```
Main files:
- `adc_product_links.json` - 35k+ ADC product detail URLs.
- `adc_products_structured.json` - main structured dataset for the next pipeline stage.
- `adc_products_structured.failed.json` - products that failed during scraping.
- `adc_products_structured_10.json` - small parser test sample.
Use `adc_products_structured.json` as the main source for new graph and
LightRAG ingestion work.
## Requirements
- Python 3.10+
- ADC scraper dependencies:
```powershell
pip install -r scripts/adc_scraper/requirements.txt
python -m playwright install chromium
```
- Local embedding dependencies:
```powershell
pip install sentence-transformers fastapi uvicorn
```
- LightRAG package from `lightrag/`
- OpenWebUI-compatible LLM API access configured in `lightrag/.env`
## Scraping Pipeline
Collect ADC product detail links:
```powershell
python scripts/adc_scraper/scrape_adc_product_links.py `
--out data_adc_databaza/adc_scrape_2026_05_04/adc_product_links.json `
--browser
```
Scrape product detail pages and PIL pages into structured JSON:
```powershell
python scripts/adc_scraper/scrape_adc_product_data.py --browser
```
The default output is:
```powershell
data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json
```
For a small test run:
```powershell
python scripts/adc_scraper/scrape_adc_product_data.py `
--browser `
--limit 10 `
--out data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json
```
## Start Servers
Start the local embedding server and LightRAG server:
```powershell
cd "c:\Users\Oleh\Desktop\Diplomova praca"
python start_servers.py
```
Keep this terminal open. Stop with `Ctrl+C`.
Health checks:
```text
http://localhost:8010/health - embedding server
http://localhost:9621/health - LightRAG server
```
## Build Explicit Knowledge Graph
This step does not require LightRAG servers. It builds a deterministic graph
directly from the structured ADC JSON.
Test on the small sample:
```powershell
python scripts/kg/build_adc_knowledge_graph.py `
--input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json `
--out-dir outputs/knowledge_graph_sample
```
Build the full graph:
```powershell
python build_knowledge_graph.py
```
Equivalent explicit command:
```powershell
python scripts/kg/build_adc_knowledge_graph.py `
--input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json `
--out-dir outputs/knowledge_graph_full
```
Generated files:
```text
outputs/knowledge_graph_full/adc_knowledge_graph.graphml
outputs/knowledge_graph_full/adc_knowledge_triples.jsonl
outputs/knowledge_graph_full/adc_graph_stats.json
```
## Ingest New ADC Data Into LightRAG
First start the servers:
```powershell
python start_servers.py
```
Keep that terminal open. In a second terminal, run a dry-run:
```powershell
python scripts/lightrag_ingest/ingest_adc_structured.py `
--input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json `
--dry-run `
--limit 5
```
Upload a small clinical batch:
```powershell
python scripts/lightrag_ingest/ingest_adc_structured.py `
--input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json `
--limit 50 `
--resume
```
Check progress and LightRAG pipeline status:
```powershell
python scripts/lightrag_ingest/ingest_adc_structured.py --status
```
Continue with a larger batch:
```powershell
python scripts/lightrag_ingest/ingest_adc_structured.py `
--input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json `
--limit 200 `
--resume
```
By default the script uploads only clinically useful records that contain at
least one of: contraindications, interactions, warnings, or side effects. To
upload every record, add:
```powershell
--all-records
```
## Old Ingestion Pipeline
Older local experiments used `checkpoint_02_ingest/` to load data from:
```powershell
data_adc_databaza/cleaned_general_info_additional.json
```
If that folder is present in your local workspace, treat it only as a reference
for older LightRAG upload logic:
```powershell
python checkpoint_02_ingest/load_leaflets.py --count 50
python checkpoint_02_ingest/load_leaflets.py --status
```
Do not treat this as the final ingestion path for the new dataset. The next
current ingestion script for the new dataset reads:
```powershell
data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json
```
and sends each record's `lightrag_text` to LightRAG:
```powershell
python scripts/lightrag_ingest/ingest_adc_structured.py --limit 50 --resume
```
## Query LightRAG
After documents are ingested and LightRAG has finished processing them:
```powershell
python -c "
import urllib.request, json
payload = json.dumps({'query': 'Ake su kontraindikacie Abirateronu?', 'mode': 'hybrid'}).encode()
req = urllib.request.Request('http://localhost:9621/query', data=payload, headers={'Content-Type': 'application/json'})
r = urllib.request.urlopen(req, timeout=120)
print(json.loads(r.read())['response'])
"
```
Available query modes:
- `hybrid` - recommended combined retrieval mode.
- `local` - entity-centered retrieval.
- `global` - broader graph-level retrieval.
- `naive` - vector-only retrieval.
Avoid querying while the document pipeline is still busy. Entity extraction can
take several minutes per batch depending on the LLM API and concurrency limits.
## Reset LightRAG Storage For a Clean Rebuild
Stop the servers first, then clear generated graph/vector data:
```powershell
Remove-Item -LiteralPath "c:\Users\Oleh\Desktop\Diplomova praca\lightrag\rag_storage\*" -Force
Remove-Item -LiteralPath "c:\Users\Oleh\Desktop\Diplomova praca\outputs\lightrag_ingest\adc_structured_progress.json" -Force
```
Use this only when you intentionally want to rebuild the graph.
## Recommended Next Steps
1. Update `validate_adc_json.py` for the new `adc_products_structured.json` schema.
2. Build an explicit knowledge graph from `graph_hints` and PIL subsections.
3. Create a new LightRAG ingestion script for the new dataset.
4. Retry failed scrape URLs from `adc_products_structured.failed.json`.
5. Prepare a small RAGAS evaluation set for contraindication and interaction questions.
## Project Layout
```text
Diplomova praca/
start_servers.py
embedding_server.py
scripts/adc_scraper/
scrape_adc_product_links.py
scrape_adc_product_data.py
validate_adc_json.py
data_adc_databaza/
adc_scrape_2026_05_04/
adc_product_links.json
adc_products_structured.json
adc_products_structured.failed.json
adc_products_structured_10.json
checkpoint_02_ingest/
load_leaflets.py
batch_ingest.py
progress.json
lightrag/
.env
rag_storage/
```
## Troubleshooting
If the embedding server does not start:
```powershell
pip install sentence-transformers fastapi uvicorn
```
If LightRAG has encoding issues:
```powershell
$env:PYTHONUTF8 = "1"
python -m lightrag.api.lightrag_server
```
If LLM extraction times out, reduce concurrency in `lightrag/.env`:
```text
MAX_ASYNC=3
MAX_PARALLEL_INSERT=1
```
If the graph looks empty after ingestion, wait for background processing and
check:
```powershell
python checkpoint_02_ingest/load_leaflets.py --status
```