215 lines
5.3 KiB
Markdown
215 lines
5.3 KiB
Markdown
# Run Instructions - LightRAG ADC Knowledge Graph
|
|
|
|
This project prepares ADC pharmaceutical leaflet data for a knowledge graph and
|
|
LightRAG-based question answering about drug interactions, contraindications,
|
|
warnings, indications, dosage, and side effects.
|
|
|
|
## Current Data
|
|
|
|
The current ADC scrape is stored in:
|
|
|
|
```powershell
|
|
data_adc_databaza/adc_scrape_2026_05_04/
|
|
```
|
|
|
|
Main files:
|
|
|
|
- `adc_product_links.json` - 35k+ ADC product detail URLs.
|
|
- `adc_products_structured.json` - main structured dataset for the next pipeline stage.
|
|
- `adc_products_structured.failed.json` - products that failed during scraping.
|
|
- `adc_products_structured_10.json` - small parser test sample.
|
|
|
|
Use `adc_products_structured.json` as the main source for new graph and
|
|
LightRAG ingestion work.
|
|
|
|
## Requirements
|
|
|
|
- Python 3.10+
|
|
- ADC scraper dependencies:
|
|
|
|
```powershell
|
|
pip install -r scripts/adc_scraper/requirements.txt
|
|
python -m playwright install chromium
|
|
```
|
|
|
|
- Local embedding dependencies:
|
|
|
|
```powershell
|
|
pip install sentence-transformers fastapi uvicorn
|
|
```
|
|
|
|
- LightRAG package from `lightrag/`
|
|
- OpenWebUI-compatible LLM API access configured in `lightrag/.env`
|
|
|
|
## Scraping Pipeline
|
|
|
|
Collect ADC product detail links:
|
|
|
|
```powershell
|
|
python scripts/adc_scraper/scrape_adc_product_links.py `
|
|
--out data_adc_databaza/adc_scrape_2026_05_04/adc_product_links.json `
|
|
--browser
|
|
```
|
|
|
|
Scrape product detail pages and PIL pages into structured JSON:
|
|
|
|
```powershell
|
|
python scripts/adc_scraper/scrape_adc_product_data.py --browser
|
|
```
|
|
|
|
The default output is:
|
|
|
|
```powershell
|
|
data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json
|
|
```
|
|
|
|
For a small test run:
|
|
|
|
```powershell
|
|
python scripts/adc_scraper/scrape_adc_product_data.py `
|
|
--browser `
|
|
--limit 10 `
|
|
--out data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json
|
|
```
|
|
|
|
## Start Servers
|
|
|
|
Start the local embedding server and LightRAG server:
|
|
|
|
```powershell
|
|
cd "c:\Users\Oleh\Desktop\Diplomova praca"
|
|
python start_servers.py
|
|
```
|
|
|
|
Keep this terminal open. Stop with `Ctrl+C`.
|
|
|
|
Health checks:
|
|
|
|
```text
|
|
http://localhost:8010/health - embedding server
|
|
http://localhost:9621/health - LightRAG server
|
|
```
|
|
|
|
## Old Ingestion Pipeline
|
|
|
|
The folder `checkpoint_02_ingest/` contains an older ingestion pipeline that
|
|
loads data from:
|
|
|
|
```powershell
|
|
data_adc_databaza/cleaned_general_info_additional.json
|
|
```
|
|
|
|
It is kept as a reference because it already contains working LightRAG upload
|
|
logic and progress tracking:
|
|
|
|
```powershell
|
|
python checkpoint_02_ingest/load_leaflets.py --count 50
|
|
python checkpoint_02_ingest/load_leaflets.py --status
|
|
```
|
|
|
|
Do not treat this as the final ingestion path for the new dataset. The next
|
|
step is to create a new ingestion script that reads:
|
|
|
|
```powershell
|
|
data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json
|
|
```
|
|
|
|
and sends each record's `lightrag_text` to LightRAG.
|
|
|
|
## Query LightRAG
|
|
|
|
After documents are ingested and LightRAG has finished processing them:
|
|
|
|
```powershell
|
|
python -c "
|
|
import urllib.request, json
|
|
payload = json.dumps({'query': 'Ake su kontraindikacie Abirateronu?', 'mode': 'hybrid'}).encode()
|
|
req = urllib.request.Request('http://localhost:9621/query', data=payload, headers={'Content-Type': 'application/json'})
|
|
r = urllib.request.urlopen(req, timeout=120)
|
|
print(json.loads(r.read())['response'])
|
|
"
|
|
```
|
|
|
|
Available query modes:
|
|
|
|
- `hybrid` - recommended combined retrieval mode.
|
|
- `local` - entity-centered retrieval.
|
|
- `global` - broader graph-level retrieval.
|
|
- `naive` - vector-only retrieval.
|
|
|
|
Avoid querying while the document pipeline is still busy. Entity extraction can
|
|
take several minutes per batch depending on the LLM API and concurrency limits.
|
|
|
|
## Reset LightRAG Storage
|
|
|
|
Stop the servers first, then clear generated graph/vector data:
|
|
|
|
```powershell
|
|
Remove-Item -LiteralPath "c:\Users\Oleh\Desktop\Diplomova praca\lightrag\rag_storage\*" -Force
|
|
python checkpoint_02_ingest/load_leaflets.py --reset
|
|
```
|
|
|
|
Use this only when you intentionally want to rebuild the graph.
|
|
|
|
## Recommended Next Steps
|
|
|
|
1. Update `validate_adc_json.py` for the new `adc_products_structured.json` schema.
|
|
2. Build an explicit knowledge graph from `graph_hints` and PIL subsections.
|
|
3. Create a new LightRAG ingestion script for the new dataset.
|
|
4. Retry failed scrape URLs from `adc_products_structured.failed.json`.
|
|
5. Prepare a small RAGAS evaluation set for contraindication and interaction questions.
|
|
|
|
## Project Layout
|
|
|
|
```text
|
|
Diplomova praca/
|
|
start_servers.py
|
|
embedding_server.py
|
|
scripts/adc_scraper/
|
|
scrape_adc_product_links.py
|
|
scrape_adc_product_data.py
|
|
validate_adc_json.py
|
|
data_adc_databaza/
|
|
adc_scrape_2026_05_04/
|
|
adc_product_links.json
|
|
adc_products_structured.json
|
|
adc_products_structured.failed.json
|
|
adc_products_structured_10.json
|
|
checkpoint_02_ingest/
|
|
load_leaflets.py
|
|
batch_ingest.py
|
|
progress.json
|
|
lightrag/
|
|
.env
|
|
rag_storage/
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
If the embedding server does not start:
|
|
|
|
```powershell
|
|
pip install sentence-transformers fastapi uvicorn
|
|
```
|
|
|
|
If LightRAG has encoding issues:
|
|
|
|
```powershell
|
|
$env:PYTHONUTF8 = "1"
|
|
python -m lightrag.api.lightrag_server
|
|
```
|
|
|
|
If LLM extraction times out, reduce concurrency in `lightrag/.env`:
|
|
|
|
```text
|
|
MAX_ASYNC=3
|
|
MAX_PARALLEL_INSERT=1
|
|
```
|
|
|
|
If the graph looks empty after ingestion, wait for background processing and
|
|
check:
|
|
|
|
```powershell
|
|
python checkpoint_02_ingest/load_leaflets.py --status
|
|
```
|