5.3 KiB
Run Instructions - LightRAG ADC Knowledge Graph
This project prepares ADC pharmaceutical leaflet data for a knowledge graph and LightRAG-based question answering about drug interactions, contraindications, warnings, indications, dosage, and side effects.
Current Data
The current ADC scrape is stored in:
data_adc_databaza/adc_scrape_2026_05_04/
Main files:
adc_product_links.json- 35k+ ADC product detail URLs.adc_products_structured.json- main structured dataset for the next pipeline stage.adc_products_structured.failed.json- products that failed during scraping.adc_products_structured_10.json- small parser test sample.
Use adc_products_structured.json as the main source for new graph and
LightRAG ingestion work.
Requirements
- Python 3.10+
- ADC scraper dependencies:
pip install -r scripts/adc_scraper/requirements.txt
python -m playwright install chromium
- Local embedding dependencies:
pip install sentence-transformers fastapi uvicorn
- LightRAG package from
lightrag/ - OpenWebUI-compatible LLM API access configured in
lightrag/.env
Scraping Pipeline
Collect ADC product detail links:
python scripts/adc_scraper/scrape_adc_product_links.py `
--out data_adc_databaza/adc_scrape_2026_05_04/adc_product_links.json `
--browser
Scrape product detail pages and PIL pages into structured JSON:
python scripts/adc_scraper/scrape_adc_product_data.py --browser
The default output is:
data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json
For a small test run:
python scripts/adc_scraper/scrape_adc_product_data.py `
--browser `
--limit 10 `
--out data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json
Start Servers
Start the local embedding server and LightRAG server:
cd "c:\Users\Oleh\Desktop\Diplomova praca"
python start_servers.py
Keep this terminal open. Stop with Ctrl+C.
Health checks:
http://localhost:8010/health - embedding server
http://localhost:9621/health - LightRAG server
Old Ingestion Pipeline
The folder checkpoint_02_ingest/ contains an older ingestion pipeline that
loads data from:
data_adc_databaza/cleaned_general_info_additional.json
It is kept as a reference because it already contains working LightRAG upload logic and progress tracking:
python checkpoint_02_ingest/load_leaflets.py --count 50
python checkpoint_02_ingest/load_leaflets.py --status
Do not treat this as the final ingestion path for the new dataset. The next step is to create a new ingestion script that reads:
data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json
and sends each record's lightrag_text to LightRAG.
Query LightRAG
After documents are ingested and LightRAG has finished processing them:
python -c "
import urllib.request, json
payload = json.dumps({'query': 'Ake su kontraindikacie Abirateronu?', 'mode': 'hybrid'}).encode()
req = urllib.request.Request('http://localhost:9621/query', data=payload, headers={'Content-Type': 'application/json'})
r = urllib.request.urlopen(req, timeout=120)
print(json.loads(r.read())['response'])
"
Available query modes:
hybrid- recommended combined retrieval mode.local- entity-centered retrieval.global- broader graph-level retrieval.naive- vector-only retrieval.
Avoid querying while the document pipeline is still busy. Entity extraction can take several minutes per batch depending on the LLM API and concurrency limits.
Reset LightRAG Storage
Stop the servers first, then clear generated graph/vector data:
Remove-Item -LiteralPath "c:\Users\Oleh\Desktop\Diplomova praca\lightrag\rag_storage\*" -Force
python checkpoint_02_ingest/load_leaflets.py --reset
Use this only when you intentionally want to rebuild the graph.
Recommended Next Steps
- Update
validate_adc_json.pyfor the newadc_products_structured.jsonschema. - Build an explicit knowledge graph from
graph_hintsand PIL subsections. - Create a new LightRAG ingestion script for the new dataset.
- Retry failed scrape URLs from
adc_products_structured.failed.json. - Prepare a small RAGAS evaluation set for contraindication and interaction questions.
Project Layout
Diplomova praca/
start_servers.py
embedding_server.py
scripts/adc_scraper/
scrape_adc_product_links.py
scrape_adc_product_data.py
validate_adc_json.py
data_adc_databaza/
adc_scrape_2026_05_04/
adc_product_links.json
adc_products_structured.json
adc_products_structured.failed.json
adc_products_structured_10.json
checkpoint_02_ingest/
load_leaflets.py
batch_ingest.py
progress.json
lightrag/
.env
rag_storage/
Troubleshooting
If the embedding server does not start:
pip install sentence-transformers fastapi uvicorn
If LightRAG has encoding issues:
$env:PYTHONUTF8 = "1"
python -m lightrag.api.lightrag_server
If LLM extraction times out, reduce concurrency in lightrag/.env:
MAX_ASYNC=3
MAX_PARALLEL_INSERT=1
If the graph looks empty after ingestion, wait for background processing and check:
python checkpoint_02_ingest/load_leaflets.py --status