7.4 KiB
Run Instructions - LightRAG ADC Knowledge Graph
This project prepares ADC pharmaceutical leaflet data for a knowledge graph and LightRAG-based question answering about drug interactions, contraindications, warnings, indications, dosage, and side effects.
Current Data
The current ADC scrape is stored in:
data_adc_databaza/adc_scrape_2026_05_04/
Main files:
adc_product_links.json- 35k+ ADC product detail URLs.adc_products_structured.json- main structured dataset for the next pipeline stage.adc_products_structured.failed.json- products that failed during scraping.adc_products_structured_10.json- small parser test sample.
Use adc_products_structured.json as the main source for new graph and
LightRAG ingestion work.
Requirements
- Python 3.10+
- ADC scraper dependencies:
pip install -r scripts/adc_scraper/requirements.txt
python -m playwright install chromium
- Local embedding dependencies:
pip install sentence-transformers fastapi uvicorn
- LightRAG package from
lightrag/ - OpenWebUI-compatible LLM API access configured in
lightrag/.env
Scraping Pipeline
Collect ADC product detail links:
python scripts/adc_scraper/scrape_adc_product_links.py `
--out data_adc_databaza/adc_scrape_2026_05_04/adc_product_links.json `
--browser
Scrape product detail pages and PIL pages into structured JSON:
python scripts/adc_scraper/scrape_adc_product_data.py --browser
The default output is:
data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json
For a small test run:
python scripts/adc_scraper/scrape_adc_product_data.py `
--browser `
--limit 10 `
--out data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json
Start Servers
Start the local embedding server and LightRAG server:
cd "c:\Users\Oleh\Desktop\Diplomova praca"
python start_servers.py
Keep this terminal open. Stop with Ctrl+C.
Health checks:
http://localhost:8010/health - embedding server
http://localhost:9621/health - LightRAG server
Build Explicit Knowledge Graph
This step does not require LightRAG servers. It builds a deterministic graph directly from the structured ADC JSON.
Test on the small sample:
python scripts/kg/build_adc_knowledge_graph.py `
--input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json `
--out-dir outputs/knowledge_graph_sample
Build the full graph:
python build_knowledge_graph.py
Equivalent explicit command:
python scripts/kg/build_adc_knowledge_graph.py `
--input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json `
--out-dir outputs/knowledge_graph_full
Generated files:
outputs/knowledge_graph_full/adc_knowledge_graph.graphml
outputs/knowledge_graph_full/adc_knowledge_triples.jsonl
outputs/knowledge_graph_full/adc_graph_stats.json
Ingest New ADC Data Into LightRAG
First start the servers:
python start_servers.py
Keep that terminal open. In a second terminal, run a dry-run:
python scripts/lightrag_ingest/ingest_adc_structured.py `
--input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json `
--dry-run `
--limit 5
Upload a small clinical batch:
python scripts/lightrag_ingest/ingest_adc_structured.py `
--input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json `
--limit 50 `
--resume
Check progress and LightRAG pipeline status:
python scripts/lightrag_ingest/ingest_adc_structured.py --status
Continue with a larger batch:
python scripts/lightrag_ingest/ingest_adc_structured.py `
--input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json `
--limit 200 `
--resume
By default the script uploads only clinically useful records that contain at least one of: contraindications, interactions, warnings, or side effects. To upload every record, add:
--all-records
Old Ingestion Pipeline
Older local experiments used checkpoint_02_ingest/ to load data from:
data_adc_databaza/cleaned_general_info_additional.json
If that folder is present in your local workspace, treat it only as a reference for older LightRAG upload logic:
python checkpoint_02_ingest/load_leaflets.py --count 50
python checkpoint_02_ingest/load_leaflets.py --status
Do not treat this as the final ingestion path for the new dataset. The next current ingestion script for the new dataset reads:
data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json
and sends each record's lightrag_text to LightRAG:
python scripts/lightrag_ingest/ingest_adc_structured.py --limit 50 --resume
Query LightRAG
After documents are ingested and LightRAG has finished processing them:
python -c "
import urllib.request, json
payload = json.dumps({'query': 'Ake su kontraindikacie Abirateronu?', 'mode': 'hybrid'}).encode()
req = urllib.request.Request('http://localhost:9621/query', data=payload, headers={'Content-Type': 'application/json'})
r = urllib.request.urlopen(req, timeout=120)
print(json.loads(r.read())['response'])
"
Available query modes:
hybrid- recommended combined retrieval mode.local- entity-centered retrieval.global- broader graph-level retrieval.naive- vector-only retrieval.
Avoid querying while the document pipeline is still busy. Entity extraction can take several minutes per batch depending on the LLM API and concurrency limits.
Reset LightRAG Storage For a Clean Rebuild
Stop the servers first, then clear generated graph/vector data:
Remove-Item -LiteralPath "c:\Users\Oleh\Desktop\Diplomova praca\lightrag\rag_storage\*" -Force
Remove-Item -LiteralPath "c:\Users\Oleh\Desktop\Diplomova praca\outputs\lightrag_ingest\adc_structured_progress.json" -Force
Use this only when you intentionally want to rebuild the graph.
Recommended Next Steps
- Update
validate_adc_json.pyfor the newadc_products_structured.jsonschema. - Build an explicit knowledge graph from
graph_hintsand PIL subsections. - Create a new LightRAG ingestion script for the new dataset.
- Retry failed scrape URLs from
adc_products_structured.failed.json. - Prepare a small RAGAS evaluation set for contraindication and interaction questions.
Project Layout
Diplomova praca/
start_servers.py
embedding_server.py
scripts/adc_scraper/
scrape_adc_product_links.py
scrape_adc_product_data.py
validate_adc_json.py
data_adc_databaza/
adc_scrape_2026_05_04/
adc_product_links.json
adc_products_structured.json
adc_products_structured.failed.json
adc_products_structured_10.json
checkpoint_02_ingest/
load_leaflets.py
batch_ingest.py
progress.json
lightrag/
.env
rag_storage/
Troubleshooting
If the embedding server does not start:
pip install sentence-transformers fastapi uvicorn
If LightRAG has encoding issues:
$env:PYTHONUTF8 = "1"
python -m lightrag.api.lightrag_server
If LLM extraction times out, reduce concurrency in lightrag/.env:
MAX_ASYNC=3
MAX_PARALLEL_INSERT=1
If the graph looks empty after ingestion, wait for background processing and check:
python checkpoint_02_ingest/load_leaflets.py --status