Oleh Poiasnik e1fdd93261 Add simple knowledge graph build wrapper

2026-05-14 12:44:43 +02:00

7.4 KiB

Raw Permalink Blame History

Run Instructions - LightRAG ADC Knowledge Graph

This project prepares ADC pharmaceutical leaflet data for a knowledge graph and LightRAG-based question answering about drug interactions, contraindications, warnings, indications, dosage, and side effects.

Current Data

The current ADC scrape is stored in:

data_adc_databaza/adc_scrape_2026_05_04/

Main files:

adc_product_links.json - 35k+ ADC product detail URLs.
adc_products_structured.json - main structured dataset for the next pipeline stage.
adc_products_structured.failed.json - products that failed during scraping.
adc_products_structured_10.json - small parser test sample.

Use adc_products_structured.json as the main source for new graph and LightRAG ingestion work.

Requirements

Python 3.10+
ADC scraper dependencies:

pip install -r scripts/adc_scraper/requirements.txt
python -m playwright install chromium

Local embedding dependencies:

pip install sentence-transformers fastapi uvicorn

LightRAG package from lightrag/
OpenWebUI-compatible LLM API access configured in lightrag/.env

Scraping Pipeline

Collect ADC product detail links:

python scripts/adc_scraper/scrape_adc_product_links.py `
  --out data_adc_databaza/adc_scrape_2026_05_04/adc_product_links.json `
  --browser

Scrape product detail pages and PIL pages into structured JSON:

python scripts/adc_scraper/scrape_adc_product_data.py --browser

The default output is:

data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json

For a small test run:

python scripts/adc_scraper/scrape_adc_product_data.py `
  --browser `
  --limit 10 `
  --out data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json

Start Servers

Start the local embedding server and LightRAG server:

cd "c:\Users\Oleh\Desktop\Diplomova praca"
python start_servers.py

Keep this terminal open. Stop with Ctrl+C.

Health checks:

http://localhost:8010/health   - embedding server
http://localhost:9621/health   - LightRAG server

Build Explicit Knowledge Graph

This step does not require LightRAG servers. It builds a deterministic graph directly from the structured ADC JSON.

Test on the small sample:

python scripts/kg/build_adc_knowledge_graph.py `
  --input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json `
  --out-dir outputs/knowledge_graph_sample

Build the full graph:

python build_knowledge_graph.py

Equivalent explicit command:

python scripts/kg/build_adc_knowledge_graph.py `
  --input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json `
  --out-dir outputs/knowledge_graph_full

Generated files:

outputs/knowledge_graph_full/adc_knowledge_graph.graphml
outputs/knowledge_graph_full/adc_knowledge_triples.jsonl
outputs/knowledge_graph_full/adc_graph_stats.json

Ingest New ADC Data Into LightRAG

First start the servers:

python start_servers.py

Keep that terminal open. In a second terminal, run a dry-run:

python scripts/lightrag_ingest/ingest_adc_structured.py `
  --input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json `
  --dry-run `
  --limit 5

Upload a small clinical batch:

python scripts/lightrag_ingest/ingest_adc_structured.py `
  --input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json `
  --limit 50 `
  --resume

Check progress and LightRAG pipeline status:

python scripts/lightrag_ingest/ingest_adc_structured.py --status

Continue with a larger batch:

python scripts/lightrag_ingest/ingest_adc_structured.py `
  --input data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json `
  --limit 200 `
  --resume

By default the script uploads only clinically useful records that contain at least one of: contraindications, interactions, warnings, or side effects. To upload every record, add:

--all-records

Old Ingestion Pipeline

Older local experiments used checkpoint_02_ingest/ to load data from:

data_adc_databaza/cleaned_general_info_additional.json

If that folder is present in your local workspace, treat it only as a reference for older LightRAG upload logic:

python checkpoint_02_ingest/load_leaflets.py --count 50
python checkpoint_02_ingest/load_leaflets.py --status

Do not treat this as the final ingestion path for the new dataset. The next current ingestion script for the new dataset reads:

data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json

and sends each record's lightrag_text to LightRAG:

python scripts/lightrag_ingest/ingest_adc_structured.py --limit 50 --resume

Query LightRAG

After documents are ingested and LightRAG has finished processing them:

python -c "
import urllib.request, json
payload = json.dumps({'query': 'Ake su kontraindikacie Abirateronu?', 'mode': 'hybrid'}).encode()
req = urllib.request.Request('http://localhost:9621/query', data=payload, headers={'Content-Type': 'application/json'})
r = urllib.request.urlopen(req, timeout=120)
print(json.loads(r.read())['response'])
"

Available query modes:

hybrid - recommended combined retrieval mode.
local - entity-centered retrieval.
global - broader graph-level retrieval.
naive - vector-only retrieval.

Avoid querying while the document pipeline is still busy. Entity extraction can take several minutes per batch depending on the LLM API and concurrency limits.

Reset LightRAG Storage For a Clean Rebuild

Stop the servers first, then clear generated graph/vector data:

Remove-Item -LiteralPath "c:\Users\Oleh\Desktop\Diplomova praca\lightrag\rag_storage\*" -Force
Remove-Item -LiteralPath "c:\Users\Oleh\Desktop\Diplomova praca\outputs\lightrag_ingest\adc_structured_progress.json" -Force

Use this only when you intentionally want to rebuild the graph.

Recommended Next Steps

Update validate_adc_json.py for the new adc_products_structured.json schema.
Build an explicit knowledge graph from graph_hints and PIL subsections.
Create a new LightRAG ingestion script for the new dataset.
Retry failed scrape URLs from adc_products_structured.failed.json.
Prepare a small RAGAS evaluation set for contraindication and interaction questions.

Project Layout

Diplomova praca/
  start_servers.py
  embedding_server.py
  scripts/adc_scraper/
    scrape_adc_product_links.py
    scrape_adc_product_data.py
    validate_adc_json.py
  data_adc_databaza/
    adc_scrape_2026_05_04/
      adc_product_links.json
      adc_products_structured.json
      adc_products_structured.failed.json
      adc_products_structured_10.json
  checkpoint_02_ingest/
    load_leaflets.py
    batch_ingest.py
    progress.json
  lightrag/
    .env
    rag_storage/

Troubleshooting

If the embedding server does not start:

pip install sentence-transformers fastapi uvicorn

If LightRAG has encoding issues:

$env:PYTHONUTF8 = "1"
python -m lightrag.api.lightrag_server

If LLM extraction times out, reduce concurrency in lightrag/.env:

MAX_ASYNC=3
MAX_PARALLEL_INSERT=1

If the graph looks empty after ingestion, wait for background processing and check:

python checkpoint_02_ingest/load_leaflets.py --status

7.4 KiB Raw Permalink Blame History