Oleh Poiasnik 5f25004d05 Initial ADC scraper project setup

2026-05-14 12:26:11 +02:00

5.3 KiB

Raw Blame History

Run Instructions - LightRAG ADC Knowledge Graph

This project prepares ADC pharmaceutical leaflet data for a knowledge graph and LightRAG-based question answering about drug interactions, contraindications, warnings, indications, dosage, and side effects.

Current Data

The current ADC scrape is stored in:

data_adc_databaza/adc_scrape_2026_05_04/

Main files:

adc_product_links.json - 35k+ ADC product detail URLs.
adc_products_structured.json - main structured dataset for the next pipeline stage.
adc_products_structured.failed.json - products that failed during scraping.
adc_products_structured_10.json - small parser test sample.

Use adc_products_structured.json as the main source for new graph and LightRAG ingestion work.

Requirements

Python 3.10+
ADC scraper dependencies:

pip install -r scripts/adc_scraper/requirements.txt
python -m playwright install chromium

Local embedding dependencies:

pip install sentence-transformers fastapi uvicorn

LightRAG package from lightrag/
OpenWebUI-compatible LLM API access configured in lightrag/.env

Scraping Pipeline

Collect ADC product detail links:

python scripts/adc_scraper/scrape_adc_product_links.py `
  --out data_adc_databaza/adc_scrape_2026_05_04/adc_product_links.json `
  --browser

Scrape product detail pages and PIL pages into structured JSON:

python scripts/adc_scraper/scrape_adc_product_data.py --browser

The default output is:

data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json

For a small test run:

python scripts/adc_scraper/scrape_adc_product_data.py `
  --browser `
  --limit 10 `
  --out data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json

Start Servers

Start the local embedding server and LightRAG server:

cd "c:\Users\Oleh\Desktop\Diplomova praca"
python start_servers.py

Keep this terminal open. Stop with Ctrl+C.

Health checks:

http://localhost:8010/health   - embedding server
http://localhost:9621/health   - LightRAG server

Old Ingestion Pipeline

The folder checkpoint_02_ingest/ contains an older ingestion pipeline that loads data from:

data_adc_databaza/cleaned_general_info_additional.json

It is kept as a reference because it already contains working LightRAG upload logic and progress tracking:

python checkpoint_02_ingest/load_leaflets.py --count 50
python checkpoint_02_ingest/load_leaflets.py --status

Do not treat this as the final ingestion path for the new dataset. The next step is to create a new ingestion script that reads:

data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json

and sends each record's lightrag_text to LightRAG.

Query LightRAG

After documents are ingested and LightRAG has finished processing them:

python -c "
import urllib.request, json
payload = json.dumps({'query': 'Ake su kontraindikacie Abirateronu?', 'mode': 'hybrid'}).encode()
req = urllib.request.Request('http://localhost:9621/query', data=payload, headers={'Content-Type': 'application/json'})
r = urllib.request.urlopen(req, timeout=120)
print(json.loads(r.read())['response'])
"

Available query modes:

hybrid - recommended combined retrieval mode.
local - entity-centered retrieval.
global - broader graph-level retrieval.
naive - vector-only retrieval.

Avoid querying while the document pipeline is still busy. Entity extraction can take several minutes per batch depending on the LLM API and concurrency limits.

Reset LightRAG Storage

Stop the servers first, then clear generated graph/vector data:

Remove-Item -LiteralPath "c:\Users\Oleh\Desktop\Diplomova praca\lightrag\rag_storage\*" -Force
python checkpoint_02_ingest/load_leaflets.py --reset

Use this only when you intentionally want to rebuild the graph.

Recommended Next Steps

Update validate_adc_json.py for the new adc_products_structured.json schema.
Build an explicit knowledge graph from graph_hints and PIL subsections.
Create a new LightRAG ingestion script for the new dataset.
Retry failed scrape URLs from adc_products_structured.failed.json.
Prepare a small RAGAS evaluation set for contraindication and interaction questions.

Project Layout

Diplomova praca/
  start_servers.py
  embedding_server.py
  scripts/adc_scraper/
    scrape_adc_product_links.py
    scrape_adc_product_data.py
    validate_adc_json.py
  data_adc_databaza/
    adc_scrape_2026_05_04/
      adc_product_links.json
      adc_products_structured.json
      adc_products_structured.failed.json
      adc_products_structured_10.json
  checkpoint_02_ingest/
    load_leaflets.py
    batch_ingest.py
    progress.json
  lightrag/
    .env
    rag_storage/

Troubleshooting

If the embedding server does not start:

pip install sentence-transformers fastapi uvicorn

If LightRAG has encoding issues:

$env:PYTHONUTF8 = "1"
python -m lightrag.api.lightrag_server

If LLM extraction times out, reduce concurrency in lightrag/.env:

MAX_ASYNC=3
MAX_PARALLEL_INSERT=1

If the graph looks empty after ingestion, wait for background processing and check:

python checkpoint_02_ingest/load_leaflets.py --status

5.3 KiB Raw Blame History