# Run Instructions - LightRAG ADC Knowledge Graph This project prepares ADC pharmaceutical leaflet data for a knowledge graph and LightRAG-based question answering about drug interactions, contraindications, warnings, indications, dosage, and side effects. ## Current Data The current ADC scrape is stored in: ```powershell data_adc_databaza/adc_scrape_2026_05_04/ ``` Main files: - `adc_product_links.json` - 35k+ ADC product detail URLs. - `adc_products_structured.json` - main structured dataset for the next pipeline stage. - `adc_products_structured.failed.json` - products that failed during scraping. - `adc_products_structured_10.json` - small parser test sample. Use `adc_products_structured.json` as the main source for new graph and LightRAG ingestion work. ## Requirements - Python 3.10+ - ADC scraper dependencies: ```powershell pip install -r scripts/adc_scraper/requirements.txt python -m playwright install chromium ``` - Local embedding dependencies: ```powershell pip install sentence-transformers fastapi uvicorn ``` - LightRAG package from `lightrag/` - OpenWebUI-compatible LLM API access configured in `lightrag/.env` ## Scraping Pipeline Collect ADC product detail links: ```powershell python scripts/adc_scraper/scrape_adc_product_links.py ` --out data_adc_databaza/adc_scrape_2026_05_04/adc_product_links.json ` --browser ``` Scrape product detail pages and PIL pages into structured JSON: ```powershell python scripts/adc_scraper/scrape_adc_product_data.py --browser ``` The default output is: ```powershell data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json ``` For a small test run: ```powershell python scripts/adc_scraper/scrape_adc_product_data.py ` --browser ` --limit 10 ` --out data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json ``` ## Start Servers Start the local embedding server and LightRAG server: ```powershell cd "c:\Users\Oleh\Desktop\Diplomova praca" python start_servers.py ``` Keep this terminal open. Stop with `Ctrl+C`. Health checks: ```text http://localhost:8010/health - embedding server http://localhost:9621/health - LightRAG server ``` ## Old Ingestion Pipeline The folder `checkpoint_02_ingest/` contains an older ingestion pipeline that loads data from: ```powershell data_adc_databaza/cleaned_general_info_additional.json ``` It is kept as a reference because it already contains working LightRAG upload logic and progress tracking: ```powershell python checkpoint_02_ingest/load_leaflets.py --count 50 python checkpoint_02_ingest/load_leaflets.py --status ``` Do not treat this as the final ingestion path for the new dataset. The next step is to create a new ingestion script that reads: ```powershell data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json ``` and sends each record's `lightrag_text` to LightRAG. ## Query LightRAG After documents are ingested and LightRAG has finished processing them: ```powershell python -c " import urllib.request, json payload = json.dumps({'query': 'Ake su kontraindikacie Abirateronu?', 'mode': 'hybrid'}).encode() req = urllib.request.Request('http://localhost:9621/query', data=payload, headers={'Content-Type': 'application/json'}) r = urllib.request.urlopen(req, timeout=120) print(json.loads(r.read())['response']) " ``` Available query modes: - `hybrid` - recommended combined retrieval mode. - `local` - entity-centered retrieval. - `global` - broader graph-level retrieval. - `naive` - vector-only retrieval. Avoid querying while the document pipeline is still busy. Entity extraction can take several minutes per batch depending on the LLM API and concurrency limits. ## Reset LightRAG Storage Stop the servers first, then clear generated graph/vector data: ```powershell Remove-Item -LiteralPath "c:\Users\Oleh\Desktop\Diplomova praca\lightrag\rag_storage\*" -Force python checkpoint_02_ingest/load_leaflets.py --reset ``` Use this only when you intentionally want to rebuild the graph. ## Recommended Next Steps 1. Update `validate_adc_json.py` for the new `adc_products_structured.json` schema. 2. Build an explicit knowledge graph from `graph_hints` and PIL subsections. 3. Create a new LightRAG ingestion script for the new dataset. 4. Retry failed scrape URLs from `adc_products_structured.failed.json`. 5. Prepare a small RAGAS evaluation set for contraindication and interaction questions. ## Project Layout ```text Diplomova praca/ start_servers.py embedding_server.py scripts/adc_scraper/ scrape_adc_product_links.py scrape_adc_product_data.py validate_adc_json.py data_adc_databaza/ adc_scrape_2026_05_04/ adc_product_links.json adc_products_structured.json adc_products_structured.failed.json adc_products_structured_10.json checkpoint_02_ingest/ load_leaflets.py batch_ingest.py progress.json lightrag/ .env rag_storage/ ``` ## Troubleshooting If the embedding server does not start: ```powershell pip install sentence-transformers fastapi uvicorn ``` If LightRAG has encoding issues: ```powershell $env:PYTHONUTF8 = "1" python -m lightrag.api.lightrag_server ``` If LLM extraction times out, reduce concurrency in `lightrag/.env`: ```text MAX_ASYNC=3 MAX_PARALLEL_INSERT=1 ``` If the graph looks empty after ingestion, wait for background processing and check: ```powershell python checkpoint_02_ingest/load_leaflets.py --status ```