Initial ADC scraper project setup

2026-05-14 12:26:11 +02:00 · 2026-05-14 12:26:11 +02:00 · 5f25004d05
commit 5f25004d05
13 changed files with 1761 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,32 @@
 # Large local datasets
 data_adc_databaza/
 # External LightRAG checkout and generated RAG storage
 lightrag/
 # Scraped/debug HTML snapshots
 pil.html
 detail-product.html
 # Generated graph artifacts
 *.graphml
 # Logs
 *.log
 # Python cache and local environments
 __pycache__/
 *.py[cod]
 .venv/
 venv/
 env/
 # Tool/local workspace metadata
 .claude/
 .tmp/
 # OS/editor files
 .DS_Store
 Thumbs.db
 .idea/
 .vscode/
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@ -0,0 +1,127 @@
 # LightRAG ADC System Architecture
 This document describes the full architecture of the LightRAG-based Adverse Drug Condition (ADC) system for processing and querying Slovak pharmaceutical leaflets.
 The system consists of three main components running locally:
 - **Embedding Server** (port 8010) — wraps a sentence-transformers model for vector generation
 - **LightRAG Server** (port 9621) — core RAG engine managing the knowledge graph and vector DB
 - **OpenWebUI LLM** (remote) — hosts the Qwen3.5-122B model used for entity extraction and answer generation
 Both local servers are launched via `start_servers.py`. Source data is 6929 Slovak pharmaceutical leaflets stored in `cleaned_general_info_additional.json`.
 ---
 ## Flow 1: Ingestion (Loading Leaflets)
 ```mermaid
 flowchart TD
    A([👤 User runs load_leaflets.py]) --> B
    B[("📄 cleaned_general_info_additional.json\n6929 Slovak leaflets")]
    B --> C{Filter:\nclinical leaflets only\ninteractions +\ncontraindications}
    C -->|Filtered leaflets| D["🔁 For each leaflet\n(loop)"]
    D --> E["POST http://localhost:9621\n/documents/text\n\nBody: { text, metadata }"]
    subgraph LightRAG_Server ["⚙️ LightRAG Server — port 9621"]
        E --> F["Text chunker\n600 tokens per chunk"]
        F --> G["🔁 For each chunk\n(loop)"]
        G --> H["POST https://ui.tukekemt.xyz\n/api/v1/chat/completions\n\nModel: model2 (Qwen3.5-122B)\nTask: extract entities & relations"]
        H --> I["Extracted:\n• Entities (drugs, conditions, etc.)\n• Relations between entities"]
        I --> J["🔁 For each entity / chunk\n(loop)"]
        J --> K["POST http://localhost:8010\n/embeddings\n\nBody: { input: text }"]
    end
    subgraph Embedding_Server ["🧠 Embedding Server — port 8010"]
        K --> L["paraphrase-multilingual\n-MiniLM-L12-v2\n(sentence-transformers)"]
        L --> M["Float vector\n(384 dimensions)"]
    end
    subgraph OpenWebUI ["☁️ OpenWebUI — ui.tukekemt.xyz"]
        H
    end
    M --> N
    subgraph RAG_Storage ["💾 rag_storage/"]
        N["graph_chunk_entity_relation.graphml\n— knowledge graph (NetworkX)"]
        O["vdb_entities.json\n— entity vectors (NanoVectorDB)"]
        P["vdb_relationships.json\n— relation vectors (NanoVectorDB)"]
        Q["kv_store_*.json\n— chunk text cache & metadata"]
    end
    I --> N
    I --> P
    M --> O
    F --> Q
 ```
 ---
 ## Flow 2: Query (Answering Questions)
 ```mermaid
 flowchart TD
    A([👤 User sends query]) --> B
    B["POST http://localhost:9621/query\n\nBody:\n{ query: string,\n  mode: hybrid | local | global | naive }"]
    subgraph LightRAG_Server ["⚙️ LightRAG Server — port 9621"]
        B --> C["Parse query\n& select retrieval mode"]
        C --> D["POST http://localhost:8010\n/embeddings\n\nEmbed the query text"]
        subgraph Retrieval ["🔍 Retrieval (parallel)"]
            E["Vector search\nNanoVectorDB\n(vdb_entities.json,\nvdb_relationships.json)"]
            F["Graph traversal\nNetworkX\n(graph_chunk_entity_relation.graphml)"]
        end
        D --> Retrieval
        Retrieval --> G["Merge & rank\nrelevant entities,\nrelations & text chunks"]
        G --> H["Build context prompt\nfrom top-K results\n+ retrieved chunk texts\n(kv_store_*.json)"]
        H --> I["POST https://ui.tukekemt.xyz\n/api/v1/chat/completions\n\nModel: model2 (Qwen3.5-122B)\nTask: generate answer\nfrom context"]
    end
    subgraph Embedding_Server ["🧠 Embedding Server — port 8010"]
        D2["paraphrase-multilingual\n-MiniLM-L12-v2"]
        D --> D2
        D2 --> E
    end
    subgraph OpenWebUI ["☁️ OpenWebUI — ui.tukekemt.xyz"]
        I
    end
    subgraph RAG_Storage ["💾 rag_storage/"]
        VDB["vdb_entities.json\nvdb_relationships.json"]
        GRAPH["graph_chunk_entity_relation.graphml"]
        KV["kv_store_*.json"]
    end
    E --- VDB
    F --- GRAPH
    H --- KV
    I --> J["Generated answer\n+ source references"]
    J --> K([👤 User receives response])
 ```
 ---
 ## Component Summary
 | Component | Type | Address | Key Endpoints |
 |---|---|---|---|
 | `embedding_server.py` | FastAPI (local) | `http://localhost:8010` | `GET /health`, `POST /embeddings`, `POST /v1/embeddings` |
 | LightRAG Server | FastAPI (local) | `http://localhost:9621` | `GET /health`, `POST /documents/text`, `POST /documents/scan`, `GET /documents/pipeline_status`, `POST /query` |
 | OpenWebUI (model2) | Remote LLM API | `https://ui.tukekemt.xyz` | `POST /api/v1/chat/completions` |
 | `rag_storage/` | File system | Local disk | `.graphml`, `.json` files |
 | `cleaned_general_info_additional.json` | Source data | Local disk | 6929 Slovak pharmaceutical leaflets |
 | `start_servers.py` | Launcher script | — | Starts embedding server + LightRAG server |
--- a/RUN_INSTRUCTION.md
+++ b/RUN_INSTRUCTION.md
@ -0,0 +1,214 @@
 # Run Instructions - LightRAG ADC Knowledge Graph
 This project prepares ADC pharmaceutical leaflet data for a knowledge graph and
 LightRAG-based question answering about drug interactions, contraindications,
 warnings, indications, dosage, and side effects.
 ## Current Data
 The current ADC scrape is stored in:
 ```powershell
 data_adc_databaza/adc_scrape_2026_05_04/
 ```
 Main files:
 - `adc_product_links.json` - 35k+ ADC product detail URLs.
 - `adc_products_structured.json` - main structured dataset for the next pipeline stage.
 - `adc_products_structured.failed.json` - products that failed during scraping.
 - `adc_products_structured_10.json` - small parser test sample.
 Use `adc_products_structured.json` as the main source for new graph and
 LightRAG ingestion work.
 ## Requirements
 - Python 3.10+
 - ADC scraper dependencies:
 ```powershell
 pip install -r scripts/adc_scraper/requirements.txt
 python -m playwright install chromium
 ```
 - Local embedding dependencies:
 ```powershell
 pip install sentence-transformers fastapi uvicorn
 ```
 - LightRAG package from `lightrag/`
 - OpenWebUI-compatible LLM API access configured in `lightrag/.env`
 ## Scraping Pipeline
 Collect ADC product detail links:
 ```powershell
 python scripts/adc_scraper/scrape_adc_product_links.py `
  --out data_adc_databaza/adc_scrape_2026_05_04/adc_product_links.json `
  --browser
 ```
 Scrape product detail pages and PIL pages into structured JSON:
 ```powershell
 python scripts/adc_scraper/scrape_adc_product_data.py --browser
 ```
 The default output is:
 ```powershell
 data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json
 ```
 For a small test run:
 ```powershell
 python scripts/adc_scraper/scrape_adc_product_data.py `
  --browser `
  --limit 10 `
  --out data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured_10.json
 ```
 ## Start Servers
 Start the local embedding server and LightRAG server:
 ```powershell
 cd "c:\Users\Oleh\Desktop\Diplomova praca"
 python start_servers.py
 ```
 Keep this terminal open. Stop with `Ctrl+C`.
 Health checks:
 ```text
 http://localhost:8010/health   - embedding server
 http://localhost:9621/health   - LightRAG server
 ```
 ## Old Ingestion Pipeline
 The folder `checkpoint_02_ingest/` contains an older ingestion pipeline that
 loads data from:
 ```powershell
 data_adc_databaza/cleaned_general_info_additional.json
 ```
 It is kept as a reference because it already contains working LightRAG upload
 logic and progress tracking:
 ```powershell
 python checkpoint_02_ingest/load_leaflets.py --count 50
 python checkpoint_02_ingest/load_leaflets.py --status
 ```
 Do not treat this as the final ingestion path for the new dataset. The next
 step is to create a new ingestion script that reads:
 ```powershell
 data_adc_databaza/adc_scrape_2026_05_04/adc_products_structured.json
 ```
 and sends each record's `lightrag_text` to LightRAG.
 ## Query LightRAG
 After documents are ingested and LightRAG has finished processing them:
 ```powershell
 python -c "
 import urllib.request, json
 payload = json.dumps({'query': 'Ake su kontraindikacie Abirateronu?', 'mode': 'hybrid'}).encode()
 req = urllib.request.Request('http://localhost:9621/query', data=payload, headers={'Content-Type': 'application/json'})
 r = urllib.request.urlopen(req, timeout=120)
 print(json.loads(r.read())['response'])
 "
 ```
 Available query modes:
 - `hybrid` - recommended combined retrieval mode.
 - `local` - entity-centered retrieval.
 - `global` - broader graph-level retrieval.
 - `naive` - vector-only retrieval.
 Avoid querying while the document pipeline is still busy. Entity extraction can
 take several minutes per batch depending on the LLM API and concurrency limits.
 ## Reset LightRAG Storage
 Stop the servers first, then clear generated graph/vector data:
 ```powershell
 Remove-Item -LiteralPath "c:\Users\Oleh\Desktop\Diplomova praca\lightrag\rag_storage\*" -Force
 python checkpoint_02_ingest/load_leaflets.py --reset
 ```
 Use this only when you intentionally want to rebuild the graph.
 ## Recommended Next Steps
 1. Update `validate_adc_json.py` for the new `adc_products_structured.json` schema.
 2. Build an explicit knowledge graph from `graph_hints` and PIL subsections.
 3. Create a new LightRAG ingestion script for the new dataset.
 4. Retry failed scrape URLs from `adc_products_structured.failed.json`.
 5. Prepare a small RAGAS evaluation set for contraindication and interaction questions.
 ## Project Layout
 ```text
 Diplomova praca/
  start_servers.py
  embedding_server.py
  scripts/adc_scraper/
    scrape_adc_product_links.py
    scrape_adc_product_data.py
    validate_adc_json.py
  data_adc_databaza/
    adc_scrape_2026_05_04/
      adc_product_links.json
      adc_products_structured.json
      adc_products_structured.failed.json
      adc_products_structured_10.json
  checkpoint_02_ingest/
    load_leaflets.py
    batch_ingest.py
    progress.json
  lightrag/
    .env
    rag_storage/
 ```
 ## Troubleshooting
 If the embedding server does not start:
 ```powershell
 pip install sentence-transformers fastapi uvicorn
 ```
 If LightRAG has encoding issues:
 ```powershell
 $env:PYTHONUTF8 = "1"
 python -m lightrag.api.lightrag_server
 ```
 If LLM extraction times out, reduce concurrency in `lightrag/.env`:
 ```text
 MAX_ASYNC=3
 MAX_PARALLEL_INSERT=1
 ```
 If the graph looks empty after ingestion, wait for background processing and
 check:
 ```powershell
 python checkpoint_02_ingest/load_leaflets.py --status
 ```
--- a/embedding_server.py
+++ b/embedding_server.py
@ -0,0 +1,69 @@
 """
 Локальный OpenAI-compatible embedding сервер на базе sentence-transformers.
 Модель: paraphrase-multilingual-MiniLM-L12-v2 (поддерживает словацкий язык!)
 Запуск:
    python embedding_server.py
 Тест:
    curl http://localhost:8010/v1/embeddings -H "Content-Type: application/json" \
         -d '{"model": "local-embed", "input": "test"}'
 """
 import time
 import json
 from fastapi import FastAPI, Request
 from fastapi.responses import JSONResponse
 import uvicorn
 from sentence_transformers import SentenceTransformer
 MODEL_NAME = "paraphrase-multilingual-MiniLM-L12-v2"
 PORT = 8010
 print(f"Загрузка модели {MODEL_NAME}...")
 model = SentenceTransformer(MODEL_NAME)
 EMBED_DIM = model.get_sentence_embedding_dimension()
 print(f"Модель загружена. Размерность: {EMBED_DIM}")
 app = FastAPI(title="Local Embedding Server")
@app.get("/health")
 def health():
    return {"status": "ok", "model": MODEL_NAME, "dim": EMBED_DIM}
 async def _handle_embeddings(request: Request):
    body = await request.json()
    inp = body.get("input", "")
    if isinstance(inp, str):
        texts = [inp]
    else:
        texts = inp
    vecs = model.encode(texts, normalize_embeddings=True).tolist()
    data = [
        {"object": "embedding", "index": i, "embedding": vec}
        for i, vec in enumerate(vecs)
    ]
    return JSONResponse({
        "object": "list",
        "data": data,
        "model": MODEL_NAME,
        "usage": {"prompt_tokens": sum(len(t.split()) for t in texts), "total_tokens": sum(len(t.split()) for t in texts)}
    })
@app.post("/v1/embeddings")
 async def embeddings_v1(request: Request):
    return await _handle_embeddings(request)
@app.post("/embeddings")
 async def embeddings_root(request: Request):
    return await _handle_embeddings(request)
 if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=PORT, log_level="warning")
--- a/scripts/adc_scraper/init.py
+++ b/scripts/adc_scraper/init.py
@ -0,0 +1 @@
 """ADC scraper scripts for the diploma project."""
--- a/scripts/adc_scraper/parse_adc_json.py
+++ b/scripts/adc_scraper/parse_adc_json.py
@ -0,0 +1,167 @@
 """Parse raw ADC HTML/JSONL into structured JSON for LightRAG ingestion."""
 from __future__ import annotations
 import argparse
 import json
 import re
 from pathlib import Path
 from typing import Any
 from bs4 import BeautifulSoup
 SECTION_PATTERNS = {
    "contraindications": [
        r"nepoužívajte",
        r"kedy .* nepoužívať",
        r"kontraindik",
    ],
    "interactions": [
        r"iné lieky",
        r"vzájomné pôsobenie",
        r"interakci",
    ],
    "side_effects": [
        r"možné vedľajšie účinky",
        r"nežiaduce účinky",
        r"vedľajšie účinky",
    ],
    "dosage": [
        r"ako používať",
        r"dávkovanie",
        r"spôsob podávania",
    ],
 }
 def html_to_text(html: str | None) -> str:
    if not html:
        return ""
    soup = BeautifulSoup(html, "lxml")
    for tag in soup(["script", "style", "noscript"]):
        tag.decompose()
    text = soup.get_text(" ", strip=True)
    return normalize_text(text)
 def normalize_text(text: str) -> str:
    text = text.replace("\xa0", " ")
    text = re.sub(r"\s+", " ", text)
    return text.strip()
 def infer_name(source_url: str, text: str) -> str:
    match = re.search(r"Písomná informácia pre používateľa\s+(.{3,160}?)(?:\s+Pozorne|\s+V tejto|\s+1\.)", text)
    if match:
        return normalize_text(match.group(1))
    slug = source_url.rstrip("/").split("/")[-1].replace(".html", "")
    slug = re.sub(r"-\d+$", "", slug)
    return slug.replace("-", " ").title()
 def extract_sections(text: str) -> dict[str, str]:
    sections: dict[str, str] = {}
    lower = text.lower()
    starts: list[tuple[int, str]] = []
    for section_name, patterns in SECTION_PATTERNS.items():
        found_positions = []
        for pattern in patterns:
            match = re.search(pattern, lower)
            if match:
                found_positions.append(match.start())
        if found_positions:
            starts.append((min(found_positions), section_name))
    starts.sort()
    for idx, (start, section_name) in enumerate(starts):
        end = starts[idx + 1][0] if idx + 1 < len(starts) else min(len(text), start + 8000)
        sections[section_name] = text[start:end].strip()
    return sections
 def iter_raw_records(path: Path) -> list[dict[str, Any]]:
    if path.suffix.lower() == ".jsonl":
        records = []
        with path.open(encoding="utf-8") as f:
            for line in f:
                line = line.strip()
                if line:
                    records.append(json.loads(line))
        return records
    data = json.loads(path.read_text(encoding="utf-8"))
    if isinstance(data, list):
        return data
    if "records" in data:
        return data["records"]
    return [data]
 def parse_record(raw: dict[str, Any]) -> dict[str, Any]:
    source_url = raw.get("source_url") or raw.get("link") or raw.get("pil_url") or ""
    pil_text = raw.get("pribalovy_letak")
    if pil_text is None:
        pil_text = html_to_text(raw.get("pil_html"))
    else:
        pil_text = normalize_text(str(pil_text))
    spc_text = raw.get("spc")
    if spc_text is None:
        spc_text = html_to_text(raw.get("spc_html"))
    else:
        spc_text = normalize_text(str(spc_text))
    combined_text = f"{pil_text} {spc_text}".strip()
    name = raw.get("name") or infer_name(source_url, combined_text)
    return {
        "source_url": source_url,
        "name": name,
        "pil_url": raw.get("pil_url"),
        "spc_url": raw.get("spc_url"),
        "pil_text": pil_text,
        "spc_text": spc_text,
        "sections": extract_sections(combined_text),
        "metadata": {
            "source": "adc.sk",
            "scraped_at": (raw.get("metadata") or {}).get("scraped_at"),
            "parser": "scripts/adc_scraper/parse_adc_json.py",
        },
    }
 def main() -> None:
    parser = argparse.ArgumentParser(description="Parse ADC raw data into structured JSON.")
    parser.add_argument("--input", type=Path, required=True)
    parser.add_argument("--out", type=Path, required=True)
    parser.add_argument("--limit", type=int, default=None)
    parser.add_argument(
        "--keep-empty",
        action="store_true",
        help="Keep records where both PIL and SPC text are empty.",
    )
    args = parser.parse_args()
    raw_records = iter_raw_records(args.input)
    parsed = []
    for record in raw_records:
        item = parse_record(record)
        if not args.keep_empty and not item["pil_text"] and not item["spc_text"]:
            continue
        parsed.append(item)
        if args.limit and len(parsed) >= args.limit:
            break
    args.out.parent.mkdir(parents=True, exist_ok=True)
    args.out.write_text(json.dumps(parsed, ensure_ascii=False, indent=2), encoding="utf-8")
    print(f"Saved {len(parsed)} structured records to {args.out}")
 if __name__ == "__main__":
    main()
--- a/scripts/adc_scraper/requirements.txt
+++ b/scripts/adc_scraper/requirements.txt
@ -0,0 +1,5 @@
 requests>=2.31.0
 beautifulsoup4>=4.12.0
 lxml>=5.0.0
 tqdm>=4.66.0
 playwright>=1.45.0
--- a/scripts/adc_scraper/scrape_adc_index.py
+++ b/scripts/adc_scraper/scrape_adc_index.py
@ -0,0 +1,119 @@
 """Collect ADC product/PIL/SPC links from index or search pages.
 The script is intentionally conservative: it only stores discovered ADC product
 URLs and does not try to parse clinical content. The next pipeline step downloads
 the actual leaflet pages.
 """
 from __future__ import annotations
 import argparse
 import json
 import time
 from collections import deque
 from pathlib import Path
 from urllib.parse import urljoin, urlparse
 import requests
 from bs4 import BeautifulSoup
 from tqdm import tqdm
 DEFAULT_HEADERS = {
    "User-Agent": "DiplomaResearchBot/0.1 (+educational use; ADC leaflet KG)",
 }
 def is_adc_url(url: str) -> bool:
    host = urlparse(url).netloc.lower()
    return host.endswith("adc.sk")
 def is_product_like_url(url: str) -> bool:
    path = urlparse(url).path.lower()
    return "/databazy/produkty/" in path and (
        "/pil/" in path or "/spc/" in path or "/detail/" in path
    )
 def extract_links(html: str, base_url: str) -> tuple[set[str], set[str]]:
    soup = BeautifulSoup(html, "lxml")
    product_links: set[str] = set()
    crawl_links: set[str] = set()
    for tag in soup.find_all("a", href=True):
        url = urljoin(base_url, tag["href"]).split("#", 1)[0]
        if not is_adc_url(url):
            continue
        if is_product_like_url(url):
            product_links.add(url)
        path = urlparse(url).path.lower()
        if "/databazy/produkty/" in path:
            crawl_links.add(url)
    return product_links, crawl_links
 def fetch(session: requests.Session, url: str, timeout: int) -> str:
    response = session.get(url, headers=DEFAULT_HEADERS, timeout=timeout)
    response.raise_for_status()
    response.encoding = response.apparent_encoding or "utf-8"
    return response.text
 def main() -> None:
    parser = argparse.ArgumentParser(description="Collect ADC product/PIL/SPC links.")
    parser.add_argument(
        "--start-url",
        action="append",
        required=True,
        help="ADC index/search URL. Can be supplied multiple times.",
    )
    parser.add_argument("--out", type=Path, required=True, help="Output JSON file.")
    parser.add_argument("--max-pages", type=int, default=20)
    parser.add_argument("--delay", type=float, default=0.5)
    parser.add_argument("--timeout", type=int, default=30)
    args = parser.parse_args()
    queue: deque[str] = deque(args.start_url)
    visited: set[str] = set()
    product_links: set[str] = set()
    session = requests.Session()
    with tqdm(total=args.max_pages, desc="ADC pages") as progress:
        while queue and len(visited) < args.max_pages:
            url = queue.popleft()
            if url in visited:
                continue
            visited.add(url)
            try:
                html = fetch(session, url, args.timeout)
            except Exception as exc:
                tqdm.write(f"Skip {url}: {exc}")
                progress.update(1)
                continue
            found_products, found_crawl = extract_links(html, url)
            product_links.update(found_products)
            for link in sorted(found_crawl):
                if link not in visited and len(visited) + len(queue) < args.max_pages * 4:
                    queue.append(link)
            progress.update(1)
            time.sleep(args.delay)
    args.out.parent.mkdir(parents=True, exist_ok=True)
    payload = {
        "source": "adc.sk",
        "visited_pages": sorted(visited),
        "links": sorted(product_links),
    }
    args.out.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
    print(f"Saved {len(product_links)} links to {args.out}")
 if __name__ == "__main__":
    main()
--- a/scripts/adc_scraper/scrape_adc_leaflets.py
+++ b/scripts/adc_scraper/scrape_adc_leaflets.py
@ -0,0 +1,124 @@
 """Download ADC PIL/SPC pages into a raw JSONL file."""
 from __future__ import annotations
 import argparse
 import json
 import time
 from datetime import datetime, timezone
 from pathlib import Path
 from urllib.parse import urlparse
 import requests
 from bs4 import BeautifulSoup
 from tqdm import tqdm
 HEADERS = {
    "User-Agent": "DiplomaResearchBot/0.1 (+educational use; ADC leaflet KG)",
 }
 def load_links(path: Path) -> list[str]:
    data = json.loads(path.read_text(encoding="utf-8"))
    if isinstance(data, list):
        return [str(x) for x in data]
    return [str(x) for x in data.get("links", [])]
 def paired_leaflet_urls(url: str) -> dict[str, str]:
    """Return best-effort PIL/SPC URLs for an ADC product URL."""
    urls: dict[str, str] = {}
    path = urlparse(url).path.lower()
    if "/pil/" in path:
        urls["pil_url"] = url
        urls["spc_url"] = url.replace("/pil/", "/spc/")
    elif "/spc/" in path:
        urls["spc_url"] = url
        urls["pil_url"] = url.replace("/spc/", "/pil/")
    else:
        urls["detail_url"] = url
    return urls
 def discover_leaflet_urls_from_detail(html: str, base_url: str) -> dict[str, str]:
    from urllib.parse import urljoin
    soup = BeautifulSoup(html, "lxml")
    result: dict[str, str] = {}
    for tag in soup.find_all("a", href=True):
        candidate = urljoin(base_url, tag["href"])
        path = urlparse(candidate).path.lower()
        if "/databazy/produkty/pil/" in path:
            result["pil_url"] = candidate
        elif "/databazy/produkty/spc/" in path:
            result["spc_url"] = candidate
    return result
 def fetch(session: requests.Session, url: str, timeout: int) -> tuple[int, str]:
    response = session.get(url, headers=HEADERS, timeout=timeout)
    response.encoding = response.apparent_encoding or "utf-8"
    return response.status_code, response.text
 def main() -> None:
    parser = argparse.ArgumentParser(description="Download ADC PIL/SPC HTML pages.")
    parser.add_argument("--links", type=Path, required=True)
    parser.add_argument("--out", type=Path, required=True)
    parser.add_argument("--limit", type=int, default=None)
    parser.add_argument("--delay", type=float, default=0.5)
    parser.add_argument("--timeout", type=int, default=30)
    args = parser.parse_args()
    links = load_links(args.links)
    if args.limit:
        links = links[: args.limit]
    args.out.parent.mkdir(parents=True, exist_ok=True)
    session = requests.Session()
    with args.out.open("w", encoding="utf-8") as out:
        for source_url in tqdm(links, desc="ADC leaflets"):
            urls = paired_leaflet_urls(source_url)
            if "detail_url" in urls:
                status, html = fetch(session, urls["detail_url"], args.timeout)
                if status == 200:
                    urls.update(discover_leaflet_urls_from_detail(html, urls["detail_url"]))
                time.sleep(args.delay)
            record = {
                "source_url": source_url,
                "pil_url": urls.get("pil_url"),
                "spc_url": urls.get("spc_url"),
                "pil_status": None,
                "spc_status": None,
                "pil_html": None,
                "spc_html": None,
                "metadata": {
                    "source": "adc.sk",
                    "scraped_at": datetime.now(timezone.utc).isoformat(),
                },
            }
            for kind in ("pil", "spc"):
                url = urls.get(f"{kind}_url")
                if not url:
                    continue
                try:
                    status, html = fetch(session, url, args.timeout)
                    record[f"{kind}_status"] = status
                    if status == 200:
                        record[f"{kind}_html"] = html
                except Exception as exc:
                    record[f"{kind}_status"] = f"error: {exc}"
                time.sleep(args.delay)
            out.write(json.dumps(record, ensure_ascii=False) + "\n")
    print(f"Saved raw leaflets to {args.out}")
 if __name__ == "__main__":
    main()
--- a/scripts/adc_scraper/scrape_adc_product_data.py
+++ b/scripts/adc_scraper/scrape_adc_product_data.py
@ -0,0 +1,580 @@
 """Scrape structured ADC product data from detail and PIL pages.
 Example:
    python scripts/adc_scraper/scrape_adc_product_data.py --browser --limit 10
 """
 from __future__ import annotations
 import argparse
 import json
 import re
 import time
 from dataclasses import dataclass
 from datetime import datetime, timezone
 from pathlib import Path
 from typing import Callable, Iterable
 from urllib.parse import urljoin, urlparse
 import requests
 from bs4 import BeautifulSoup, Tag
 from tqdm import tqdm
 BASE_URL = "https://www.adc.sk"
 DEFAULT_DATA_DIR = Path("data_adc_databaza/adc_scrape_2026_05_04")
 DEFAULT_LINKS = DEFAULT_DATA_DIR / "adc_product_links.json"
 DEFAULT_OUT = DEFAULT_DATA_DIR / "adc_products_structured.json"
 HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "sk-SK,sk;q=0.9,cs;q=0.8,en;q=0.7",
    "Cache-Control": "no-cache",
    "Pragma": "no-cache",
    "Referer": "https://www.adc.sk/databazy/produkty",
 }
 DETAIL_SECTION_ALIASES = {
    "Popis a určenie": "description_and_indications",
    "Použitie": "use_and_dosage",
    "Nežiaduce účinky": "side_effects",
    "Účinné látky": "active_substances",
    "Indikačná skupina": "indication_group",
    "ADC Klasifikácia produktu": "adc_classification",
    "Všeobecné informácie vzťahujúce sa k produktu": "general_product_info",
 }
 PIL_SECTION_PATTERNS = {
    "what_is_it": r"^1\.\s+Čo je .+",
    "before_use": r"^2\.\s+Čo potrebujete vedieť .+",
    "how_to_use": r"^3\.\s+Ako .+",
    "side_effects": r"^4\.\s+Možné .+účinky",
    "storage": r"^5\.\s+Ako uchovávať .+",
    "package_info": r"^6\.\s+Obsah balenia .+",
 }
 PIL_SUBSECTION_ALIASES = {
    "contraindications": [
        r"^Neužívajte .+",
        r"^Nepoužívajte .+",
        r"^Nesmiete .+",
    ],
    "warnings": [
        r"^Upozornenia a opatrenia",
        r"^Buďte zvlášť opatrný .+",
    ],
    "interactions": [
        r"^Iné lieky a .+",
        r"^Užívanie .+ s inými liekmi",
    ],
    "pregnancy_breastfeeding": [
        r"^Tehotenstvo.*dojčenie.*",
        r"^Tehotenstvo.*",
    ],
    "driving": [
        r"^Vedenie vozidiel .+",
    ],
 }
@dataclass(frozen=True)
 class ProductUrls:
    detail_url: str
    pil_url: str
    spc_url: str
 def clean_text(value: str) -> str:
    value = value.replace("\xa0", " ")
    value = re.sub(r"[ \t\r\f\v]+", " ", value)
    value = re.sub(r"\n{3,}", "\n\n", value)
    value = re.sub(r"(?im)^reklama$", "", value)
    return value.strip()
 def normalize_key(value: str) -> str:
    value = clean_text(value).lower()
    replacements = {
        "á": "a",
        "ä": "a",
        "č": "c",
        "ď": "d",
        "é": "e",
        "í": "i",
        "ľ": "l",
        "ĺ": "l",
        "ň": "n",
        "ó": "o",
        "ô": "o",
        "ŕ": "r",
        "š": "s",
        "ť": "t",
        "ú": "u",
        "ý": "y",
        "ž": "z",
    }
    for source, target in replacements.items():
        value = value.replace(source, target)
    value = re.sub(r"[^a-z0-9]+", "_", value)
    return value.strip("_")
 def product_urls(detail_url: str) -> ProductUrls:
    return ProductUrls(
        detail_url=detail_url,
        pil_url=detail_url.replace("/detail/", "/pil/"),
        spc_url=detail_url.replace("/detail/", "/spc/"),
    )
 def product_id_from_url(url: str) -> str | None:
    match = re.search(r"-(\d+)\.html(?:$|\?)", urlparse(url).path)
    return match.group(1) if match else None
 def slug_from_url(url: str) -> str:
    name = Path(urlparse(url).path).name
    return re.sub(r"-\d+\.html$", "", name)
 def load_links(path: Path) -> list[str]:
    data = json.loads(path.read_text(encoding="utf-8"))
    if not isinstance(data, list):
        raise ValueError(f"Expected a JSON list in {path}")
    return [str(item) for item in data if str(item).strip()]
 def soup_from_html(html: str) -> BeautifulSoup:
    return BeautifulSoup(html, "lxml")
 def remove_noise(root: Tag) -> None:
    for tag in root.select(
        "script, style, noscript, nav, header, footer, iframe, form, "
        ".modal, .adbl, .ad-video-fake, .breadcrumb, .piktograms"
    ):
        tag.decompose()
 def node_text(node: Tag) -> str:
    remove_noise(node)
    return clean_text(node.get_text("\n", strip=True))
 def parse_json_ld_product(soup: BeautifulSoup) -> dict[str, str | None]:
    for script in soup.find_all("script", {"type": "application/ld+json"}):
        raw = script.string or script.get_text()
        if not raw.strip():
            continue
        try:
            data = json.loads(raw)
        except json.JSONDecodeError:
            continue
        items = data if isinstance(data, list) else [data]
        for item in items:
            if isinstance(item, dict) and item.get("@type") == "Product":
                return {
                    "name": item.get("name"),
                    "description": item.get("description"),
                    "image_url": item.get("image"),
                }
    return {}
 def parse_info_rows(soup: BeautifulSoup) -> dict[str, str]:
    fields: dict[str, str] = {}
    for row in soup.select(".pmi-info-row"):
        children = [child for child in row.find_all(recursive=False) if isinstance(child, Tag)]
        if len(children) >= 2:
            key = clean_text(children[0].get_text(" ", strip=True))
            value = clean_text(" ".join(child.get_text(" ", strip=True) for child in children[1:]))
        else:
            parts = [part.strip() for part in row.get_text("|", strip=True).split("|") if part.strip()]
            if len(parts) < 2:
                continue
            key, value = parts[0], " ".join(parts[1:])
        if key and value:
            fields[normalize_key(key)] = value
    for table in soup.find_all("table"):
        for tr in table.find_all("tr"):
            cells = [clean_text(c.get_text(" ", strip=True)) for c in tr.find_all(["th", "td"])]
            if len(cells) == 2 and len(cells[0]) <= 80 and cells[1]:
                fields.setdefault(normalize_key(cells[0]), cells[1])
    return fields
 def collect_until_next_section(header: Tag) -> str:
    parts: list[str] = []
    for sibling in header.next_siblings:
        if isinstance(sibling, Tag) and sibling.name == "h4" and "section-product" in sibling.get("class", []):
            break
        if not isinstance(sibling, Tag):
            continue
        clone = BeautifulSoup(str(sibling), "lxml")
        text = node_text(clone)
        if text and text != clean_text(header.get_text(" ", strip=True)):
            parts.append(text)
    return clean_text("\n".join(parts))
 def parse_detail_sections(soup: BeautifulSoup) -> dict[str, str]:
    sections: dict[str, str] = {}
    for header in soup.select("h4.section-product"):
        title = clean_text(header.get_text(" ", strip=True))
        key = DETAIL_SECTION_ALIASES.get(title, normalize_key(title))
        text = collect_until_next_section(header)
        if text:
            sections[key] = text
    return sections
 def parse_classification(soup: BeautifulSoup) -> list[dict[str, str]]:
    levels: list[dict[str, str]] = []
    box = soup.select_one(".classification-levels")
    if not box:
        return levels
    for tr in box.find_all("tr"):
        cells = [clean_text(c.get_text(" ", strip=True)) for c in tr.find_all("td")]
        if len(cells) >= 2:
            levels.append({"code": cells[0], "name": cells[1]})
    return levels
 def parse_detail_page(html: str, detail_url: str) -> dict:
    soup = soup_from_html(html)
    json_ld = parse_json_ld_product(soup)
    h1 = soup.find("h1")
    fields = parse_info_rows(soup)
    sections = parse_detail_sections(soup)
    return {
        "product_id": product_id_from_url(detail_url),
        "slug": slug_from_url(detail_url),
        "name": json_ld.get("name") or (clean_text(h1.get_text(" ", strip=True)) if h1 else None),
        "short_description": clean_text(str(json_ld.get("description") or "")) or None,
        "image_url": json_ld.get("image_url"),
        "detail_fields": fields,
        "sections": sections,
        "classification": parse_classification(soup),
        "active_substances": split_list_field(sections.get("active_substances") or ""),
        "indication_group": sections.get("indication_group"),
    }
 def split_list_field(value: str) -> list[str]:
    if not value:
        return []
    items = [clean_text(item) for item in re.split(r"\n|,|;", value) if clean_text(item)]
    return list(dict.fromkeys(items))
 def extract_article_text(html: str) -> str:
    soup = soup_from_html(html)
    article = soup.find("article")
    if article:
        return node_text(article)
    fallback = soup.find("div", id="product") or soup.body or soup
    return node_text(fallback)
 def split_by_numbered_pil_sections(text: str) -> dict[str, str]:
    lines = [line.strip() for line in text.splitlines() if line.strip()]
    starts: list[tuple[str, int]] = []
    for idx, line in enumerate(lines):
        for key, pattern in PIL_SECTION_PATTERNS.items():
            if re.match(pattern, line, flags=re.IGNORECASE):
                starts.append((key, idx))
                break
    sections: dict[str, str] = {}
    for pos, (key, idx) in enumerate(starts):
        end = starts[pos + 1][1] if pos + 1 < len(starts) else len(lines)
        sections[key] = clean_text("\n".join(lines[idx:end]))
    return sections
 def split_pil_subsections(before_use_text: str) -> dict[str, str]:
    if not before_use_text:
        return {}
    lines = [line.strip() for line in before_use_text.splitlines() if line.strip()]
    starts: list[tuple[str, int]] = []
    for idx, line in enumerate(lines):
        for key, patterns in PIL_SUBSECTION_ALIASES.items():
            if any(re.match(pattern, line, flags=re.IGNORECASE) for pattern in patterns):
                starts.append((key, idx))
                break
    result: dict[str, str] = {}
    for pos, (key, idx) in enumerate(starts):
        end = starts[pos + 1][1] if pos + 1 < len(starts) else len(lines)
        result[key] = clean_text("\n".join(lines[idx:end]))
    return result
 def parse_pil_page(html: str) -> dict:
    text = extract_article_text(html)
    sections = split_by_numbered_pil_sections(text)
    subsections = split_pil_subsections(sections.get("before_use", ""))
    return {
        "full_text": text,
        "sections": sections,
        "subsections": subsections,
    }
 def build_lightrag_text(detail: dict, pil: dict | None, urls: ProductUrls) -> str:
    chunks: list[str] = []
    name = detail.get("name") or detail.get("slug") or urls.detail_url
    chunks.append(f"Liek: {name}")
    chunks.append(f"ADC detail URL: {urls.detail_url}")
    chunks.append(f"ADC PIL URL: {urls.pil_url}")
    fields = detail.get("detail_fields") or {}
    important_fields = [
        "registracne_cislo_produktu",
        "kod_statnej_autority_sukl",
        "nazov_produktu_podla_sukl",
        "aplikacna_forma",
        "vyrobca",
        "drzitel_rozhodnutia",
        "dodavatelia",
        "vydaj",
        "typ_produktu",
        "legislativne_zatriedenie",
    ]
    for key in important_fields:
        if fields.get(key):
            chunks.append(f"{key}: {fields[key]}")
    for section_key, title in [
        ("description_and_indications", "Popis a indikácie"),
        ("use_and_dosage", "Použitie a dávkovanie"),
        ("side_effects", "Nežiaduce účinky"),
        ("active_substances", "Účinné látky"),
        ("indication_group", "Indikačná skupina"),
        ("general_product_info", "Všeobecné informácie"),
    ]:
        text = (detail.get("sections") or {}).get(section_key)
        if text:
            chunks.append(f"\n{title}\n{text}")
    if pil:
        subsections = pil.get("subsections") or {}
        for key, title in [
            ("contraindications", "Kontraindikácie z PIL"),
            ("warnings", "Upozornenia z PIL"),
            ("interactions", "Interakcie z PIL"),
            ("pregnancy_breastfeeding", "Tehotenstvo a dojčenie z PIL"),
            ("driving", "Vedenie vozidiel z PIL"),
        ]:
            if subsections.get(key):
                chunks.append(f"\n{title}\n{subsections[key]}")
        for key, title in [
            ("what_is_it", "Na čo sa používa z PIL"),
            ("how_to_use", "Ako užívať z PIL"),
            ("side_effects", "Vedľajšie účinky z PIL"),
        ]:
            section_text = (pil.get("sections") or {}).get(key)
            if section_text:
                chunks.append(f"\n{title}\n{section_text}")
    return clean_text("\n\n".join(chunks))
 def build_graph_hints(detail: dict, pil: dict | None) -> dict:
    fields = detail.get("detail_fields") or {}
    sections = detail.get("sections") or {}
    pil_subsections = (pil or {}).get("subsections") or {}
    pil_sections = (pil or {}).get("sections") or {}
    return {
        "drug": detail.get("name"),
        "active_substances": detail.get("active_substances") or [],
        "dosage_form": fields.get("aplikacna_forma"),
        "manufacturer": fields.get("vyrobca"),
        "marketing_authorization_holder": fields.get("drzitel_rozhodnutia"),
        "supplier": fields.get("dodavatelia"),
        "sukl_code": fields.get("kod_statnej_autority_sukl"),
        "registration_number": fields.get("registracne_cislo_produktu"),
        "classification_codes": detail.get("classification") or [],
        "indications_text": sections.get("description_and_indications") or pil_sections.get("what_is_it"),
        "dosage_text": sections.get("use_and_dosage") or pil_sections.get("how_to_use"),
        "contraindications_text": pil_subsections.get("contraindications"),
        "warnings_text": pil_subsections.get("warnings"),
        "interactions_text": pil_subsections.get("interactions"),
        "side_effects_text": sections.get("side_effects") or pil_sections.get("side_effects"),
    }
 def build_record(detail_html: str, pil_html: str | None, urls: ProductUrls) -> dict:
    detail = parse_detail_page(detail_html, urls.detail_url)
    pil = parse_pil_page(pil_html) if pil_html else None
    scraped_at = datetime.now(timezone.utc).isoformat(timespec="seconds")
    return {
        "source": "adc.sk",
        "scraped_at": scraped_at,
        "urls": {
            "detail": urls.detail_url,
            "pil": urls.pil_url,
            "spc": urls.spc_url,
        },
        "product": detail,
        "pil": pil,
        "graph_hints": build_graph_hints(detail, pil),
        "lightrag_text": build_lightrag_text(detail, pil, urls),
    }
 def fetch_requests(session: requests.Session, url: str, timeout: int, retries: int) -> str:
    last_error: Exception | None = None
    for attempt in range(1, retries + 1):
        try:
            response = session.get(url, headers=HEADERS, timeout=timeout)
            response.raise_for_status()
            response.encoding = response.apparent_encoding or "utf-8"
            return response.text
        except Exception as exc:
            last_error = exc
            if attempt < retries:
                time.sleep(1.5 * attempt)
    raise RuntimeError(f"Failed to fetch {url}: {last_error}")
 def make_requests_fetcher(timeout: int, retries: int) -> Callable[[str], str]:
    session = requests.Session()
    return lambda url: fetch_requests(session, url, timeout, retries)
 def make_browser_fetcher() -> tuple[Callable[[str], str], Callable[[], None]]:
    try:
        from playwright.sync_api import sync_playwright
    except ImportError as exc:
        raise SystemExit(
            "Playwright is not installed. Run: pip install playwright; python -m playwright install chromium"
        ) from exc
    playwright = sync_playwright().start()
    browser = playwright.chromium.launch(headless=True)
    page = browser.new_page(
        user_agent=HEADERS["User-Agent"],
        locale="sk-SK",
        viewport={"width": 1366, "height": 900},
    )
    def fetch(url: str) -> str:
        response = page.goto(url, wait_until="domcontentloaded", timeout=60000)
        if response is None or response.status >= 400:
            status = response.status if response else "no-response"
            raise RuntimeError(f"HTTP {status} for {url}")
        return page.content()
    def close() -> None:
        browser.close()
        playwright.stop()
    return fetch, close
 def iter_links(links: Iterable[str], limit: int | None) -> Iterable[str]:
    count = 0
    for link in links:
        if limit is not None and count >= limit:
            break
        count += 1
        yield link
 def write_records_json(
    out_path: Path,
    links: list[str],
    fetch: Callable[[str], str],
    limit: int | None,
    delay: float,
    skip_failed: bool,
 ) -> list[dict[str, str]]:
    out_path.parent.mkdir(parents=True, exist_ok=True)
    failures: list[dict[str, str]] = []
    selected_links = list(iter_links(links, limit))
    with out_path.open("w", encoding="utf-8") as out:
        out.write("[\n")
        wrote_any = False
        for detail_url in tqdm(selected_links, desc="ADC products"):
            urls = product_urls(detail_url)
            try:
                detail_html = fetch(urls.detail_url)
                time.sleep(delay)
                pil_html = fetch(urls.pil_url)
                record = build_record(detail_html, pil_html, urls)
            except Exception as exc:
                failures.append({"url": detail_url, "error": str(exc)})
                tqdm.write(f"Failed product {detail_url}: {exc}")
                if not skip_failed:
                    raise
                continue
            if wrote_any:
                out.write(",\n")
            json.dump(record, out, ensure_ascii=False, indent=2)
            wrote_any = True
            out.flush()
            time.sleep(delay)
        out.write("\n]\n")
    return failures
 def main() -> None:
    parser = argparse.ArgumentParser(description="Scrape ADC product detail + PIL data into structured JSON.")
    parser.add_argument("--links", type=Path, default=DEFAULT_LINKS, help="Input JSON list with detail URLs.")
    parser.add_argument("--out", type=Path, default=DEFAULT_OUT, help="Output structured JSON file.")
    parser.add_argument("--limit", type=int, default=None, help="Scrape only the first N products.")
    parser.add_argument("--delay", type=float, default=0.25, help="Delay between page loads in seconds.")
    parser.add_argument("--timeout", type=int, default=30, help="HTTP timeout in seconds for requests mode.")
    parser.add_argument("--retries", type=int, default=3, help="Retries per URL in requests mode.")
    parser.add_argument("--browser", action="store_true", help="Use Playwright Chromium. Use this if ADC returns 403.")
    parser.add_argument("--stop-on-fail", action="store_true", help="Stop on first failed product.")
    args = parser.parse_args()
    links = load_links(args.links)
    close_browser: Callable[[], None] | None = None
    if args.browser:
        fetch, close_browser = make_browser_fetcher()
    else:
        fetch = make_requests_fetcher(args.timeout, args.retries)
    try:
        failures = write_records_json(
            out_path=args.out,
            links=links,
            fetch=fetch,
            limit=args.limit,
            delay=args.delay,
            skip_failed=not args.stop_on_fail,
        )
    finally:
        if close_browser:
            close_browser()
    print(f"Saved structured product data to {args.out}")
    if failures:
        failed_path = args.out.with_suffix(".failed.json")
        failed_path.write_text(json.dumps(failures, ensure_ascii=False, indent=2), encoding="utf-8")
        print(f"Failed products: {len(failures)}. Saved errors to {failed_path}")
 if __name__ == "__main__":
    main()
--- a/scripts/adc_scraper/scrape_adc_product_links.py
+++ b/scripts/adc_scraper/scrape_adc_product_links.py
@ -0,0 +1,182 @@
 """Scrape product detail links from ADC product listing pages.
 Example:
    python scripts/adc_scraper/scrape_adc_product_links.py --out data_adc_databaza/adc_scrape_2026_05_04/adc_product_links.json
 """
 from __future__ import annotations
 import argparse
 import json
 import time
 from pathlib import Path
 from urllib.parse import urljoin
 import requests
 from bs4 import BeautifulSoup
 from tqdm import tqdm
 BASE_URL = "https://www.adc.sk"
 LISTING_URL = "https://www.adc.sk/databazy/produkty?page={page}&ord=a1"
 DEFAULT_PAGES = 711
 HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "sk-SK,sk;q=0.9,cs;q=0.8,en;q=0.7",
    "Cache-Control": "no-cache",
    "Pragma": "no-cache",
    "Referer": "https://www.adc.sk/databazy/produkty",
 }
 def fetch_page(session: requests.Session, page: int, timeout: int, retries: int) -> str:
    url = LISTING_URL.format(page=page)
    last_error: Exception | None = None
    for attempt in range(1, retries + 1):
        try:
            response = session.get(url, headers=HEADERS, timeout=timeout)
            response.raise_for_status()
            response.encoding = response.apparent_encoding or "utf-8"
            return response.text
        except Exception as exc:
            last_error = exc
            if attempt < retries:
                time.sleep(1.5 * attempt)
    raise RuntimeError(f"Failed to fetch page {page}: {last_error}")
 def extract_product_links(html: str) -> list[str]:
    soup = BeautifulSoup(html, "lxml")
    links: list[str] = []
    for tag in soup.select('a.product[href^="/databazy/produkty/detail/"]'):
        href = tag.get("href")
        if not href:
            continue
        links.append(urljoin(BASE_URL, href))
    return links
 def scrape_with_requests(
    start_page: int,
    pages: int,
    delay: float,
    timeout: int,
    retries: int,
 ) -> tuple[list[str], list[int]]:
    session = requests.Session()
    seen: set[str] = set()
    all_links: list[str] = []
    failed_pages: list[int] = []
    end_page = start_page + pages - 1
    for page in tqdm(range(start_page, end_page + 1), desc="ADC pages"):
        try:
            html = fetch_page(session, page, timeout, retries)
            page_links = extract_product_links(html)
        except Exception as exc:
            tqdm.write(str(exc))
            failed_pages.append(page)
            continue
        for link in page_links:
            if link not in seen:
                seen.add(link)
                all_links.append(link)
        time.sleep(delay)
    return all_links, failed_pages
 def scrape_with_browser(start_page: int, pages: int, delay: float) -> tuple[list[str], list[int]]:
    try:
        from playwright.sync_api import sync_playwright
    except ImportError as exc:
        raise SystemExit(
            "Playwright is not installed. Run: pip install playwright; python -m playwright install chromium"
        ) from exc
    seen: set[str] = set()
    all_links: list[str] = []
    failed_pages: list[int] = []
    end_page = start_page + pages - 1
    with sync_playwright() as playwright:
        browser = playwright.chromium.launch(headless=True)
        page_obj = browser.new_page(
            user_agent=HEADERS["User-Agent"],
            locale="sk-SK",
            viewport={"width": 1366, "height": 900},
        )
        for page in tqdm(range(start_page, end_page + 1), desc="ADC pages"):
            url = LISTING_URL.format(page=page)
            try:
                response = page_obj.goto(url, wait_until="domcontentloaded", timeout=60000)
                if response is None or response.status >= 400:
                    status = response.status if response else "no-response"
                    raise RuntimeError(f"HTTP {status}")
                html = page_obj.content()
                page_links = extract_product_links(html)
            except Exception as exc:
                tqdm.write(f"Failed page {page}: {exc}")
                failed_pages.append(page)
                continue
            for link in page_links:
                if link not in seen:
                    seen.add(link)
                    all_links.append(link)
            time.sleep(delay)
        browser.close()
    return all_links, failed_pages
 def main() -> None:
    parser = argparse.ArgumentParser(description="Scrape ADC product detail links.")
    parser.add_argument("--out", type=Path, required=True, help="Output JSON file.")
    parser.add_argument("--pages", type=int, default=DEFAULT_PAGES, help="Number of ADC listing pages.")
    parser.add_argument("--start-page", type=int, default=1, help="First page number.")
    parser.add_argument("--delay", type=float, default=0.25, help="Delay between requests in seconds.")
    parser.add_argument("--timeout", type=int, default=30, help="HTTP timeout in seconds.")
    parser.add_argument("--retries", type=int, default=3, help="Retries per page.")
    parser.add_argument(
        "--browser",
        action="store_true",
        help="Use Playwright Chromium instead of requests. Useful when ADC returns HTTP 403.",
    )
    args = parser.parse_args()
    if args.browser:
        all_links, failed_pages = scrape_with_browser(args.start_page, args.pages, args.delay)
    else:
        all_links, failed_pages = scrape_with_requests(
            args.start_page,
            args.pages,
            args.delay,
            args.timeout,
            args.retries,
        )
    args.out.parent.mkdir(parents=True, exist_ok=True)
    args.out.write_text(json.dumps(all_links, ensure_ascii=False, indent=2), encoding="utf-8")
    print(f"Saved {len(all_links)} unique product links to {args.out}")
    if failed_pages:
        print(f"Failed pages: {failed_pages}")
 if __name__ == "__main__":
    main()
--- a/scripts/adc_scraper/validate_adc_json.py
+++ b/scripts/adc_scraper/validate_adc_json.py
@ -0,0 +1,47 @@
 """Validate basic quality of structured ADC JSON."""
 from __future__ import annotations
 import argparse
 import json
 from collections import Counter
 from pathlib import Path
 def main() -> None:
    parser = argparse.ArgumentParser(description="Validate structured ADC JSON.")
    parser.add_argument("--input", type=Path, required=True)
    args = parser.parse_args()
    data = json.loads(args.input.read_text(encoding="utf-8"))
    if not isinstance(data, list):
        raise SystemExit("Input must be a JSON list.")
    missing = Counter()
    section_counter = Counter()
    total_pil_chars = 0
    total_spc_chars = 0
    for record in data:
        for key in ("source_url", "name", "pil_text", "spc_text", "sections"):
            if not record.get(key):
                missing[key] += 1
        total_pil_chars += len(record.get("pil_text") or "")
        total_spc_chars += len(record.get("spc_text") or "")
        for section_name, section_text in (record.get("sections") or {}).items():
            if section_text:
                section_counter[section_name] += 1
    print(f"Records: {len(data)}")
    print(f"Average PIL chars: {total_pil_chars // max(len(data), 1)}")
    print(f"Average SPC chars: {total_spc_chars // max(len(data), 1)}")
    print("Missing fields:")
    for key in ("source_url", "name", "pil_text", "spc_text", "sections"):
        print(f"  {key}: {missing[key]}")
    print("Detected sections:")
    for key, value in section_counter.most_common():
        print(f"  {key}: {value}")
 if __name__ == "__main__":
    main()
--- a/start_servers.py
+++ b/start_servers.py
@ -0,0 +1,94 @@
 """
 Запуск всех серверов для работы с графом знаний ADC.
 Запускает:
  1. Embedding server  — localhost:8010  (локальная модель, словацкий язык)
  2. LightRAG server   — localhost:9621  (граф + API + WebUI)
 Использование:
    python start_servers.py
 Остановка: Ctrl+C
 """
 import subprocess
 import sys
 import time
 import urllib.request
 import os
 from pathlib import Path
 ROOT = Path(__file__).parent
 LIGHTRAG_DIR = ROOT / "lightrag"
 EMBEDDING_SCRIPT = ROOT / "embedding_server.py"
 def wait_for(url, name, timeout=60):
    print(f"  Ожидаю {name}...", end="", flush=True)
    for _ in range(timeout):
        try:
            urllib.request.urlopen(url, timeout=2)
            print(" OK")
            return True
        except:
            print(".", end="", flush=True)
            time.sleep(1)
    print(" ТАЙМАУТ")
    return False
 def main():
    print("=" * 50)
    print("Запуск серверов LightRAG ADC")
    print("=" * 50)
    env = os.environ.copy()
    env["PYTHONUTF8"] = "1"
    # 1. Embedding server
    print("\n[1/2] Запуск Embedding server (порт 8010)...")
    embed_proc = subprocess.Popen(
        [sys.executable, str(EMBEDDING_SCRIPT)],
        env=env,
        cwd=str(ROOT),
    )
    if not wait_for("http://localhost:8010/health", "embedding server"):
        print("ОШИБКА: embedding server не запустился")
        embed_proc.terminate()
        sys.exit(1)
    # 2. LightRAG server
    print("\n[2/2] Запуск LightRAG server (порт 9621)...")
    lightrag_proc = subprocess.Popen(
        [sys.executable, "-m", "lightrag.api.lightrag_server"],
        env=env,
        cwd=str(LIGHTRAG_DIR),
    )
    if not wait_for("http://localhost:9621/health", "LightRAG server", timeout=30):
        print("ОШИБКА: LightRAG server не запустился")
        embed_proc.terminate()
        lightrag_proc.terminate()
        sys.exit(1)
    print("\n" + "=" * 50)
    print("Все серверы запущены!")
    print("  Embedding:  http://localhost:8010/health")
    print("  LightRAG:   http://localhost:9621/health")
    print("  WebUI:      http://localhost:9621/webui  (если собран)")
    print("=" * 50)
    print("\nCtrl+C для остановки\n")
    try:
        embed_proc.wait()
        lightrag_proc.wait()
    except KeyboardInterrupt:
        print("\nОстанавливаю серверы...")
        embed_proc.terminate()
        lightrag_proc.terminate()
        print("Готово.")
 if __name__ == "__main__":
    main()
		`@ -0,0 +1 @@`
							`"""ADC scraper scripts for the diploma project."""`