Skip to content

advafaeian/clinical-rag

Repository files navigation

🩺 ClinicalRAG — AI-Powered Physician Assistant

A RAG system that enables physicians to retrieve instant, source-cited answers from their own collection of medical textbooks.

Features

  • Hierarchical PDF parsingfitz_hybrid detects heading levels from font size and weight, builds a full breadcrumb path (heading_path) for every chunk, so retrieved passages always know where they came from in the book
  • Three parser backendsfitz_hybrid (local, multi-column aware, full hierarchy), pymupdf4llm (local, flat markdown), llamaparse (cloud, best for scanned/complex layouts)
  • Semantic chunking — section boundaries are split first along detected headings, then sentence-splitter runs within each section; chunks never straddle section breaks
  • Multi-provider embeddings — BGE-M3 (Cloudflare, no rate limits), Voyage AI (highest quality), Gemini; each gets its own ChromaDB collection so you can maintain and compare all three simultaneously
  • Provider auto-detection — collections are stamped at ingest time; load_index() reads the stamp and wires the correct embed model automatically — no mismatch possible
  • Cohere reranker — cross-encoder reranking narrows the retrieval pool to the highest-precision passages before the LLM sees them
  • Incremental ingestion — re-running ingest or adding a single PDF with --pdf is safe; already-indexed files are skipped automatically
  • Two LLM backends — Gemini 2.5 Flash or DeepSeek-R1-Distill-Qwen-32B (Cloudflare, uses same credentials as BGE-M3)
  • Inline citations — every answer is grounded in source passages with book title and page number displayed

Architecture

PDFs (medical textbooks)
    │
    ▼
Parser (choose one)
  ├── fitz_hybrid    ← local · multi-column · full heading hierarchy (DEFAULT)
  ├── pymupdf4llm    ← local · flat markdown only
  └── llamaparse     ← cloud · best for scanned/complex layouts
    │
    ▼
Semantic Chunker     ← heading-aware section split → sentence split within sections
    │
    ▼
Embedding Provider   ← choose one (or index all three):
  ├── BGE-M3         BAAI/bge-m3 via Cloudflare Workers AI  (1024-dim, recommended)
  ├── Voyage AI      voyage-4-large                          (2048-dim, best quality)
  └── Gemini         models/gemini-embedding-001             (768-dim, rate-limit prone)
    │
    ▼
ChromaDB (local)     ← provider-scoped collections (one per embedding model)
    │                   e.g. clinical_rag_bge_m3 / clinical_rag_voyage / clinical_rag_gemini
    ▼
Query
  → Auto-detect embedding  (reads collection stamp → no mismatch possible)
  → Cohere reranker        (cross-encoder, top-N precision)
  → LLM answer             (Gemini 2.5 Flash or DeepSeek-R1)
  → Inline citations       (source title + page number)

Project Structure

clinical-rag/
├── config/
│   └── settings.py              # All config, env vars, collection naming
├── embeddings/
│   └── providers.py             # EmbeddingProvider enum, build/detect/stamp
├── ingestion/
│   ├── parser.py                # PDF → Documents (fitz_hybrid / pymupdf4llm / llamaparse)
│   ├── chunker.py               # Semantic chunking: heading-aware + sentence splitter
│   └── indexer.py               # Embed + store; deduplication; auto-detect on load
├── retrieval/
│   ├── retriever.py             # ChromaDB vector retrieval
│   └── reranker.py              # Cohere reranker wrapper
├── pipeline/
│   └── rag_pipeline.py          # End-to-end orchestration
├── evaluation/
│   └── eval_pipeline.py         # Faithfulness / Relevancy / Correctness evals
├── ui/
│   └── app.py                   # Streamlit UI with provider selector + chat interface
├── notebooks/
│   └── demo.ipynb               # Interactive walkthrough
├── data/
│   ├── pdfs/                    # Drop your medical textbook PDFs here
│   └── chroma_db/               # Auto-created, one sub-collection per provider
├── ingest.py                    # CLI: parse + embed + store
├── query.py                     # CLI: query the pipeline
├── requirements.txt
├── Makefile
└── .env.example

Setup

1. Install

pip install -r requirements.txt

2. Configure

cp .env.example .env
# Fill in your API keys

3. Drop PDFs

data/pdfs/harrisons_principles.pdf
data/pdfs/uptodate_export.pdf

Any medical textbook PDF works. fitz_hybrid handles multi-column layouts and complex heading hierarchies automatically.

4. Ingest

# Recommended — BGE-M3, no rate limits
python ingest.py --provider bge_m3

# Best retrieval quality
python ingest.py --provider voyage

# You can index all three — each gets its own collection
python ingest.py --provider gemini

# Add a single PDF (already-indexed files are skipped automatically)
python ingest.py --pdf data/pdfs/new_textbook.pdf --provider bge_m3

# Force re-ingest from scratch
python ingest.py --provider bge_m3 --reset

# Choose parser explicitly
python ingest.py --parser fitz_hybrid    # default: local, full heading hierarchy
python ingest.py --parser llamaparse     # cloud, best for scanned layouts

5. Query

# CLI
python query.py --provider bge_m3 "What is the first-line treatment for septic shock?"
python query.py --provider voyage --interactive

# Web UI (provider + LLM selector included)
streamlit run ui/app.py

6. Evaluate

python evaluation/eval_pipeline.py

API Keys

Service Key Required for
Google AI GEMINI_API_KEY Gemini 2.5 Flash LLM (always) + Gemini embeddings
Cohere COHERE_API_KEY Reranker (always)
Cloudflare CF_API_TOKEN + CF_ACCOUNT_ID BGE-M3 embeddings + DeepSeek LLM
Voyage AI VOYAGE_API_KEY Voyage embeddings only
LlamaCloud LLAMA_PARSE_API_KEY Only when PARSER=llamaparse

Embedding Provider Comparison

Provider Dimensions Notes
bge_m3 1024 Multilingual, strong on medical text, no rate limits (Cloudflare)
voyage 2048 Best retrieval quality, 32k context window
gemini 768 May hit 429 rate limits; exponential backoff built-in

Auto-Detection

At ingest time, each ChromaDB collection is stamped with the provider name. At query time, load_index() reads this stamp and automatically instantiates the matching embed model — query and document vectors are guaranteed to match without any manual tracking on your part.

# This just works — no need to remember which model you used:
pipeline = ClinicalRAGPipeline.from_existing_index(provider="voyage")

About

Physician-focused RAG pipeline with hierarchical PDF parsing, semantic chunking, reranking, and source-cited answers from complex medical textbooks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors