🩺 ClinicalRAG — AI-Powered Physician Assistant

A RAG system that enables physicians to retrieve instant, source-cited answers from their own collection of medical textbooks.

Features

Hierarchical PDF parsing — fitz_hybrid detects heading levels from font size and weight, builds a full breadcrumb path (heading_path) for every chunk, so retrieved passages always know where they came from in the book
Three parser backends — fitz_hybrid (local, multi-column aware, full hierarchy), pymupdf4llm (local, flat markdown), llamaparse (cloud, best for scanned/complex layouts)
Semantic chunking — section boundaries are split first along detected headings, then sentence-splitter runs within each section; chunks never straddle section breaks
Multi-provider embeddings — BGE-M3 (Cloudflare, no rate limits), Voyage AI (highest quality), Gemini; each gets its own ChromaDB collection so you can maintain and compare all three simultaneously
Provider auto-detection — collections are stamped at ingest time; load_index() reads the stamp and wires the correct embed model automatically — no mismatch possible
Cohere reranker — cross-encoder reranking narrows the retrieval pool to the highest-precision passages before the LLM sees them
Incremental ingestion — re-running ingest or adding a single PDF with --pdf is safe; already-indexed files are skipped automatically
Two LLM backends — Gemini 2.5 Flash or DeepSeek-R1-Distill-Qwen-32B (Cloudflare, uses same credentials as BGE-M3)
Inline citations — every answer is grounded in source passages with book title and page number displayed

Architecture

PDFs (medical textbooks)
    │
    ▼
Parser (choose one)
  ├── fitz_hybrid    ← local · multi-column · full heading hierarchy (DEFAULT)
  ├── pymupdf4llm    ← local · flat markdown only
  └── llamaparse     ← cloud · best for scanned/complex layouts
    │
    ▼
Semantic Chunker     ← heading-aware section split → sentence split within sections
    │
    ▼
Embedding Provider   ← choose one (or index all three):
  ├── BGE-M3         BAAI/bge-m3 via Cloudflare Workers AI  (1024-dim, recommended)
  ├── Voyage AI      voyage-4-large                          (2048-dim, best quality)
  └── Gemini         models/gemini-embedding-001             (768-dim, rate-limit prone)
    │
    ▼
ChromaDB (local)     ← provider-scoped collections (one per embedding model)
    │                   e.g. clinical_rag_bge_m3 / clinical_rag_voyage / clinical_rag_gemini
    ▼
Query
  → Auto-detect embedding  (reads collection stamp → no mismatch possible)
  → Cohere reranker        (cross-encoder, top-N precision)
  → LLM answer             (Gemini 2.5 Flash or DeepSeek-R1)
  → Inline citations       (source title + page number)

Project Structure

clinical-rag/
├── config/
│   └── settings.py              # All config, env vars, collection naming
├── embeddings/
│   └── providers.py             # EmbeddingProvider enum, build/detect/stamp
├── ingestion/
│   ├── parser.py                # PDF → Documents (fitz_hybrid / pymupdf4llm / llamaparse)
│   ├── chunker.py               # Semantic chunking: heading-aware + sentence splitter
│   └── indexer.py               # Embed + store; deduplication; auto-detect on load
├── retrieval/
│   ├── retriever.py             # ChromaDB vector retrieval
│   └── reranker.py              # Cohere reranker wrapper
├── pipeline/
│   └── rag_pipeline.py          # End-to-end orchestration
├── evaluation/
│   └── eval_pipeline.py         # Faithfulness / Relevancy / Correctness evals
├── ui/
│   └── app.py                   # Streamlit UI with provider selector + chat interface
├── notebooks/
│   └── demo.ipynb               # Interactive walkthrough
├── data/
│   ├── pdfs/                    # Drop your medical textbook PDFs here
│   └── chroma_db/               # Auto-created, one sub-collection per provider
├── ingest.py                    # CLI: parse + embed + store
├── query.py                     # CLI: query the pipeline
├── requirements.txt
├── Makefile
└── .env.example

Setup

1. Install

pip install -r requirements.txt

2. Configure

cp .env.example .env
# Fill in your API keys

3. Drop PDFs

data/pdfs/harrisons_principles.pdf
data/pdfs/uptodate_export.pdf

Any medical textbook PDF works. fitz_hybrid handles multi-column layouts and complex heading hierarchies automatically.

4. Ingest

# Recommended — BGE-M3, no rate limits
python ingest.py --provider bge_m3

# Best retrieval quality
python ingest.py --provider voyage

# You can index all three — each gets its own collection
python ingest.py --provider gemini

# Add a single PDF (already-indexed files are skipped automatically)
python ingest.py --pdf data/pdfs/new_textbook.pdf --provider bge_m3

# Force re-ingest from scratch
python ingest.py --provider bge_m3 --reset

# Choose parser explicitly
python ingest.py --parser fitz_hybrid    # default: local, full heading hierarchy
python ingest.py --parser llamaparse     # cloud, best for scanned layouts

5. Query

# CLI
python query.py --provider bge_m3 "What is the first-line treatment for septic shock?"
python query.py --provider voyage --interactive

# Web UI (provider + LLM selector included)
streamlit run ui/app.py

6. Evaluate

python evaluation/eval_pipeline.py

API Keys

Service	Key	Required for
Google AI	`GEMINI_API_KEY`	Gemini 2.5 Flash LLM (always) + Gemini embeddings
Cohere	`COHERE_API_KEY`	Reranker (always)
Cloudflare	`CF_API_TOKEN` + `CF_ACCOUNT_ID`	BGE-M3 embeddings + DeepSeek LLM
Voyage AI	`VOYAGE_API_KEY`	Voyage embeddings only
LlamaCloud	`LLAMA_PARSE_API_KEY`	Only when `PARSER=llamaparse`

Embedding Provider Comparison

Provider	Dimensions	Notes
`bge_m3`	1024	Multilingual, strong on medical text, no rate limits (Cloudflare)
`voyage`	2048	Best retrieval quality, 32k context window
`gemini`	768	May hit 429 rate limits; exponential backoff built-in

Auto-Detection

At ingest time, each ChromaDB collection is stamped with the provider name. At query time, load_index() reads this stamp and automatically instantiates the matching embed model — query and document vectors are guaranteed to match without any manual tracking on your part.

# This just works — no need to remember which model you used:
pipeline = ClinicalRAGPipeline.from_existing_index(provider="voyage")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🩺 ClinicalRAG — AI-Powered Physician Assistant

Features

Architecture

Project Structure

Setup

1. Install

2. Configure

3. Drop PDFs

4. Ingest

5. Query

6. Evaluate

API Keys

Embedding Provider Comparison

Auto-Detection

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
config		config
embeddings		embeddings
evaluation		evaluation
ingestion		ingestion
notebooks		notebooks
pipeline		pipeline
retrieval		retrieval
ui		ui
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
__init__.py		__init__.py
ingest.py		ingest.py
query.py		query.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🩺 ClinicalRAG — AI-Powered Physician Assistant

Features

Architecture

Project Structure

Setup

1. Install

2. Configure

3. Drop PDFs

4. Ingest

5. Query

6. Evaluate

API Keys

Embedding Provider Comparison

Auto-Detection

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages