A RAG system that enables physicians to retrieve instant, source-cited answers from their own collection of medical textbooks.
- Hierarchical PDF parsing —
fitz_hybriddetects heading levels from font size and weight, builds a full breadcrumb path (heading_path) for every chunk, so retrieved passages always know where they came from in the book - Three parser backends —
fitz_hybrid(local, multi-column aware, full hierarchy),pymupdf4llm(local, flat markdown),llamaparse(cloud, best for scanned/complex layouts) - Semantic chunking — section boundaries are split first along detected headings, then sentence-splitter runs within each section; chunks never straddle section breaks
- Multi-provider embeddings — BGE-M3 (Cloudflare, no rate limits), Voyage AI (highest quality), Gemini; each gets its own ChromaDB collection so you can maintain and compare all three simultaneously
- Provider auto-detection — collections are stamped at ingest time;
load_index()reads the stamp and wires the correct embed model automatically — no mismatch possible - Cohere reranker — cross-encoder reranking narrows the retrieval pool to the highest-precision passages before the LLM sees them
- Incremental ingestion — re-running ingest or adding a single PDF with
--pdfis safe; already-indexed files are skipped automatically - Two LLM backends — Gemini 2.5 Flash or DeepSeek-R1-Distill-Qwen-32B (Cloudflare, uses same credentials as BGE-M3)
- Inline citations — every answer is grounded in source passages with book title and page number displayed
PDFs (medical textbooks)
│
▼
Parser (choose one)
├── fitz_hybrid ← local · multi-column · full heading hierarchy (DEFAULT)
├── pymupdf4llm ← local · flat markdown only
└── llamaparse ← cloud · best for scanned/complex layouts
│
▼
Semantic Chunker ← heading-aware section split → sentence split within sections
│
▼
Embedding Provider ← choose one (or index all three):
├── BGE-M3 BAAI/bge-m3 via Cloudflare Workers AI (1024-dim, recommended)
├── Voyage AI voyage-4-large (2048-dim, best quality)
└── Gemini models/gemini-embedding-001 (768-dim, rate-limit prone)
│
▼
ChromaDB (local) ← provider-scoped collections (one per embedding model)
│ e.g. clinical_rag_bge_m3 / clinical_rag_voyage / clinical_rag_gemini
▼
Query
→ Auto-detect embedding (reads collection stamp → no mismatch possible)
→ Cohere reranker (cross-encoder, top-N precision)
→ LLM answer (Gemini 2.5 Flash or DeepSeek-R1)
→ Inline citations (source title + page number)
clinical-rag/
├── config/
│ └── settings.py # All config, env vars, collection naming
├── embeddings/
│ └── providers.py # EmbeddingProvider enum, build/detect/stamp
├── ingestion/
│ ├── parser.py # PDF → Documents (fitz_hybrid / pymupdf4llm / llamaparse)
│ ├── chunker.py # Semantic chunking: heading-aware + sentence splitter
│ └── indexer.py # Embed + store; deduplication; auto-detect on load
├── retrieval/
│ ├── retriever.py # ChromaDB vector retrieval
│ └── reranker.py # Cohere reranker wrapper
├── pipeline/
│ └── rag_pipeline.py # End-to-end orchestration
├── evaluation/
│ └── eval_pipeline.py # Faithfulness / Relevancy / Correctness evals
├── ui/
│ └── app.py # Streamlit UI with provider selector + chat interface
├── notebooks/
│ └── demo.ipynb # Interactive walkthrough
├── data/
│ ├── pdfs/ # Drop your medical textbook PDFs here
│ └── chroma_db/ # Auto-created, one sub-collection per provider
├── ingest.py # CLI: parse + embed + store
├── query.py # CLI: query the pipeline
├── requirements.txt
├── Makefile
└── .env.example
pip install -r requirements.txtcp .env.example .env
# Fill in your API keysdata/pdfs/harrisons_principles.pdf
data/pdfs/uptodate_export.pdf
Any medical textbook PDF works. fitz_hybrid handles multi-column layouts and complex heading hierarchies automatically.
# Recommended — BGE-M3, no rate limits
python ingest.py --provider bge_m3
# Best retrieval quality
python ingest.py --provider voyage
# You can index all three — each gets its own collection
python ingest.py --provider gemini
# Add a single PDF (already-indexed files are skipped automatically)
python ingest.py --pdf data/pdfs/new_textbook.pdf --provider bge_m3
# Force re-ingest from scratch
python ingest.py --provider bge_m3 --reset
# Choose parser explicitly
python ingest.py --parser fitz_hybrid # default: local, full heading hierarchy
python ingest.py --parser llamaparse # cloud, best for scanned layouts# CLI
python query.py --provider bge_m3 "What is the first-line treatment for septic shock?"
python query.py --provider voyage --interactive
# Web UI (provider + LLM selector included)
streamlit run ui/app.pypython evaluation/eval_pipeline.py| Service | Key | Required for |
|---|---|---|
| Google AI | GEMINI_API_KEY |
Gemini 2.5 Flash LLM (always) + Gemini embeddings |
| Cohere | COHERE_API_KEY |
Reranker (always) |
| Cloudflare | CF_API_TOKEN + CF_ACCOUNT_ID |
BGE-M3 embeddings + DeepSeek LLM |
| Voyage AI | VOYAGE_API_KEY |
Voyage embeddings only |
| LlamaCloud | LLAMA_PARSE_API_KEY |
Only when PARSER=llamaparse |
| Provider | Dimensions | Notes |
|---|---|---|
bge_m3 |
1024 | Multilingual, strong on medical text, no rate limits (Cloudflare) |
voyage |
2048 | Best retrieval quality, 32k context window |
gemini |
768 | May hit 429 rate limits; exponential backoff built-in |
At ingest time, each ChromaDB collection is stamped with the provider name.
At query time, load_index() reads this stamp and automatically instantiates
the matching embed model — query and document vectors are guaranteed to match
without any manual tracking on your part.
# This just works — no need to remember which model you used:
pipeline = ClinicalRAGPipeline.from_existing_index(provider="voyage")