A fully-local, zero-dependency graph-RAG engine for incident investigation over runbooks / troubleshooting guides (TSGs).
Point it at a folder of markdown runbooks. It parses them, extracts the entities that connect them — services, Kusto/KQL tables, symptoms, mitigations — and builds a small knowledge graph. Given an incident description, it returns the most relevant runbooks plus the graph-linked siblings that flat keyword search would miss, along with the exact Kusto tables to check and the documented mitigations.
The "reasoning" layer is your own coding assistant (Claude Code, Cursor, etc.). This engine just does fast, explainable, local retrieval and graph traversal. It never calls an LLM and never touches the network.
Built as an open, adaptable blueprint. Drop your real runbooks into
data/runbooks/, and hand the repo to your assistant —CLAUDE.mdtells it how to wire this into your environment.
Plain vector/keyword search finds runbooks whose text matches an incident. But incidents cascade: an auth-throttling event surfaces as payment 504s. If two runbooks query the same Kusto table or touch the same service, they're related even when their words differ. Modeling runbooks + entities as a graph captures exactly those links — the approach Microsoft itself has written up for GraphRAG-powered incident & change management and TSG recommendation.
- Standard library only for the core. No
pip installneeded to run — important for locked-down corporate environments. - No external systems. No vector DB, no graph DB, no cloud, no LLM API. Everything is in-process over local files.
- No data egress. Your runbooks never leave the machine. Safe for internal/confidential TSGs.
- Hackable. The whole engine is ~600 lines across 6 small modules.
git clone https://github.com/rahulcommercial/runbook-graphrag
cd runbook-graphrag
# 1. find runbooks for an incident
python3 -m rag.cli query "checkout 504s and rising p99 latency, throttling upstream"
# 2. inspect an entity and everything connected to it
python3 -m rag.cli explain table:Requests
# 3. shortest path between two nodes (how is this runbook related to that service?)
python3 -m rag.cli path doc:payment-timeouts service:auth-api
# 4. recurring hotspots across all runbooks
python3 -m rag.cli godnodes
# 5. corpus + graph stats
python3 -m rag.cli statsFeed a whole incident file:
python3 -m rag.cli query "$(cat data/incidents/sample-incident.md)" --jsonrunbooks (*.md)
│ rag/ingest.py parse frontmatter, sections, mitigations
│ rag/kql.py extract Kusto tables / clusters / functions
▼
documents ── rag/graph.py ─► knowledge graph (doc ↔ table ↔ service ↔ symptom)
│
▼
rag/retrieve.py BM25 lexical match + graph expansion (shared-entity siblings)
│
▼
ranked runbooks + "linked by" reasons + tables to check + mitigations
│
▼
your coding assistant reasons over the result (CLI or optional MCP server)
| Module | Responsibility | Deps |
|---|---|---|
rag/ingest.py |
runbook markdown → Document |
stdlib |
rag/kql.py |
Kusto/KQL entity extraction | stdlib |
rag/graph.py |
typed knowledge graph, BFS path, god-nodes | stdlib |
rag/retrieve.py |
BM25 + graph expansion | stdlib |
rag/cli.py |
command-line interface | stdlib |
rag/store.py |
optional SQLite persistence | stdlib |
rag/mcp_server.py |
optional MCP server for assistants | mcp (optional) |
Option A — shell (zero install): let your assistant run python3 -m rag.cli query "..." --json and read the result. Works anywhere.
Option B — MCP server (optional): pip install "mcp[cli]", then register python -m rag.mcp_server as an MCP stdio server. Exposes find_runbooks and explain as tools. MCP here is local stdio only — still no network.
Forgiving markdown. Frontmatter is optional; inline **Field:** lines work too.
---
title: Payment gateway timeouts
service: payments-api
severity: 2
symptoms: [checkout 504, p99 latency above 2s]
---
## Symptoms
- Checkout returns 504
## Diagnosis
```kql
Requests | where Service == "payments-api" | summarize count() by ResultCode
```
## Mitigation
- Failover to the secondary regionpython3 tests/test_engine.py # or: python3 -m pytest- Microsoft GraphRAG — the original graph-RAG approach
- LightRAG — lightweight local-first graph RAG
- nano-graphrag — hackable ~1.1k-line implementation
- RAG-based incident resolution for IT support (arXiv 2409.13707)
This project trades those frameworks' LLM-driven entity extraction for deterministic, dependency-free extraction tuned to runbooks + KQL — so it runs in environments where you can't install packages or send data out.
MIT — see LICENSE.