runbook-graphrag

A fully-local, zero-dependency graph-RAG engine for incident investigation over runbooks / troubleshooting guides (TSGs).

Point it at a folder of markdown runbooks. It parses them, extracts the entities that connect them — services, Kusto/KQL tables, symptoms, mitigations — and builds a small knowledge graph. Given an incident description, it returns the most relevant runbooks plus the graph-linked siblings that flat keyword search would miss, along with the exact Kusto tables to check and the documented mitigations.

The "reasoning" layer is your own coding assistant (Claude Code, Cursor, etc.). This engine just does fast, explainable, local retrieval and graph traversal. It never calls an LLM and never touches the network.

Built as an open, adaptable blueprint. Drop your real runbooks into data/runbooks/, and hand the repo to your assistant — CLAUDE.md tells it how to wire this into your environment.

Why graph-RAG for incidents?

Plain vector/keyword search finds runbooks whose text matches an incident. But incidents cascade: an auth-throttling event surfaces as payment 504s. If two runbooks query the same Kusto table or touch the same service, they're related even when their words differ. Modeling runbooks + entities as a graph captures exactly those links — the approach Microsoft itself has written up for GraphRAG-powered incident & change management and TSG recommendation.

Design constraints (deliberate)

Standard library only for the core. No pip install needed to run — important for locked-down corporate environments.
No external systems. No vector DB, no graph DB, no cloud, no LLM API. Everything is in-process over local files.
No data egress. Your runbooks never leave the machine. Safe for internal/confidential TSGs.
Hackable. The whole engine is ~600 lines across 6 small modules.

Quickstart

git clone https://github.com/rahulcommercial/runbook-graphrag
cd runbook-graphrag

# 1. find runbooks for an incident
python3 -m rag.cli query "checkout 504s and rising p99 latency, throttling upstream"

# 2. inspect an entity and everything connected to it
python3 -m rag.cli explain table:Requests

# 3. shortest path between two nodes (how is this runbook related to that service?)
python3 -m rag.cli path doc:payment-timeouts service:auth-api

# 4. recurring hotspots across all runbooks
python3 -m rag.cli godnodes

# 5. corpus + graph stats
python3 -m rag.cli stats

Feed a whole incident file:

python3 -m rag.cli query "$(cat data/incidents/sample-incident.md)" --json

How it works

runbooks (*.md)
   │  rag/ingest.py   parse frontmatter, sections, mitigations
   │  rag/kql.py      extract Kusto tables / clusters / functions
   ▼
documents ── rag/graph.py ─► knowledge graph (doc ↔ table ↔ service ↔ symptom)
   │
   ▼
rag/retrieve.py   BM25 lexical match  +  graph expansion (shared-entity siblings)
   │
   ▼
ranked runbooks + "linked by" reasons + tables to check + mitigations
   │
   ▼
your coding assistant reasons over the result  (CLI or optional MCP server)

Module	Responsibility	Deps
`rag/ingest.py`	runbook markdown → `Document`	stdlib
`rag/kql.py`	Kusto/KQL entity extraction	stdlib
`rag/graph.py`	typed knowledge graph, BFS path, god-nodes	stdlib
`rag/retrieve.py`	BM25 + graph expansion	stdlib
`rag/cli.py`	command-line interface	stdlib
`rag/store.py`	optional SQLite persistence	stdlib
`rag/mcp_server.py`	optional MCP server for assistants	`mcp` (optional)

Connecting your assistant

Option A — shell (zero install): let your assistant run python3 -m rag.cli query "..." --json and read the result. Works anywhere.

Option B — MCP server (optional): pip install "mcp[cli]", then register python -m rag.mcp_server as an MCP stdio server. Exposes find_runbooks and explain as tools. MCP here is local stdio only — still no network.

Runbook format

Forgiving markdown. Frontmatter is optional; inline **Field:** lines work too.

---
title: Payment gateway timeouts
service: payments-api
severity: 2
symptoms: [checkout 504, p99 latency above 2s]
---

## Symptoms
- Checkout returns 504

## Diagnosis
```kql
Requests | where Service == "payments-api" | summarize count() by ResultCode
```

## Mitigation
- Failover to the secondary region

Tests

python3 tests/test_engine.py     # or: python3 -m pytest

Prior art / inspiration

Microsoft GraphRAG — the original graph-RAG approach
LightRAG — lightweight local-first graph RAG
nano-graphrag — hackable ~1.1k-line implementation
RAG-based incident resolution for IT support (arXiv 2409.13707)

This project trades those frameworks' LLM-driven entity extraction for deterministic, dependency-free extraction tuned to runbooks + KQL — so it runs in environments where you can't install packages or send data out.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
rag		rag
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

runbook-graphrag

Why graph-RAG for incidents?

Design constraints (deliberate)

Quickstart

How it works

Connecting your assistant

Runbook format

Tests

Prior art / inspiration

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

runbook-graphrag

Why graph-RAG for incidents?

Design constraints (deliberate)

Quickstart

How it works

Connecting your assistant

Runbook format

Tests

Prior art / inspiration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages