Skip to content

rahulcommercial/runbook-graphrag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

runbook-graphrag

A fully-local, zero-dependency graph-RAG engine for incident investigation over runbooks / troubleshooting guides (TSGs).

Point it at a folder of markdown runbooks. It parses them, extracts the entities that connect them — services, Kusto/KQL tables, symptoms, mitigations — and builds a small knowledge graph. Given an incident description, it returns the most relevant runbooks plus the graph-linked siblings that flat keyword search would miss, along with the exact Kusto tables to check and the documented mitigations.

The "reasoning" layer is your own coding assistant (Claude Code, Cursor, etc.). This engine just does fast, explainable, local retrieval and graph traversal. It never calls an LLM and never touches the network.

Built as an open, adaptable blueprint. Drop your real runbooks into data/runbooks/, and hand the repo to your assistant — CLAUDE.md tells it how to wire this into your environment.

Why graph-RAG for incidents?

Plain vector/keyword search finds runbooks whose text matches an incident. But incidents cascade: an auth-throttling event surfaces as payment 504s. If two runbooks query the same Kusto table or touch the same service, they're related even when their words differ. Modeling runbooks + entities as a graph captures exactly those links — the approach Microsoft itself has written up for GraphRAG-powered incident & change management and TSG recommendation.

Design constraints (deliberate)

  • Standard library only for the core. No pip install needed to run — important for locked-down corporate environments.
  • No external systems. No vector DB, no graph DB, no cloud, no LLM API. Everything is in-process over local files.
  • No data egress. Your runbooks never leave the machine. Safe for internal/confidential TSGs.
  • Hackable. The whole engine is ~600 lines across 6 small modules.

Quickstart

git clone https://github.com/rahulcommercial/runbook-graphrag
cd runbook-graphrag

# 1. find runbooks for an incident
python3 -m rag.cli query "checkout 504s and rising p99 latency, throttling upstream"

# 2. inspect an entity and everything connected to it
python3 -m rag.cli explain table:Requests

# 3. shortest path between two nodes (how is this runbook related to that service?)
python3 -m rag.cli path doc:payment-timeouts service:auth-api

# 4. recurring hotspots across all runbooks
python3 -m rag.cli godnodes

# 5. corpus + graph stats
python3 -m rag.cli stats

Feed a whole incident file:

python3 -m rag.cli query "$(cat data/incidents/sample-incident.md)" --json

How it works

runbooks (*.md)
   │  rag/ingest.py   parse frontmatter, sections, mitigations
   │  rag/kql.py      extract Kusto tables / clusters / functions
   ▼
documents ── rag/graph.py ─► knowledge graph (doc ↔ table ↔ service ↔ symptom)
   │
   ▼
rag/retrieve.py   BM25 lexical match  +  graph expansion (shared-entity siblings)
   │
   ▼
ranked runbooks + "linked by" reasons + tables to check + mitigations
   │
   ▼
your coding assistant reasons over the result  (CLI or optional MCP server)
Module Responsibility Deps
rag/ingest.py runbook markdown → Document stdlib
rag/kql.py Kusto/KQL entity extraction stdlib
rag/graph.py typed knowledge graph, BFS path, god-nodes stdlib
rag/retrieve.py BM25 + graph expansion stdlib
rag/cli.py command-line interface stdlib
rag/store.py optional SQLite persistence stdlib
rag/mcp_server.py optional MCP server for assistants mcp (optional)

Connecting your assistant

Option A — shell (zero install): let your assistant run python3 -m rag.cli query "..." --json and read the result. Works anywhere.

Option B — MCP server (optional): pip install "mcp[cli]", then register python -m rag.mcp_server as an MCP stdio server. Exposes find_runbooks and explain as tools. MCP here is local stdio only — still no network.

Runbook format

Forgiving markdown. Frontmatter is optional; inline **Field:** lines work too.

---
title: Payment gateway timeouts
service: payments-api
severity: 2
symptoms: [checkout 504, p99 latency above 2s]
---

## Symptoms
- Checkout returns 504

## Diagnosis
​```kql
Requests | where Service == "payments-api" | summarize count() by ResultCode
​```

## Mitigation
- Failover to the secondary region

Tests

python3 tests/test_engine.py     # or: python3 -m pytest

Prior art / inspiration

This project trades those frameworks' LLM-driven entity extraction for deterministic, dependency-free extraction tuned to runbooks + KQL — so it runs in environments where you can't install packages or send data out.

License

MIT — see LICENSE.

About

Fully-local, standard-library graph-RAG engine for incident investigation over runbooks / TSGs. KQL-aware, no external systems, no data egress.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages