pi-keeper

A Pi extension that tries to keep the model's short-term memory from filling up during long sessions.

Every model has a context window — a fixed amount of text it can "see" at once (the system prompt, your messages, and every tool output so far). On a long task that window fills with big, mostly-useless tool dumps. When it gets too full, Pi has to summarize and drop old turns (compaction), and the local server often has to re-read the whole conversation from scratch (reprocessing / prefill) — which is slow.

pi-keeper's goal is to delay both of those by keeping the window lean. Whether it actually helps depends a lot on your setup — see Does this actually help? below, which is the honest version.

What it does

What	Needs the custom server?	Plain description
Spill big outputs	no	A tool output bigger than ~8000 characters is written to a file on disk and replaced in the chat with a one-line pointer + a short preview. The model can read the full thing back later with `keeper_recall`. This is the part that most reliably shrinks the window.
Durable notes	no	A plain file (`AGENTS.md`) is pasted into the system prompt every turn, so notes survive compaction. Today you fill that file yourself — there is no automatic "remember this" tool yet.
Off-context reading	optional	`keeper_read` / `keeper_debug` can do a heavy read or a reasoning pass in a separate server session, so the bulky text never lands in your main chat. Without the server they just fall back to a normal read / inline note.
Cache reuse on rewind	optional	When you rewind the conversation (`/keeper rollback`), it asks the server to reuse its saved progress instead of re-reading everything.

The slash command is /keeper and the tools are keeper_read, keeper_debug, keeper_recall (details below).

Does this actually help?

Honest answer: it depends, and you should measure it rather than trust the marketing.

Spilling big outputs — yes, almost always. Replacing a 50 KB tool dump with a one-line pointer straightforwardly removes tokens from the window. Worst case the model has to call keeper_recall to get a slice back — one extra round-trip — but the window genuinely stays smaller and you hit compaction later.
Off-context reading — best with a spare slot, but now works single-slot too. On a single slot (--parallel 1), side work shares the slot with your main chat, so it would wipe its saved progress and force the next turn to re-read the whole conversation. pi-keeper avoids that two ways: if your llama-server build exposes the RAM state stash (save-ram/restore-ram), it snapshots the main conversation to host RAM, runs the side-session on the slot, then restores it — no re-read, and it works even on recurrent / hybrid models (where the prompt cache is disabled). pi-keeper detects this automatically. Without it, single-slot side work quietly falls back to a normal inline read (so it never backfires); to get the off-context benefit there, run with --parallel 2 for a dedicated slot, or — if you're sure your model keeps its prompt cache — force multiplexing with PI_KEEPER_MULTIPLEX=1.
Durable notes — only if something writes to AGENTS.md. Right now nothing fills it automatically, so out of the box this tier does nothing until you put notes in the file.

How to measure it yourself

Run the same task twice — once normally, once with /keeper spill off, /keeper side off, /keeper pin off — and compare your llama-server log:

total prompt processing tokens (less re-reading = better),
how often compaction kicks in,
fill_pct in the completion responses (how full the window got).

If the numbers don't improve, the extension isn't earning its keep on your setup — that's useful to know, and the whole reason this section exists.

Install

Use Pi's package manager:

pi install https://github.com/nonml/pi-keeper     # from GitHub (also: git:github.com/nonml/pi-keeper)
# or, working from a local checkout:
pi install ./pi-keeper                            # add -l to install for the current project only

That registers it in Pi's settings — pi list shows it, pi remove <same source> (alias pi uninstall) removes it, and pi update pulls the latest. There's no build step: Pi loads the TypeScript directly and a bare index.ts at the repo root is all it needs.

Run /keeper to check status. The "no server" features (spilling, durable notes) work right away with any Pi provider.

Optional: the custom llama.cpp server (for the faster paths)

The off-context and cache-reuse features need a llama.cpp server that exposes a few extra endpoints (checkpoint / rollback / fork + a fill_pct gauge). Those live on the pi-keeper branch of the fork: https://github.com/nonml/llama.cpp.

Option A — build the fork directly (simplest)

git clone -b pi-keeper https://github.com/nonml/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON        # drop -DGGML_CUDA=ON for a CPU-only build
cmake --build build --config Release -j

Option B — add the patch to your own up-to-date llama.cpp

If you already track upstream and just want the one feature commit on top of it, cherry-pick it:

# from inside your existing (upstream) llama.cpp checkout
git remote add nonml https://github.com/nonml/llama.cpp
git fetch nonml pi-keeper
git cherry-pick pi-keeper
# if upstream has moved and it conflicts: fix the files, then `git cherry-pick --continue`
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

It's a single self-contained commit (the checkpoint/rollback/fork endpoints + fill_pct), so the cherry-pick is clean unless upstream changed the same server files.

Run it

./build/bin/llama-server -m your-model.gguf --slot-save-path ./slots --parallel 2

--slot-save-path is required for any of the slot features.
--parallel 2 gives side sessions their own slot (see the warning above). Use --parallel 1 only if you'll keep /keeper side off.

The endpoint details are documented in the fork at tools/server/README-checkpoint.md.

Commands

One command, /keeper, with sub-commands. The autocomplete popup shows each one and its current on/off state, so you don't have to memorize them:

/keeper doctor — status: server, slots, what's on/off, where notes are stored (bare /keeper just lists the commands)
/keeper probe — re-check what the local server supports
/keeper rollback [n] — rewind to the n-th most recent message you sent (default: the last one)
/keeper spill on|off — toggle spilling big outputs to disk
/keeper pin on|off — toggle asking the server to reuse its cache on rewind
/keeper side on|off — toggle off-context reading/reasoning (turn off on a single-slot server)

Tools (the model can call these)

keeper_read(path, goal) — read a file and return only what goal needs
keeper_debug(question) — a focused root-cause reasoning pass over recent context
keeper_recall(ref, start?, n?) — read part of a previously spilled output by its ref

Settings (all optional, via environment variables)

Variable	Default	Meaning
`PI_KEEPER_SERVER`	`http://127.0.0.1:8080`	llama.cpp server address (a `/v1` suffix is stripped automatically)
`PI_KEEPER_WORKDIR`	next to Pi's own session files	where spilled outputs and `AGENTS.md` are kept
`PI_KEEPER_SPILL_CHARS`	`8000`	spill tool outputs bigger than this many characters
`PI_KEEPER_SIDE_SLOT`	`2`	which server slot to use for side sessions (needs `--parallel ≥ 3`)
`PI_KEEPER_MAIN_SLOT`	`0`	the slot your main chat is pinned to
`PI_KEEPER_MULTIPLEX`	`0`	allow single-slot side-sessions even when the prompt cache can't be confirmed on. Leave off unless you know your model keeps its cache (set to `1`)

How it's laid out

pi-keeper/
├── index.ts    wires up Pi's hooks, the tools, and the /keeper command
├── server.ts   talks to the custom llama.cpp server (and degrades gracefully when it's absent)
├── state.ts    the on-disk notes + spilled outputs
└── README.md

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pi-keeper

What it does

Does this actually help?

How to measure it yourself

Install

Optional: the custom llama.cpp server (for the faster paths)

Option A — build the fork directly (simplest)

Option B — add the patch to your own up-to-date llama.cpp

Run it

Commands

Tools (the model can call these)

Settings (all optional, via environment variables)

How it's laid out

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
index.ts		index.ts
server.ts		server.ts
state.ts		state.ts

Folders and files

Latest commit

History

Repository files navigation

pi-keeper

What it does

Does this actually help?

How to measure it yourself

Install

Optional: the custom llama.cpp server (for the faster paths)

Option A — build the fork directly (simplest)

Option B — add the patch to your own up-to-date llama.cpp

Run it

Commands

Tools (the model can call these)

Settings (all optional, via environment variables)

How it's laid out

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages