A Pi extension that tries to keep the model's short-term memory from filling up during long sessions.
Every model has a context window — a fixed amount of text it can "see" at once (the system prompt, your messages, and every tool output so far). On a long task that window fills with big, mostly-useless tool dumps. When it gets too full, Pi has to summarize and drop old turns (compaction), and the local server often has to re-read the whole conversation from scratch (reprocessing / prefill) — which is slow.
pi-keeper's goal is to delay both of those by keeping the window lean. Whether it actually helps depends a lot on your setup — see Does this actually help? below, which is the honest version.
| What | Needs the custom server? | Plain description |
|---|---|---|
| Spill big outputs | no | A tool output bigger than ~8000 characters is written to a file on disk and replaced in the chat with a one-line pointer + a short preview. The model can read the full thing back later with keeper_recall. This is the part that most reliably shrinks the window. |
| Durable notes | no | A plain file (AGENTS.md) is pasted into the system prompt every turn, so notes survive compaction. Today you fill that file yourself — there is no automatic "remember this" tool yet. |
| Off-context reading | optional | keeper_read / keeper_debug can do a heavy read or a reasoning pass in a separate server session, so the bulky text never lands in your main chat. Without the server they just fall back to a normal read / inline note. |
| Cache reuse on rewind | optional | When you rewind the conversation (/keeper rollback), it asks the server to reuse its saved progress instead of re-reading everything. |
The slash command is /keeper and the tools are keeper_read, keeper_debug,
keeper_recall (details below).
Honest answer: it depends, and you should measure it rather than trust the marketing.
-
Spilling big outputs — yes, almost always. Replacing a 50 KB tool dump with a one-line pointer straightforwardly removes tokens from the window. Worst case the model has to call
keeper_recallto get a slice back — one extra round-trip — but the window genuinely stays smaller and you hit compaction later. -
Off-context reading — best with a spare slot, but now works single-slot too. On a single slot (
--parallel 1), side work shares the slot with your main chat, so it would wipe its saved progress and force the next turn to re-read the whole conversation. pi-keeper avoids that two ways: if your llama-server build exposes the RAM state stash (save-ram/restore-ram), it snapshots the main conversation to host RAM, runs the side-session on the slot, then restores it — no re-read, and it works even on recurrent / hybrid models (where the prompt cache is disabled). pi-keeper detects this automatically. Without it, single-slot side work quietly falls back to a normal inline read (so it never backfires); to get the off-context benefit there, run with--parallel 2for a dedicated slot, or — if you're sure your model keeps its prompt cache — force multiplexing withPI_KEEPER_MULTIPLEX=1. -
Durable notes — only if something writes to
AGENTS.md. Right now nothing fills it automatically, so out of the box this tier does nothing until you put notes in the file.
Run the same task twice — once normally, once with /keeper spill off, /keeper side off,
/keeper pin off — and compare your llama-server log:
- total prompt processing tokens (less re-reading = better),
- how often compaction kicks in,
fill_pctin the completion responses (how full the window got).
If the numbers don't improve, the extension isn't earning its keep on your setup — that's useful to know, and the whole reason this section exists.
Use Pi's package manager:
pi install https://github.com/nonml/pi-keeper # from GitHub (also: git:github.com/nonml/pi-keeper)
# or, working from a local checkout:
pi install ./pi-keeper # add -l to install for the current project onlyThat registers it in Pi's settings — pi list shows it, pi remove <same source> (alias
pi uninstall) removes it, and pi update pulls the latest. There's no build step: Pi loads the
TypeScript directly and a bare index.ts at the repo root is all it needs.
Run /keeper to check status. The "no server" features (spilling, durable notes) work right away
with any Pi provider.
The off-context and cache-reuse features need a llama.cpp server that exposes a few extra
endpoints (checkpoint / rollback / fork + a fill_pct gauge). Those live on the pi-keeper
branch of the fork: https://github.com/nonml/llama.cpp.
git clone -b pi-keeper https://github.com/nonml/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # drop -DGGML_CUDA=ON for a CPU-only build
cmake --build build --config Release -jIf you already track upstream and just want the one feature commit on top of it, cherry-pick it:
# from inside your existing (upstream) llama.cpp checkout
git remote add nonml https://github.com/nonml/llama.cpp
git fetch nonml pi-keeper
git cherry-pick pi-keeper
# if upstream has moved and it conflicts: fix the files, then `git cherry-pick --continue`
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -jIt's a single self-contained commit (the checkpoint/rollback/fork endpoints + fill_pct), so the
cherry-pick is clean unless upstream changed the same server files.
./build/bin/llama-server -m your-model.gguf --slot-save-path ./slots --parallel 2--slot-save-pathis required for any of the slot features.--parallel 2gives side sessions their own slot (see the warning above). Use--parallel 1only if you'll keep/keeper side off.
The endpoint details are documented in the fork at tools/server/README-checkpoint.md.
One command, /keeper, with sub-commands. The autocomplete popup shows each one and its current
on/off state, so you don't have to memorize them:
/keeper doctor— status: server, slots, what's on/off, where notes are stored (bare/keeperjust lists the commands)/keeper probe— re-check what the local server supports/keeper rollback [n]— rewind to the n-th most recent message you sent (default: the last one)/keeper spill on|off— toggle spilling big outputs to disk/keeper pin on|off— toggle asking the server to reuse its cache on rewind/keeper side on|off— toggle off-context reading/reasoning (turn off on a single-slot server)
keeper_read(path, goal)— read a file and return only whatgoalneedskeeper_debug(question)— a focused root-cause reasoning pass over recent contextkeeper_recall(ref, start?, n?)— read part of a previously spilled output by itsref
| Variable | Default | Meaning |
|---|---|---|
PI_KEEPER_SERVER |
http://127.0.0.1:8080 |
llama.cpp server address (a /v1 suffix is stripped automatically) |
PI_KEEPER_WORKDIR |
next to Pi's own session files | where spilled outputs and AGENTS.md are kept |
PI_KEEPER_SPILL_CHARS |
8000 |
spill tool outputs bigger than this many characters |
PI_KEEPER_SIDE_SLOT |
2 |
which server slot to use for side sessions (needs --parallel ≥ 3) |
PI_KEEPER_MAIN_SLOT |
0 |
the slot your main chat is pinned to |
PI_KEEPER_MULTIPLEX |
0 |
allow single-slot side-sessions even when the prompt cache can't be confirmed on. Leave off unless you know your model keeps its cache (set to 1) |
pi-keeper/
├── index.ts wires up Pi's hooks, the tools, and the /keeper command
├── server.ts talks to the custom llama.cpp server (and degrades gracefully when it's absent)
├── state.ts the on-disk notes + spilled outputs
└── README.md
MIT