Skip to content

nonml/pi-keeper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

pi-keeper

A Pi extension that tries to keep the model's short-term memory from filling up during long sessions.

Every model has a context window — a fixed amount of text it can "see" at once (the system prompt, your messages, and every tool output so far). On a long task that window fills with big, mostly-useless tool dumps. When it gets too full, Pi has to summarize and drop old turns (compaction), and the local server often has to re-read the whole conversation from scratch (reprocessing / prefill) — which is slow.

pi-keeper's goal is to delay both of those by keeping the window lean. Whether it actually helps depends a lot on your setup — see Does this actually help? below, which is the honest version.


What it does

What Needs the custom server? Plain description
Spill big outputs no A tool output bigger than ~8000 characters is written to a file on disk and replaced in the chat with a one-line pointer + a short preview. The model can read the full thing back later with keeper_recall. This is the part that most reliably shrinks the window.
Durable notes no A plain file (AGENTS.md) is pasted into the system prompt every turn, so notes survive compaction. Today you fill that file yourself — there is no automatic "remember this" tool yet.
Off-context reading optional keeper_read / keeper_debug can do a heavy read or a reasoning pass in a separate server session, so the bulky text never lands in your main chat. Without the server they just fall back to a normal read / inline note.
Cache reuse on rewind optional When you rewind the conversation (/keeper rollback), it asks the server to reuse its saved progress instead of re-reading everything.

The slash command is /keeper and the tools are keeper_read, keeper_debug, keeper_recall (details below).


Does this actually help?

Honest answer: it depends, and you should measure it rather than trust the marketing.

  • Spilling big outputs — yes, almost always. Replacing a 50 KB tool dump with a one-line pointer straightforwardly removes tokens from the window. Worst case the model has to call keeper_recall to get a slice back — one extra round-trip — but the window genuinely stays smaller and you hit compaction later.

  • Off-context reading — best with a spare slot, but now works single-slot too. On a single slot (--parallel 1), side work shares the slot with your main chat, so it would wipe its saved progress and force the next turn to re-read the whole conversation. pi-keeper avoids that two ways: if your llama-server build exposes the RAM state stash (save-ram/restore-ram), it snapshots the main conversation to host RAM, runs the side-session on the slot, then restores it — no re-read, and it works even on recurrent / hybrid models (where the prompt cache is disabled). pi-keeper detects this automatically. Without it, single-slot side work quietly falls back to a normal inline read (so it never backfires); to get the off-context benefit there, run with --parallel 2 for a dedicated slot, or — if you're sure your model keeps its prompt cache — force multiplexing with PI_KEEPER_MULTIPLEX=1.

  • Durable notes — only if something writes to AGENTS.md. Right now nothing fills it automatically, so out of the box this tier does nothing until you put notes in the file.

How to measure it yourself

Run the same task twice — once normally, once with /keeper spill off, /keeper side off, /keeper pin off — and compare your llama-server log:

  • total prompt processing tokens (less re-reading = better),
  • how often compaction kicks in,
  • fill_pct in the completion responses (how full the window got).

If the numbers don't improve, the extension isn't earning its keep on your setup — that's useful to know, and the whole reason this section exists.


Install

Use Pi's package manager:

pi install https://github.com/nonml/pi-keeper     # from GitHub (also: git:github.com/nonml/pi-keeper)
# or, working from a local checkout:
pi install ./pi-keeper                            # add -l to install for the current project only

That registers it in Pi's settings — pi list shows it, pi remove <same source> (alias pi uninstall) removes it, and pi update pulls the latest. There's no build step: Pi loads the TypeScript directly and a bare index.ts at the repo root is all it needs.

Run /keeper to check status. The "no server" features (spilling, durable notes) work right away with any Pi provider.


Optional: the custom llama.cpp server (for the faster paths)

The off-context and cache-reuse features need a llama.cpp server that exposes a few extra endpoints (checkpoint / rollback / fork + a fill_pct gauge). Those live on the pi-keeper branch of the fork: https://github.com/nonml/llama.cpp.

Option A — build the fork directly (simplest)

git clone -b pi-keeper https://github.com/nonml/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON        # drop -DGGML_CUDA=ON for a CPU-only build
cmake --build build --config Release -j

Option B — add the patch to your own up-to-date llama.cpp

If you already track upstream and just want the one feature commit on top of it, cherry-pick it:

# from inside your existing (upstream) llama.cpp checkout
git remote add nonml https://github.com/nonml/llama.cpp
git fetch nonml pi-keeper
git cherry-pick pi-keeper
# if upstream has moved and it conflicts: fix the files, then `git cherry-pick --continue`
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

It's a single self-contained commit (the checkpoint/rollback/fork endpoints + fill_pct), so the cherry-pick is clean unless upstream changed the same server files.

Run it

./build/bin/llama-server -m your-model.gguf --slot-save-path ./slots --parallel 2
  • --slot-save-path is required for any of the slot features.
  • --parallel 2 gives side sessions their own slot (see the warning above). Use --parallel 1 only if you'll keep /keeper side off.

The endpoint details are documented in the fork at tools/server/README-checkpoint.md.


Commands

One command, /keeper, with sub-commands. The autocomplete popup shows each one and its current on/off state, so you don't have to memorize them:

  • /keeper doctor — status: server, slots, what's on/off, where notes are stored (bare /keeper just lists the commands)
  • /keeper probe — re-check what the local server supports
  • /keeper rollback [n] — rewind to the n-th most recent message you sent (default: the last one)
  • /keeper spill on|off — toggle spilling big outputs to disk
  • /keeper pin on|off — toggle asking the server to reuse its cache on rewind
  • /keeper side on|off — toggle off-context reading/reasoning (turn off on a single-slot server)

Tools (the model can call these)

  • keeper_read(path, goal) — read a file and return only what goal needs
  • keeper_debug(question) — a focused root-cause reasoning pass over recent context
  • keeper_recall(ref, start?, n?) — read part of a previously spilled output by its ref

Settings (all optional, via environment variables)

Variable Default Meaning
PI_KEEPER_SERVER http://127.0.0.1:8080 llama.cpp server address (a /v1 suffix is stripped automatically)
PI_KEEPER_WORKDIR next to Pi's own session files where spilled outputs and AGENTS.md are kept
PI_KEEPER_SPILL_CHARS 8000 spill tool outputs bigger than this many characters
PI_KEEPER_SIDE_SLOT 2 which server slot to use for side sessions (needs --parallel ≥ 3)
PI_KEEPER_MAIN_SLOT 0 the slot your main chat is pinned to
PI_KEEPER_MULTIPLEX 0 allow single-slot side-sessions even when the prompt cache can't be confirmed on. Leave off unless you know your model keeps its cache (set to 1)

How it's laid out

pi-keeper/
├── index.ts    wires up Pi's hooks, the tools, and the /keeper command
├── server.ts   talks to the custom llama.cpp server (and degrades gracefully when it's absent)
├── state.ts    the on-disk notes + spilled outputs
└── README.md

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors