feat: split localpi token status rates#14
Open
osolmaz wants to merge 2 commits into
Open
Conversation
Member
Author
|
Final report: Implemented the localpi token status phase split and pushed the branch. Validation:
PR is ready for human review/merge. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Localpi was showing one token speed number for a whole turn.
That mixed prompt processing time with output generation time.
This change makes the Pi token status extension report generation speed separately, and adds a final prefill rate when usage data is available.
It uses the first streamed assistant output as the only phase boundary Pi exposes to this extension.
What Changed
The token status extension now tracks when assistant output first appears.
Before that point, it treats the turn as prefill/time-to-first-output; after that point, it reports generation speed using output tokens over generation elapsed time.
firstOutputAtto per-turn token status state.tok/svalue togen ... tok/safter output begins.prefill ... tok/swhen usage input/cache data is available.Testing
The changed extension source transpiles, the focused extension tests pass, and the TypeScript project builds.
The full local check is blocked on this machine by live local model servers being discovered during an unrelated runtime test.
npm run formatpassed.npm run typecheckpassed.npm test -- tests/extensions.test.ts tests/extension-source.test.tspassed.npm run buildpassed.npm run checkfailed only intests/runtime.test.ts > runtime resolution > selects profile aliases for providers with discovery disabled; the failure shows live LM Studio/vLLM models from127.0.0.1:1234and127.0.0.1:8000being included in catalog data.Risks
This is a display-only change in the generated Pi extension.
The main limitation is that Pi does not expose DS4's internal prompt-sync boundary here, so localpi uses first streamed output as the observable boundary.
message_updateevents, final generation timing falls back to whole-turn elapsed time.