Skip to content

fix(core): load blocks sealed by split-KV peers into contiguous device layouts#382

Open
xiaguan wants to merge 3 commits into
masterfrom
fix/contiguous-load-split-host-blocks
Open

fix(core): load blocks sealed by split-KV peers into contiguous device layouts#382
xiaguan wants to merge 3 commits into
masterfrom
fix/contiguous-load-split-host-blocks

Conversation

@xiaguan

@xiaguan xiaguan commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

Problem

build_copy_descs pairs a Contiguous device block with only the host block's segment 0 for the full span. A block sealed by a peer with a split-KV registration (e.g. the vLLM connector's KV-first (2, num_blocks, ...) layout) stores K and V as two separate host segments, so a contiguous-layout instance loading it reads past the K allocation and restores garbage V data — silently.

This is the missing half of an existing asymmetry: the Split branch already falls back gracefully when the host block is contiguous (v_ptr = k_ptr + k.bytes), but the Contiguous branch had no handling for a split host block.

Fix

When the host block carries two segments, emit one copy per segment targeting each half of the device span. Blocks whose segments don't exactly span the device block are rejected with an explicit error (same "incompatible KV layouts" family as the existing slot-count guard) instead of copying misaligned bytes.

Also carries a one-line #[allow(too_many_arguments)] on rdma_v1::read_async_indices — newer clippy fails the pre-commit hook on master; the signature mirrors the NIXL FFI 1:1.

Validation

End-to-end P/D disaggregation on an 8-GPU H200 node (jz node 34):

  • P = vLLM + PegaKVConnector (Qwen3-8B, --block-size 16, per-layer split-KV registration, 36 slots)
  • D = openinfer decode instance (per-layer fused [K|V] pages, contiguous registration, 36 slots)
  • 3-prompt smoke (short / cross-block / block-aligned), temperature=0: all outputs byte-identical to D's local-prefill baseline.
  • RDMA fetch path exercised: 441 blocks / 992 MiB restored in 41.8 ms (23.7 GiB/s) for a 7k-token prompt.

Without this fix the same setup restores corrupted V segments.

🤖 Generated with Claude Code

xiaguan and others added 2 commits July 3, 2026 14:08
Newer clippy flags the 8-argument binding; the signature mirrors the
NIXL descriptor-index read API 1:1, so collapsing it into a struct
would only obscure the FFI contract.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e layouts

build_copy_descs paired a Contiguous device block with only the host
block's segment 0 for the full span. A block sealed by a peer with a
split-KV registration (e.g. the vLLM connector's KV-first layout) stores
K and V as two separate host segments, so a contiguous-layout instance
loading it read past the K allocation and restored garbage V data.

Mirror the Split branch's existing contiguous-host fallback: when the
host block carries two segments, emit one copy per segment targeting
each half of the device span, and reject blocks whose segments do not
exactly span the device block instead of copying misaligned bytes.

This is what lets an openinfer decode instance (per-layer fused [K|V]
pages) restore blocks prefilled and sealed by a vLLM prefill instance
(per-layer split K/V) in P/D disaggregation.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…pping

Add regression tests for build_copy_descs' Contiguous branch when
the host block carries two segments (split-KV peer): verify two
copies are emitted with correct device addresses and sizes, and that
mismatched segment spans are rejected as incompatible KV layouts
instead of copying misaligned bytes.

Also clarifies the rejection error message: 'N segments (k+v bytes)
but contiguous device block is M bytes' reads less ambiguously than
the prior 'N x k+v bytes' phrasing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant