Skip to content

fix(core): register RDMA-fetched blocks to MetaServer after prefetch#378

Merged
xiaguan merged 2 commits into
masterfrom
fix/rdma-prefetch-register-metaserver
Jul 1, 2026
Merged

fix(core): register RDMA-fetched blocks to MetaServer after prefetch#378
xiaguan merged 2 commits into
masterfrom
fix/rdma-prefetch-register-metaserver

Conversation

@xiaguan

@xiaguan xiaguan commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Problem

RDMA prefetch pulled blocks into the local read cache but never re-advertised them to the MetaServer. Once the original holder evicted those blocks (unregister), the MetaServer believed nobody owned them — even though the fetcher still held valid, RDMA-servable copies in pinned RAM.

Scenario that breaks:

  1. Node A saves blocks → registered to MetaServer as owned by A
  2. Node B RDMA-READs them into local pinned RAM → not registered (the bug)
  3. Node A evicts via LRU → try_unregister → MetaServer thinks nobody owns them
  4. Node C queries MetaServer → no hit → cannot discover blocks on Node B, even though they exist and are servable

Fix

After RDMA prefetch completes and blocks are inserted into the read cache, the fetched block hashes are registered to the MetaServer via the same fire-and-forget path (try_register_namespace) used by the normal save path.

SSD prefetch is intentionally skipped: those blocks were already registered by this node's own save path, and eviction explicitly unregisters them — the SSD case is internally consistent.

Changes

  • pegaflow-core/src/storage/prefetch.rs: PrefetchScheduler now holds an optional MetaServerClient. In poll_existing, after read_cache.batch_insert, RDMA-sourced blocks are registered to the MetaServer.
  • pegaflow-core/src/storage/mod.rs: pass metaserver_client into PrefetchScheduler::new.
  • pegaflow-server/tests/p2p_rdma.rs: added wait_for_metaserver_ownership helper and a test assertion verifying the fetching node re-registers the blocks it pulled.

Verification

  • cargo build -p pegaflow-core --no-default-features --features cuda-13,rdma — passes
  • cargo clippy -p pegaflow-core --no-default-features --features cuda-13,rdma — clean
  • cargo test -p pegaflow-core --no-default-features --features cuda-13,rdma -- storage::prefetch — 2/2 pass
  • All pre-commit hooks pass
  • p2p_rdma integration test compiles (requires RDMA hardware to run)

xiaguan added 2 commits July 1, 2026 13:25
RDMA prefetch pulled blocks into the local read cache but never
re-advertised them to the MetaServer. Once the original holder evicted
those blocks (unregister), the MetaServer believed nobody owned them —
even though the fetcher still held valid, RDMA-servable copies.

After RDMA prefetch completes, the fetched block hashes are now
registered to the MetaServer (fire-and-forget, same path as the normal
save). SSD prefetch is skipped because those blocks were already
registered by this node's own save path and eviction explicitly
unregisters them.

The p2p_rdma integration test now asserts that the fetching node
re-registers the blocks it pulled.
@xiaguan xiaguan merged commit ce899cf into master Jul 1, 2026
12 checks passed
@xiaguan xiaguan deleted the fix/rdma-prefetch-register-metaserver branch July 1, 2026 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants