Skip to content

KVM Dirty log ring interface#344

Open
davidkleymann wants to merge 1 commit intorust-vmm:mainfrom
asymmetric-research:main
Open

KVM Dirty log ring interface#344
davidkleymann wants to merge 1 commit intorust-vmm:mainfrom
asymmetric-research:main

Conversation

@davidkleymann
Copy link
Copy Markdown

@davidkleymann davidkleymann commented Sep 3, 2025

Summary of the PR

Add support for dirty page tracking via the Dirty ring interface. Adds KvmDirtyLogRing structure for keeping track of the indices and the base pointer to the shared memory buffer. Implements iterating over dirty pages, thereby harvesting them. Implements reset_dirty_rings on VmFd to trigger recycling of dirty ring buffer elements by the kernel after processing. Adds the dirty_log_ring field to VcpuFd.

This is a draft that needs some review and improvements, I'm hereby asking for suggestions for improving the following remaining weaknesses:

  1. If any vCPU is running while harvesting dirty pages, the user of KvmDirtyLogRing needs to call reset_dirty_rings before reading contents of dirty pages. This is currently the users responsibility, which allows for calling reset_dirty_rings after all dirty pages have been read by the VMM for cases where all vCPUs are halted, but this may be a confusing interface for other cases.

More info on the interface:
https://www.kernel.org/doc/html/latest/virt/kvm/api.html#kvm-cap-dirty-log-ring-kvm-cap-dirty-log-ring-acq-rel

Requirements

Before submitting your PR, please make sure you addressed the following
requirements:

  • All commits in this PR have Signed-Off-By trailers (with
    git commit -s), and the commit message has max 60 characters for the
    summary and max 75 characters for each description line.
  • All added/changed functionality has a corresponding unit/integration
    test.
  • All added/changed public-facing functionality has entries in the "Upcoming
    Release" section of CHANGELOG.md (if no such section exists, please create one).
  • Any newly added unsafe code is properly documented.

Copy link
Copy Markdown
Member

@roypat roypat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please squash your commits so that we do not have later commits be fixups for previous commits.

Comment thread kvm-ioctls/CHANGELOG.md
Comment thread kvm-ioctls/src/ioctls/mod.rs Outdated
Comment thread kvm-ioctls/src/ioctls/vcpu.rs Outdated
Comment thread kvm-ioctls/src/ioctls/vm.rs
Comment thread kvm-ioctls/src/ioctls/vm.rs
Comment thread kvm-ioctls/src/ioctls/vcpu.rs Outdated
Comment thread kvm-ioctls/src/ioctls/mod.rs
Comment thread kvm-ioctls/CHANGELOG.md Outdated
Comment thread kvm-ioctls/src/ioctls/mod.rs Outdated
Comment thread kvm-ioctls/src/ioctls/vcpu.rs Outdated
@roypat
Copy link
Copy Markdown
Member

roypat commented Sep 9, 2025

Also, do we need to do anything about KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP?

@roypat
Copy link
Copy Markdown
Member

roypat commented Sep 9, 2025

Send and Sync are probably not safe on KvmDirtyLogRing because accesses are not stateless, due to the state we need to keep track of in user space (next_dirty)

As for this, next_dirty doesnt imply that it cannot be Send/Sync, because the traits dont mean that multiple threads can access the struct at the same time (thats always unsound in Rust's memory model), just that you can transfer ownership between threads, and that read-only references can be shared between threads. A problem would only arrive if we can also Clone this structure, because then the iterator can cause data races inside the mmap'd area from different Rust threads. But as long as for each vcpu we only allow creating a single instance of the iterator (using safe code that is), there are no problems.

Now, whether Send and Sync will actually be useful is a different matter, because I'm assuming you want them to be able to harvest the dirty ring while the vcpu is still running, but with the current API in this PR, you can only get a &mut reference to the ring buffer structure, which you cannot send to another thread anyway. So in this case, you'd need some API that lets you take ownership of the KvmDirtyRingLog structure (maybe store the one that gets created at vcpu creation time in an option, and then have a function that just wraps .take() on that option, keeping in mind that we must never allow safe code to get two owned versions of it for the same vcpu).

@roypat
Copy link
Copy Markdown
Member

roypat commented Sep 9, 2025

Also sorry for the late response, we were all preparing for / traveling to KVM Forum last week!

@davidkleymann
Copy link
Copy Markdown
Author

davidkleymann commented Oct 7, 2025

Also, do we need to do anything about KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP?

Turns out we do, and if it is supported, we also need to enable KVM_CAP_DIRTY_LOG_RING_ACQ_REL prior to enabling _WITH_BITMAP. Correct me if I'm wrong.

@davidkleymann davidkleymann force-pushed the main branch 5 times, most recently from 66f0d63 to 362f0f5 Compare October 7, 2025 11:33
@davidkleymann
Copy link
Copy Markdown
Author

Why is enable_cap not available in aarch64?
#[cfg(any(target_arch = "x86_64", target_arch = "s390x", target_arch = "powerpc"))]

@davidkleymann davidkleymann force-pushed the main branch 8 times, most recently from 11c85a5 to ec0da8b Compare October 7, 2025 12:10
@roypat
Copy link
Copy Markdown
Member

roypat commented Oct 8, 2025

Why is enable_cap not available in aarch64? #[cfg(any(target_arch = "x86_64", target_arch = "s390x", target_arch = "powerpc"))]

according to the docs, it doesn't exist on arm: https://docs.kernel.org/virt/kvm/api.html#kvm-enable-cap . But it seems the docs are wrong there (they're not, there's two ioctls, one for vcpus and one for vms, with the vcpu one being arch-specific, and the VM one documented as "architectures: all"), because there is a generic implementation in the kernel for some capabilities (including the dirty log related ones) which is available on all arches. So kvm-ioctls is doing it wrong (probably whoever wrote it also misread the docs). Can you change Vm::enable_cap to be available always (e.g. without any cfgs) and add a changelog entry under "fixed"?

@davidkleymann davidkleymann force-pushed the main branch 3 times, most recently from b00ead3 to 483342a Compare October 18, 2025 20:52
@davidkleymann
Copy link
Copy Markdown
Author

The enable_cap doctest uses KVM_CAP_SPLIT_IRQCHIP as an example, which is x86 only. This breaks the test and I disabled the enable_cap call in the test on other arches.

@davidkleymann
Copy link
Copy Markdown
Author

Send and Sync are probably not safe on KvmDirtyLogRing because accesses are not stateless, due to the state we need to keep track of in user space (next_dirty)

As for this, next_dirty doesnt imply that it cannot be Send/Sync, because the traits dont mean that multiple threads can access the struct at the same time (thats always unsound in Rust's memory model), just that you can transfer ownership between threads, and that read-only references can be shared between threads. A problem would only arrive if we can also Clone this structure, because then the iterator can cause data races inside the mmap'd area from different Rust threads. But as long as for each vcpu we only allow creating a single instance of the iterator (using safe code that is), there are no problems.

Now, whether Send and Sync will actually be useful is a different matter, because I'm assuming you want them to be able to harvest the dirty ring while the vcpu is still running, but with the current API in this PR, you can only get a &mut reference to the ring buffer structure, which you cannot send to another thread anyway. So in this case, you'd need some API that lets you take ownership of the KvmDirtyRingLog structure (maybe store the one that gets created at vcpu creation time in an option, and then have a function that just wraps .take() on that option, keeping in mind that we must never allow safe code to get two owned versions of it for the same vcpu).

I needed Send and Sync to move the Vcpus to separate threads, but I only implemented what the rust compiler suggested and am still looking for a better solution for this unrelated problem.

@davidkleymann davidkleymann force-pushed the main branch 5 times, most recently from 6ccb6da to 1d09123 Compare October 22, 2025 10:14
@davidkleymann davidkleymann force-pushed the main branch 2 times, most recently from a7fcd12 to c07248c Compare December 5, 2025 12:43
Comment thread kvm-ioctls/src/ioctls/vm.rs Outdated
Comment thread kvm-ioctls/src/ioctls/vm.rs Outdated
Comment thread kvm-ioctls/src/ioctls/vm.rs Outdated
@davidkleymann davidkleymann force-pushed the main branch 4 times, most recently from f9cf849 to 85f01fe Compare December 10, 2025 14:32
Comment thread .vscode/settings.json Outdated
ShadowCurse
ShadowCurse previously approved these changes Dec 12, 2025
roypat
roypat previously approved these changes Dec 22, 2025
@roypat roypat enabled auto-merge (rebase) December 22, 2025 09:18
@davidkleymann
Copy link
Copy Markdown
Author

When will this be merged?

@roypat
Copy link
Copy Markdown
Member

roypat commented Jan 12, 2026

I've just trigged dependabot, which should give us a PR for fixing the CI failure, and then we can merge this immediately after! Sorry for the long wait :(

@davidkleymann
Copy link
Copy Markdown
Author

Thanks. No worries, just making sure it gets merged eventually :)

Copy link
Copy Markdown
Contributor

@ludfjig ludfjig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm (the latest main branch merge commit should maybe have been a rebase instead?)

Comment thread kvm-ioctls/src/ioctls/vcpu.rs Outdated
Implement dirty log ring interface with `enable_dirty_log_ring` and
`dirty_log_ring_iter` methods. Enable `VmFd` `enable_cap` and ioctl imports
on all architectures. Add memory fences in iterator for proper
synchronization on weak memory consistency architectures.

Signed-off-by: David Kleymann <david@asymmetric.re>
auto-merge was automatically disabled February 19, 2026 12:51

Head branch was pushed to by a user without write access

@davidkleymann davidkleymann dismissed stale reviews from ShadowCurse and roypat via 1b25571 February 19, 2026 12:51
@davidkleymann
Copy link
Copy Markdown
Author

Rebased and removed an unneccessary #[allow(unused)].

phip1611 added a commit to phip1611/cloud-hypervisor that referenced this pull request Mar 11, 2026
This adds the basic plumbing for the DirtyLogWorker which will fetch the
dirty log asynchronously in background and aggregate the effective
MemoryRangeTable with dirtied memory.

# Motivation

- Performance: Fetching the dirty log, parsing the dirty bitmap, and
  aggregating the corresponding data structures is fairly costly. I just
  ran a vM with a active working set of 5 GiB (with 4 workers) and the
  measured overhead per iteration was 10-20ms. Given that we want to
  have as small downtimes as possible, we want that overhead to be
  close to zero for the final iteration.
- Accurate dirty rate: This way, we have a more fine-grained sampling of
  the dirty rate (dirties 4k pages per second) which is an interesting
  metric to observe the current workload (regarding memory writes).

# Bigger Picture / Outlook to KVMs Dirty Ring Interface

The most robust and performant version which Cloud Hypervisor should use
to get dirtied pages in the future is KVM's dirty ring interface [0].
This requires [1] to be merged first in rust-vmm/kvm. Experience showed
that bumping any of the rust-vmm crates is a major challenge as all of
them are highly interdependent and developed in individual repositories.
So it will take some time before we can even consider starting the work
of that feature in CHV.

That being said: This design improves the current situation
significantly without blocking any future refactorings or replacements
with KVM's dirty ring interface.

# Design

I actively decided against Arc<Mutex<Vm>> in the DirtyLogWorker as
this would be very invasive, make the migration code overly complicated
(many locks and unlocks at the right times) and lastly, be a very big
change to only call `vm.dirty_log()` in the thread. Note that the latter
is just a thin wrapper around calling `cpu_manager.dirty_log()` and
`memory_manager.dirty_log()`.

[0] https://lwn.net/Articles/833206/
[1] rust-vmm/kvm#344

On-behalf-of: SAP philipp.schuster@sap.com
Signed-off-by: Philipp Schuster <philipp.schuster@cyberus-technology.de>
@davidkleymann davidkleymann requested a review from roypat April 14, 2026 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants