perf(aarch64): sync M2 Asahi improvements from leanMultisig downstream#13
Closed
Barnadrot wants to merge 2 commits into
Closed
perf(aarch64): sync M2 Asahi improvements from leanMultisig downstream#13Barnadrot wants to merge 2 commits into
Barnadrot wants to merge 2 commits into
Conversation
Ports the iter-8 + iter-19 wins from the leanMultisig M2 Asahi experiment
(commits 22fe0f88 and b342fa36) into upstream zk-alloc.
iter 8 over-allocates the slab region by 32 MiB, rounds REGION_BASE up to a
32 MiB hugepage boundary, sets MADV_HUGEPAGE, and writes one byte per THP
page across each slab during REGION_INIT.call_once. With the alignment +
hint, each touch fault is satisfied with a 32 MiB THP synchronously, making
the THP win deterministic instead of khugepaged-async-dependent. iter 7
saw the same signal but with p=0.019; iter 8 stabilises it. On M2 Asahi
the net win is roughly -2.5% on warm prove time.
iter 19 makes the pre-touch budget runtime-adaptive:
pretouch_bytes = (MemTotal / max_threads / 3).clamp(THP_SIZE, 1 GiB)
A hard-coded 1 GiB × 14 slabs = 14 GiB pre-touch overshoots the 16 GiB
target M-series Macs (eval-gate prove_loop_cand was OOM-killed twice on
the Asahi M2 box on 2026-05-11 with anon-rss ~14.3 GiB). The adaptive
formula caps total pre-touch at MemTotal/3, leaving the workload's own
~10 GiB touched footprint and the rest of the process headroom. On a
64 GiB Hetzner box the formula tops out at the 1 GiB ceiling, preserving
iter 8's exact behaviour there.
MemTotal is sourced via an allocation-free fallback: `syscall::total_ram_bytes()`
returns 0 from the libc fallback arm (current aarch64-Linux path in this
base; the real sysinfo-syscall implementation will live in #11's raw-syscall
arm after merge). When it returns 0 the formula falls back to THP_SIZE
per slab — conservative but safe (no OOM, but loses most of iter 8's
THP-coverage benefit until #11 + this rollup are both on main).
All changes are cfg-gated to target_arch="aarch64"; x86_64 keeps the
existing MADV_NOHUGEPAGE hint and the unmodified region layout. Local
cargo fmt / clippy / test --workspace pass on x86_64 Hetzner Zen 4.
Pairing rule: iter 8 must not ship without iter 19. iter 8 alone OOMs
16 GiB Macs (Justin's deployment target). This commit ships both.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports leanMultisig iter-10 (commit b211697d). With iter 8's 32 MiB-THP arena landed in the previous commit, the 4096-byte size-routing threshold leaves sub-page allocs in System where they hit base-page TLB entries (16 KiB on M2 Asahi). Lowering to 256 routes the 256..4095 band into the THP-backed arena, buying the hugepage TLB benefit for that mass — the original zk-alloc profile attributed ~1.30% of cycles to glibc helpers servicing that band on M2. Phase-crossing safety: ~1.5 KB Injector blocks now land in the arena. The rayon-flush feature (default-on, src/lib.rs:225) drains the rayon injector inside end_phase() before the next begin_phase() recycles the slab, preventing the corruption case the original 4096 default guarded against. Sticky-System realloc still protects Vecs grown across phases. Kept cfg-gated to target_arch="aarch64": the iter-8 rationale (hugepage TLB) doesn't apply on x86 where the historical NOHUGEPAGE hint stands and the 4096 default is a documented phase-crossing-safety choice. Users can still override either path via ZK_ALLOC_MIN_BYTES. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
Author
|
DD found the original numbers were incorrect. Closing this now |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
perf: sync M2 Asahi improvements from leanMultisig downstream
Summary
Pulls the two kept M2 Asahi wins from
Barnadrot/leanMultisig:zk-alloc-m2-asahiinto upstream
zk-allocas a rollup of cfg-gated commits. Both ports targettarget_arch = "aarch64"exclusively — x86_64 paths are untouched. The combinedM2 effect downstream was ~-2.5% on warm prove time with OOM-safety on 16 GiB
M-series Macs; upstream re-measurement on Apple Silicon is deferred to a
follow-up paired benchmark.
Stacked on PR #12 (and depends on #11 for the full win on aarch64-Linux)
This branch is based on
fix/assert-flat-phase-contract(PR #12). It assumesthat PR merges first; rebase to
mainafter #12 lands.The aarch64 wins also depend on PR #11's raw-syscall arm to access an
allocation-free
sysinfo(2)for the adaptive PRETOUCH formula. In thisrollup
syscall::total_ram_bytes()returns 0 from the libc-fallback arm,which makes the adaptive logic fall back to one hugepage per slab —
correct and safe (no OOM) but loses most of iter 8's THP-coverage benefit
until #11 lands. After both #11 and this PR are on
main, a follow-upshould add the real
sysinfosyscall implementation to the new aarch64-Linuxraw-syscall imp block in
src/syscall.rs.Commits in this rollup
44aa0ac22fe0f88) + iter 19 (b342fa36)[THP_SIZE, 1 GiB]), via allocation-freesyscall::total_ram_bytes()cargo fmt --check+cargo clippy --workspace --all-targets -- -D warnings+cargo test --workspaceall green on Hetzner Zen 43968120b211697d)DEFAULT_MIN_ARENA_BYTESfrom 4096 to 256 (route 256..4095 size band into THP-backed arena)cargo fmt --check+ clippy + test green on Hetzner Zen 4Both commits keep the entire change set behind
#[cfg(target_arch = "aarch64")]so x86_64 retains its existing MADV_NOHUGEPAGE hint, its 4096 size-routing
threshold, and the unmodified
ensure_regionlayout.Memory-adaptive pretouch — why iter 8 + iter 19 must ship together
iter 8 (THP arena) on its own pre-touches a hard-coded 1 GiB per slab × 14
slabs = 14 GiB of physical commit. That overshoots the 16 GiB RAM budget on
target M-series Macs. The eval gate's
prove_loop_candwas OOM-killed twiceon the Asahi M2 box on 2026-05-11 with anon-rss ~14.3 GiB (two oom-kill
cascades in
journalctl).iter 19 fixes this by sizing the per-slab pre-touch budget against MemTotal:
On the 64 GiB Hetzner box the formula tops out at the 1 GiB ceiling,
preserving iter 8's exact pre-touch profile. On the 16 GiB M2 box the
formula leaves ~10 GiB headroom for the workload's own touched footprint
and the rest of the process. Justin's deployment target includes 16 GiB
M1/M2/M3 Macs, so iter 8 is unshippable without iter 19's cap. Both ship
in commit
44aa0ac.Test plan
test on macos-latest, aarch64-linux cross-build, clippy, rustfmt, MSRV 1.73
check — see
.github/workflows/ci.yml)cargo test --workspacepasses after every commit (verifiedon Hetzner CCX33 Zen 4)
ensure_regionis byte-identical to PR fix: enforce flat-phase contract via assert (no depth counter) #12 tip)pre-touch is the 1 GiB ceiling per slab, matching pre-port behaviour.
Deferred to follow-up benchmark (currently
total_ram_bytes()returns 0in the libc-fallback arm; once fix(syscall): MAP_NORESERVE on aarch64 Linux via raw syscall #11 + this rollup are both on main and a
real sysinfo-syscall impl is added to the aarch64-Linux raw-syscall arm,
re-validate on Asahi M2)
What was deliberately not ported
f5e2299b(Tom/Emile's rayon-flush fix) — already upstream.end_phase()already drains the rayon crossbeam-deque injector when thedefault-on
rayon-flushfeature is enabled (src/lib.rs:225-244), andtests/test_rayon.rsalready pulls the leanMultisig regression test verbatim(its docstring cites
leanMultisig commit f5e2299bexplicitly).zk-alloc-m2-asahi(iters 7, 9, 11–18, 20–23) —these are paired with their own
Revert "..."commits on the same branch,meaning the experiment rejected them.
Audit trail
experiment_logs/zk-alloc/downstream-sync-2026-05-11/ports.tsv— one row perevaluated delta with decision, commit SHA, gate state, rationale.
sync-from-leanmultisig-m2-2026-05-11on the executor's localclone (not pushed). Brain pushes + opens the PR.
Drafted by experiment agent on 2026-05-11. Stop criterion: every identified delta has a decision row.
🤖 Generated with Claude Code