Skip to content

perf(aarch64): sync M2 Asahi improvements from leanMultisig downstream#13

Closed
Barnadrot wants to merge 2 commits into
fix/assert-flat-phase-contractfrom
sync-from-leanmultisig-m2-2026-05-11
Closed

perf(aarch64): sync M2 Asahi improvements from leanMultisig downstream#13
Barnadrot wants to merge 2 commits into
fix/assert-flat-phase-contractfrom
sync-from-leanmultisig-m2-2026-05-11

Conversation

@Barnadrot
Copy link
Copy Markdown
Owner

perf: sync M2 Asahi improvements from leanMultisig downstream

Summary

Pulls the two kept M2 Asahi wins from Barnadrot/leanMultisig:zk-alloc-m2-asahi
into upstream zk-alloc as a rollup of cfg-gated commits. Both ports target
target_arch = "aarch64" exclusively — x86_64 paths are untouched. The combined
M2 effect downstream was ~-2.5% on warm prove time with OOM-safety on 16 GiB
M-series Macs; upstream re-measurement on Apple Silicon is deferred to a
follow-up paired benchmark.

Stacked on PR #12 (and depends on #11 for the full win on aarch64-Linux)

This branch is based on fix/assert-flat-phase-contract (PR #12). It assumes
that PR merges first; rebase to main after #12 lands.

The aarch64 wins also depend on PR #11's raw-syscall arm to access an
allocation-free sysinfo(2) for the adaptive PRETOUCH formula. In this
rollup syscall::total_ram_bytes() returns 0 from the libc-fallback arm,
which makes the adaptive logic fall back to one hugepage per slab —
correct and safe (no OOM) but loses most of iter 8's THP-coverage benefit
until #11 lands. After both #11 and this PR are on main, a follow-up
should add the real sysinfo syscall implementation to the new aarch64-Linux
raw-syscall imp block in src/syscall.rs.

Commits in this rollup

Commit Source Change Verified
44aa0ac leanMultisig:zk-alloc-m2-asahi iter 8 (22fe0f88) + iter 19 (b342fa36) aarch64: 32 MiB-aligned mmap + MADV_HUGEPAGE + adaptive PRETOUCH_BYTES (capped MemTotal/max_threads/3, clamped [THP_SIZE, 1 GiB]), via allocation-free syscall::total_ram_bytes() cargo fmt --check + cargo clippy --workspace --all-targets -- -D warnings + cargo test --workspace all green on Hetzner Zen 4
3968120 leanMultisig:zk-alloc-m2-asahi iter 10 (b211697d) aarch64: lower DEFAULT_MIN_ARENA_BYTES from 4096 to 256 (route 256..4095 size band into THP-backed arena) cargo fmt --check + clippy + test green on Hetzner Zen 4

Both commits keep the entire change set behind #[cfg(target_arch = "aarch64")]
so x86_64 retains its existing MADV_NOHUGEPAGE hint, its 4096 size-routing
threshold, and the unmodified ensure_region layout.

Memory-adaptive pretouch — why iter 8 + iter 19 must ship together

iter 8 (THP arena) on its own pre-touches a hard-coded 1 GiB per slab × 14
slabs = 14 GiB of physical commit. That overshoots the 16 GiB RAM budget on
target M-series Macs. The eval gate's prove_loop_cand was OOM-killed twice
on the Asahi M2 box on 2026-05-11 with anon-rss ~14.3 GiB (two oom-kill
cascades in journalctl).

iter 19 fixes this by sizing the per-slab pre-touch budget against MemTotal:

pretouch_bytes = (MemTotal / max_threads / 3).clamp(THP_SIZE, 1 GiB)
  • 16 GiB / 14 / 3 ≈ 390 MiB per slab → ~5.4 GiB total pre-touch
  • 64 GiB / 14 / 3 ≈ 1.56 GiB → capped at the 1 GiB hard ceiling

On the 64 GiB Hetzner box the formula tops out at the 1 GiB ceiling,
preserving iter 8's exact pre-touch profile. On the 16 GiB M2 box the
formula leaves ~10 GiB headroom for the workload's own touched footprint
and the rest of the process. Justin's deployment target includes 16 GiB
M1/M2/M3 Macs, so iter 8 is unshippable without iter 19's cap. Both ship
in commit 44aa0ac.

Test plan

  • CI green on this PR (x86_64-linux build + test, aarch64-darwin build +
    test on macos-latest, aarch64-linux cross-build, clippy, rustfmt, MSRV 1.73
    check — see .github/workflows/ci.yml)
  • Local cargo test --workspace passes after every commit (verified
    on Hetzner CCX33 Zen 4)
  • No regression on x86_64 (every change is cfg-gated to aarch64; x86_64's
    ensure_region is byte-identical to PR fix: enforce flat-phase contract via assert (no depth counter) #12 tip)
  • Adaptive PRETOUCH formula validated on Hetzner (64 GiB) — expected
    pre-touch is the 1 GiB ceiling per slab, matching pre-port behaviour.
    Deferred to follow-up benchmark (currently total_ram_bytes() returns 0
    in the libc-fallback arm; once fix(syscall): MAP_NORESERVE on aarch64 Linux via raw syscall #11 + this rollup are both on main and a
    real sysinfo-syscall impl is added to the aarch64-Linux raw-syscall arm,
    re-validate on Asahi M2)

What was deliberately not ported

  • leanMultisig f5e2299b (Tom/Emile's rayon-flush fix) — already upstream.
    end_phase() already drains the rayon crossbeam-deque injector when the
    default-on rayon-flush feature is enabled (src/lib.rs:225-244), and
    tests/test_rayon.rs already pulls the leanMultisig regression test verbatim
    (its docstring cites leanMultisig commit f5e2299b explicitly).
  • All reverted iterations on zk-alloc-m2-asahi (iters 7, 9, 11–18, 20–23) —
    these are paired with their own Revert "..." commits on the same branch,
    meaning the experiment rejected them.

Audit trail

  • experiment_logs/zk-alloc/downstream-sync-2026-05-11/ports.tsv — one row per
    evaluated delta with decision, commit SHA, gate state, rationale.
  • This branch is sync-from-leanmultisig-m2-2026-05-11 on the executor's local
    clone (not pushed). Brain pushes + opens the PR.

Drafted by experiment agent on 2026-05-11. Stop criterion: every identified delta has a decision row.

🤖 Generated with Claude Code

Barnadrot and others added 2 commits May 11, 2026 16:28
Ports the iter-8 + iter-19 wins from the leanMultisig M2 Asahi experiment
(commits 22fe0f88 and b342fa36) into upstream zk-alloc.

iter 8 over-allocates the slab region by 32 MiB, rounds REGION_BASE up to a
32 MiB hugepage boundary, sets MADV_HUGEPAGE, and writes one byte per THP
page across each slab during REGION_INIT.call_once. With the alignment +
hint, each touch fault is satisfied with a 32 MiB THP synchronously, making
the THP win deterministic instead of khugepaged-async-dependent. iter 7
saw the same signal but with p=0.019; iter 8 stabilises it. On M2 Asahi
the net win is roughly -2.5% on warm prove time.

iter 19 makes the pre-touch budget runtime-adaptive:

    pretouch_bytes = (MemTotal / max_threads / 3).clamp(THP_SIZE, 1 GiB)

A hard-coded 1 GiB × 14 slabs = 14 GiB pre-touch overshoots the 16 GiB
target M-series Macs (eval-gate prove_loop_cand was OOM-killed twice on
the Asahi M2 box on 2026-05-11 with anon-rss ~14.3 GiB). The adaptive
formula caps total pre-touch at MemTotal/3, leaving the workload's own
~10 GiB touched footprint and the rest of the process headroom. On a
64 GiB Hetzner box the formula tops out at the 1 GiB ceiling, preserving
iter 8's exact behaviour there.

MemTotal is sourced via an allocation-free fallback: `syscall::total_ram_bytes()`
returns 0 from the libc fallback arm (current aarch64-Linux path in this
base; the real sysinfo-syscall implementation will live in #11's raw-syscall
arm after merge). When it returns 0 the formula falls back to THP_SIZE
per slab — conservative but safe (no OOM, but loses most of iter 8's
THP-coverage benefit until #11 + this rollup are both on main).

All changes are cfg-gated to target_arch="aarch64"; x86_64 keeps the
existing MADV_NOHUGEPAGE hint and the unmodified region layout. Local
cargo fmt / clippy / test --workspace pass on x86_64 Hetzner Zen 4.

Pairing rule: iter 8 must not ship without iter 19. iter 8 alone OOMs
16 GiB Macs (Justin's deployment target). This commit ships both.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports leanMultisig iter-10 (commit b211697d). With iter 8's 32 MiB-THP
arena landed in the previous commit, the 4096-byte size-routing threshold
leaves sub-page allocs in System where they hit base-page TLB entries
(16 KiB on M2 Asahi). Lowering to 256 routes the 256..4095 band into
the THP-backed arena, buying the hugepage TLB benefit for that mass —
the original zk-alloc profile attributed ~1.30% of cycles to glibc
helpers servicing that band on M2.

Phase-crossing safety: ~1.5 KB Injector blocks now land in the arena.
The rayon-flush feature (default-on, src/lib.rs:225) drains the rayon
injector inside end_phase() before the next begin_phase() recycles the
slab, preventing the corruption case the original 4096 default guarded
against. Sticky-System realloc still protects Vecs grown across phases.

Kept cfg-gated to target_arch="aarch64": the iter-8 rationale (hugepage
TLB) doesn't apply on x86 where the historical NOHUGEPAGE hint stands
and the 4096 default is a documented phase-crossing-safety choice. Users
can still override either path via ZK_ALLOC_MIN_BYTES.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Barnadrot
Copy link
Copy Markdown
Owner Author

DD found the original numbers were incorrect. Closing this now

@Barnadrot Barnadrot closed this May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant