perf(aarch64): sync M2 Asahi improvements from leanMultisig downstream by Barnadrot · Pull Request #13 · Barnadrot/zk-alloc

Barnadrot · 2026-05-11T14:34:52Z

perf: sync M2 Asahi improvements from leanMultisig downstream

Summary

Pulls the two kept M2 Asahi wins from Barnadrot/leanMultisig:zk-alloc-m2-asahi
into upstream zk-alloc as a rollup of cfg-gated commits. Both ports target
target_arch = "aarch64" exclusively — x86_64 paths are untouched. The combined
M2 effect downstream was ~-2.5% on warm prove time with OOM-safety on 16 GiB
M-series Macs; upstream re-measurement on Apple Silicon is deferred to a
follow-up paired benchmark.

Stacked on PR #12 (and depends on #11 for the full win on aarch64-Linux)

This branch is based on fix/assert-flat-phase-contract (PR #12). It assumes
that PR merges first; rebase to main after #12 lands.

The aarch64 wins also depend on PR #11's raw-syscall arm to access an
allocation-free sysinfo(2) for the adaptive PRETOUCH formula. In this
rollup syscall::total_ram_bytes() returns 0 from the libc-fallback arm,
which makes the adaptive logic fall back to one hugepage per slab —
correct and safe (no OOM) but loses most of iter 8's THP-coverage benefit
until #11 lands. After both #11 and this PR are on main, a follow-up
should add the real sysinfo syscall implementation to the new aarch64-Linux
raw-syscall imp block in src/syscall.rs.

Commits in this rollup

Commit	Source	Change	Verified
`44aa0ac`	leanMultisig:zk-alloc-m2-asahi iter 8 (`22fe0f88`) + iter 19 (`b342fa36`)	aarch64: 32 MiB-aligned mmap + MADV_HUGEPAGE + adaptive PRETOUCH_BYTES (capped MemTotal/max_threads/3, clamped `[THP_SIZE, 1 GiB]`), via allocation-free `syscall::total_ram_bytes()`	`cargo fmt --check` + `cargo clippy --workspace --all-targets -- -D warnings` + `cargo test --workspace` all green on Hetzner Zen 4
`3968120`	leanMultisig:zk-alloc-m2-asahi iter 10 (`b211697d`)	aarch64: lower `DEFAULT_MIN_ARENA_BYTES` from 4096 to 256 (route 256..4095 size band into THP-backed arena)	`cargo fmt --check` + clippy + test green on Hetzner Zen 4

Both commits keep the entire change set behind #[cfg(target_arch = "aarch64")]
so x86_64 retains its existing MADV_NOHUGEPAGE hint, its 4096 size-routing
threshold, and the unmodified ensure_region layout.

Memory-adaptive pretouch — why iter 8 + iter 19 must ship together

iter 8 (THP arena) on its own pre-touches a hard-coded 1 GiB per slab × 14
slabs = 14 GiB of physical commit. That overshoots the 16 GiB RAM budget on
target M-series Macs. The eval gate's prove_loop_cand was OOM-killed twice
on the Asahi M2 box on 2026-05-11 with anon-rss ~14.3 GiB (two oom-kill
cascades in journalctl).

iter 19 fixes this by sizing the per-slab pre-touch budget against MemTotal:

pretouch_bytes = (MemTotal / max_threads / 3).clamp(THP_SIZE, 1 GiB)

16 GiB / 14 / 3 ≈ 390 MiB per slab → ~5.4 GiB total pre-touch
64 GiB / 14 / 3 ≈ 1.56 GiB → capped at the 1 GiB hard ceiling

On the 64 GiB Hetzner box the formula tops out at the 1 GiB ceiling,
preserving iter 8's exact pre-touch profile. On the 16 GiB M2 box the
formula leaves ~10 GiB headroom for the workload's own touched footprint
and the rest of the process. Justin's deployment target includes 16 GiB
M1/M2/M3 Macs, so iter 8 is unshippable without iter 19's cap. Both ship
in commit 44aa0ac.

Test plan

CI green on this PR (x86_64-linux build + test, aarch64-darwin build +
test on macos-latest, aarch64-linux cross-build, clippy, rustfmt, MSRV 1.73
check — see .github/workflows/ci.yml)
Local cargo test --workspace passes after every commit (verified
on Hetzner CCX33 Zen 4)
No regression on x86_64 (every change is cfg-gated to aarch64; x86_64's
ensure_region is byte-identical to PR fix: enforce flat-phase contract via assert (no depth counter) #12 tip)
Adaptive PRETOUCH formula validated on Hetzner (64 GiB) — expected
pre-touch is the 1 GiB ceiling per slab, matching pre-port behaviour.
Deferred to follow-up benchmark (currently total_ram_bytes() returns 0
in the libc-fallback arm; once fix(syscall): MAP_NORESERVE on aarch64 Linux via raw syscall #11 + this rollup are both on main and a
real sysinfo-syscall impl is added to the aarch64-Linux raw-syscall arm,
re-validate on Asahi M2)

What was deliberately not ported

leanMultisig f5e2299b (Tom/Emile's rayon-flush fix) — already upstream.
end_phase() already drains the rayon crossbeam-deque injector when the
default-on rayon-flush feature is enabled (src/lib.rs:225-244), and
tests/test_rayon.rs already pulls the leanMultisig regression test verbatim
(its docstring cites leanMultisig commit f5e2299b explicitly).
All reverted iterations on zk-alloc-m2-asahi (iters 7, 9, 11–18, 20–23) —
these are paired with their own Revert "..." commits on the same branch,
meaning the experiment rejected them.

Audit trail

experiment_logs/zk-alloc/downstream-sync-2026-05-11/ports.tsv — one row per
evaluated delta with decision, commit SHA, gate state, rationale.
This branch is sync-from-leanmultisig-m2-2026-05-11 on the executor's local
clone (not pushed). Brain pushes + opens the PR.

Drafted by experiment agent on 2026-05-11. Stop criterion: every identified delta has a decision row.

🤖 Generated with Claude Code

Ports the iter-8 + iter-19 wins from the leanMultisig M2 Asahi experiment (commits 22fe0f88 and b342fa36) into upstream zk-alloc. iter 8 over-allocates the slab region by 32 MiB, rounds REGION_BASE up to a 32 MiB hugepage boundary, sets MADV_HUGEPAGE, and writes one byte per THP page across each slab during REGION_INIT.call_once. With the alignment + hint, each touch fault is satisfied with a 32 MiB THP synchronously, making the THP win deterministic instead of khugepaged-async-dependent. iter 7 saw the same signal but with p=0.019; iter 8 stabilises it. On M2 Asahi the net win is roughly -2.5% on warm prove time. iter 19 makes the pre-touch budget runtime-adaptive: pretouch_bytes = (MemTotal / max_threads / 3).clamp(THP_SIZE, 1 GiB) A hard-coded 1 GiB × 14 slabs = 14 GiB pre-touch overshoots the 16 GiB target M-series Macs (eval-gate prove_loop_cand was OOM-killed twice on the Asahi M2 box on 2026-05-11 with anon-rss ~14.3 GiB). The adaptive formula caps total pre-touch at MemTotal/3, leaving the workload's own ~10 GiB touched footprint and the rest of the process headroom. On a 64 GiB Hetzner box the formula tops out at the 1 GiB ceiling, preserving iter 8's exact behaviour there. MemTotal is sourced via an allocation-free fallback: `syscall::total_ram_bytes()` returns 0 from the libc fallback arm (current aarch64-Linux path in this base; the real sysinfo-syscall implementation will live in #11's raw-syscall arm after merge). When it returns 0 the formula falls back to THP_SIZE per slab — conservative but safe (no OOM, but loses most of iter 8's THP-coverage benefit until #11 + this rollup are both on main). All changes are cfg-gated to target_arch="aarch64"; x86_64 keeps the existing MADV_NOHUGEPAGE hint and the unmodified region layout. Local cargo fmt / clippy / test --workspace pass on x86_64 Hetzner Zen 4. Pairing rule: iter 8 must not ship without iter 19. iter 8 alone OOMs 16 GiB Macs (Justin's deployment target). This commit ships both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ports leanMultisig iter-10 (commit b211697d). With iter 8's 32 MiB-THP arena landed in the previous commit, the 4096-byte size-routing threshold leaves sub-page allocs in System where they hit base-page TLB entries (16 KiB on M2 Asahi). Lowering to 256 routes the 256..4095 band into the THP-backed arena, buying the hugepage TLB benefit for that mass — the original zk-alloc profile attributed ~1.30% of cycles to glibc helpers servicing that band on M2. Phase-crossing safety: ~1.5 KB Injector blocks now land in the arena. The rayon-flush feature (default-on, src/lib.rs:225) drains the rayon injector inside end_phase() before the next begin_phase() recycles the slab, preventing the corruption case the original 4096 default guarded against. Sticky-System realloc still protects Vecs grown across phases. Kept cfg-gated to target_arch="aarch64": the iter-8 rationale (hugepage TLB) doesn't apply on x86 where the historical NOHUGEPAGE hint stands and the 4096 default is a documented phase-crossing-safety choice. Users can still override either path via ZK_ALLOC_MIN_BYTES. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Barnadrot · 2026-05-11T17:04:49Z

DD found the original numbers were incorrect. Closing this now

Barnadrot and others added 2 commits May 11, 2026 16:28

Barnadrot closed this May 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(aarch64): sync M2 Asahi improvements from leanMultisig downstream#13

perf(aarch64): sync M2 Asahi improvements from leanMultisig downstream#13
Barnadrot wants to merge 2 commits into
fix/assert-flat-phase-contractfrom
sync-from-leanmultisig-m2-2026-05-11

Barnadrot commented May 11, 2026

Uh oh!

Barnadrot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Barnadrot commented May 11, 2026

perf: sync M2 Asahi improvements from leanMultisig downstream

Summary

Stacked on PR #12 (and depends on #11 for the full win on aarch64-Linux)

Commits in this rollup

Memory-adaptive pretouch — why iter 8 + iter 19 must ship together

Test plan

What was deliberately not ported

Audit trail

Uh oh!

Barnadrot commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant