perf: sparse gather for LogUp aux trace when multiplicity is selector…#565
perf: sparse gather for LogUp aux trace when multiplicity is selector…#565diegokingston wants to merge 8 commits into
Conversation
…-gated Many bus interactions are gated by mutually-exclusive selector columns (e.g., AND/OR/XOR op flags, ADD/SUB/MUL op flags): when the selector is zero, the row's contribution to the term column is zero regardless of the fingerprint. We now detect Column/Sum/Sum3 multiplicities and, when fewer than 7/8 of a chunk's rows are active, compute fingerprints only at the active rows before batch-inverting. Inactive rows stay at zero from the initial `vec![zero; trace_len]` allocation. Soundness: `sparse_active_rows` marks a row as active whenever any contributing column is non-zero, which is an over-approximation of `m != 0` (cancellation can only cause us to process more rows, never fewer). Dense multiplicities (One, Negated, Diff, Linear) fall back to the unchanged contiguous path. Applies to both single (`compute_logup_term_column`) and paired (`compute_logup_batched_term_column`) term-column builders. Verified via all 71 stark bus soundness tests and 265 prover lib tests.
|
/bench 3 |
Benchmark — fib_iterative_8M (median of 3)Table parallelism: auto (cores / 3)
Commit: 7047d4d · Baseline: cached · Runner: self-hosted bench |
Codex Code ReviewFindings
No security vulnerabilities or correctness bugs found in the PR diff. I attempted |
| let active: Vec<usize> = (0..chunk_len) | ||
| .filter(|&i| c[chunk_start + i] != zero) | ||
| .collect(); | ||
| collect_if_sparse(active) |
There was a problem hiding this comment.
Medium – Wasted allocation on the dense fallback path
The full active: Vec<usize> is built (up to chunk_len elements) before the threshold check. When more than 7/8 of the rows are active (the dense case), the Vec is immediately discarded by collect_if_sparse returning None. This allocation+drop happens on every chunk for every interaction that turns out to be dense — exactly the hot path you're trying to keep fast.
A cheaper approach: count active rows first, bail early, then collect only when it's known to be worthwhile:
Multiplicity::Column(col) => {
let c = &main_segment_cols[*col];
let count = (0..chunk_len).filter(|&i| c[chunk_start + i] != zero).count();
if count * SPARSE_DEN > chunk_len * SPARSE_NUM {
return None;
}
let active: Vec<usize> = (0..chunk_len)
.filter(|&i| c[chunk_start + i] != zero)
.collect();
Some(active)
}(Same pattern for Sum and Sum3.) The double pass over chunk_len elements is cheap compared to a heap allocation + copy, and chunk_len = 1024 fits in L1.
| let indices_a: Vec<usize> = match &active_a { | ||
| Some(v) => v.clone(), | ||
| None => (0..chunk_len).collect(), | ||
| }; | ||
| let indices_b: Vec<usize> = match &active_b { | ||
| Some(v) => v.clone(), | ||
| None => (0..chunk_len).collect(), | ||
| }; |
There was a problem hiding this comment.
Medium – Unnecessary Vec clone
active_a and active_b are owned Option<Vec<usize>> values that are never used again after this point. Matching on &active_a forces a .clone() in the Some branch. Use unwrap_or_else to consume them directly:
| let indices_a: Vec<usize> = match &active_a { | |
| Some(v) => v.clone(), | |
| None => (0..chunk_len).collect(), | |
| }; | |
| let indices_b: Vec<usize> = match &active_b { | |
| Some(v) => v.clone(), | |
| None => (0..chunk_len).collect(), | |
| }; | |
| let indices_a = active_a.unwrap_or_else(|| (0..chunk_len).collect()); | |
| let indices_b = active_b.unwrap_or_else(|| (0..chunk_len).collect()); |
This avoids a heap allocation + copy for every chunk in the half-sparse case (one sparse, one dense interaction).
| /// is non-zero — which is an over-approximation of `m != 0` (we don't miss any | ||
| /// non-zero rows, we just might process some m=0 rows due to cancellation). | ||
| /// | ||
| /// The threshold check: if more than `SPARSE_ACTIVE_FRAC` of the chunk is active, |
There was a problem hiding this comment.
Low – Doc refers to a non-existent constant name
The comment says SPARSE_ACTIVE_FRAC but no such constant exists; the actual constants are SPARSE_NUM and SPARSE_DEN defined inside the function body. Update to match:
| /// The threshold check: if more than `SPARSE_ACTIVE_FRAC` of the chunk is active, | |
| /// The threshold check: if more than `SPARSE_NUM / SPARSE_DEN` (7/8) of the chunk is active, |
Review of perf/sparse-logup-aux-traceOverview: Adds a sparse fast-path to the LogUp auxiliary trace builder. When a multiplicity is structurally identifiable as sparse (Column/Sum/Sum3 gated by selectors with <7/8 of rows active), it skips fingerprint computation for zero-multiplicity rows and batch-inverts only the active subset. Soundness: The over-approximation invariant holds — sparse_active_rows always includes every row where m≠0, so no non-zero contribution is silently dropped. The dense fallback for One/Negated/Diff/Linear is correct. Correctness: The scatter-then-add pattern in the half-sparse batched path relies on result_chunk starting at zero, which is guaranteed by the Issues found (inline comments posted):
No security or correctness bugs found. The soundness argument is well-reasoned and the existing test suite (71 bus + 265 prover tests) covers the behavioral contract. One design note: the half-sparse case (one dense, one sparse interaction) still falls through to the "sparse path" which allocates a full |
…debug_assertions - `cargo fmt` on sparse LogUp changes - Polynomial import in constraints/evaluator.rs is only used inside `#[cfg(all(debug_assertions, not(feature = "parallel")))]`; match the import cfg to the use site so release builds without `parallel` don't warn. (Pre-existing on origin/main but surfaced by CI lint here.)
|
/bench |
|
/bench 5 |
Codex Code ReviewFindings
No security vulnerabilities or correctness bugs found in the changed logic. I attempted to run |
| let indices_a: Vec<usize> = match &active_a { | ||
| Some(v) => v.clone(), | ||
| None => (0..chunk_len).collect(), | ||
| }; | ||
| let indices_b: Vec<usize> = match &active_b { | ||
| Some(v) => v.clone(), | ||
| None => (0..chunk_len).collect(), | ||
| }; |
There was a problem hiding this comment.
Low: unnecessary clone + silent perf regression in the half-sparse case
active_a/active_b are not used after these bindings, so the .clone() is avoidable. More importantly, when exactly one side is None (e.g. A is dense, B is sparse), this path allocates a full (0..chunk_len).collect() and then processes the dense interaction with indirect indexing — which is strictly worse than the dedicated dense branch above that uses sequential access. The dense-path early-return only triggers when both are None; the half-sparse case silently regresses.
Suggested fixes:
- Consume the
Optionto avoid the clone:active_a.unwrap_or_else(|| (0..chunk_len).collect()). - Add dedicated half-sparse branches (
active_a.is_some() && active_b.is_none()and vice versa) that only scatter the sparse side, similar to howcompute_logup_term_columnhandles its sparse path. The dense side can use the same sequential loop as the dense-path branch.
Review: perf/sparse-logup-aux-traceThe optimization is sound. The over-approximation argument in the PR description is correct: marking a row active when any contributing column is non-zero guarantees every row with non-zero multiplicity is processed; field cancellation (e.g. Low: half-sparse inefficiency in No correctness or security issues found. |
…-gated
Many bus interactions are gated by mutually-exclusive selector columns (e.g., AND/OR/XOR op flags, ADD/SUB/MUL op flags): when the selector is zero, the row's contribution to the term column is zero regardless of the fingerprint. We now detect Column/Sum/Sum3 multiplicities and, when fewer than 7/8 of a chunk's rows are active, compute fingerprints only at the active rows before batch-inverting. Inactive rows stay at zero from the initial
vec![zero; trace_len]allocation.Soundness:
sparse_active_rowsmarks a row as active whenever any contributing column is non-zero, which is an over-approximation ofm != 0(cancellation can only cause us to process more rows, never fewer). Dense multiplicities (One, Negated, Diff, Linear) fall back to the unchanged contiguous path.Applies to both single (
compute_logup_term_column) and paired (compute_logup_batched_term_column) term-column builders. Verified via all 71 stark bus soundness tests and 265 prover lib tests.