Segmented sieve for primes, add eachprime iterator.#120
Conversation
10613fb to
58ad969
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #120 +/- ##
==========================================
- Coverage 94.38% 91.96% -2.42%
==========================================
Files 2 3 +1
Lines 463 585 +122
==========================================
+ Hits 437 538 +101
- Misses 26 47 +21 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
primes, add eachprime iterator.
Replace the prime sieve with a segmented Sieve of Eratosthenes built on a mod-30 wheel with one bit-lane per residue coprime to 30 (1,7,11,13,17,19, 23,29), rather than 8 bits packed per 30-block. Within a lane the multiples of a prime form a constant-stride-p progression, which makes cross-off uniform and lets output read the lanes word-parallel. Core (src/sieve.jl): - SegmentedSieve is an iterator of windows: each step fills the 8 lanes for the next window and yields the sieve itself. Boundary bits (< lo, > hi, and final-chunk padding) are masked, so consumers need no per-prime range checks. Window size scales with isqrt(hi) -- floored at an L2-sized window (best for the common emit path, where output reads all 8 lanes right after they are written) and grown for large hi to amortize the per-window setup, capped at ~8 MiB of scratch. - Cross-off is size-branched. Primes < 256 are removed by ANDing each lane with a precomputed periodic stride-p "comb" (this subsumes presieving); larger primes use a scalar stride bit-clear. Lanes are indexed in absolute block space, so a comb mask is a fixed function of (abs_chunk % p) and the comb tables are built once at module load. - The large-prime stride is division-free in the hot loop. A prime's start block and its p^2 floor come from per-residue-class tables keyed on p mod 30 and scaled by p/30, leaving only the unavoidable `mod p`. Only (p/30, residue class) is stored per prime -- p and p^2 reconstruct from those -- and the stride primes are enumerated once via eachprime, so the full base-prime vector is never materialized. - Output byte-transposes the 8 lane words into packed block-major words (a delta-swap network plus an 8x8 bit transpose) and scans them with trailing_zeros, emitting primes in ascending order. - Overflow-safe to typemax(Int): the large-prime math works in block units (p^2 / 30) and compares p against isqrt(win_max) rather than forming p^2, and boundary masks are phrased as residue comparisons. primes, primesmask, and eachprime are all routed through the core; eachprime is a new lazy iterator that buffers one window of primes at a time. 2, 3, and 5 are re-inserted by the wrappers (the wheel skips them). Counting is roughly 2x faster than the previous packed sieve, and prime-list generation is faster as well, with the largest gains at high limits where the division-free large-prime stride dominates. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
af8d992 to
00c7910
Compare
|
@s-celles and/or @haampie can you take a look at this? I think it's in pretty good shape. For some basic benchmarks vs master (PR on top): TLDR is we're between 50% faster for the worst cases to 10x faster for the best cases. |
|
If it were just me eyeballing this, I'd say LGTM — the segmented design is a solid improvement, But I ran it past my coding agent (Claude Code Fable 5) and it found a correctness bug worth fixing before merge. The bugFor a band of inputs, the sieve both returns composites as prime and misses real primes: length(primes(2, 1_100_000_000)) # 61_663_353, should be ~55.7M
# checking primes(2, 1.1e9) against isprime on just [7.9e6, 8.1e6]:
# 6_799 false positives, 5_637 real primes missingIt also propagates to very large ranges through the recursive base-prime enumeration ( primes(2^61, 2^61 + 3000) # returns 25 composites as "prime"so the "overflow-safe to Root cause_seg_blocks(hi) = clamp(isqrt(hi) << 3, 1 << 18, 1 << 24)Windows advance by first_chunk = win_block ÷ 64which assumes Two things conspire to hide it:
FixSince function SegmentedSieve(lo::Int, hi::Int; seg_blocks::Int = _seg_blocks(hi))
# win_block advances by seg_blocks and must stay chunk-aligned:
# comb_clear! derives its phase from win_block ÷ 64.
seg_blocks = max(64, seg_blocks & ~63)
...Rounding down is safe: I verified aligned windows of any size (64, 128, 265280) give exact results, and Suggested regression testsThe mechanism is cheap to pin down — deliberately unaligned window sizes over a small, brute-forceable range spanning many windows (fails on the current branch, passes with the fix): @testset "unaligned seg_blocks are sanitized" begin
ref = [n for n in 7:200_000 if isprime(n)]
for sb in (100, 137, 1000, 4095) # not multiples of 64
out = Int[]
for s in Primes.SegmentedSieve(7, 200_000; seg_blocks = sb)
Primes.each_lane_prime(s) do p
push!(out, p)
end
end
@test out == ref
end
endPlus a count-only cross-check of the default path over multiple windows above 2³⁰ (self-validating against a known-aligned window size, no external π constant needed): @testset "default path crosses windows correctly" begin
countprimes(lo, hi, sb) = begin
c = 0
for s in Primes.SegmentedSieve(lo, hi; seg_blocks = sb)
Primes.each_lane_prime(s) do _; c += 1; end
end
c
end
for hi in (1_100_000_000, 1_500_000_000) # isqrt(hi) << 3 is not a multiple of 64
@test countprimes(7, hi, Primes._seg_blocks(hi)) == countprimes(7, hi, 1 << 18)
end
end |
The window advance (win_block + seg_blocks) only preserved chunk alignment when seg_blocks was a multiple of 64, but _seg_blocks could return any isqrt(hi) << 3 between the clamp bounds; a misaligned window start shifted the comb phase (win_block ÷ 64) and silently left composites set from the second window on. Rather than clamping seg_blocks to a multiple of 64, make misalignment unrepresentable: store the window stride as seg_chunks (64-block chunks), iterate by win_chunk (deriving win_block = win_chunk << 6 once per window), and store nchunks instead of nblocks — its only consumer, each_lane_prime, works in chunks anyway since padding bits are already masked. Any user-supplied seg_chunks is now valid. Add a regression test sieving multi-window with odd seg_chunks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Good catch! I think I've found a better solution than validating where we instead expose the interface as number of chunks (this makes it so all values are allowable) |
Before
_primesmask(2^30)took 2.726 seconds after it took2.358s. Although this is a relatively small improvement overall, it removes ~100% of the time for the small primes (250ms vs 30ms) which means that this will continue to show large gains once we optimize the larger primes.I've also separated sieving into it's own file since I expect the code will become more complex as we move to better sieves.
@haampie since this is essentially part 1 of #87.