Skip to content

Segmented sieve for primes, add eachprime iterator.#120

Open
oscardssmith wants to merge 3 commits into
mainfrom
oscardssmith-faster-sieve-part-1
Open

Segmented sieve for primes, add eachprime iterator.#120
oscardssmith wants to merge 3 commits into
mainfrom
oscardssmith-faster-sieve-part-1

Conversation

@oscardssmith

Copy link
Copy Markdown
Member

Before _primesmask(2^30) took 2.726 seconds after it took 2.358s. Although this is a relatively small improvement overall, it removes ~100% of the time for the small primes (250ms vs 30ms) which means that this will continue to show large gains once we optimize the larger primes.

I've also separated sieving into it's own file since I expect the code will become more complex as we move to better sieves.
@haampie since this is essentially part 1 of #87.

@oscardssmith oscardssmith changed the title Oscardssmith faster sieve part 1 faster prime sieve part 1 Jun 13, 2022
@oscardssmith oscardssmith reopened this Jul 3, 2023
@oscardssmith oscardssmith force-pushed the oscardssmith-faster-sieve-part-1 branch from 10613fb to 58ad969 Compare June 16, 2026 19:45
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.97861% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.96%. Comparing base (c5b95b9) to head (ad7147a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/sieve.jl 91.84% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #120      +/-   ##
==========================================
- Coverage   94.38%   91.96%   -2.42%     
==========================================
  Files           2        3       +1     
  Lines         463      585     +122     
==========================================
+ Hits          437      538     +101     
- Misses         26       47      +21     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@oscardssmith oscardssmith changed the title faster prime sieve part 1 Segmented sieve for primes, add eachprime iterator. Jun 21, 2026
oscardssmith and others added 2 commits June 30, 2026 14:21
Replace the prime sieve with a segmented Sieve of Eratosthenes built on a
mod-30 wheel with one bit-lane per residue coprime to 30 (1,7,11,13,17,19,
23,29), rather than 8 bits packed per 30-block. Within a lane the multiples
of a prime form a constant-stride-p progression, which makes cross-off
uniform and lets output read the lanes word-parallel.

Core (src/sieve.jl):

- SegmentedSieve is an iterator of windows: each step fills the 8 lanes for
  the next window and yields the sieve itself. Boundary bits (< lo, > hi, and
  final-chunk padding) are masked, so consumers need no per-prime range checks.
  Window size scales with isqrt(hi) -- floored at an L2-sized window (best for
  the common emit path, where output reads all 8 lanes right after they are
  written) and grown for large hi to amortize the per-window setup, capped at
  ~8 MiB of scratch.

- Cross-off is size-branched. Primes < 256 are removed by ANDing each lane with
  a precomputed periodic stride-p "comb" (this subsumes presieving); larger
  primes use a scalar stride bit-clear. Lanes are indexed in absolute block
  space, so a comb mask is a fixed function of (abs_chunk % p) and the comb
  tables are built once at module load.

- The large-prime stride is division-free in the hot loop. A prime's start
  block and its p^2 floor come from per-residue-class tables keyed on p mod 30
  and scaled by p/30, leaving only the unavoidable `mod p`. Only (p/30, residue
  class) is stored per prime -- p and p^2 reconstruct from those -- and the
  stride primes are enumerated once via eachprime, so the full base-prime
  vector is never materialized.

- Output byte-transposes the 8 lane words into packed block-major words (a
  delta-swap network plus an 8x8 bit transpose) and scans them with
  trailing_zeros, emitting primes in ascending order.

- Overflow-safe to typemax(Int): the large-prime math works in block units
  (p^2 / 30) and compares p against isqrt(win_max) rather than forming p^2,
  and boundary masks are phrased as residue comparisons.

primes, primesmask, and eachprime are all routed through the core; eachprime is
a new lazy iterator that buffers one window of primes at a time. 2, 3, and 5
are re-inserted by the wrappers (the wheel skips them).

Counting is roughly 2x faster than the previous packed sieve, and prime-list
generation is faster as well, with the largest gains at high limits where the
division-free large-prime stride dominates.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@oscardssmith oscardssmith force-pushed the oscardssmith-faster-sieve-part-1 branch from af8d992 to 00c7910 Compare June 30, 2026 21:31
@oscardssmith

Copy link
Copy Markdown
Member Author

@s-celles and/or @haampie can you take a look at this? I think it's in pretty good shape. For some basic benchmarks vs master (PR on top): TLDR is we're between 50% faster for the worst cases to 10x faster for the best cases.

julia> @time primes(10^7);
  0.002389 seconds (85 allocations: 5.351 MiB)
  0.018814 seconds (27 allocations: 7.634 MiB, 23.42% gc time)

julia> @time primes(10^9);
  0.498452 seconds (223 allocations: 389.628 MiB, 20.87% gc time)
  3.267264 seconds (7 allocations: 643.508 MiB, 1.92% gc time)

julia> @time primes(10^12, 10^12+10^9);
  0.641084 seconds (164 allocations: 695.490 MiB, 10.62% gc time)
  4.271945 seconds (9 allocations: 531.497 MiB, 1.79% gc time)

julia> @time primes(10^18, 10^18+10^9);
  6.492079 seconds (336 allocations: 1.513 GiB, 3.52% gc time)
  9.017591 seconds (9 allocations: 693.133 MiB, 1.38% gc time)

@s-celles

s-celles commented Jul 2, 2026

Copy link
Copy Markdown

If it were just me eyeballing this, I'd say LGTM — the segmented design is a solid improvement, eachprime is a welcome addition, and the speedups check out. 👍

But I ran it past my coding agent (Claude Code Fable 5) and it found a correctness bug worth fixing before merge.

The bug

For a band of inputs, the sieve both returns composites as prime and misses real primes:

length(primes(2, 1_100_000_000))   # 61_663_353, should be ~55.7M

# checking primes(2, 1.1e9) against isprime on just [7.9e6, 8.1e6]:
# 6_799 false positives, 5_637 real primes missing

It also propagates to very large ranges through the recursive base-prime enumeration (isqrt(hi) lands in the broken band):

primes(2^61, 2^61 + 3000)          # returns 25 composites as "prime"

so the "overflow-safe to typemax(Int)" claim doesn't hold yet. Results are correct below ~2³⁰ and for hi large enough that the window clamp kicks in, which is exactly why the test suite (max ~3M) doesn't catch it.

Root cause

_seg_blocks(hi) = clamp(isqrt(hi) << 3, 1 << 18, 1 << 24)

Windows advance by seg_blocks (iterate returns win_block + ss.seg_blocks), and comb_clear! derives its phase from

first_chunk = win_block ÷ 64

which assumes win_block is chunk-aligned. isqrt(hi) << 3 is generally not a multiple of 64 (e.g. hi = 1.1e9seg_blocks = 265328 ≡ 48 (mod 64)), so from the second window on, the division truncates and every small-prime comb is applied shifted by win_block % 64 bits — clearing prime bits (false negatives) and sparing composite bits (false positives). The large-prime stride path and the output transpose work in local offsets and are unaffected, which is consistent with what we observed: all direct-level false positives have a smallest factor < 256.

Two things conspire to hide it:

  • the clamp bounds (2¹⁸, 2²⁴) happen to be multiples of 64, so only the middle band 2³⁰ ≲ hi ≲ 2⁴² is affected;
  • the first window is always aligned (_first_block is ... << 6), so ranges that fit in one window are correct even in that band.

Fix

Since seg_blocks is also a user-facing kwarg, the constructor is the right chokepoint (covers the default path and explicit callers):

function SegmentedSieve(lo::Int, hi::Int; seg_blocks::Int = _seg_blocks(hi))
    # win_block advances by seg_blocks and must stay chunk-aligned:
    # comb_clear! derives its phase from win_block ÷ 64.
    seg_blocks = max(64, seg_blocks & ~63)
    ...

Rounding down is safe: I verified aligned windows of any size (64, 128, 265280) give exact results, and 265280 restores the correct count for hi = 1.1e9 (55,662,470). A one-line comment on the invariant would keep a future refactor from reintroducing it.

Suggested regression tests

The mechanism is cheap to pin down — deliberately unaligned window sizes over a small, brute-forceable range spanning many windows (fails on the current branch, passes with the fix):

@testset "unaligned seg_blocks are sanitized" begin
    ref = [n for n in 7:200_000 if isprime(n)]
    for sb in (100, 137, 1000, 4095)          # not multiples of 64
        out = Int[]
        for s in Primes.SegmentedSieve(7, 200_000; seg_blocks = sb)
            Primes.each_lane_prime(s) do p
                push!(out, p)
            end
        end
        @test out == ref
    end
end

Plus a count-only cross-check of the default path over multiple windows above 2³⁰ (self-validating against a known-aligned window size, no external π constant needed):

@testset "default path crosses windows correctly" begin
    countprimes(lo, hi, sb) = begin
        c = 0
        for s in Primes.SegmentedSieve(lo, hi; seg_blocks = sb)
            Primes.each_lane_prime(s) do _; c += 1; end
        end
        c
    end
    for hi in (1_100_000_000, 1_500_000_000)   # isqrt(hi) << 3 is not a multiple of 64
        @test countprimes(7, hi, Primes._seg_blocks(hi)) == countprimes(7, hi, 1 << 18)
    end
end

The window advance (win_block + seg_blocks) only preserved chunk
alignment when seg_blocks was a multiple of 64, but _seg_blocks could
return any isqrt(hi) << 3 between the clamp bounds; a misaligned window
start shifted the comb phase (win_block ÷ 64) and silently left
composites set from the second window on.

Rather than clamping seg_blocks to a multiple of 64, make misalignment
unrepresentable: store the window stride as seg_chunks (64-block
chunks), iterate by win_chunk (deriving win_block = win_chunk << 6 once
per window), and store nchunks instead of nblocks — its only consumer,
each_lane_prime, works in chunks anyway since padding bits are already
masked. Any user-supplied seg_chunks is now valid.

Add a regression test sieving multi-window with odd seg_chunks.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@oscardssmith

Copy link
Copy Markdown
Member Author

Good catch! I think I've found a better solution than validating where we instead expose the interface as number of chunks (this makes it so all values are allowable)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants