Segmented sieve for `primes`, add `eachprime` iterator. by oscardssmith · Pull Request #120 · JuliaMath/Primes.jl

oscardssmith · 2022-06-13T16:29:38Z

Before _primesmask(2^30) took 2.726 seconds after it took 2.358s. Although this is a relatively small improvement overall, it removes ~100% of the time for the small primes (250ms vs 30ms) which means that this will continue to show large gains once we optimize the larger primes.

I've also separated sieving into it's own file since I expect the code will become more complex as we move to better sieves.
@haampie since this is essentially part 1 of #87.

codecov · 2026-06-16T19:48:01Z

Codecov Report

❌ Patch coverage is 91.97861% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.96%. Comparing base (c5b95b9) to head (ad7147a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/sieve.jl	91.84%	15 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #120      +/-   ##
==========================================
- Coverage   94.38%   91.96%   -2.42%     
==========================================
  Files           2        3       +1     
  Lines         463      585     +122     
==========================================
+ Hits          437      538     +101     
- Misses         26       47      +21

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Replace the prime sieve with a segmented Sieve of Eratosthenes built on a mod-30 wheel with one bit-lane per residue coprime to 30 (1,7,11,13,17,19, 23,29), rather than 8 bits packed per 30-block. Within a lane the multiples of a prime form a constant-stride-p progression, which makes cross-off uniform and lets output read the lanes word-parallel. Core (src/sieve.jl): - SegmentedSieve is an iterator of windows: each step fills the 8 lanes for the next window and yields the sieve itself. Boundary bits (< lo, > hi, and final-chunk padding) are masked, so consumers need no per-prime range checks. Window size scales with isqrt(hi) -- floored at an L2-sized window (best for the common emit path, where output reads all 8 lanes right after they are written) and grown for large hi to amortize the per-window setup, capped at ~8 MiB of scratch. - Cross-off is size-branched. Primes < 256 are removed by ANDing each lane with a precomputed periodic stride-p "comb" (this subsumes presieving); larger primes use a scalar stride bit-clear. Lanes are indexed in absolute block space, so a comb mask is a fixed function of (abs_chunk % p) and the comb tables are built once at module load. - The large-prime stride is division-free in the hot loop. A prime's start block and its p^2 floor come from per-residue-class tables keyed on p mod 30 and scaled by p/30, leaving only the unavoidable `mod p`. Only (p/30, residue class) is stored per prime -- p and p^2 reconstruct from those -- and the stride primes are enumerated once via eachprime, so the full base-prime vector is never materialized. - Output byte-transposes the 8 lane words into packed block-major words (a delta-swap network plus an 8x8 bit transpose) and scans them with trailing_zeros, emitting primes in ascending order. - Overflow-safe to typemax(Int): the large-prime math works in block units (p^2 / 30) and compares p against isqrt(win_max) rather than forming p^2, and boundary masks are phrased as residue comparisons. primes, primesmask, and eachprime are all routed through the core; eachprime is a new lazy iterator that buffers one window of primes at a time. 2, 3, and 5 are re-inserted by the wrappers (the wheel skips them). Counting is roughly 2x faster than the previous packed sieve, and prime-list generation is faster as well, with the largest gains at high limits where the division-free large-prime stride dominates. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

oscardssmith · 2026-06-30T21:53:39Z

@s-celles and/or @haampie can you take a look at this? I think it's in pretty good shape. For some basic benchmarks vs master (PR on top): TLDR is we're between 50% faster for the worst cases to 10x faster for the best cases.

julia> @time primes(10^7);
  0.002389 seconds (85 allocations: 5.351 MiB)
  0.018814 seconds (27 allocations: 7.634 MiB, 23.42% gc time)

julia> @time primes(10^9);
  0.498452 seconds (223 allocations: 389.628 MiB, 20.87% gc time)
  3.267264 seconds (7 allocations: 643.508 MiB, 1.92% gc time)

julia> @time primes(10^12, 10^12+10^9);
  0.641084 seconds (164 allocations: 695.490 MiB, 10.62% gc time)
  4.271945 seconds (9 allocations: 531.497 MiB, 1.79% gc time)

julia> @time primes(10^18, 10^18+10^9);
  6.492079 seconds (336 allocations: 1.513 GiB, 3.52% gc time)
  9.017591 seconds (9 allocations: 693.133 MiB, 1.38% gc time)

s-celles · 2026-07-02T06:43:28Z

If it were just me eyeballing this, I'd say LGTM — the segmented design is a solid improvement, eachprime is a welcome addition, and the speedups check out. 👍

But I ran it past my coding agent (Claude Code Fable 5) and it found a correctness bug worth fixing before merge.

The bug

For a band of inputs, the sieve both returns composites as prime and misses real primes:

length(primes(2, 1_100_000_000))   # 61_663_353, should be ~55.7M

# checking primes(2, 1.1e9) against isprime on just [7.9e6, 8.1e6]:
# 6_799 false positives, 5_637 real primes missing

It also propagates to very large ranges through the recursive base-prime enumeration (isqrt(hi) lands in the broken band):

primes(2^61, 2^61 + 3000)          # returns 25 composites as "prime"

so the "overflow-safe to typemax(Int)" claim doesn't hold yet. Results are correct below ~2³⁰ and for hi large enough that the window clamp kicks in, which is exactly why the test suite (max ~3M) doesn't catch it.

Root cause

_seg_blocks(hi) = clamp(isqrt(hi) << 3, 1 << 18, 1 << 24)

Windows advance by seg_blocks (iterate returns win_block + ss.seg_blocks), and comb_clear! derives its phase from

first_chunk = win_block ÷ 64

which assumes win_block is chunk-aligned. isqrt(hi) << 3 is generally not a multiple of 64 (e.g. hi = 1.1e9 → seg_blocks = 265328 ≡ 48 (mod 64)), so from the second window on, the division truncates and every small-prime comb is applied shifted by win_block % 64 bits — clearing prime bits (false negatives) and sparing composite bits (false positives). The large-prime stride path and the output transpose work in local offsets and are unaffected, which is consistent with what we observed: all direct-level false positives have a smallest factor < 256.

Two things conspire to hide it:

the clamp bounds (2¹⁸, 2²⁴) happen to be multiples of 64, so only the middle band 2³⁰ ≲ hi ≲ 2⁴² is affected;
the first window is always aligned (_first_block is ... << 6), so ranges that fit in one window are correct even in that band.

Fix

Since seg_blocks is also a user-facing kwarg, the constructor is the right chokepoint (covers the default path and explicit callers):

function SegmentedSieve(lo::Int, hi::Int; seg_blocks::Int = _seg_blocks(hi))
    # win_block advances by seg_blocks and must stay chunk-aligned:
    # comb_clear! derives its phase from win_block ÷ 64.
    seg_blocks = max(64, seg_blocks & ~63)
    ...

Rounding down is safe: I verified aligned windows of any size (64, 128, 265280) give exact results, and 265280 restores the correct count for hi = 1.1e9 (55,662,470). A one-line comment on the invariant would keep a future refactor from reintroducing it.

Suggested regression tests

The mechanism is cheap to pin down — deliberately unaligned window sizes over a small, brute-forceable range spanning many windows (fails on the current branch, passes with the fix):

@testset "unaligned seg_blocks are sanitized" begin
    ref = [n for n in 7:200_000 if isprime(n)]
    for sb in (100, 137, 1000, 4095)          # not multiples of 64
        out = Int[]
        for s in Primes.SegmentedSieve(7, 200_000; seg_blocks = sb)
            Primes.each_lane_prime(s) do p
                push!(out, p)
            end
        end
        @test out == ref
    end
end

Plus a count-only cross-check of the default path over multiple windows above 2³⁰ (self-validating against a known-aligned window size, no external π constant needed):

@testset "default path crosses windows correctly" begin
    countprimes(lo, hi, sb) = begin
        c = 0
        for s in Primes.SegmentedSieve(lo, hi; seg_blocks = sb)
            Primes.each_lane_prime(s) do _; c += 1; end
        end
        c
    end
    for hi in (1_100_000_000, 1_500_000_000)   # isqrt(hi) << 3 is not a multiple of 64
        @test countprimes(7, hi, Primes._seg_blocks(hi)) == countprimes(7, hi, 1 << 18)
    end
end

The window advance (win_block + seg_blocks) only preserved chunk alignment when seg_blocks was a multiple of 64, but _seg_blocks could return any isqrt(hi) << 3 between the clamp bounds; a misaligned window start shifted the comb phase (win_block ÷ 64) and silently left composites set from the second window on. Rather than clamping seg_blocks to a multiple of 64, make misalignment unrepresentable: store the window stride as seg_chunks (64-block chunks), iterate by win_chunk (deriving win_block = win_chunk << 6 once per window), and store nchunks instead of nblocks — its only consumer, each_lane_prime, works in chunks anyway since padding bits are already masked. Any user-supplied seg_chunks is now valid. Add a regression test sieving multi-window with odd seg_chunks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

oscardssmith · 2026-07-02T14:45:43Z

Good catch! I think I've found a better solution than validating where we instead expose the interface as number of chunks (this makes it so all values are allowable)

oscardssmith changed the title ~~Oscardssmith faster sieve part 1~~ faster prime sieve part 1 Jun 13, 2022

oscardssmith mentioned this pull request Jul 3, 2023

I would like to re-use Primes.Factorization for polynomials #126

Closed

oscardssmith closed this Jul 3, 2023

oscardssmith reopened this Jul 3, 2023

oscardssmith force-pushed the oscardssmith-faster-sieve-part-1 branch from 10613fb to 58ad969 Compare June 16, 2026 19:45

oscardssmith changed the title ~~faster prime sieve part 1~~ Segmented sieve for primes, add eachprime iterator. Jun 21, 2026

oscardssmith and others added 2 commits June 30, 2026 14:21

drop compat to 1.6

00c7910

oscardssmith force-pushed the oscardssmith-faster-sieve-part-1 branch from af8d992 to 00c7910 Compare June 30, 2026 21:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Segmented sieve for `primes`, add `eachprime` iterator.#120

Segmented sieve for `primes`, add `eachprime` iterator.#120
oscardssmith wants to merge 3 commits into
mainfrom
oscardssmith-faster-sieve-part-1

oscardssmith commented Jun 13, 2022

Uh oh!

codecov Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

oscardssmith commented Jun 30, 2026

Uh oh!

s-celles commented Jul 2, 2026

Uh oh!

oscardssmith commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

oscardssmith commented Jun 13, 2022

Uh oh!

codecov Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

oscardssmith commented Jun 30, 2026

Uh oh!

s-celles commented Jul 2, 2026

The bug

Root cause

Fix

Suggested regression tests

Uh oh!

oscardssmith commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Jun 16, 2026 •

edited

Loading