Skip to content

fix(script): rebalance manifest backfill workers#2314

Merged
riderx merged 3 commits into
mainfrom
codex/parallel-manifest-backfill
May 21, 2026
Merged

fix(script): rebalance manifest backfill workers#2314
riderx merged 3 commits into
mainfrom
codex/parallel-manifest-backfill

Conversation

@riderx
Copy link
Copy Markdown
Member

@riderx riderx commented May 21, 2026

Summary (AI generated)

  • Replace static manifest ID range/window splitting with a shared manifest.id cursor used by all workers.
  • Force candidate reads to 1000 rows per batch, even when older commands pass a larger --batch-size.
  • Add a failed metadata CSV report so missing-size rows can be investigated after the run.

Motivation (AI generated)

The previous worker model could still waste workers on sparse ID regions. A shared cursor keeps every worker claiming the next 1000-row batch until no candidates remain, while preserving indexed manifest.id reads.

Business Impact (AI generated)

This should make the production backfill finish faster and produce a concrete CSV audit trail for files whose storage metadata cannot be resolved.

Test Plan (AI generated)

  • bun scripts/backfill_manifest_file_sizes.mjs --help
  • node --check scripts/backfill_manifest_file_sizes.mjs
  • bunx eslint --no-ignore scripts/backfill_manifest_file_sizes.mjs
  • git diff --check
  • bun lint

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

📝 Walkthrough

Walkthrough

Refactors scripts/backfill_manifest_file_sizes.mjs to claim bounded manifest.id windows instead of fixed ranges. Adds createIdWindowClaimer and --id-window-size, introduces processIdWindow for in-window pagination, updates runRangeWorker to repeatedly claim windows, and extends progress reporting with totalWindows/claimedWindows and per-worker current window fields.

Changes

Window-based manifest backfill refactoring

Layer / File(s) Summary
Window-claiming abstraction and CLI integration
scripts/backfill_manifest_file_sizes.mjs
Adds createIdWindowClaimer(minId,maxId,idWindowSize) that yields sequential {start,end,index} windows and a totalWindows count; adds --id-window-size CLI flag and parses idWindowSize with sensible defaults.
Progress report structure for window tracking
scripts/backfill_manifest_file_sizes.mjs
Extends top-level report with idWindowSize and claimedWindows; per-worker reports now include currentStartId/currentEndId, windows, and lastId; progress logs append claimedWindows/totalWindows when known.
In-window pagination (processIdWindow)
scripts/backfill_manifest_file_sizes.mjs
Introduces processIdWindow to paginate candidate rows within a claimed {start,end} window by advancing afterId, merging each page’s results, and writing progress after each page.
Worker loop and window claiming
scripts/backfill_manifest_file_sizes.mjs
Refactors runRangeWorker to repeatedly claimRange() windows, increment claimedWindows, update per-worker current window state, and delegate scanning to processIdWindow.
Worker spawning and single-worker wiring
scripts/backfill_manifest_file_sizes.mjs
--all path uses the claimer, sets report.totalWindows, initializes worker reports, and passes claimRange into workers; non---all path initializes window-related fields on the single worker report; runRangeWorker call site adjusted to accept workerIndex.
Example documentation update
scripts/backfill_manifest_file_sizes.mjs
Header CLI example updated to show higher default concurrency.

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • Cap-go/capgo#2313: Also refactors scripts/backfill_manifest_file_sizes.mjs worker partitioning and runRangeWorker-style loops.
  • Cap-go/capgo#2292: Earlier changes to the same backfill script that this PR further refactors.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: refactoring manifest backfill workers to use dynamic ID windows for better load distribution.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is well-structured and covers the required sections: Summary, Motivation, Business Impact, and Test Plan with completed checks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands and usage tips.

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented May 21, 2026

Merging this PR will not alter performance

✅ 43 untouched benchmarks
⏩ 2 skipped benchmarks1


Comparing codex/parallel-manifest-backfill (fbb2420) with main (fa6727e)

Open in CodSpeed

Footnotes

  1. 2 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@riderx riderx marked this pull request as ready for review May 21, 2026 10:19
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/backfill_manifest_file_sizes.mjs`:
- Line 676: The default id window size calculation for idWindowSize (the
getNumberArg call using Math.max(batchSize * 25, 50000)) uses magic numbers 25
and 50000 without context; add a brief inline comment next to this expression
that documents why 25x batchSize is chosen (e.g., balances worker claim
frequency vs DB scan overhead) and why 50000 is the floor/threshold, referencing
the variables batchSize and the all flag so future maintainers understand the
tuning rationale.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c401b078-5a53-4d67-8aee-78ef1ca28e13

📥 Commits

Reviewing files that changed from the base of the PR and between 3492e44 and df7fdfb.

📒 Files selected for processing (1)
  • scripts/backfill_manifest_file_sizes.mjs

Comment thread scripts/backfill_manifest_file_sizes.mjs Outdated
@sonarqubecloud
Copy link
Copy Markdown

@riderx riderx merged commit 01e3575 into main May 21, 2026
42 checks passed
@riderx riderx deleted the codex/parallel-manifest-backfill branch May 21, 2026 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant