Skip to content

fix(script): fan out full manifest backfill batches#2318

Merged
riderx merged 3 commits into
mainfrom
codex/full-batch-manifest-backfill
May 21, 2026
Merged

fix(script): fan out full manifest backfill batches#2318
riderx merged 3 commits into
mainfrom
codex/full-batch-manifest-backfill

Conversation

@riderx
Copy link
Copy Markdown
Member

@riderx riderx commented May 21, 2026

Summary (AI generated)

  • Remove the storage metadata concurrency throttle from the manifest file-size backfill.
  • Let each worker fan out the full fixed 1000-row candidate batch at once.
  • Keep --concurrency accepted as a backward-compatible no-op so older commands still run.

Motivation (AI generated)

The backfill needs to process millions of manifest rows quickly. The DB/API candidate read remains bounded at 1000 rows, but storage metadata checks should use the full batch capacity per worker instead of a smaller extra throttle.

Business Impact (AI generated)

This increases production backfill throughput so missing manifest file sizes can be corrected faster, while preserving indexed manifest.id scans and bulk DB updates.

Test Plan (AI generated)

  • node --check scripts/backfill_manifest_file_sizes.mjs
  • bun scripts/backfill_manifest_file_sizes.mjs --help
  • bunx eslint --no-ignore scripts/backfill_manifest_file_sizes.mjs
  • git diff --check origin/main...HEAD
  • bun lint

Summary by CodeRabbit

  • Chores
    • Updated manifest file size backfill script's concurrency model to improve parallel processing efficiency of storage operations.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Warning

Rate limit exceeded

@riderx has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 8 minutes and 32 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c04b2093-7f8d-4b4b-8238-108b43bf0877

📥 Commits

Reviewing files that changed from the base of the PR and between 5e3686e and 42567da.

📒 Files selected for processing (1)
  • scripts/backfill_manifest_file_sizes.mjs
📝 Walkthrough

Walkthrough

scripts/backfill_manifest_file_sizes.mjs refactors its storage concurrency model to use a fan-out-based approach derived from batch size and worker count, replacing the --concurrency option with derived storageFanoutPerWorker. The core processCandidates function now fetches S3 object sizes in parallel via Promise.all instead of limiting concurrency per worker. The obsolete mapWithConcurrency helper is removed, and all progress reporting and logs are updated to use fan-out terminology.

Changes

Storage concurrency refactor

Layer / File(s) Summary
CLI interface and concurrency model
scripts/backfill_manifest_file_sizes.mjs
Help text marks --concurrency as backward-compatible no-op; option parsing removes concurrency reading and computes storageFanoutPerWorker from batchSize and effectiveStorageFanout from worker count; usage example updated to --workers=16; per-worker storage concurrency option removed from processCandidates call.
Storage fetch refactoring
scripts/backfill_manifest_file_sizes.mjs
mapWithConcurrency helper removed; processCandidates refactored to fetch all candidate object sizes in parallel using Promise.all instead of per-worker concurrency limiting.
Report and logging updates
scripts/backfill_manifest_file_sizes.mjs
Report object fields and startup log updated to record storageFanoutPerWorker and effectiveStorageFanout instead of previous concurrency fields.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • Cap-go/capgo#2313: Refactors the same backfill_manifest_file_sizes.mjs script—reworking worker/concurrency handling and bulk backfill flow in parallel with this PR's concurrency model changes.
  • Cap-go/capgo#2292: Earlier PR that modified scripts/backfill_manifest_file_sizes.mjs; this PR extends that work by refactoring the concurrency and parallel fetch model.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: removing storage concurrency throttling to allow full batch fan-out during manifest backfill.
Description check ✅ Passed The description covers the main summary, motivation, and business impact well, with a comprehensive test plan included. However, the Screenshots and Checklist sections required by the template are missing or incomplete.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented May 21, 2026

Merging this PR will not alter performance

✅ 43 untouched benchmarks
⏩ 2 skipped benchmarks1


Comparing codex/full-batch-manifest-backfill (42567da) with main (a114796)

Open in CodSpeed

Footnotes

  1. 2 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@riderx riderx marked this pull request as ready for review May 21, 2026 13:06
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@scripts/backfill_manifest_file_sizes.mjs`:
- Line 654: Update the help text for the --concurrency option (the string
currently saying "Backward-compatible no-op; each worker checks its full
1000-row batch at once.") to also explain the rationale for its removal—e.g.,
note that storage requests now use full-batch parallelism for maximum throughput
and that concurrency was previously limited per-worker—so the help message
clarifies why --concurrency is a no-op.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 74782cf1-1b67-4e27-a7c4-d5051b9e01fd

📥 Commits

Reviewing files that changed from the base of the PR and between 9523f13 and 5e3686e.

📒 Files selected for processing (1)
  • scripts/backfill_manifest_file_sizes.mjs

Comment thread scripts/backfill_manifest_file_sizes.mjs Outdated
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown

@hadessunxy-code hadessunxy-code left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the backfill change and the decoded-path fallback looks sensible, but I would be careful with the new fan-out behavior before merging.

--all defaults to 8 workers and each worker now Promise.alls a full 1000-row batch, so the default path can issue up to ~8,000 storage metadata requests at once. If R2/S3 starts returning transient 429/5xx responses, the retry loop will also sleep and retry those failed calls in large synchronized waves. That is a pretty big behavior change from the previous --concurrency cap, and it also makes the still-accepted --concurrency flag unable to protect production runs.

I would either keep a real storage concurrency limiter around the per-row getObjectSizeWithRetry calls, or make the full-batch fan-out opt-in / bounded by a new explicit high-watermark. That keeps the legacy backfill safe for production operators while still allowing the faster mode when the storage backend can absorb it.

@riderx riderx merged commit bb60c38 into main May 21, 2026
42 checks passed
@riderx riderx deleted the codex/full-batch-manifest-backfill branch May 21, 2026 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants