feat(local-review): add claude CLI (-p) review engine (alternative to codex) by anagnorisis2peripeteia · Pull Request #274 · openclaw/clawsweeper

anagnorisis2peripeteia · 2026-06-10T02:45:04Z

Summary

Adds a Claude CLI (-p) review engine to local-review, selected with --engine claude, as an alternative to the Codex engine. Codex stays the default, so existing runs are unchanged.

Unlike the Codex engine — which authenticates via Codex OAuth / CLAWSWEEPER_CODEX_LOGIN_METHOD (#251) and needs the Windows spawn handling in #250 — the Claude engine shells out to claude -p and relies on the CLI's own local auth, so it needs none of that login plumbing. The binary is configurable via --claude-bin / CLAUDE_BIN, so any claude -p-compatible CLI can be dropped in; --claude-model sets the model.

⚠️ Stacked on #253

This builds directly on #253 (feat/local-review-command) — the local-review command this extends — and is branched from it. Until #253 merges, this PR's diff includes #253's commit. Merge #253 first; the new content here is the claude CLI (-p) review engine + harden(...) commits.

How it works

Reuses the existing review prompt (buildReviewPrompt) and the Decision schema validation (parseDecision) unchanged — the engine only swaps the subprocess and the output recognition.
The decision schema travels in the prompt rather than via a CLI flag: claude's --json-schema is inline-only and the full schema (~36 KB) exceeds the platform argv limit, so it's appended to the prompt (stdin, unbounded) with an explicit output contract.
The terminal result record is recognised the same way a claude-stream-json stream is parsed (handles both the --output-format json array form and the stream-json line form), preferring the CLI's native structured_output and falling back to the result text.

Security hardening (addressing review feedback)

The security-relevant subprocess construction is extracted into src/claude-engine.ts and unit-tested (test/local-review-claude.test.ts, 7 tests):

Scrubbed environment. The spawned binary is given an env via claudeReviewEnv() that withholds GitHub/OpenAI/ClawSweeper-App credentials (GH_TOKEN, GITHUB_TOKEN, COMMIT_SWEEPER_TARGET_GH_TOKEN, CLAWSWEEPER_PROOF_INSPECTION_TOKEN, CLAWSWEEPER_APP_ID, CLAWSWEEPER_APP_PRIVATE_KEY, OPENAI_API_KEY, CODEX_API_KEY) — important because --claude-bin is operator-configurable. The Claude CLI authenticates from its own local credentials.
No permission bypass; read-only tools. Replaced --permission-mode bypassPermissions with an explicit --allowedTools Read Grep Glob allow-list — no Edit/Write/Bash/bypass. The dirty-checkout guard remains as defense-in-depth.

Real behavior proof

Ran the hardened engine end-to-end against a one-line planted bug (add() rewritten to subtract). It produced a valid Decision that flagged the contract break and passed strict parseDecision validation (paths redacted):

$ local-review --engine claude --target-dir <repo> --base master --repo openclaw/openclaw
[local-review] repo=openclaw/openclaw base=master files=1 diff=170 chars
[local-review] title: fix: subtract in add()
[local-review] engine=claude bin=claude
[local-review] decision written to <redacted-path>
--- decision (excerpt) ---
{
  "decision": "keep_open",
  "overallCorrectness": "patch is incorrect",
  "overallConfidenceScore": 0.97,
  "summary": "This patch is incorrect: it changes `add()` to return `a - b`, inverting the function's established and named behavior. The base commit shows `add()` returning `a + b` was the original intended behavior, and nothing in the commit message, history, or repository suggests addition was ever broken. Merging this would make `add(2, 3)` return `-1` instead of `5` for every caller. A shell check of the one-line diff is enough to block this as written."
}
reviewFindings[0].title: add() now subtracts, breaking its named contract for all callers
securityReview.status: cleared

Testing

tsgo build clean; node --test test/local-review-claude.test.ts → 7/7 pass (read-only argv, scrubbed env, result-record parsing in both array and NDJSON forms, JSON extraction).
End-to-end local-review --engine claude proof above.

Adds a 'local-review' CLI command that reviews the current branch diff against a base branch using Codex, without requiring a GitHub PR. Includes all prerequisites for local runs: - CODEX_BIN env var support (Windows .cmd wrapper compatibility) - shell: true on win32 for all Codex spawn sites - CLAWSWEEPER_CODEX_LOGIN_METHOD for ChatGPT Pro OAuth login - Applied to all 4 Codex subprocess launchers - Clears stale output before each review run Merge order: this PR subsumes openclaw#250 (Windows spawn) and openclaw#251 (OAuth login). If this lands first, close those as superseded. If either lands first, this PR needs a trivial rebase (identical changes).

Add `--engine claude` as an alternative to codex for the local-review command, shelling out to the Claude CLI in headless `-p` mode. The binary is configurable via `--claude-bin` / CLAUDE_BIN so any `claude -p`-compatible CLI can be dropped in, and `--claude-model` sets the model. Codex remains the default engine (no behaviour change for existing runs). Reuses the existing review prompt and the Decision schema validation; the schema travels in the prompt because the CLI's `--json-schema` is inline-only and the full schema exceeds the platform argv limit. The terminal result record is recognised the same way as a claude-stream-json stream (handles both the --output-format json array and the stream-json line forms), preferring the native structured_output and falling back to the result text. Read-only is enforced via the prompt contract, disabled edit tools, and a dirty-checkout guard.

clawsweeper · 2026-06-10T02:46:22Z

Codex review: needs maintainer review before merge. Reviewed June 11, 2026, 4:59 AM ET / 08:59 UTC.

Summary
Adds a Claude CLI (-p) backend for local-review with configurable binary and model selection, credential scrubbing, read-only tools, schema normalization, correction retries, and focused tests.

Reproducibility: not applicable. This PR adds a new optional review engine rather than fixing an existing documented failure; the relevant verification is its supplied end-to-end behavior proof.

Review metrics: 3 noteworthy metrics.

Current merge conflicts: 5 files conflicted. The conflicts affect central review, Codex environment, launcher, coverage-proof, and test paths rather than isolated documentation.
Claude-only surface: 4 files; 717 additions, 14 deletions. Removing the stacked foundation still leaves a substantial independent engine implementation to review.
Focused Claude coverage: 209 test lines added. The new helper tests cover command arguments, credential scrubbing, result parsing, extraction, and schema pruning.

Merge readiness
Overall: 🦐 gold shrimp
Proof: 🦞 diamond lobster
Patch quality: 🦐 gold shrimp
Result: needs maintainer review before merge.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

Resolve feat: add local-review command for offline branch review #253 and rebase onto current main while preserving newer Codex model and workflow-status changes.
Retain operator-facing documentation that --claude-bin names trusted executable code, then run the focused tests and pnpm run check on Node 24 or newer.

Risk before merge

[P1] The five-file conflict overlaps active current-main review and Codex launcher changes, so manual resolution could regress current model selection or review automation even if both branches pass independently.
[P1] --claude-bin and CLAUDE_BIN select executable code that receives the full repository review prompt and can use the operator's local Claude authentication; credential scrubbing and tool flags do not sandbox an arbitrary binary.
[P1] Maintainers must decide whether the project should maintain a second model-specific output-normalization and correction path alongside Codex.

Maintainer options:

Rebase and preserve current behavior (recommended)
Resolve the local-review foundation, rebase this PR, and verify that conflict resolution retains current Codex model, launcher, and review-status behavior.
Accept the local trust boundary
Support the engine only if operator documentation clearly states that the selected Claude-compatible executable is trusted code with access to repository context and local Claude authentication.
Defer the second engine
Pause or close this PR if the ongoing parser, subprocess, and trust-boundary maintenance is not worth the optional local capability.

Next step before merge

[P2] A maintainer must settle the stacked foundation, review the manual conflict resolution, and choose whether to support the new executable trust boundary; there is no narrow autonomous repair yet.

Security
Needs attention: The expected Claude CLI is meaningfully hardened, but an operator-selected executable remains trusted code rather than a sandboxed reviewer.

Review details

Best possible solution:

Resolve #253 first, rebase this PR while preserving newer current-main Codex and review behavior, and land the Claude engine only with clear documentation that the configured executable is trusted code.

Do we have a high-confidence way to reproduce the issue?

Not applicable: this PR adds a new optional review engine rather than fixing an existing documented failure; the relevant verification is its supplied end-to-end behavior proof.

Is this the best way to solve the issue?

Unclear: the hardened engine design is technically plausible, but the final rebased implementation and long-term decision to support a second model-specific runtime require maintainer judgment.

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against 429a73be116e.

Label changes

Label justifications:

P2: This is a useful but optional local automation feature with bounded user impact and significant maintainer-review needs.
merge-risk: 🚨 automation: The PR changes review subprocess execution, model-output interpretation, correction retries, and conflicts with newer current-main launcher behavior.
merge-risk: 🚨 security-boundary: Merging introduces an operator-configurable executable that receives repository content and can use local Claude authentication.
rating: 🦐 gold shrimp: Overall readiness is 🦐 gold shrimp; proof is 🦞 diamond lobster and patch quality is 🦐 gold shrimp.
status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (terminal): The PR body includes redacted after-fix terminal output from the hardened engine showing a real local review detect a planted subtraction bug and produce a strictly validated Decision.
proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes redacted after-fix terminal output from the hardened engine showing a real local review detect a planted subtraction bug and produce a strictly validated Decision.

Evidence reviewed

Security concerns:

[medium] Treat the configured Claude binary as trusted code — src/clawsweeper.ts:17167
The executable receives the complete review prompt and inherits non-scrubbed local environment plus access to its own Claude credentials; the allow-list constrains a conforming Claude CLI but cannot constrain a malicious or incompatible replacement binary.
Confidence: 0.99

What I checked:

Unresolved stack: The branch explicitly depends on the still-open local-review foundation at feat: add local-review command for offline branch review #253, and its first commit contains that foundation rather than Claude-specific work. (src/clawsweeper.ts, 3569490e022d)
Concrete current-main conflicts: A three-way merge against current main reports conflicts in src/clawsweeper.ts, src/codex-env.ts, src/commit-sweeper.ts, src/pr-close-coverage-proof.ts, and test/clawsweeper.test.ts, including newer current-main Codex model plumbing. (src/clawsweeper.ts:24, 429a73be116e)
Claude-only scope: Even after excluding the foundational local-review commit, the feature adds 717 lines across four files, so final conflict resolution is a meaningful review surface rather than a trivial rebase. (src/claude-engine.ts:1, ac82b03e62f1)
Hardened subprocess: The current head scrubs named GitHub, ClawSweeper App, OpenAI, and Codex credentials and limits the expected Claude CLI to Read, Grep, and Glob without permission bypass. (src/claude-engine.ts:11, ac82b03e62f1)
Parser invariant repair: Commit ac82b03 restores strict shared Mantis recommendation validation and confines non-recommended empty-command normalization to the Claude engine path, addressing the previous correctness finding. (src/clawsweeper.ts:6509, ac82b03e62f1)
Clean review checkout: Read-only inspection left the target checkout unchanged. (429a73be116e)

Likely related people:

Peter Steinberger: Current-main history attributes the core review prompt and proof infrastructure plus recent autonomous-review changes in src/clawsweeper.ts to Peter Steinberger. (role: introduced and recently extended the central review pipeline; confidence: high; commits: 558017d383b7, 9209fa90cb3c, 55b06242dc49; files: src/clawsweeper.ts)
Dallin Romney: Recent history shows repeated work on review and proof command plumbing, retry behavior, and automation visibility in the same central file. (role: recent automation-area contributor; confidence: medium; commits: 40568e1fc23d, ebea216f184d, df47b34afea3; files: src/clawsweeper.ts)
Momo: The current main tip updates workflow status publishing in src/clawsweeper.ts, part of the file now conflicting with this branch. (role: most recent adjacent contributor; confidence: medium; commits: 429a73be116e; files: src/clawsweeper.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Address review feedback on the claude engine: - Pass a scrubbed environment (claudeReviewEnv) to the spawned binary so the operator-configurable --claude-bin never receives GitHub/OpenAI/ClawSweeper App credentials. - Replace --permission-mode bypassPermissions with an explicit read-only --allowedTools allow-list (Read/Grep/Glob; no edit/write/bash). - Extract argv/env construction and result parsing into src/claude-engine.ts with unit tests (test/local-review-claude.test.ts).

anagnorisis2peripeteia · 2026-06-10T03:02:13Z

@clawsweeper re-review — addressed the P1s: scrubbed subprocess env (no GitHub/OpenAI/app creds reach the configurable binary), replaced bypassPermissions with a read-only --allowedTools Read Grep Glob allow-list, extracted the argv/env construction + parser into src/claude-engine.ts with unit tests, and added redacted real --engine claude terminal output to the PR body.

clawsweeper · 2026-06-10T03:02:15Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/27250248996
Updated: 2026-06-10T03:10:13.363Z

…ivergent output `local-review --engine claude` failed at the final validation step. The Claude engine only gets the decision schema as prompt guidance (Codex constrains generation via --output-schema; Claude's --json-schema is validate-or-drop and the full schema exceeds the Windows cmd argv limit through the claude.cmd shim). So the model intermittently (a) adds keys the schema does not define (e.g. `kind` on a review finding) that strict parseDecision rejects outright, and (b) omits a required field such as mantisRecommendation.maintainerComment. - pruneToSchema(value, schema) in claude-engine.ts: a pure, schema-driven walk that drops undefined keys at every nesting level (object/array nodes are recognised by `properties`/`items`, so union- or untyped nodes are handled), reporting the dropped paths. Absorbs the extra-key case with no model round-trip. - Bounded correction retry in runClaude: feed the exact validation error back to Claude (<=2 corrections) so it self-corrects omissions, without blind-defaulting critical fields. Hard errors (spawn/dirty-checkout) still propagate immediately. - Memoize the parsed decision schema. Tested: build + lint clean; new pruneToSchema unit tests (incl. union types) pass; live `local-review --engine claude` over a real diff returns valid, on-topic decisions, exercising both the prune and the retry. The 28 pre-existing apply-decisions suite failures are environmental on Windows and reproduce identically without this change.

…aude omission in the engine path Addresses ClawSweeper's openclaw#274 review (P2, conf 0.96). The prior commit defaulted mantisRecommendation.maintainerComment to "" in the SHARED parseDecision, which relaxed the contract for every engine and let a status:"recommended" result with no command pass validation despite being non-dispatchable downstream. - Restore strict requireString(maintainerComment) in parseMantisRecommendation for all engines, preserving the "recommended => needs a command" invariant. - Move the Claude-specific tolerance into the engine path: normalizeClaudeDecisionShape defaults maintainerComment to "" ONLY when status !== "recommended" (the case models systematically omit). A recommended result missing the comment is left to fail strict validation so the bounded correction retry has Claude supply it. - Tests: strict-required regression for both statuses; normalize-only-when-not-recommended. - Document the --claude-bin operator trust boundary.

anagnorisis2peripeteia · 2026-06-11T08:54:06Z

@clawsweeper re-review — addressed the P2 correctness finding (conf 0.96): reverted the shared-parser maintainerComment relaxation, so parseMantisRecommendation is strict requireString again for all engines (preserves the "recommended ⇒ needs a command" invariant). Moved the Claude-only tolerance into the engine path: normalizeClaudeDecisionShape defaults maintainerComment to "" only when status !== "recommended"; a recommended result with no command still fails strict validation and is fixed by the bounded correction retry. Added regression coverage (strict-required for both statuses; normalize-only-when-not-recommended) and an operator-facing --claude-bin trust-boundary note. Live e2e: first-try pass on the common not_recommended case. The #253 stack/rebase and explicit trust-boundary acceptance remain maintainer gates. (ac82b03)

clawsweeper · 2026-06-11T08:54:08Z

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/27335543838
Updated: 2026-06-11T08:59:35.617Z

jason-allen-oneal · 2026-06-11T17:53:10Z

Review note from local harness testing:

Validated the #274 branch locally on Linux:

pnpm install --frozen-lockfile
pnpm run build
node --test test/local-review-claude.test.ts
pnpm run test:unit

Results:

Claude engine unit tests passed: 11/11
Full unit lane passed: 432/432

I did not live-test --engine claude because this machine does not currently have a claude CLI installed. The implementation is cleanly layered on top of #253, keeps Codex as the default engine, and adds useful safeguards around read-only tool access, env scrubbing, dirty-checkout detection, schema pruning, and bounded schema correction.

Recommended merge order: land #253 first, then rebase/merge #274.

anagnorisis2peripeteia added 2 commits June 4, 2026 12:14

jason-allen-oneal mentioned this pull request Jun 11, 2026

fix: resolve Windows Codex spawn failures for local review runs #250

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(local-review): add claude CLI (-p) review engine (alternative to codex)#274

feat(local-review): add claude CLI (-p) review engine (alternative to codex)#274
anagnorisis2peripeteia wants to merge 5 commits into
openclaw:mainfrom
anagnorisis2peripeteia:feat/local-review-claude-engine

anagnorisis2peripeteia commented Jun 10, 2026 •

edited

Loading

Uh oh!

clawsweeper Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

anagnorisis2peripeteia commented Jun 10, 2026

Uh oh!

clawsweeper Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

anagnorisis2peripeteia commented Jun 11, 2026

Uh oh!

clawsweeper Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

jason-allen-oneal commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anagnorisis2peripeteia commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

⚠️ Stacked on #253

How it works

Security hardening (addressing review feedback)

Real behavior proof

Testing

Uh oh!

clawsweeper Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anagnorisis2peripeteia commented Jun 10, 2026

Uh oh!

clawsweeper Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anagnorisis2peripeteia commented Jun 11, 2026

Uh oh!

clawsweeper Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jason-allen-oneal commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anagnorisis2peripeteia commented Jun 10, 2026 •

edited

Loading

clawsweeper Bot commented Jun 10, 2026 •

edited

Loading

clawsweeper Bot commented Jun 10, 2026 •

edited

Loading

clawsweeper Bot commented Jun 11, 2026 •

edited

Loading