Skip to content

feat(local-review): add claude CLI (-p) review engine (alternative to codex)#274

Open
anagnorisis2peripeteia wants to merge 5 commits into
openclaw:mainfrom
anagnorisis2peripeteia:feat/local-review-claude-engine
Open

feat(local-review): add claude CLI (-p) review engine (alternative to codex)#274
anagnorisis2peripeteia wants to merge 5 commits into
openclaw:mainfrom
anagnorisis2peripeteia:feat/local-review-claude-engine

Conversation

@anagnorisis2peripeteia

@anagnorisis2peripeteia anagnorisis2peripeteia commented Jun 10, 2026

Copy link
Copy Markdown

Summary

Adds a Claude CLI (-p) review engine to local-review, selected with --engine claude, as an alternative to the Codex engine. Codex stays the default, so existing runs are unchanged.

Unlike the Codex engine — which authenticates via Codex OAuth / CLAWSWEEPER_CODEX_LOGIN_METHOD (#251) and needs the Windows spawn handling in #250 — the Claude engine shells out to claude -p and relies on the CLI's own local auth, so it needs none of that login plumbing. The binary is configurable via --claude-bin / CLAUDE_BIN, so any claude -p-compatible CLI can be dropped in; --claude-model sets the model.

⚠️ Stacked on #253

This builds directly on #253 (feat/local-review-command) — the local-review command this extends — and is branched from it. Until #253 merges, this PR's diff includes #253's commit. Merge #253 first; the new content here is the claude CLI (-p) review engine + harden(...) commits.

How it works

  • Reuses the existing review prompt (buildReviewPrompt) and the Decision schema validation (parseDecision) unchanged — the engine only swaps the subprocess and the output recognition.
  • The decision schema travels in the prompt rather than via a CLI flag: claude's --json-schema is inline-only and the full schema (~36 KB) exceeds the platform argv limit, so it's appended to the prompt (stdin, unbounded) with an explicit output contract.
  • The terminal result record is recognised the same way a claude-stream-json stream is parsed (handles both the --output-format json array form and the stream-json line form), preferring the CLI's native structured_output and falling back to the result text.

Security hardening (addressing review feedback)

The security-relevant subprocess construction is extracted into src/claude-engine.ts and unit-tested (test/local-review-claude.test.ts, 7 tests):

  • Scrubbed environment. The spawned binary is given an env via claudeReviewEnv() that withholds GitHub/OpenAI/ClawSweeper-App credentials (GH_TOKEN, GITHUB_TOKEN, COMMIT_SWEEPER_TARGET_GH_TOKEN, CLAWSWEEPER_PROOF_INSPECTION_TOKEN, CLAWSWEEPER_APP_ID, CLAWSWEEPER_APP_PRIVATE_KEY, OPENAI_API_KEY, CODEX_API_KEY) — important because --claude-bin is operator-configurable. The Claude CLI authenticates from its own local credentials.
  • No permission bypass; read-only tools. Replaced --permission-mode bypassPermissions with an explicit --allowedTools Read Grep Glob allow-list — no Edit/Write/Bash/bypass. The dirty-checkout guard remains as defense-in-depth.

Real behavior proof

Ran the hardened engine end-to-end against a one-line planted bug (add() rewritten to subtract). It produced a valid Decision that flagged the contract break and passed strict parseDecision validation (paths redacted):

$ local-review --engine claude --target-dir <repo> --base master --repo openclaw/openclaw
[local-review] repo=openclaw/openclaw base=master files=1 diff=170 chars
[local-review] title: fix: subtract in add()
[local-review] engine=claude bin=claude
[local-review] decision written to <redacted-path>
--- decision (excerpt) ---
{
  "decision": "keep_open",
  "overallCorrectness": "patch is incorrect",
  "overallConfidenceScore": 0.97,
  "summary": "This patch is incorrect: it changes `add()` to return `a - b`, inverting the function's established and named behavior. The base commit shows `add()` returning `a + b` was the original intended behavior, and nothing in the commit message, history, or repository suggests addition was ever broken. Merging this would make `add(2, 3)` return `-1` instead of `5` for every caller. A shell check of the one-line diff is enough to block this as written."
}
reviewFindings[0].title: add() now subtracts, breaking its named contract for all callers
securityReview.status: cleared

Testing

  • tsgo build clean; node --test test/local-review-claude.test.ts → 7/7 pass (read-only argv, scrubbed env, result-record parsing in both array and NDJSON forms, JSON extraction).
  • End-to-end local-review --engine claude proof above.

Adds a 'local-review' CLI command that reviews the current branch diff
against a base branch using Codex, without requiring a GitHub PR.

Includes all prerequisites for local runs:
- CODEX_BIN env var support (Windows .cmd wrapper compatibility)
- shell: true on win32 for all Codex spawn sites
- CLAWSWEEPER_CODEX_LOGIN_METHOD for ChatGPT Pro OAuth login
- Applied to all 4 Codex subprocess launchers
- Clears stale output before each review run

Merge order: this PR subsumes openclaw#250 (Windows spawn) and openclaw#251 (OAuth
login). If this lands first, close those as superseded. If either
lands first, this PR needs a trivial rebase (identical changes).
Add `--engine claude` as an alternative to codex for the local-review
command, shelling out to the Claude CLI in headless `-p` mode. The binary is
configurable via `--claude-bin` / CLAUDE_BIN so any `claude -p`-compatible CLI
can be dropped in, and `--claude-model` sets the model. Codex remains the
default engine (no behaviour change for existing runs).

Reuses the existing review prompt and the Decision schema validation; the
schema travels in the prompt because the CLI's `--json-schema` is inline-only
and the full schema exceeds the platform argv limit. The terminal result
record is recognised the same way as a claude-stream-json stream (handles both
the --output-format json array and the stream-json line forms), preferring the
native structured_output and falling back to the result text. Read-only is
enforced via the prompt contract, disabled edit tools, and a dirty-checkout
guard.
@clawsweeper

clawsweeper Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed June 11, 2026, 4:59 AM ET / 08:59 UTC.

Summary
Adds a Claude CLI (-p) backend for local-review with configurable binary and model selection, credential scrubbing, read-only tools, schema normalization, correction retries, and focused tests.

Reproducibility: not applicable. This PR adds a new optional review engine rather than fixing an existing documented failure; the relevant verification is its supplied end-to-end behavior proof.

Review metrics: 3 noteworthy metrics.

  • Current merge conflicts: 5 files conflicted. The conflicts affect central review, Codex environment, launcher, coverage-proof, and test paths rather than isolated documentation.
  • Claude-only surface: 4 files; 717 additions, 14 deletions. Removing the stacked foundation still leaves a substantial independent engine implementation to review.
  • Focused Claude coverage: 209 test lines added. The new helper tests cover command arguments, credential scrubbing, result parsing, extraction, and schema pruning.

Merge readiness
Overall: 🦐 gold shrimp
Proof: 🦞 diamond lobster
Patch quality: 🦐 gold shrimp
Result: needs maintainer review before merge.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Resolve feat: add local-review command for offline branch review #253 and rebase onto current main while preserving newer Codex model and workflow-status changes.
  • Retain operator-facing documentation that --claude-bin names trusted executable code, then run the focused tests and pnpm run check on Node 24 or newer.

Risk before merge

  • [P1] The five-file conflict overlaps active current-main review and Codex launcher changes, so manual resolution could regress current model selection or review automation even if both branches pass independently.
  • [P1] --claude-bin and CLAUDE_BIN select executable code that receives the full repository review prompt and can use the operator's local Claude authentication; credential scrubbing and tool flags do not sandbox an arbitrary binary.
  • [P1] Maintainers must decide whether the project should maintain a second model-specific output-normalization and correction path alongside Codex.

Maintainer options:

  1. Rebase and preserve current behavior (recommended)
    Resolve the local-review foundation, rebase this PR, and verify that conflict resolution retains current Codex model, launcher, and review-status behavior.
  2. Accept the local trust boundary
    Support the engine only if operator documentation clearly states that the selected Claude-compatible executable is trusted code with access to repository context and local Claude authentication.
  3. Defer the second engine
    Pause or close this PR if the ongoing parser, subprocess, and trust-boundary maintenance is not worth the optional local capability.

Next step before merge

  • [P2] A maintainer must settle the stacked foundation, review the manual conflict resolution, and choose whether to support the new executable trust boundary; there is no narrow autonomous repair yet.

Security
Needs attention: The expected Claude CLI is meaningfully hardened, but an operator-selected executable remains trusted code rather than a sandboxed reviewer.

Review details

Best possible solution:

Resolve #253 first, rebase this PR while preserving newer current-main Codex and review behavior, and land the Claude engine only with clear documentation that the configured executable is trusted code.

Do we have a high-confidence way to reproduce the issue?

Not applicable: this PR adds a new optional review engine rather than fixing an existing documented failure; the relevant verification is its supplied end-to-end behavior proof.

Is this the best way to solve the issue?

Unclear: the hardened engine design is technically plausible, but the final rebased implementation and long-term decision to support a second model-specific runtime require maintainer judgment.

AGENTS.md: found and applied where relevant.

Codex review notes: model internal, reasoning high; reviewed against 429a73be116e.

Label changes

Label justifications:

  • P2: This is a useful but optional local automation feature with bounded user impact and significant maintainer-review needs.
  • merge-risk: 🚨 automation: The PR changes review subprocess execution, model-output interpretation, correction retries, and conflicts with newer current-main launcher behavior.
  • merge-risk: 🚨 security-boundary: Merging introduces an operator-configurable executable that receives repository content and can use local Claude authentication.
  • rating: 🦐 gold shrimp: Overall readiness is 🦐 gold shrimp; proof is 🦞 diamond lobster and patch quality is 🦐 gold shrimp.
  • status: ⏳ waiting on author: ClawSweeper has contributor-facing work open and is waiting for author action. Sufficient (terminal): The PR body includes redacted after-fix terminal output from the hardened engine showing a real local review detect a planted subtraction bug and produce a strictly validated Decision.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body includes redacted after-fix terminal output from the hardened engine showing a real local review detect a planted subtraction bug and produce a strictly validated Decision.
Evidence reviewed

Security concerns:

  • [medium] Treat the configured Claude binary as trusted code — src/clawsweeper.ts:17167
    The executable receives the complete review prompt and inherits non-scrubbed local environment plus access to its own Claude credentials; the allow-list constrains a conforming Claude CLI but cannot constrain a malicious or incompatible replacement binary.
    Confidence: 0.99

What I checked:

Likely related people:

  • Peter Steinberger: Current-main history attributes the core review prompt and proof infrastructure plus recent autonomous-review changes in src/clawsweeper.ts to Peter Steinberger. (role: introduced and recently extended the central review pipeline; confidence: high; commits: 558017d383b7, 9209fa90cb3c, 55b06242dc49; files: src/clawsweeper.ts)
  • Dallin Romney: Recent history shows repeated work on review and proof command plumbing, retry behavior, and automation visibility in the same central file. (role: recent automation-area contributor; confidence: medium; commits: 40568e1fc23d, ebea216f184d, df47b34afea3; files: src/clawsweeper.ts)
  • Momo: The current main tip updates workflow status publishing in src/clawsweeper.ts, part of the file now conflicting with this branch. (role: most recent adjacent contributor; confidence: medium; commits: 429a73be116e; files: src/clawsweeper.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. merge-risk: 🚨 automation 🚨 Merging this PR could break CI, automerge, proof capture, label sync, or automation. labels Jun 10, 2026
Address review feedback on the claude engine:
- Pass a scrubbed environment (claudeReviewEnv) to the spawned binary so the
  operator-configurable --claude-bin never receives GitHub/OpenAI/ClawSweeper
  App credentials.
- Replace --permission-mode bypassPermissions with an explicit read-only
  --allowedTools allow-list (Read/Grep/Glob; no edit/write/bash).
- Extract argv/env construction and result parsing into src/claude-engine.ts
  with unit tests (test/local-review-claude.test.ts).
@anagnorisis2peripeteia

Copy link
Copy Markdown
Author

@clawsweeper re-review — addressed the P1s: scrubbed subprocess env (no GitHub/OpenAI/app creds reach the configurable binary), replaced bypassPermissions with a read-only --allowedTools Read Grep Glob allow-list, extracted the argv/env construction + parser into src/claude-engine.ts with unit tests, and added redacted real --engine claude terminal output to the PR body.

@clawsweeper

clawsweeper Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 10, 2026
…ivergent output

`local-review --engine claude` failed at the final validation step. The Claude
engine only gets the decision schema as prompt guidance (Codex constrains
generation via --output-schema; Claude's --json-schema is validate-or-drop and
the full schema exceeds the Windows cmd argv limit through the claude.cmd shim).
So the model intermittently (a) adds keys the schema does not define (e.g. `kind`
on a review finding) that strict parseDecision rejects outright, and (b) omits a
required field such as mantisRecommendation.maintainerComment.

- pruneToSchema(value, schema) in claude-engine.ts: a pure, schema-driven walk
  that drops undefined keys at every nesting level (object/array nodes are
  recognised by `properties`/`items`, so union- or untyped nodes are handled),
  reporting the dropped paths. Absorbs the extra-key case with no model round-trip.
- Bounded correction retry in runClaude: feed the exact validation error back to
  Claude (<=2 corrections) so it self-corrects omissions, without blind-defaulting
  critical fields. Hard errors (spawn/dirty-checkout) still propagate immediately.
- Memoize the parsed decision schema.

Tested: build + lint clean; new pruneToSchema unit tests (incl. union types) pass;
live `local-review --engine claude` over a real diff returns valid, on-topic
decisions, exercising both the prune and the retry. The 28 pre-existing
apply-decisions suite failures are environmental on Windows and reproduce
identically without this change.
@clawsweeper clawsweeper Bot added rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels Jun 11, 2026
…aude omission in the engine path

Addresses ClawSweeper's openclaw#274 review (P2, conf 0.96). The prior commit defaulted
mantisRecommendation.maintainerComment to "" in the SHARED parseDecision, which
relaxed the contract for every engine and let a status:"recommended" result with
no command pass validation despite being non-dispatchable downstream.

- Restore strict requireString(maintainerComment) in parseMantisRecommendation
  for all engines, preserving the "recommended => needs a command" invariant.
- Move the Claude-specific tolerance into the engine path:
  normalizeClaudeDecisionShape defaults maintainerComment to "" ONLY when
  status !== "recommended" (the case models systematically omit). A recommended
  result missing the comment is left to fail strict validation so the bounded
  correction retry has Claude supply it.
- Tests: strict-required regression for both statuses; normalize-only-when-not-recommended.
- Document the --claude-bin operator trust boundary.
@anagnorisis2peripeteia

Copy link
Copy Markdown
Author

@clawsweeper re-review — addressed the P2 correctness finding (conf 0.96): reverted the shared-parser maintainerComment relaxation, so parseMantisRecommendation is strict requireString again for all engines (preserves the "recommended ⇒ needs a command" invariant). Moved the Claude-only tolerance into the engine path: normalizeClaudeDecisionShape defaults maintainerComment to "" only when status !== "recommended"; a recommended result with no command still fails strict validation and is fixed by the bounded correction retry. Added regression coverage (strict-required for both statuses; normalize-only-when-not-recommended) and an operator-facing --claude-bin trust-boundary note. Live e2e: first-try pass on the common not_recommended case. The #253 stack/rebase and explicit trust-boundary acceptance remain maintainer gates. (ac82b03)

@clawsweeper

clawsweeper Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@jason-allen-oneal

Copy link
Copy Markdown

Review note from local harness testing:

Validated the #274 branch locally on Linux:

pnpm install --frozen-lockfile
pnpm run build
node --test test/local-review-claude.test.ts
pnpm run test:unit

Results:

  • Claude engine unit tests passed: 11/11
  • Full unit lane passed: 432/432

I did not live-test --engine claude because this machine does not currently have a claude CLI installed. The implementation is cleanly layered on top of #253, keeps Codex as the default engine, and adds useful safeguards around read-only tool access, env scrubbing, dirty-checkout detection, schema pruning, and bounded schema correction.

Recommended merge order: land #253 first, then rebase/merge #274.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 automation 🚨 Merging this PR could break CI, automerge, proof capture, label sync, or automation. merge-risk: 🚨 security-boundary 🚨 Merging this PR could weaken sandboxing, authorization, credentials, or sensitive data. P2 Normal priority bug or improvement with limited blast radius. proof: sufficient Contributor real behavior proof is sufficient. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants