This repository shows how to put a coding model inside a controlled CI loop instead of treating generation as the end of the workflow.
The codebase is intentionally left in a buggy baseline state. That is part of the demo. The job of the CI loop is to inspect the broken state, build a constrained failure-driven context snapshot, call the configured backend for a candidate change, render a patch, rerun validation, and decide whether the change should be accepted.
This repo is designed around two operating modes:
codexis the default local development path, where a developer runs the loop before commit as a pre-hook style quality gateopenai_responses_apiis the backup remote CI path, where Jenkins or another pipeline triggers the loop after a commit lands on a UAT or prod-tagged branch
In both cases, ci_loop.py is the gatekeeper. It is the component that decides whether a generated patch is valid enough to count as a pass.
This workflow is currently failure-driven and test-first.
- It assumes the target repository already has meaningful executable tests for the behavior you care about.
- In practice, that means functional/integration scenario tests are the primary signal for acceptance or rejection.
- Unit tests are useful supporting coverage, but unit-only coverage is usually not enough for this gate to reflect real system behavior.
- If no relevant tests exist, the loop can still generate patches, but acceptance quality and confidence drop because the validation signal is weak.
- This workflow is not designed as a general code-review agent. It is a code-fix gate for buggy implementations detected by failing pre-defined tests.
- For repositories that do not define tests before implementation, this workflow has limited practical value.
tdd + humans > vibe tests: If teams are shipping large amounts of AI-generated code, TDD has to be a first-class constraint. The system still needs a human to define expected behavior in executable functional test cases. Without that, "tests" degrade into weak synthetic checks that do not describe the actual product contract.zero-trust: Test-and-fix should be a workflow guard, not an optional developer habit. If the guard can be skipped, someone will skip it under time pressure.automation > manual process: Tooling beats process memory. A tracked pre-commit hook and a CI-enforced post-commit gate are stronger controls than a team norm that says "please remember to run the reviewer."model council > single model: Using a non-Anthropic-family model to check Anthropic-family generated code is deliberate. It reduces same-model bias and usually produces a healthier review surface than asking one model family to repeatedly validate itself.ralph-loop-for-fixing-ai-slop: Developers often use one AI to generate code and then ask the same or similar AI to fix the fallout manually. This repo pushes that repair step into a controlled validation loop so the model must satisfy the tests instead of asking the human to carry the debugging burden.
Those observations led to this repo: put the coding model inside a constrained repair loop, not at the end of the workflow.
- the loop detects failing tests
- it builds a failure-driven context snapshot from test results, repo delta, and local dependencies
- it asks the configured backend for a candidate patch
- it applies the patch, reruns validation, and decides whether to accept or reject it
- the model does not decide success; the loop does
TDD helps only if the tests remain the source of truth. One common AI failure mode is changing the tests to match the code it just wrote, which can create a synthetic pass while the real behavior is still wrong. This repo assumes the tests are the contract and keeps the acceptance decision outside the model.
Code review tools optimize for quality, style, or risk signals. This workflow is narrower and more concrete: it is a code-fix gate for buggy implementations that are already failing pre-defined tests.
Ask a friend to identify your flaws, they soften the feedback. Model biases are real. Anyone shipping AI-generated code to prod knows this. Since I am writing most of the code using Anthropic models, OpenAI models were my preferred choice.
Most coding-agent demos stop at "the model wrote some code."
That is not the hard part in production.
The harder part is deciding whether a generated change is safe to merge.
This repo focuses on that boundary:
- build context from the current repo state
- ask the model for a constrained change
- render and apply a patch
- run validations
- accept or reject the change
+------------------+
| Codebase |
+--------+---------+
|
v
+------------------+
| Context Builder |
+--------+---------+
|
v
+------------------+
| Codex |
| (Diff Gen) |
+--------+---------+
|
v
+------------------+
| Patch (Diff) |
+--------+---------+
|
v
+------------------+
| Validation Layer |
| Tests / Lint |
+--------+---------+
|
+--------+---------+
| |
v v
Accept Reject
Developer change
↓
Local hook or manual trigger
↓
python3 ci_loop.py run-all --backend codex
↓
Build Context (repo snapshot)
↓
Codex CLI worker (`codex exec`)
↓
response.md + patch.diff
↓
Validation Layer
- Scenario test
- Optional local lint/static checks
↓
Decision
- Accept candidate patch
- Reject and keep working tree at baseline
↓
Developer decides whether to commit
Remote commit to UAT/prod-tagged branch
↓
Jenkins post-commit build trigger
↓
python3 ci_loop.py run-all --backend openai_responses_api
↓
Build Context (repo snapshot)
↓
OpenAI Responses API
↓
response.json + patch.diff
↓
Validation Layer
- Scenario test
- Optional lint/static/security checks
↓
Decision Layer
- Accept
- Reject
- Retry
↓
Mark CI build success or failure
In this repository, backend selection is configurable. The shared loop is:
ci_loop.pybuildscontext.txtfrom the failing test run, the observed failure output, dynamically discovered local code dependencies, recent repo delta when that signal is relevant, and matched or candidatetest_scenarios/knowledge depending on confidenceci_loop.pydispatches the scenario attempt through the configured backend- the backend writes a raw artifact such as
response.jsonorresponse.md - the repo renders those edits into
patch.diff - the patch is applied and validated locally
If you want this demo to use a specific Codex-capable model on the OpenAI backup path, set OPENAI_MODEL accordingly. The Responses API is the mechanism for the backup route. The configured model behind that API is the reasoning engine.
The default backend and model are stored in ci_config.json:
{
"backend": "codex",
"openai_model": "gpt-4.1"
}Backend resolution order is:
--backendCLI flagCI_LOOP_BACKENDenvironment variableci_config.json- built-in fallback:
codex
Model resolution order is:
--modelCLI flagOPENAI_MODELenvironment variableci_config.json- built-in fallback:
gpt-4.1
How to change it:
- Edit ci_config.json if you want to change the repo default backend or model.
- Set
CI_LOOP_BACKEND=...if you want a per-environment backend override. - Set
OPENAI_MODEL=...if you want a per-environment override. - Pass
--model ...if you want a one-off run override.
Supported backends today:
codex: implemented via non-interactivecodex execand used by default for local developer-time executionopenai_responses_api: implemented as the backup route for remote CI execution
Backend-specific runtime requirements:
codexrequires a working authenticated Codex CLI session and available Codex usage quotaopenai_responses_apirequiresOPENAI_API_KEYplus network access
Optional local viewer:
brew install glowglow is not required to run the repo, but it makes response.md and patch.diff much easier to present in a terminal demo.
Install the tracked Git hook:
./install_git_hooks.shThat configures core.hooksPath=.githooks and enables the local pre-commit gate in .githooks/pre-commit.
Hook control options:
- default: enabled through
ci_config.jsonviagit_hooks.pre_commit_enabled - temporary local bypass: run with
SKIP_CI_GATEKEEPER_PRE_COMMIT=1
That gives you a stable repo-level switch plus a one-off escape hatch for local testing.
If git commit appears to "hang" or starts printing CI loop logs, that is expected in this repo: the tracked pre-commit hook is running the local gate.
Use this for a one-time bypass:
SKIP_CI_GATEKEEPER_PRE_COMMIT=1 git commit -m "<message>"If you need to disable the hook repo-wide for local experimentation, set git_hooks.pre_commit_enabled to false in ci_config.json or unset core.hooksPath.
Important limit:
The default backend is now codex, so the repo can run without any OpenAI API key dependency when you use the Codex CLI path. The backup openai_responses_api route still calls https://api.openai.com/v1/responses, so it requires network access and an API key.
All scenario files under demo_scenarios/ map to runnable demo flows:
scenario_1_integration_bugWrite/read inconsistency in user_store.pyscenario_2_wrong_fix_pathTempting local fix vs systemic fix in user_registry.pyscenario_3_refactor_bugContract drift between orders.py and pricing.pyscenario_4_low_confidenceDeliberately unmapped failure to demonstrate clarification/proposal artifact generation using delivery_window.py
Machine-usable recurring scenario knowledge now lives separately under test_scenarios/. That registry is used for automated matching and prompt enrichment; demo_scenarios/ remains human-facing demo documentation only.
The old scenarios/ folder is removed and no longer used.
The runtime now implements five phases.
- run the failing scenario test first
- normalize that result into
failed_tests,failure_summary,failure_output, and likely repo-local modules - place that normalized failure record at the top of
context.txt
- inspect recent changed Python files from the working tree or previous commit
- keep only delta that overlaps the failing test context
- attach that bounded signal under
# RECENT_REPO_DELTAand# RECENT_REPO_DELTA_DIFF
- load
test_scenarios/ - attach high-confidence matches automatically under
# SCENARIO_MATCHand# TEST_SCENARIO_RECORD - attach medium-confidence candidates cautiously under
# SCENARIO_CANDIDATE
- if the failure cannot be classified confidently enough, stop before backend generation
- write
output/<scenario>/clarification_request.json - use that artifact to review the targeted contract questions before retrying repair generation
- when the failure looks new or only partially classified, auto-draft
output/<scenario>/scenario_proposal.json - do not persist that proposal silently
- approve it explicitly with:
python3 ci_loop.py approve-scenario-proposal --scenario scenario_1_integration_bugList scenarios:
python3 ci_loop.py list-scenariosCheck the intentionally failing baseline:
python3 ci_loop.py test --scenario scenario_1_integration_bug
python3 ci_loop.py test --scenario scenario_2_wrong_fix_path
python3 ci_loop.py test --scenario scenario_3_refactor_bug
python3 ci_loop.py test --scenario scenario_4_low_confidenceRun one scenario:
python3 ci_loop.py run --scenario scenario_1_integration_bugDemo-safe run (revert accepted fixes after validation):
python3 ci_loop.py run --scenario scenario_1_integration_bug --dryRunRun the full sweep:
python3 ci_loop.py run-all --max-retries 2Demo-safe full sweep:
python3 ci_loop.py run-all --max-retries 2 --dryRunNote: run-all executes only gating scenarios (1 to 3). scenario_4_low_confidence is excluded on purpose so pre-commit and CI gate flows stay green while still allowing a dedicated low-confidence demo.
Run all scenarios including scenario 4 (default fail-closed clarification policy):
python3 ci_loop.py run-all --include-non-gating --max-retries 2 --dryRunRun all scenarios including scenario 4 with interactive clarification:
python3 ci_loop.py run-all --include-non-gating --clarification-policy interactive --max-retries 2 --dryRunRun scenario 4 with forced heuristic clarification options:
python3 ci_loop.py run --scenario scenario_4_low_confidence --clarification-policy interactive --clarifier-option-source heuristic --max-retries 1 --dryRunClarification policy behavior:
fail(default): writes clarification artifacts and exits before low-confidence generationinteractive: prompts the operator with explicit clarification questions plus recommended options, supportsedit/erounds for answer refinement, acceptsyes/yto continue, records the interactive trace inclarification_dialog.json, and then proceeds using the resolved answers as runtime context. This works in a real terminal and also with piped stdin for scripted demos.- The runtime log now prints concise mode banners and does not print the full clarifier prompt template.
Clarifier option source behavior (interactive mode only):
--clarifier-option-source backend(default): options come from the configured backend clarifier--clarifier-option-source heuristic: backend clarifier calls are bypassed and deterministic heuristic options are used
Run explicitly with the local development backend:
python3 ci_loop.py run-all --max-retries 2Run explicitly with the remote CI backend:
python3 ci_loop.py run-all --backend openai_responses_api --max-retries 2If you want to force the local path explicitly, you can still pass:
python3 ci_loop.py run-all --backend codex --max-retries 2Generate context and any clarification artifacts without attempting a repair:
python3 ci_loop.py build-context --scenario scenario_2_wrong_fix_path
python3 ci_loop.py plan-clarification --scenario scenario_2_wrong_fix_path
python3 ci_loop.py plan-clarification --scenario scenario_4_low_confidenceLow-confidence artifact demo:
cat output/scenario_4_low_confidence/clarification_request.json | jq .
cat output/scenario_4_low_confidence/scenario_proposal.json | jq .
# interactive mode only:
cat output/scenario_4_low_confidence/clarification_dialog.json | jq .Approve a reviewed proposal into test_scenarios/:
python3 ci_loop.py approve-scenario-proposal --scenario scenario_2_wrong_fix_pathWith the default codex backend, run and run-all work without an OPENAI_API_KEY when the Codex CLI is available. The backup openai_responses_api route still requires a valid OPENAI_API_KEY in .env or the shell environment, plus network access to the OpenAI Responses API.
The shared repair-review prompt lives in ci_gatekeeper_reviewer.prompt. The shared interactive clarification template lives in ci_gatekeeper_clarifier.prompt.
Both backends use it:
codexuses it as the worker prompt passed tocodex execopenai_responses_apiuses the same prompt body as the repair request sent through the Responses API
That keeps the repair stance in one file instead of duplicating prompt logic across backends. The clarification template keeps reverse-prompting behavior consistent: recommended options first, explicit alternatives, and free-text fallback.
Use codex as the developer-side gate before code leaves the laptop. The intended pattern is:
- A developer changes code locally.
- A local hook or manual command runs
python3 ci_loop.py run-all --backend codex --max-retries 2. - Codex proposes a minimal patch and the loop validates it immediately.
- The developer only proceeds to commit if the generated repair path actually validates.
This is the right backend for a pre-commit or pre-push style workflow because it keeps the loop close to the developer, produces a readable response.md artifact for local inspection, and now has a tracked Git hook install path in this repo.
The intended deployment model for openai_responses_api is a CI job such as Jenkins. A typical productionized flow would be:
- A commit lands on the remote repository.
- Jenkins or another CI system triggers a build on a UAT branch, prod-tagged branch, or similar protected release path.
- The build runs
python3 ci_loop.py run-all --backend openai_responses_api --max-retries 2or a scenario-specific command. - The loop generates a candidate fix, validates it, and decides pass or fail.
- The CI job reports success only if the accepted change satisfies the validation layer.
In a real pipeline, you would usually add lint, static analysis, and security checks alongside the tests already shown here.
The current repo state has been re-verified on both backends:
- Broken baseline checks fail for all four scenarios, which is the intended demo starting state.
python3 ci_loop.py run-all --backend openai_responses_api --max-retries 2passes end to end withopenai_responses_api.python3 ci_loop.py run-all --backend codex --max-retries 2passes end to end withcodex.
That means both the remote CI path and the local developer path are currently working in this repo.
Each scenario writes artifacts under output/<scenario>/:
context.txt- backend-specific raw artifact such as
response.jsonorresponse.md patch.diff- optional
clarification_request.json - optional
scenario_proposal.json - optional
clarification_dialog.json(interactive mode only)
Use these concrete files when demoing so people can inspect actual artifacts instead of imagining the flow:
- context snapshot: output/scenario_4_low_confidence/context.txt
- codex raw output: output/scenario_4_low_confidence/response.md
- responses API raw output: output/scenario_4_low_confidence/response.json
- rendered patch: output/scenario_4_low_confidence/patch.diff
- low-confidence request artifact: output/scenario_4_low_confidence/clarification_request.json
- recurring-scenario draft artifact: output/scenario_4_low_confidence/scenario_proposal.json
- interactive clarification trace: output/scenario_4_low_confidence/clarification_dialog.json
These screenshots map 1:1 to the artifact files above:
- Context snapshot

- Codex raw output

- Responses API raw output

- Rendered patch

- Low-confidence request artifact

- Recurring-scenario draft artifact

- Interactive clarification trace

context.txt: the failure-driven input snapshot sent to the model for that scenario. It includes the normalized failure record, raw failure output, dynamically discovered local code context, bounded recent repo delta when relevant, matched or candidatetest_scenarios/knowledge when confidence warrants it, optional clarification metadata when the loop is blocked, and the static scenario fallback files.response.json: the raw OpenAI Responses API output for theopenai_responses_apibackend.response.md: the raw backend log for the implementedcodexbackend.patch.diff: the unified diff rendered locally from backend output. Use this to review, apply, or discuss the concrete code change.clarification_request.json: the confidence-gated question set the operator should review before allowing a low-confidence repair path.scenario_proposal.json: an auto-draftedtest_scenarios/candidate that still requires explicit approval before it becomes durable repo knowledge.clarification_dialog.json: full interactive trace with suggested options, selected inputs, answer revisions, backend source, and response-thread ids (foropenai_responses_api). The Responses API clarifier path chainsprevious_response_idacross question turns so option generation retains conversation context.
- Open
context.txtto see the exact failure-driven input state. - Open the backend-specific raw artifact to inspect the generator output:
response.jsonforopenai_responses_api,response.mdforcodex. - Open
patch.diffto inspect the exact code change the loop will apply. - Run
python3 ci_loop.py apply --scenario <scenario>to applypatch.diff. - Run
python3 ci_loop.py test --scenario <scenario>to validate the patched result.
Short version:
context.txt= failure record + failure output + relevant code input + recent repo delta + matched or candidate scenario knowledgeresponse.jsonorresponse.md= raw backend output, depending on the selected backend modepatch.diff= concrete code change derived from that outputclarification_request.json= stop-and-review signal for low-confidence failuresscenario_proposal.json= reviewable recurring-scenario draft, not durable state yetclarification_dialog.json= detailed Q&A transcript for auditability and debugging
Verified current artifact targets:
scenario_1_integration_bug->user_store.pyscenario_2_wrong_fix_path->user_registry.pyscenario_3_refactor_bug->orders.py
These artifacts are intentionally kept in the repo so you have a fallback demo trail even if the live API call fails on stage.
Build context:
python3 ci_loop.py build-context --scenario scenario_3_refactor_bugGenerate and apply automatically:
python3 ci_loop.py generate-patch --scenario scenario_3_refactor_bug
python3 ci_loop.py apply --scenario scenario_3_refactor_bug
python3 ci_loop.py test --scenario scenario_3_refactor_bugOr inspect the remote backup API path directly:
curl https://api.openai.com/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4.1",
"instructions": "Return only strict JSON. Do not include markdown fences, prose, or commentary.",
"input": [
{
"role": "user",
"content": [
{
"type": "input_text",
"text": "Return only JSON matching this schema: {\"edits\":[{\"path\":\"relative/path.py\",\"content\":\"full updated file contents\"}]}. Only include files that need to change.\n\nFix the failing tests in tests/test_scenario_3_refactor_bug.py. Do not modify tests. Do not change the pricing.calculate_total contract. Only edit the minimum code needed, preferably in orders.py. Preserve the refactored percentage-based tax semantics.\n\nRepository context follows.\n\n'"$(cat output/example_scenario/context.txt | sed 's/"/\\"/g')"'"
}
]
}
]
}' | tee output/scenario_3_refactor_bug/response.jsonFor actual patch rendering/apply/validation, use the supported CLI path:
python3 ci_loop.py generate-patch --scenario scenario_3_refactor_bug --backend openai_responses_api
python3 ci_loop.py apply --scenario scenario_3_refactor_bug
python3 ci_loop.py test --scenario scenario_3_refactor_bug- Expose this CI gatekeeper utility as a reusable skill inside Claude Code so teams can invoke it natively from their local agent workflow.
- Package this repository as a Claude Code plugin so the gate can be installed and invoked as a first-class plugin capability.
The Codex CLI is the default local path. The Responses API call is the backup route for CI (when you run it inside Enterprise CI setup - hosted on Jenkins-type servers) or when you explicitly choose openai_responses_api.
The model is not the system. The loop is the system.
See PLAYBOOK.md for the demo walkthrough and DEMO-COMMANDS.md for the live command sequence.
