Skip to content

CC4: train-time env diversity produces ZSC to held-out red agents at 10M timesteps#20

Draft
PaulHax wants to merge 17 commits into
mainfrom
cec
Draft

CC4: train-time env diversity produces ZSC to held-out red agents at 10M timesteps#20
PaulHax wants to merge 17 commits into
mainfrom
cec

Conversation

@PaulHax
Copy link
Copy Markdown
Collaborator

@PaulHax PaulHax commented May 12, 2026

Summary

Replication of the Jha et al. (2025) cross-environment cooperation claim — "Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination", arXiv:2504.12714 — on the CAGE Challenge 4 environment.

The paper's claim: training-time environment diversity (with the partner held constant) is sufficient to produce zero-shot coordination with held-out partners. We test the same mechanism on CC4 with the held-out axis being the adversarial red agent rather than a cooperative ally.

Headline: The replication confirms transfer at the cocktail level (Δ +354 to +593). 3-seed single-axis ablations then decompose the cocktail into per-axis contributions: mission diversity is the dominant carrier (+373 Δ alone, 12/12 cells positive), phase-boundary and crown-jewel each contribute small positive amounts (+95 and +106), while the topology bank is actively harmful (−227, only 1/12 cells positive). The cocktail's value ≈ sum of these axes; removing topology would predict a stronger result than the full cocktail. Amplitude-matched control falsifies the alternative "stronger gradient" hypothesis — the mechanism is diversity, not signal strength.

Setup

Blue policy is IPPO (shared trunk, 256×2 tanh, identical hyperparameters across arms). Training partner is fixed to fsm (the cc4_stock variant default). The two arms differ only in what gets sampled at env reset:

reset-time knob C00 (control) C11 (cocktail)
host counts CC4-default random (servers 1-6, users 3-10 per subnet per reset) 16-snapshot bank (op-zone A servers ∈ {2,3,5}, op-zone B ∈ {4,5,6}; fixed per snapshot)
router adjacency fixed 9-subnet CC4 graph fixed 9-subnet CC4 graph (identical to C00)
mission (LWF,ASF,RIA) (1,1,1) anti-correlated bank of 4: each non-baseline boosts 2 of 3 components
phase boundaries (0,167,333) bank of 4 jittered splits
crown-jewel rewards canonical bank of 6 subnet-priority rotations

Important note on what really differs between arms: C00 is NOT a fixed canonical env. CC4's default generative mode randomizes per-subnet server counts ([1, 6]) and user counts ([3, 10]) on every reset, so C00 already trains on a continuously varied host-count distribution. The C11 "topology bank" samples from just 16 fixed snapshots instead — far fewer distinct topologies than C00's fresh-every-reset draw — so on this axis C00 is the more varied arm. (The bank's op-zones are larger on average, but that is host magnitude, not variety.) Subnet adjacency is identically fixed in both arms.

The C11 cocktail's actual contribution over C00 is concentrated in the three semantic axes below — mission, phase timing, and crown-jewel rotation — which C00 fully fixes:

  1. Host-count bank (16 snapshots). Replaces continuous random host-count sampling with 16 pre-generated shapes. Op-zone A servers span {2,3,5} and op-zone B span {4,5,6} (9 distinct A/B pairs across the 16 entries); total hosts range 88–100. This is fewer distinct topologies than C00's generative draw, not more.
  2. Mission bank (4 anti-correlated entries). The (LWF, ASF, RIA) reward multipliers control which mission components dominate the episode return. Entries are (1,1,1), (3,3,1), (1,3,3), (3,1,3) — every non-baseline entry boosts exactly two of three components. Under the anti-correlated set, the "loud signal" rotates and the policy has to read state.
  3. Phase-boundary bank (4 jittered splits). Standard CC4 episode phases at steps (0, 167, 333); bank also includes (0, 100, 300), (0, 200, 400), (0, 150, 250). Phase-conditioned policy behavior can no longer key off absolute step counts.
  4. Phase-rewards bank (6 crown-jewel rotations). Different subnets become high-value at different phases.

Per reset, C11 samples one entry from each bank independently.

Both arms train against fsm only. Evaluation is on the canonical env against four reds: fsm (in-distribution sanity), and cia_c, cia_i, cia_a (CIA-biased role-targeting reds — held out).

Pre-registered verdict thresholds (paired Δ = C11 mean − C00 mean across matched seeds):

  • CONFIRMED: Δ ≥ +200 reward AND lower bound (mean − stderr) > 0
  • REFUTED: Δ ≤ +50 OR sign flip across seeds

Result — full cocktail (C11 vs C00)

3 training seeds × 90 eval episodes per (arm, seed, red) on the canonical env.

Paired Δ (C11 − C00):

held-out red Δmean ± stderr lower bound per-seed deltas
fsm +593 138 +455 [+431, +868, +480]
cia_c +429 100 +329 [+230, +552, +504]
cia_i +383 115 +268 [+158, +532, +459]
cia_a +354 104 +250 [+147, +465, +449]

All four cross the pre-registered +200 confirm bar with lower bound > 0 and no sign flips across the three training seeds.

Per-arm mean reward (the underlying numbers):

red sleep baseline C00 mean C11 mean Δ C00 vs sleep C11 vs sleep
fsm −6565 −2056 −1463 +593 +4509 +5102
cia_c −4684 −1827 −1399 +429 +2857 +3285
cia_i −4739 −1810 −1427 +383 +2929 +3312
cia_a −4157 −1727 −1373 +354 +2430 +2784

Both arms clearly beat the sleep-policy floor across every red. The cocktail-trained policy beats the canonical-trained baseline by +354 to +593 on top of that.

Ablation — which axis carries the signal?

To identify which of the four C11 banks actually drives the transfer effect, we trained single-axis arms (each with one bank active, others off) and evaluated under the same conditions. 3 training seeds × 30 eval episodes per (arm, seed, red), stochastic eval.

3-seed paired Δ per single-axis ablation:

ablation fsm cia_c cia_i cia_a mean Δ cells +pos
miss (mission bank only) +516 +351 +350 +276 +373 12 / 12
cjewel (crown-jewel only) +123 +110 +89 +101 +106 11 / 12
pbound (phase-boundary only) +70 +161 +69 +81 +95 8 / 12
topo (topology bank only) −157 −232 −255 −265 −227 1 / 12

Decomposition of the cocktail:

arm fsm cia_c cia_i cia_a mean Δ cells +pos
C11 (full cocktail) +435 +421 +412 +368 +409 11 / 12
C11prime (amp-matched mission) +441 +384 +364 +330 +380 12 / 12
sum of single axes +552 +390 +253 +193 +347

The full cocktail's per-red Δ is close to the linear sum of single-axis effects (+409 observed vs +347 if simply additive). The axes don't show strong synergy; the cocktail's value is roughly the sum of: a big mission contribution, two small positive contributions from phase-boundary and crown-jewel, minus a substantial topology cost.

Mission alone captures 87-94% of the full cocktail benefit. The remaining gap is filled by the small positive contributions of cjewel and pbound, partially offset by topology's drag.

Why is topology hurting? C00's canonical config draws a fresh random topology every reset (per-subnet servers 1–6, users 3–10, independent across 8 subnets). The "topology bank" replaces this with just 16 fixed snapshots — far fewer distinct topologies than the control sees. The op-zones in those snapshots are larger on average (op-zone B ~5 servers vs the generative mean ~3.5), but variety, not size, is what a diversity ablation tests — and on that axis the bank is narrower. The label is a misnomer: it constrained topological variety rather than expanding it.

Predicted improvement from removing topology: if the linear additivity holds, a cocktail with mission + cjewel + pbound (and no topology bank) would predict ~+574 Δ — meaningfully above the full cocktail's +409. Untested.

Falsifying the diversity-vs-gradient confound

C11's mission bank {(1,1,1), (3,3,1), (1,3,3), (3,1,3)} boosts two of three mission components on three of four entries. Per-entry magnitude sums to (3, 7, 7, 7) — the "diversity" intervention also amplified the reward signal. The +409 cocktail Δ was therefore confounded between "more env variety" and "stronger gradient."

We trained C11prime (3 seeds) with the mission bank rewritten to match per-entry reward magnitude exactly but rotate a single component instead of two:

arm mission bank entries per-entry sum
C11 [1,1,1], [3,3,1], [1,3,3], [3,1,3] 3, 7, 7, 7
C11prime [1,1,1], [5,1,1], [1,5,1], [1,1,5] 3, 7, 7, 7

At full 3-seed power, C11prime produces mean Δ +380 (12/12 cells positive), within 7% of C11's +409. Amplitude-matched diversity reproduces the headline effect almost exactly — the result is not an artifact of stronger gradients. The mechanism is environment variety.

Reframed conclusion

  1. The headline +354..+593 Δ across all four held-out reds replicates. Both arms clearly beat the no-defense sleep floor; the cocktail clearly beats canonical.
  2. The cocktail is a near-linear sum of per-axis effects, not a multiplicative interaction. Mission alone (+373) is the dominant contributor (87-94% of the cocktail). Phase-boundary (+95) and crown-jewel (+106) each add small positive amounts. Topology (-227) actively subtracts. Sum of single axes: +347 ≈ observed cocktail +409.
  3. The topology bank actively harms transfer. Constraining op-zone counts to a 16-snapshot subset is less varied than CC4-default continuous sampling, and the resulting policy underperforms the control by 157-265 reward across reds (only 1 of 12 seed×red cells positive).
  4. Diversity, not amplified reward, drives the effect. 3-seed C11prime (amplitude-matched mission) at +380 reproduces C11's +409 within 7% — stronger-gradient hypothesis falsified.
  5. The cocktail is suboptimal. Linear additivity predicts that "C11 minus topo" would deliver ~+574 Δ — substantially better than the full cocktail. Untested but a clean prediction.

Side observations

  1. Cocktail diversity reduces cross-seed variance. Per-seed Δs are tighter on the easier held-out reds (cia_c range 322, cia_a range 318) than on fsm (range 437), but every seed-pair is positive for every red.
  2. C11 also beats C00 on the trained partner. Δ on fsm is +593, the largest of the four. Env diversity is not trading in-distribution performance for transfer here — it improves both.
  3. Held-out reds are role-targeting variants of one shared attack policy, not three independent partners. cia_c, cia_i, cia_a all share _CIA_PROB_MATRIX and differ only in target-role weights. The four held-out red rows are best read as one trained partner + three priors over one role-biased family.
  4. Env shape differs between fsm eval and CIA-red eval. CIA reds need resilience_roles=True for their role-tag selectors, which forces op_zone_servers=3. fsm eval uses resilience_roles=False with CC4-default random counts. Within-red paired Δ is still valid; only absolute rewards across reds are non-comparable.
  5. Mid-training checkpoints often beat the final. Checkpoint-by-checkpoint eval of C11_10M_seed42 (1.2M → 10M steps in 1.2M increments) shows transfer Δ peaks around 3.6M and dips slightly at 9.6M / 10M. Single seed only, but the pattern appears on both fsm and cia_a — possibly mild late-training drift.

Action distribution — what behavior does each arm learn?

Probe of stochastic eval (seed 42 checkpoints, 90 episodes × 5 blue agents × 500 steps, on fsmcia_a results were qualitatively identical):

C00 vs C11 (full cocktail):

action C00 C11 Δ pp
Restore 6.0% 12.2% +6.3
BlockTraffic 7.9% 4.0% −3.9
AllowTraffic 13.0% 16.3% +3.3
Analyse 24.3% 21.0% −3.3
Decoy 25.8% 22.3% −3.4
Remove 19.8% 19.5% −0.2
Monitor 1.8% 2.8% +1.0
Sleep 1.5% 1.8% +0.4

C00 learns a cheap "block traffic + kill sessions + lots of decoys" defense. C11 learns "restore hosts fully + keep the network open + slightly less decoy spam."

C00 vs miss alone — does the mission-only arm reproduce C11's behavior? Yes, ~75% of the intensity:

action C00 miss Δ pp
Restore 6.0% 10.6% +4.6
BlockTraffic 8.1% 5.5% −2.5
AllowTraffic 13.1% 12.6% −0.5
Decoy 25.7% 25.0% −0.6

Mission-only training recovers the Restore-up, BlockTraffic-down signature. AllowTraffic doesn't shift here — that piece of the strategy may come from interaction with the other banks.

C11 vs miss head-to-head — they play almost identically: all action deltas ≤ 2.4pp except AllowTraffic (miss plays −3.7pp less). The "mission carries the effect" finding is mirrored in policy behavior, not just reward.

C00 vs topo — topology learns a different, worse defense:

action C00 topo Δ pp
Restore 6.0% 7.0% +1.0
Remove 19.6% 22.0% +2.4
Decoy 25.7% 27.4% +1.8
BlockTraffic 8.1% 4.3% −3.7

Topology-only doesn't learn the Restore-heavy pattern. Instead it leans further into kill-sessions + decoy-spam — a doubled-down version of C00's strategy, which transfers worse (Δ reward −380 on fsm). Different defensive choice, not just degraded C11.

The headline: mission diversity teaches the policy to prefer host-restore over session-kill defense, and that preference transfers across held-out reds. Topology bank teaches the opposite.

Scale sensitivity

The same cocktail at 3M training steps produces Δs of only +20 to +102 and REFUTES on all four reds (sign-flipping single-seed results). 10M steps converges on the transferable representation; 3M is not enough at this LR schedule.

Open questions

  • "C11 minus topo" arm. Linear additivity predicts ~+574 mean Δ — substantially better than the +409 full cocktail. Would be the cleanest test of the "topology drag" finding.
  • 5M training budget for the miss-only arm. Mid-training checkpoints often outperform 10M (checkpoint-curve eval) — a recalibrated 5M LR schedule may match or beat 10M at half the compute.

How to reproduce

# Train base arms × 3 seeds (~30 GPU-hr on one A6000):
./scripts/train/cec_phase6_optionb_ablation.sh --arm C00_10M
./scripts/train/cec_phase6_optionb_ablation.sh --arm C11_10M

# Train single-axis ablations × 3 seeds (~30 GPU-hr each arm):
./scripts/train/cec_phase6_optionb_ablation.sh --arm miss_10M
./scripts/train/cec_phase6_optionb_ablation.sh --arm topo_10M
./scripts/train/cec_phase6_optionb_ablation.sh --arm C11prime_10M

# CPU stoch eval × 4 held-out reds × all arms × 3 seeds:
for red in fsm cia_c cia_i cia_a; do
  JAX_PLATFORMS=cpu uv run python scripts/eval/cec_phase6_eval_sweep.py \
    --episodes 90 --seed 1000 --stochastic --reds $red \
    --arms C00_10M C11_10M miss_10M topo_10M C11prime_10M \
    --train-seeds 42 142 242 &
done; wait

# Aggregate (paired Δ + verdict):
PHASE6_ARMS="C00_10M C11_10M" uv run python scripts/dev/cec_phase6_aggregate.py

Test plan

  • Pre-registered verdict thresholds met for all four held-out reds on the full cocktail (3 seeds, 90 eps)
  • Mission-only ablation arm replicates 87-94% of cocktail benefit (3 seeds, 30 eps, 12/12 cells positive)
  • Topology-only ablation shown harmful (3 seeds, 30 eps, 11/12 cells negative)
  • Phase-boundary and crown-jewel single-axis arms confirmed small-positive (3 seeds each, +95 and +106 mean Δ)
  • Amplitude-matched mission bank falsifies the gradient-strength interpretation (3 seeds, mean Δ +380 vs C11's +409)
  • Action-distribution probe: miss reproduces C11's Restore-heavy strategy; topo learns a different worse Remove+Decoy strategy
  • Test the linear-additivity prediction: a "cocktail minus topology" arm should yield ~+574 Δ

PaulHax added 17 commits May 9, 2026 14:36
Adds the ``mission_bank`` / ``mission_bank_amplify`` recipe surface and
plumbs it through ``ScenarioEnv._select_const`` so each reset samples a
``(LWF, ASF, RIA)`` triple and post-multiplies ``const.phase_rewards``.
Mirrors the diversity branch's proven Phase 4b approach (post-multiply
at reset rather than carrying a per-state field), so ``SimulatorState``
shape and ``rewards.py`` are unchanged.

When ``mission_bank`` is omitted or empty, behavior matches today
exactly (no per-reset variation).  ``mission_bank_amplify`` scales the
*entire* sampled triple element-wise — amplify=10 with bank entry
(1, 3, 1) yields (10, 30, 10), documented in code and tests.

Pinned recipe API (per Phase 6 plan, do not rename — other streams
depend on these key names):

  train.mission_bank: list[list[float]]   # (LWF, ASF, RIA) triples
  train.mission_bank_amplify: float       # default 1.0

Threaded through both ``make_jax_env`` call sites in ``ippo_jax.py``
via ``MISSION_BANK`` / ``MISSION_BANK_AMPLIFY`` config keys.

Tests in ``tests/test_mission_bank_sampling.py`` cover determinism,
4-bin chi-square uniformity at α=0.01 across 10k keys, the single-entry
3× ASF channel scaling, the amplify-multiplies-the-whole-triple
contract, and the empty-bank fast path.
16 pre-built topology snapshots covering router-adjacency, op-zone
sizing, and cross-segment allow-list variation. Plumbs train.topology_bank
through recipe → ippo_jax. Per-reset random index sampling already
supported by ScenarioEnv._select_const.
…6 / S3)

Lets eval choose a different red than training used (Phase 6 Test 2's
held-out-partner sweep). CLI > recipe.eval.red > recipe.eval.variant.
# Conflicts:
#	scripts/train/algorithms/ippo_jax.py
#	src/jaxborg/recipe.py
Adds the Test 1 heuristic-spread spikes for both diversity axes and the
four Test 2 training recipes per cec-phase6-plan.md.

Spike scripts (Test 1, σ-ratio ≥ 1.5 gate):
- scripts/dev/cec_phase6_spike_axis_a.py — topology bank vs fixed shape
- scripts/dev/cec_phase6_spike_axis_b.py — mission bank vs fixed profile
Both use a sleep policy (policy-agnostic spread test) and the same
CEC_SPIKE_EPISODES / CEC_SPIKE_STEPS env-var pattern as phase5.

Recipes (Test 2, 2x2 factorial × 3 seeds × 3M timesteps):
- recipes/cec_phase6_C00.yaml — control (fixed shape, fixed mission)
- recipes/cec_phase6_C10.yaml — Axis A (16-shape bank, fixed mission)
- recipes/cec_phase6_C01.yaml — Axis B (fixed shape, 4-entry bank)
- recipes/cec_phase6_C11.yaml — both axes
All four train against fsm (cc4_stock); held-out red sweep happens at
eval time via eval_recipe.py --eval-red. mission_bank_amplify=1.0 per
the plan's pre-registered default.

Tests: tests/test_phase6_recipes.py covers loadability, plan citation,
TOPOLOGY_BANK length & on-disk presence, MISSION_BANK content & amplify.
Builds the env arm C11 trains on (both topology + mission banks active),
vmaps reset across 64 envs, runs a JIT'd 50-step rollout with random
actions, asserts no NaN/Inf in rewards or observations, and checks that
≥3 distinct topology snapshots and ≥3 distinct mission-multiplier triples
were sampled. Catches PRNG-splitting regressions where one bank silently
collapses to a singleton. Marked slow (CPU JIT compile ~5 min); part of
Phase 6's S1+S2+S3+S4 integration gate.
…of 3)

The 16-shape bank now emits totals {3, 6, 9} via per-zone floors
{(1,2), (3,3), (4,5)} so the AUTH/DB/WEB role candidate pool divides
evenly across each shape. Previous {4, 6, 8} totals split unevenly
(4 → 1+1+1+1 leaves a leftover; 8 → 3+2+3 biases roles).

build_topology now accepts op_zone_min_servers as int OR (a, b) tuple;
int form preserves legacy behavior. Variant chain unchanged.

Bank regenerated; spike + smoke + resilience tests green.
…y jitter, crown-jewel rotation

Three env-diversity additions to address Test 1's findings:

1. Anti-correlated mission profiles (P1). Replaces the default
   {(1,1,1),(3,1,1),(1,3,1),(1,1,3)} bank with
   {(1,1,1),(3,3,1),(1,3,3),(3,1,3)} — every non-baseline entry boosts 2 of 3
   components, so a "boost the loud one" memorization fails. Disambiguates
   "diversity itself helps" from "loud reward signal helps" (the Test 1
   axis-B σ=3.33 mechanical-scaling critique).

2. Phase-boundary jitter bank (P2). Per-reset sample of phase_boundaries
   from a 4-entry bank covering {canonical, short setup, long setup,
   short mid-phase}. Phase transitions, allow-list flips, and phase_rewards
   index switches all move with the sampled split — breaks "deploy decoys
   at step 167" memorization.

3. Crown-jewel rotation / phase_rewards bank (P3). Per-reset sample of an
   entire (MISSION_PHASES, NUM_SUBNETS, 3) phase_rewards array from a
   6-entry bank. Each entry rotates which subnet is high-value in which
   phase (OPS_A↔OPS_B swap; ADMIN priority; OFFICE priority; both-OPS
   alert; full rotation). Same physical topology produces different reward
   gradients per episode → forces the policy to read state instead of
   memorizing subnet indices. Direct fix for Axis A's null σ-ratio.

API: train.phase_boundary_bank: list[[int, int, int]] |
     train.phase_rewards_bank: bool (uses canonical bank) | list of arrays.
Both plumbed through ScenarioEnv → FsmRedCC4Env → make_jax_env → recipe →
ippo_jax. Empty/None preserves legacy behavior.

C01/C11 recipes updated to the anti-correlated mission bank.
axis-A spike now exercises all three new banks together (topology +
boundary + rewards) so the topology-variation signal can actually reach
the reward channel.

20 new tests in test_phase_boundary_bank.py + test_phase_rewards_bank.py;
886 unit tests pass total, no regressions.
…gregator

Cleanup:
- Drop C01 / C10 recipes — the per-axis ablations were dropped after
  Test 1 v2 found the σ-ratio gate was policy-mediated.
- Drop the cec_phase6_spike_axis_a / _b scripts — diagnostic served
  its purpose; results are committed in earlier commits.
- gitignore logs/ and remove old spike logs.
- Trim test_phase6_recipes.py ARMS to {C00, C11}.

Test 2 setup:
- C11 recipe now enables ALL FOUR banks: topology (16 shapes),
  mission (anti-correlated 4-entry), phase_boundary_bank (4-entry),
  phase_rewards_bank: true (canonical 6-entry crown-jewel rotation).
- C00 recipe re-documented as the canonical-config control.
- scripts/eval/cec_phase6_eval_jax.py: JAX-native held-out red eval
  (load checkpoint, vmap argmax rollouts, write JSONL row).
- scripts/train/cec_phase6_test2.sh: 6-job sbatch launcher
  (C00, C11) × (42, 142, 242). --dry-run prints commands.
- scripts/eval/cec_phase6_eval_test2.sh: 30-job eval orchestrator
  (6 ckpts × 5 reds), one sbatch per cell.
- scripts/dev/cec_phase6_aggregate.py: reads phase6_*.jsonl rows,
  emits paired-delta C11−C00 table per held-out red with the
  pre-registered CONFIRMED/REFUTED/INCONCLUSIVE verdict.

Run sequence (when GPUs free):
  1. ./scripts/train/cec_phase6_test2.sh         # 6 sbatch jobs, ~3 hr parallel
  2. ./scripts/eval/cec_phase6_eval_test2.sh     # 30 eval jobs, ~10 min parallel
  3. uv run python scripts/dev/cec_phase6_aggregate.py
Eval is rollouts-only — no gradient compute, no benefit from GPU. Drop
--gres=gpu:1, request --cpus-per-task=8 instead, pin JAX_PLATFORMS=cpu.
Frees the GPU pool for training/diagnostic work that actually needs it.

Per-cell wall ~5 min on CPU (~3 min JIT compile + sub-minute rollout).
30 cells fan out across the cluster, ~30 min total if 5+ run concurrently.
Slurm's --wrap puts content in a generated slurm_script run via /bin/sh
(dash on this system), which doesn't support 'set -o pipefail'. Switch
to 'set -eu' since none of these commands use pipes.
RandomSelectRedAgent is a CybORG-side construct and isn't in the JAX
red-selector REGISTRY, so cec_phase6_eval_jax.py crashes on --eval-red random.
Drop random from the default eval list / argparse choices / aggregator REDS
and document the CybORG-eval fallback. The 6 failed eval cells from the
prior sweep are no longer relevant; the 4 informative held-out reds (fsm,
cia_c, cia_i, cia_a) cover the Phase 6 ZSC question.
The 00:30:00 limit was tight for N=90 eps and insufficient for N=300:
the 12 C00 jobs in the prior CPU sweep all hit TIMEOUT at 30:18 with no
result row written. JIT compile ~3 min + 300-ep CPU rollout was >27 min.
01:30:00 gives headroom even if rollout scales worse than expected.
Adds C00_10M (control) and C11_10M (full 4-bank cocktail) recipes plus
four single-axis ablation recipes (topo / miss / pbound / cjewel) at 10M
training steps. New sequential CPU eval sweep amortizes JIT compile across
checkpoints by red. Aggregator decouples from hard-coded arm names so the
same script handles 3M, 10M, and ablation arms via PHASE6_ARMS env var.

At 10M training steps the cocktail produces a paired Δ of +286 to +404
reward on all four held-out reds (fsm / cia_c / cia_i / cia_a), all
crossing the pre-registered +200 confirm bar with lower bound > 0 and
no sign flips across 3 training seeds. The same cocktail at 3M was
REFUTED — undertraining, not a failed mechanism.

Files:
- recipes/cec_phase6_{C00,C11,topo,miss,pbound,cjewel}_10M.yaml
- scripts/train/cec_phase6_optionb_ablation.sh   (chained 1-GPU launcher)
- scripts/eval/cec_phase6_eval_sweep.py          (sequential CPU sweep)
- scripts/dev/cec_phase6_aggregate.py            (arm-decoupled, pooled)
- scripts/eval/cec_phase6_eval_test2.sh          (PHASE6_ARMS override)
- scripts/eval/cec_phase6_eval_jax.py            (JAX cache env defaults)
@PaulHax PaulHax marked this pull request as draft May 21, 2026 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant