Conversation
Adds the ``mission_bank`` / ``mission_bank_amplify`` recipe surface and plumbs it through ``ScenarioEnv._select_const`` so each reset samples a ``(LWF, ASF, RIA)`` triple and post-multiplies ``const.phase_rewards``. Mirrors the diversity branch's proven Phase 4b approach (post-multiply at reset rather than carrying a per-state field), so ``SimulatorState`` shape and ``rewards.py`` are unchanged. When ``mission_bank`` is omitted or empty, behavior matches today exactly (no per-reset variation). ``mission_bank_amplify`` scales the *entire* sampled triple element-wise — amplify=10 with bank entry (1, 3, 1) yields (10, 30, 10), documented in code and tests. Pinned recipe API (per Phase 6 plan, do not rename — other streams depend on these key names): train.mission_bank: list[list[float]] # (LWF, ASF, RIA) triples train.mission_bank_amplify: float # default 1.0 Threaded through both ``make_jax_env`` call sites in ``ippo_jax.py`` via ``MISSION_BANK`` / ``MISSION_BANK_AMPLIFY`` config keys. Tests in ``tests/test_mission_bank_sampling.py`` cover determinism, 4-bin chi-square uniformity at α=0.01 across 10k keys, the single-entry 3× ASF channel scaling, the amplify-multiplies-the-whole-triple contract, and the empty-bank fast path.
16 pre-built topology snapshots covering router-adjacency, op-zone sizing, and cross-segment allow-list variation. Plumbs train.topology_bank through recipe → ippo_jax. Per-reset random index sampling already supported by ScenarioEnv._select_const.
…6 / S3) Lets eval choose a different red than training used (Phase 6 Test 2's held-out-partner sweep). CLI > recipe.eval.red > recipe.eval.variant.
# Conflicts: # scripts/train/algorithms/ippo_jax.py # src/jaxborg/recipe.py
Adds the Test 1 heuristic-spread spikes for both diversity axes and the four Test 2 training recipes per cec-phase6-plan.md. Spike scripts (Test 1, σ-ratio ≥ 1.5 gate): - scripts/dev/cec_phase6_spike_axis_a.py — topology bank vs fixed shape - scripts/dev/cec_phase6_spike_axis_b.py — mission bank vs fixed profile Both use a sleep policy (policy-agnostic spread test) and the same CEC_SPIKE_EPISODES / CEC_SPIKE_STEPS env-var pattern as phase5. Recipes (Test 2, 2x2 factorial × 3 seeds × 3M timesteps): - recipes/cec_phase6_C00.yaml — control (fixed shape, fixed mission) - recipes/cec_phase6_C10.yaml — Axis A (16-shape bank, fixed mission) - recipes/cec_phase6_C01.yaml — Axis B (fixed shape, 4-entry bank) - recipes/cec_phase6_C11.yaml — both axes All four train against fsm (cc4_stock); held-out red sweep happens at eval time via eval_recipe.py --eval-red. mission_bank_amplify=1.0 per the plan's pre-registered default. Tests: tests/test_phase6_recipes.py covers loadability, plan citation, TOPOLOGY_BANK length & on-disk presence, MISSION_BANK content & amplify.
Builds the env arm C11 trains on (both topology + mission banks active), vmaps reset across 64 envs, runs a JIT'd 50-step rollout with random actions, asserts no NaN/Inf in rewards or observations, and checks that ≥3 distinct topology snapshots and ≥3 distinct mission-multiplier triples were sampled. Catches PRNG-splitting regressions where one bank silently collapses to a singleton. Marked slow (CPU JIT compile ~5 min); part of Phase 6's S1+S2+S3+S4 integration gate.
…of 3)
The 16-shape bank now emits totals {3, 6, 9} via per-zone floors
{(1,2), (3,3), (4,5)} so the AUTH/DB/WEB role candidate pool divides
evenly across each shape. Previous {4, 6, 8} totals split unevenly
(4 → 1+1+1+1 leaves a leftover; 8 → 3+2+3 biases roles).
build_topology now accepts op_zone_min_servers as int OR (a, b) tuple;
int form preserves legacy behavior. Variant chain unchanged.
Bank regenerated; spike + smoke + resilience tests green.
…y jitter, crown-jewel rotation
Three env-diversity additions to address Test 1's findings:
1. Anti-correlated mission profiles (P1). Replaces the default
{(1,1,1),(3,1,1),(1,3,1),(1,1,3)} bank with
{(1,1,1),(3,3,1),(1,3,3),(3,1,3)} — every non-baseline entry boosts 2 of 3
components, so a "boost the loud one" memorization fails. Disambiguates
"diversity itself helps" from "loud reward signal helps" (the Test 1
axis-B σ=3.33 mechanical-scaling critique).
2. Phase-boundary jitter bank (P2). Per-reset sample of phase_boundaries
from a 4-entry bank covering {canonical, short setup, long setup,
short mid-phase}. Phase transitions, allow-list flips, and phase_rewards
index switches all move with the sampled split — breaks "deploy decoys
at step 167" memorization.
3. Crown-jewel rotation / phase_rewards bank (P3). Per-reset sample of an
entire (MISSION_PHASES, NUM_SUBNETS, 3) phase_rewards array from a
6-entry bank. Each entry rotates which subnet is high-value in which
phase (OPS_A↔OPS_B swap; ADMIN priority; OFFICE priority; both-OPS
alert; full rotation). Same physical topology produces different reward
gradients per episode → forces the policy to read state instead of
memorizing subnet indices. Direct fix for Axis A's null σ-ratio.
API: train.phase_boundary_bank: list[[int, int, int]] |
train.phase_rewards_bank: bool (uses canonical bank) | list of arrays.
Both plumbed through ScenarioEnv → FsmRedCC4Env → make_jax_env → recipe →
ippo_jax. Empty/None preserves legacy behavior.
C01/C11 recipes updated to the anti-correlated mission bank.
axis-A spike now exercises all three new banks together (topology +
boundary + rewards) so the topology-variation signal can actually reach
the reward channel.
20 new tests in test_phase_boundary_bank.py + test_phase_rewards_bank.py;
886 unit tests pass total, no regressions.
…gregator
Cleanup:
- Drop C01 / C10 recipes — the per-axis ablations were dropped after
Test 1 v2 found the σ-ratio gate was policy-mediated.
- Drop the cec_phase6_spike_axis_a / _b scripts — diagnostic served
its purpose; results are committed in earlier commits.
- gitignore logs/ and remove old spike logs.
- Trim test_phase6_recipes.py ARMS to {C00, C11}.
Test 2 setup:
- C11 recipe now enables ALL FOUR banks: topology (16 shapes),
mission (anti-correlated 4-entry), phase_boundary_bank (4-entry),
phase_rewards_bank: true (canonical 6-entry crown-jewel rotation).
- C00 recipe re-documented as the canonical-config control.
- scripts/eval/cec_phase6_eval_jax.py: JAX-native held-out red eval
(load checkpoint, vmap argmax rollouts, write JSONL row).
- scripts/train/cec_phase6_test2.sh: 6-job sbatch launcher
(C00, C11) × (42, 142, 242). --dry-run prints commands.
- scripts/eval/cec_phase6_eval_test2.sh: 30-job eval orchestrator
(6 ckpts × 5 reds), one sbatch per cell.
- scripts/dev/cec_phase6_aggregate.py: reads phase6_*.jsonl rows,
emits paired-delta C11−C00 table per held-out red with the
pre-registered CONFIRMED/REFUTED/INCONCLUSIVE verdict.
Run sequence (when GPUs free):
1. ./scripts/train/cec_phase6_test2.sh # 6 sbatch jobs, ~3 hr parallel
2. ./scripts/eval/cec_phase6_eval_test2.sh # 30 eval jobs, ~10 min parallel
3. uv run python scripts/dev/cec_phase6_aggregate.py
Eval is rollouts-only — no gradient compute, no benefit from GPU. Drop --gres=gpu:1, request --cpus-per-task=8 instead, pin JAX_PLATFORMS=cpu. Frees the GPU pool for training/diagnostic work that actually needs it. Per-cell wall ~5 min on CPU (~3 min JIT compile + sub-minute rollout). 30 cells fan out across the cluster, ~30 min total if 5+ run concurrently.
Slurm's --wrap puts content in a generated slurm_script run via /bin/sh (dash on this system), which doesn't support 'set -o pipefail'. Switch to 'set -eu' since none of these commands use pipes.
RandomSelectRedAgent is a CybORG-side construct and isn't in the JAX red-selector REGISTRY, so cec_phase6_eval_jax.py crashes on --eval-red random. Drop random from the default eval list / argparse choices / aggregator REDS and document the CybORG-eval fallback. The 6 failed eval cells from the prior sweep are no longer relevant; the 4 informative held-out reds (fsm, cia_c, cia_i, cia_a) cover the Phase 6 ZSC question.
The 00:30:00 limit was tight for N=90 eps and insufficient for N=300: the 12 C00 jobs in the prior CPU sweep all hit TIMEOUT at 30:18 with no result row written. JIT compile ~3 min + 300-ep CPU rollout was >27 min. 01:30:00 gives headroom even if rollout scales worse than expected.
Adds C00_10M (control) and C11_10M (full 4-bank cocktail) recipes plus
four single-axis ablation recipes (topo / miss / pbound / cjewel) at 10M
training steps. New sequential CPU eval sweep amortizes JIT compile across
checkpoints by red. Aggregator decouples from hard-coded arm names so the
same script handles 3M, 10M, and ablation arms via PHASE6_ARMS env var.
At 10M training steps the cocktail produces a paired Δ of +286 to +404
reward on all four held-out reds (fsm / cia_c / cia_i / cia_a), all
crossing the pre-registered +200 confirm bar with lower bound > 0 and
no sign flips across 3 training seeds. The same cocktail at 3M was
REFUTED — undertraining, not a failed mechanism.
Files:
- recipes/cec_phase6_{C00,C11,topo,miss,pbound,cjewel}_10M.yaml
- scripts/train/cec_phase6_optionb_ablation.sh (chained 1-GPU launcher)
- scripts/eval/cec_phase6_eval_sweep.py (sequential CPU sweep)
- scripts/dev/cec_phase6_aggregate.py (arm-decoupled, pooled)
- scripts/eval/cec_phase6_eval_test2.sh (PHASE6_ARMS override)
- scripts/eval/cec_phase6_eval_jax.py (JAX cache env defaults)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replication of the Jha et al. (2025) cross-environment cooperation claim — "Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination", arXiv:2504.12714 — on the CAGE Challenge 4 environment.
The paper's claim: training-time environment diversity (with the partner held constant) is sufficient to produce zero-shot coordination with held-out partners. We test the same mechanism on CC4 with the held-out axis being the adversarial red agent rather than a cooperative ally.
Headline: The replication confirms transfer at the cocktail level (Δ +354 to +593). 3-seed single-axis ablations then decompose the cocktail into per-axis contributions: mission diversity is the dominant carrier (+373 Δ alone, 12/12 cells positive), phase-boundary and crown-jewel each contribute small positive amounts (+95 and +106), while the topology bank is actively harmful (−227, only 1/12 cells positive). The cocktail's value ≈ sum of these axes; removing topology would predict a stronger result than the full cocktail. Amplitude-matched control falsifies the alternative "stronger gradient" hypothesis — the mechanism is diversity, not signal strength.
Setup
Blue policy is IPPO (shared trunk, 256×2 tanh, identical hyperparameters across arms). Training partner is fixed to
fsm(the cc4_stock variant default). The two arms differ only in what gets sampled at env reset:servers 1-6, users 3-10per subnet per reset)(LWF,ASF,RIA)(1,1,1)(0,167,333)Important note on what really differs between arms: C00 is NOT a fixed canonical env. CC4's default generative mode randomizes per-subnet server counts (
[1, 6]) and user counts ([3, 10]) on every reset, so C00 already trains on a continuously varied host-count distribution. The C11 "topology bank" samples from just 16 fixed snapshots instead — far fewer distinct topologies than C00's fresh-every-reset draw — so on this axis C00 is the more varied arm. (The bank's op-zones are larger on average, but that is host magnitude, not variety.) Subnet adjacency is identically fixed in both arms.The C11 cocktail's actual contribution over C00 is concentrated in the three semantic axes below — mission, phase timing, and crown-jewel rotation — which C00 fully fixes:
{2,3,5}and op-zone B span{4,5,6}(9 distinct A/B pairs across the 16 entries); total hosts range 88–100. This is fewer distinct topologies than C00's generative draw, not more.(LWF, ASF, RIA)reward multipliers control which mission components dominate the episode return. Entries are(1,1,1),(3,3,1),(1,3,3),(3,1,3)— every non-baseline entry boosts exactly two of three components. Under the anti-correlated set, the "loud signal" rotates and the policy has to read state.(0, 167, 333); bank also includes(0, 100, 300),(0, 200, 400),(0, 150, 250). Phase-conditioned policy behavior can no longer key off absolute step counts.Per reset, C11 samples one entry from each bank independently.
Both arms train against
fsmonly. Evaluation is on the canonical env against four reds:fsm(in-distribution sanity), andcia_c,cia_i,cia_a(CIA-biased role-targeting reds — held out).Pre-registered verdict thresholds (paired Δ = C11 mean − C00 mean across matched seeds):
Result — full cocktail (C11 vs C00)
3 training seeds × 90 eval episodes per (arm, seed, red) on the canonical env.
Paired Δ (C11 − C00):
[+431, +868, +480][+230, +552, +504][+158, +532, +459][+147, +465, +449]All four cross the pre-registered +200 confirm bar with lower bound > 0 and no sign flips across the three training seeds.
Per-arm mean reward (the underlying numbers):
Both arms clearly beat the sleep-policy floor across every red. The cocktail-trained policy beats the canonical-trained baseline by +354 to +593 on top of that.
Ablation — which axis carries the signal?
To identify which of the four C11 banks actually drives the transfer effect, we trained single-axis arms (each with one bank active, others off) and evaluated under the same conditions. 3 training seeds × 30 eval episodes per (arm, seed, red), stochastic eval.
3-seed paired Δ per single-axis ablation:
miss(mission bank only)cjewel(crown-jewel only)pbound(phase-boundary only)topo(topology bank only)Decomposition of the cocktail:
C11(full cocktail)C11prime(amp-matched mission)The full cocktail's per-red Δ is close to the linear sum of single-axis effects (+409 observed vs +347 if simply additive). The axes don't show strong synergy; the cocktail's value is roughly the sum of: a big mission contribution, two small positive contributions from phase-boundary and crown-jewel, minus a substantial topology cost.
Mission alone captures 87-94% of the full cocktail benefit. The remaining gap is filled by the small positive contributions of
cjewelandpbound, partially offset by topology's drag.Why is topology hurting? C00's canonical config draws a fresh random topology every reset (per-subnet servers 1–6, users 3–10, independent across 8 subnets). The "topology bank" replaces this with just 16 fixed snapshots — far fewer distinct topologies than the control sees. The op-zones in those snapshots are larger on average (op-zone B ~5 servers vs the generative mean ~3.5), but variety, not size, is what a diversity ablation tests — and on that axis the bank is narrower. The label is a misnomer: it constrained topological variety rather than expanding it.
Predicted improvement from removing topology: if the linear additivity holds, a cocktail with mission + cjewel + pbound (and no topology bank) would predict ~+574 Δ — meaningfully above the full cocktail's +409. Untested.
Falsifying the diversity-vs-gradient confound
C11's mission bank
{(1,1,1), (3,3,1), (1,3,3), (3,1,3)}boosts two of three mission components on three of four entries. Per-entry magnitude sums to(3, 7, 7, 7)— the "diversity" intervention also amplified the reward signal. The +409 cocktail Δ was therefore confounded between "more env variety" and "stronger gradient."We trained
C11prime(3 seeds) with the mission bank rewritten to match per-entry reward magnitude exactly but rotate a single component instead of two:[1,1,1], [3,3,1], [1,3,3], [3,1,3]3, 7, 7, 7[1,1,1], [5,1,1], [1,5,1], [1,1,5]3, 7, 7, 7At full 3-seed power,
C11primeproduces mean Δ +380 (12/12 cells positive), within 7% of C11's +409. Amplitude-matched diversity reproduces the headline effect almost exactly — the result is not an artifact of stronger gradients. The mechanism is environment variety.Reframed conclusion
Side observations
cia_crange 322,cia_arange 318) than onfsm(range 437), but every seed-pair is positive for every red.fsmis +593, the largest of the four. Env diversity is not trading in-distribution performance for transfer here — it improves both.cia_c,cia_i,cia_aall share_CIA_PROB_MATRIXand differ only in target-role weights. The four held-out red rows are best read as one trained partner + three priors over one role-biased family.resilience_roles=Truefor their role-tag selectors, which forcesop_zone_servers=3. fsm eval usesresilience_roles=Falsewith CC4-default random counts. Within-red paired Δ is still valid; only absolute rewards across reds are non-comparable.Action distribution — what behavior does each arm learn?
Probe of stochastic eval (seed 42 checkpoints, 90 episodes × 5 blue agents × 500 steps, on
fsm—cia_aresults were qualitatively identical):C00 vs C11 (full cocktail):
C00 learns a cheap "block traffic + kill sessions + lots of decoys" defense. C11 learns "restore hosts fully + keep the network open + slightly less decoy spam."
C00 vs
missalone — does the mission-only arm reproduce C11's behavior? Yes, ~75% of the intensity:Mission-only training recovers the Restore-up, BlockTraffic-down signature. AllowTraffic doesn't shift here — that piece of the strategy may come from interaction with the other banks.
C11 vs
misshead-to-head — they play almost identically: all action deltas ≤ 2.4pp except AllowTraffic (missplays −3.7pp less). The "mission carries the effect" finding is mirrored in policy behavior, not just reward.C00 vs
topo— topology learns a different, worse defense:Topology-only doesn't learn the Restore-heavy pattern. Instead it leans further into kill-sessions + decoy-spam — a doubled-down version of C00's strategy, which transfers worse (Δ reward −380 on fsm). Different defensive choice, not just degraded C11.
The headline: mission diversity teaches the policy to prefer host-restore over session-kill defense, and that preference transfers across held-out reds. Topology bank teaches the opposite.
Scale sensitivity
The same cocktail at 3M training steps produces Δs of only +20 to +102 and REFUTES on all four reds (sign-flipping single-seed results). 10M steps converges on the transferable representation; 3M is not enough at this LR schedule.
Open questions
How to reproduce
Test plan
missreproduces C11's Restore-heavy strategy;topolearns a different worse Remove+Decoy strategy