Gemma 4 26B: strong protocol-following regression vs Gemma 3 in bounded machine-facing evaluation

I’m testing `gemma4:26b` locally in a bounded Project Phoenix evaluation lane and wanted to report a concrete behavior difference versus `gemma3:27b`.

On an `RTX 3090`, `gemma4:26b` loads and runs cleanly at `100% GPU`, and it is fast:

- total bundle time: `99.406s`
- proxy stage: `43.158s`
- protocol stage: `56.127s`

However, on a bounded machine-facing protocol lane, the model failed all `6/6` protocol probes as `non_json`:

- `strict`: `0/6`
- `wrapper`: `0/6`
- `safe_repair`: `0/6`

By comparison, our current `gemma3:27b` row on the same lane is materially stronger:

- desktop: `3/6`, `5/6`, `5/6`
- laptop: `2/6`, `5/6`, `5/6`

So the current early read is:

- `Gemma 4` appears faster and stronger in general reasoning / HITL use
- but `Gemma 3` is currently much safer in a strict machine-facing protocol / handoff setting

Question:

- Is this kind of weak protocol-following / JSON-discipline behavior versus Gemma 3 expected in the current release?
- Is there a recommended patch, prompt pattern, runtime setting, or updated checkpoint that would improve it?

I’m happy to provide more exact artifact details if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 4 26B: strong protocol-following regression vs Gemma 3 in bounded machine-facing evaluation #604

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gemma 4 26B: strong protocol-following regression vs Gemma 3 in bounded machine-facing evaluation #604

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions