Bug: Gemma4Tokenizer missing FORBIDDEN_TOKENS — multimodal placeholder tokens can leak into text-only output

## Description

`Gemma4Tokenizer` does not define `FORBIDDEN_TOKENS`, unlike `Gemma3Tokenizer` and `Gemma3nTokenizer` which both forbid multimodal placeholder tokens from being generated during sampling.

This means when sampling with a Gemma 4 model in text-only mode, the sampler has no restriction on generating raw image/audio placeholder tokens (`<|image|>`, `<|image>`, `<image|>`, `<|audio|>`, `<|audio>`, `<audio|>`), which would produce corrupted output.

## Comparison

```python
# Gemma3Tokenizer (line ~440) — correctly forbids image tokens:
FORBIDDEN_TOKENS = (
    special_tokens.START_OF_IMAGE,
    special_tokens.END_OF_IMAGE,
)

# Gemma3nTokenizer (line ~465) — same:
FORBIDDEN_TOKENS = (
    special_tokens.START_OF_IMAGE,
    special_tokens.END_OF_IMAGE,
)

# Gemma4Tokenizer (line ~475) — MISSING, inherits empty tuple from base:
# (no FORBIDDEN_TOKENS defined)
```

## How it's used

In `gemma/gm/text/_sampler.py:501`:
```python
forbidden_tokens += self.tokenizer.FORBIDDEN_TOKENS
```
For Gemma4, this adds nothing, so multimodal tokens are never masked out.

## Proposed Fix

Add `FORBIDDEN_TOKENS` to `Gemma4Tokenizer` covering all multimodal placeholder tokens:

```python
class Gemma4Tokenizer(Tokenizer):
  ...
  FORBIDDEN_TOKENS = (
      special_tokens.IMAGE_PLACEHOLDER,
      special_tokens.START_OF_IMAGE,
      special_tokens.END_OF_IMAGE,
      special_tokens.AUDIO_PLACEHOLDER,
      special_tokens.START_OF_AUDIO,
      special_tokens.END_OF_AUDIO,
  )
```

## Location

- File: `gemma/gm/text/_tokenizer.py`
- Class: `Gemma4Tokenizer` (around line 475)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Gemma4Tokenizer missing FORBIDDEN_TOKENS — multimodal placeholder tokens can leak into text-only output #613

Description

Comparison

How it's used

Proposed Fix

Location

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Gemma4Tokenizer missing FORBIDDEN_TOKENS — multimodal placeholder tokens can leak into text-only output #613

Description

Description

Comparison

How it's used

Proposed Fix

Location

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions