Skip to content

Bug: Gemma4Tokenizer missing FORBIDDEN_TOKENS — multimodal placeholder tokens can leak into text-only output #613

@ac12644

Description

@ac12644

Description

Gemma4Tokenizer does not define FORBIDDEN_TOKENS, unlike Gemma3Tokenizer and Gemma3nTokenizer which both forbid multimodal placeholder tokens from being generated during sampling.

This means when sampling with a Gemma 4 model in text-only mode, the sampler has no restriction on generating raw image/audio placeholder tokens (<|image|>, <|image>, <image|>, <|audio|>, <|audio>, <audio|>), which would produce corrupted output.

Comparison

# Gemma3Tokenizer (line ~440) — correctly forbids image tokens:
FORBIDDEN_TOKENS = (
    special_tokens.START_OF_IMAGE,
    special_tokens.END_OF_IMAGE,
)

# Gemma3nTokenizer (line ~465) — same:
FORBIDDEN_TOKENS = (
    special_tokens.START_OF_IMAGE,
    special_tokens.END_OF_IMAGE,
)

# Gemma4Tokenizer (line ~475) — MISSING, inherits empty tuple from base:
# (no FORBIDDEN_TOKENS defined)

How it's used

In gemma/gm/text/_sampler.py:501:

forbidden_tokens += self.tokenizer.FORBIDDEN_TOKENS

For Gemma4, this adds nothing, so multimodal tokens are never masked out.

Proposed Fix

Add FORBIDDEN_TOKENS to Gemma4Tokenizer covering all multimodal placeholder tokens:

class Gemma4Tokenizer(Tokenizer):
  ...
  FORBIDDEN_TOKENS = (
      special_tokens.IMAGE_PLACEHOLDER,
      special_tokens.START_OF_IMAGE,
      special_tokens.END_OF_IMAGE,
      special_tokens.AUDIO_PLACEHOLDER,
      special_tokens.START_OF_AUDIO,
      special_tokens.END_OF_AUDIO,
  )

Location

  • File: gemma/gm/text/_tokenizer.py
  • Class: Gemma4Tokenizer (around line 475)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions