Skip to content

fix(embeddings): apply e5 query/passage instruction prefixes#260

Open
kzzalews wants to merge 1 commit into
rtk-ai:mainfrom
kzzalews:ai/fix/e5-instruction-prefixes
Open

fix(embeddings): apply e5 query/passage instruction prefixes#260
kzzalews wants to merge 1 commit into
rtk-ai:mainfrom
kzzalews:ai/fix/e5-instruction-prefixes

Conversation

@kzzalews
Copy link
Copy Markdown

Problem

ICM embeds all text raw, without the instruction prefixes that the
multilingual-e5 family is trained to require. Per the
intfloat/multilingual-e5 model card:

Each input text should start with "query: " or "passage: ", even for
non-English texts. … otherwise you will see a performance degradation.

Today every document is embedded as "<topic> <summary>" and every query as
the raw string (icm-core/src/memory.rs::embed_text(), icm-cli/.../main.rs:1695,
icm-mcp/src/tools.rs:1191). fastembed does not add the prefixes — it is the
caller's responsibility.

Fix

Apply the prefixes in the embedder, model-aware:

  • embed() / embed_batch() prepend the document prefix ("passage: ")
  • a new Embedder::embed_query() prepends the query prefix ("query: "),
    used by the CLI and MCP recall paths

Only the e5 family (MultilingualE5Small/Base/Large) is affected; every other
model is byte-identical to today. embed_query() defaults to delegating to
embed(), so non-e5 Embedder impls are unchanged.

Results

Deterministic IR eval on a real 424-memory store, 12 lexically-distinct
paraphrase queries (same queries and known targets in both arms — only the
prefix differs):

Metric before after
recall@1 8/12 9/12
recall@5 10/12 12/12
MRR 0.764 0.850

No regressions. One target missing from the top-10 was recovered to rank 5;
another rose from rank 6 to 2. Near-verbatim queries (already easy) are
unaffected, as expected.

Tests

  • New offline unit tests: e5_models_use_instruction_prefixes,
    non_e5_models_are_left_unprefixed.
  • cargo test --workspace green.

Migration ⚠️

Existing e5 stores must be re-embedded once after upgrading:

icm embed --force

A mix of prefixed (new) and unprefixed (old) vectors is worse than either
consistent state, so this is required. Queries pick up query: automatically.
A follow-up could bump an embedding-scheme version to auto-trigger this,
mirroring the model-change reset in icm-store/src/schema.rs.


Written by Aiko (Claude Opus 4.8) · Lv. 20.5 · 2026-05-29
Context: ICM (topics: preferences, errors-resolved) · Source: rtk-ai/icm @ v0.10.50 (main 804dac2)
Tools: git, cargo build/test, sqlite3, gh, binary/source inspection, deterministic IR eval (recall@k/MRR) on a copy of the live store
Limitations: eval N=12 on one mixed PL/EN corpus; MRR delta driven mainly by 2 of 12 queries; not an MTEB run; icm bench-recall deliberately not used (measures ICM-vs-no-ICM on a synthetic English corpus — unsuitable here)

🤖 Generated with Claude Code

multilingual-e5 models are trained to expect "query: " on search queries
and "passage: " on stored documents; omitting the prefixes degrades
retrieval (per the intfloat/multilingual-e5 model card). ICM embedded all
text raw, so e5 ran out-of-distribution on every store and recall.

Apply the prefixes in the embedder, model-aware:
- embed() / embed_batch() prepend the document ("passage: ") prefix
- a new Embedder::embed_query() prepends the query ("query: ") prefix,
  used by the CLI recall and MCP recall paths

Only e5-family models (MultilingualE5 Small/Base/Large) are affected;
every other model is byte-identical to before. The default impl of
embed_query() delegates to embed(), so non-e5 Embedder impls are unchanged.

Deterministic IR eval on a real 424-memory corpus (12 lexically-distinct
paraphrase queries, same targets, only the prefix differs):
recall@1 8->9/12, recall@5 10->12/12, MRR 0.764->0.850, no regressions.
Near-verbatim queries (already easy) are unaffected.

Migration: existing e5 stores must be re-embedded once with
`icm embed --force`; a mix of prefixed and unprefixed vectors is worse
than neither.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@kzzalews kzzalews changed the title [ai] fix(embeddings): apply e5 query/passage instruction prefixes fix(embeddings): apply e5 query/passage instruction prefixes May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant