fix(embeddings): apply e5 query/passage instruction prefixes by kzzalews · Pull Request #260 · rtk-ai/icm

kzzalews · 2026-05-29T11:18:16Z

Problem

ICM embeds all text raw, without the instruction prefixes that the
multilingual-e5 family is trained to require. Per the
intfloat/multilingual-e5 model card:

Each input text should start with "query: " or "passage: ", even for
non-English texts. … otherwise you will see a performance degradation.

Today every document is embedded as "<topic> <summary>" and every query as
the raw string (icm-core/src/memory.rs::embed_text(), icm-cli/.../main.rs:1695,
icm-mcp/src/tools.rs:1191). fastembed does not add the prefixes — it is the
caller's responsibility.

Fix

Apply the prefixes in the embedder, model-aware:

embed() / embed_batch() prepend the document prefix ("passage: ")
a new Embedder::embed_query() prepends the query prefix ("query: "),
used by the CLI and MCP recall paths

Only the e5 family (MultilingualE5Small/Base/Large) is affected; every other
model is byte-identical to today. embed_query() defaults to delegating to
embed(), so non-e5 Embedder impls are unchanged.

Results

Deterministic IR eval on a real 424-memory store, 12 lexically-distinct
paraphrase queries (same queries and known targets in both arms — only the
prefix differs):

Metric	before	after
recall@1	8/12	9/12
recall@5	10/12	12/12
MRR	0.764	0.850

No regressions. One target missing from the top-10 was recovered to rank 5;
another rose from rank 6 to 2. Near-verbatim queries (already easy) are
unaffected, as expected.

Tests

New offline unit tests: e5_models_use_instruction_prefixes,
non_e5_models_are_left_unprefixed.
cargo test --workspace green.

Migration ⚠️

Existing e5 stores must be re-embedded once after upgrading:

icm embed --force

A mix of prefixed (new) and unprefixed (old) vectors is worse than either
consistent state, so this is required. Queries pick up query: automatically.
A follow-up could bump an embedding-scheme version to auto-trigger this,
mirroring the model-change reset in icm-store/src/schema.rs.

Written by Aiko (Claude Opus 4.8) · Lv. 20.5 · 2026-05-29
Context: ICM (topics: preferences, errors-resolved) · Source: rtk-ai/icm @ v0.10.50 (main 804dac2)
Tools: git, cargo build/test, sqlite3, gh, binary/source inspection, deterministic IR eval (recall@k/MRR) on a copy of the live store
Limitations: eval N=12 on one mixed PL/EN corpus; MRR delta driven mainly by 2 of 12 queries; not an MTEB run; icm bench-recall deliberately not used (measures ICM-vs-no-ICM on a synthetic English corpus — unsuitable here)

🤖 Generated with Claude Code

multilingual-e5 models are trained to expect "query: " on search queries and "passage: " on stored documents; omitting the prefixes degrades retrieval (per the intfloat/multilingual-e5 model card). ICM embedded all text raw, so e5 ran out-of-distribution on every store and recall. Apply the prefixes in the embedder, model-aware: - embed() / embed_batch() prepend the document ("passage: ") prefix - a new Embedder::embed_query() prepends the query ("query: ") prefix, used by the CLI recall and MCP recall paths Only e5-family models (MultilingualE5 Small/Base/Large) are affected; every other model is byte-identical to before. The default impl of embed_query() delegates to embed(), so non-e5 Embedder impls are unchanged. Deterministic IR eval on a real 424-memory corpus (12 lexically-distinct paraphrase queries, same targets, only the prefix differs): recall@1 8->9/12, recall@5 10->12/12, MRR 0.764->0.850, no regressions. Near-verbatim queries (already easy) are unaffected. Migration: existing e5 stores must be re-embedded once with `icm embed --force`; a mix of prefixed and unprefixed vectors is worse than neither. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

kzzalews changed the title ~~[ai] fix(embeddings): apply e5 query/passage instruction prefixes~~ fix(embeddings): apply e5 query/passage instruction prefixes May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(embeddings): apply e5 query/passage instruction prefixes#260

fix(embeddings): apply e5 query/passage instruction prefixes#260
kzzalews wants to merge 1 commit into
rtk-ai:mainfrom
kzzalews:ai/fix/e5-instruction-prefixes

kzzalews commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kzzalews commented May 29, 2026

Problem

Fix

Results

Tests

Migration ⚠️

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant