fix(embeddings): apply e5 query/passage instruction prefixes#260
Open
kzzalews wants to merge 1 commit into
Open
fix(embeddings): apply e5 query/passage instruction prefixes#260kzzalews wants to merge 1 commit into
kzzalews wants to merge 1 commit into
Conversation
multilingual-e5 models are trained to expect "query: " on search queries
and "passage: " on stored documents; omitting the prefixes degrades
retrieval (per the intfloat/multilingual-e5 model card). ICM embedded all
text raw, so e5 ran out-of-distribution on every store and recall.
Apply the prefixes in the embedder, model-aware:
- embed() / embed_batch() prepend the document ("passage: ") prefix
- a new Embedder::embed_query() prepends the query ("query: ") prefix,
used by the CLI recall and MCP recall paths
Only e5-family models (MultilingualE5 Small/Base/Large) are affected;
every other model is byte-identical to before. The default impl of
embed_query() delegates to embed(), so non-e5 Embedder impls are unchanged.
Deterministic IR eval on a real 424-memory corpus (12 lexically-distinct
paraphrase queries, same targets, only the prefix differs):
recall@1 8->9/12, recall@5 10->12/12, MRR 0.764->0.850, no regressions.
Near-verbatim queries (already easy) are unaffected.
Migration: existing e5 stores must be re-embedded once with
`icm embed --force`; a mix of prefixed and unprefixed vectors is worse
than neither.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
ICM embeds all text raw, without the instruction prefixes that the
multilingual-e5family is trained to require. Per theintfloat/multilingual-e5 model card:
Today every document is embedded as
"<topic> <summary>"and every query asthe raw string (
icm-core/src/memory.rs::embed_text(),icm-cli/.../main.rs:1695,icm-mcp/src/tools.rs:1191).fastembeddoes not add the prefixes — it is thecaller's responsibility.
Fix
Apply the prefixes in the embedder, model-aware:
embed()/embed_batch()prepend the document prefix ("passage: ")Embedder::embed_query()prepends the query prefix ("query: "),used by the CLI and MCP recall paths
Only the e5 family (
MultilingualE5Small/Base/Large) is affected; every othermodel is byte-identical to today.
embed_query()defaults to delegating toembed(), so non-e5Embedderimpls are unchanged.Results
Deterministic IR eval on a real 424-memory store, 12 lexically-distinct
paraphrase queries (same queries and known targets in both arms — only the
prefix differs):
No regressions. One target missing from the top-10 was recovered to rank 5;
another rose from rank 6 to 2. Near-verbatim queries (already easy) are
unaffected, as expected.
Tests
e5_models_use_instruction_prefixes,non_e5_models_are_left_unprefixed.cargo test --workspacegreen.Migration⚠️
Existing e5 stores must be re-embedded once after upgrading:
A mix of prefixed (new) and unprefixed (old) vectors is worse than either
consistent state, so this is required. Queries pick up
query:automatically.A follow-up could bump an embedding-scheme version to auto-trigger this,
mirroring the model-change reset in
icm-store/src/schema.rs.Written by Aiko (Claude Opus 4.8) · Lv. 20.5 · 2026-05-29
Context: ICM (topics: preferences, errors-resolved) · Source: rtk-ai/icm @ v0.10.50 (main
804dac2)Tools: git, cargo build/test, sqlite3, gh, binary/source inspection, deterministic IR eval (recall@k/MRR) on a copy of the live store
Limitations: eval N=12 on one mixed PL/EN corpus; MRR delta driven mainly by 2 of 12 queries; not an MTEB run;
icm bench-recalldeliberately not used (measures ICM-vs-no-ICM on a synthetic English corpus — unsuitable here)🤖 Generated with Claude Code