feat: support Jina AI v5 Matryoshka embedding models (nano + small)#135
feat: support Jina AI v5 Matryoshka embedding models (nano + small)#135metaphorics wants to merge 11 commits into
Conversation
- Add MigrationStatus { dim_changed, old_dim, new_dim, affected_rows }
to icm-store/src/lib.rs (public export)
- Change init_db / init_db_with_dims to return IcmResult<MigrationStatus>
instead of IcmResult<()>; dim-change path populates all four fields
(affected_rows comes from conn.execute() return value); no-change path
returns MigrationStatus::default()
- Propagate through SqliteStore::with_dims / SqliteStore::new which now
return IcmResult<(Self, MigrationStatus)>
- Update open_store in icm-cli/src/main.rs to destructure the tuple;
fix all direct SqliteStore::new / SqliteStore::with_dims call sites in
main.rs and learn_tests.rs
- No behavior change to the migration logic itself; 125 icm-store tests pass
…t, model_name, license Adds four new methods to the Embedder trait, all with default impls so existing implementors (FastEmbedder) keep working unchanged. * embed_query / embed_document: hooks for asymmetric retrieval models. Default: delegate to embed(). Lets future models (e.g. jina-v5) use query/passage prefixes per HF convention. * model_name / license: exposed for upcoming `icm config show` CLI output. Default: empty strings. Recall paths in icm-cli (cmd_recall) and icm-mcp (icm_recall tool) now call embed_query() instead of embed(). Store paths keep embed() since embed_document defaults to embed anyway, so this is backward compatible for every model that does not override the new methods.
…nizers Adds a new Jina v5 text-nano-retrieval embedder that runs ONNX inference locally via the ort crate and tokenizers, gated behind the new `jina-v5` Cargo feature. The model weights are downloaded from HuggingFace on first use into the standard hf-hub cache. License: CC-BY-NC-4.0 (non-commercial). The new `Embedder::license()` method returns "CC-BY-NC-4.0" so future `icm config show` output can warn the user. Workspace deps (each marked optional in icm-core/Cargo.toml so default builds pull none of them): * `hf-hub` for model download into the per-user cache. * `ort` 2.0.0-rc.9 with `load-dynamic` (uses system onnxruntime; no binary download at build time) and `ndarray` (for try_extract_tensor). * `tokenizers` 0.21 with default features — required for the BPE/regex backends; the bare `http` feature alone fails tokenizers' own compile_error! gate. * `ndarray` for input tensor construction. `EmbeddingsConfig` gains a `backend` enum (`fastembed` | `jina-v5-nano`) and an optional `truncate_dim` (Matryoshka representation truncation: 32 / 64 / 128 / 256 / 512 / 768). Unsupported backend selections (e.g. config requests `jina-v5-nano` but binary was built without the feature) now produce an explicit error rather than silently falling back to no-embeddings. The `init_embedder` factory returns `Result<Option<Box<dyn Embedder>>>` under every feature combination, so all CLI call sites now use `embedder.as_deref()` instead of cfg-gated `as &dyn Embedder` casts. The `MigrationStatus` returned by `open_store` is now logged when the embedding dim changes. Tests: 4 new unit tests for the Matryoshka `truncate_and_renorm` helper (shape, unit-norm, n-clamping, zero-vector). All 291 existing tests still pass; clippy clean across all feature combos.
- New JinaV5SmallEmbedder in crates/icm-core/src/jina_v5_small.rs gated behind the existing jina-v5 Cargo feature - HF model: jinaai/jina-embeddings-v5-text-small-retrieval - Default dim: 1024; valid Matryoshka dims: [32,64,128,256,512,768,1024] - Reuses crate::jina_v5_nano::truncate_and_renorm (not duplicated) - Same mean-pool + L2-norm ONNX pipeline as nano (Qwen3 vs EuroBERT is internal to the model; inference code is identical from our side) - embed_query / embed_document asymmetric prefix injection deferred to S-4 - 3 unit tests: truncate_correct_dim, truncate_max_dim, invalid_dim_rejected - Exported from lib.rs as JinaV5SmallEmbedder behind jina-v5 feature Note: shared-core extraction across nano+small deferred — follow-up slice.
- EmbedderBackend::JinaV5Small added to config.rs enum (serde kebab-case
serializes as "jina-v5-small")
- Updated truncate_dim doc comment to mention jina-v5-small alongside nano
- init_embedder (active, cfg(any(embeddings, jina-v5))): two new arms
for JinaV5Small — cfg(jina-v5) success path + cfg(not(jina-v5)) error path
- init_embedder (no-features stub): one new arm for JinaV5Small with Err
- All three feature matrix builds pass clean:
cargo build -p icm-cli --features jina-v5
cargo build -p icm-cli --features jina-v5
cargo build -p icm-cli --no-default-features --features jina-v5
…embed
- Add --no-auto-reembed global CLI flag (default: false) that skips
automatic re-embedding when the embedder dimension changes between runs.
- Replace the warning-only dim-change block with auto re-embedding:
- When an embedder is active, cmd_embed(force=false) runs immediately
so NULL rows (cleared by migration) are repopulated.
- --no-auto-reembed suppresses this for interactive CLI commands but
is ignored for (MCP always auto-reembeds).
- When no embedder is active, a manual hint is printed instead.
- Broaden cmd_embed gate from #[cfg(feature = "embeddings")] to
#[cfg(any(feature = "embeddings", feature = "jina-v5"))] so
jina-v5-only builds can use the function.
- Add migration_tests::test_dim_change_detection_and_nulling:
opens a store @384, stores 5 memories with embeddings, reopens @768,
verifies dim_changed=true, affected_rows=5, and all embeddings=NULL.
- F-1: registration-only patch built and committed locally; saved as durable artifact at docs/issues/jina-v5-fastembed-rs-F1.patch. Apply with `git am` to a fastembed-rs fork. Snapshot test deferred (catch-all panic signals 'fill-me-in' to CI / maintainer). - F-2: cannot ship as registration-only. v5-text-small is Qwen3 decoder-based and Anush008/fastembed-rs main lacks the decoder ONNX plumbing (last-token pooling, position_ids/KV-cache injection). Closed PR rtk-ai#236 attempted this work. ICM's own ort integration in icm-core (S-2) covers the local consumer path. Track 1 (ICM internal) is fully complete: S-store, S-1, S-2, S-3, S-4, S-5 all merged.
📊 Automated PR Analysis
SummaryAdds support for Jina AI v5 Matryoshka embedding models (nano and small) as opt-in backends behind a Review Checklist
Linked issues: #134 Analyzed automatically by wshm · This is an automated analysis, not a human review. |
|
Thanks for this — genuinely one of the cleanest feature PRs we've had. The slice breakdown, the On licensing (CC BY-NC): we're comfortable shipping this as an opt-in, off-by-default backend where the user downloads the weights themselves. We won't enable it in any hosted/commercial deployment on our side, so it stays a self-hosted convenience. No blocker there — your surfacing of the license (config / A few things before we merge: 1. Rebase needed
2. Please add a CI job for the
|
|
Thanks for the substantial work @metaphorics — happy to land this with Licensing: both Required guardrails before merge
Want to take another pass, or would you prefer I add the licence-gate |
Closes #134.
Adds support for Jina AI's v5 Matryoshka text-embedding models (
jina-embeddings-v5-text-{nano,small}-retrieval) as selectable backends behind ajina-v5Cargo feature, gated OFF by default. License: CC BY-NC 4.0 (non-commercial) — surfaced in README,config/default.toml,icm config show, andicm recalloutput.What's in this PR
36b4030MigrationStatus { dim_changed, old_dim, new_dim, affected_rows }returned from schema migration; decouplesicm-storefrom embedder logic4cf9462+1995d50+ee977faEmbeddertrait extended with defaultembed_query/embed_document/model_name/license;JinaV5NanoEmbedder(EuroBERT-based, 768d) viaort2.0.0-rc.9 +tokenizers+hf-hube5e394f+818972dJinaV5SmallEmbedder(Qwen3-based, 1024d); sharedtruncate_and_renormMatryoshka helperf46d179--no-auto-reembedopt-out flag; integration test32376f2embed_queryprepends"retrieval.query: ",embed_documentprepends"retrieval.passage: "; internalTextEncodertrait enablesMockTextEncoderunit tests asserting exact prefix stringsfc528fdconfig/default.tomllicense comment;icm config showprintsbackend = jina-v5-nanoandlicense = CC-BY-NC-4.0, non-commercial;icm recallheader includesmodel: <hf-id>; CHANGELOG.md createdVerification
Architecture decisions
icm-storehas zero embedder references. The dim-change migration returns aMigrationStatusvalue; CLI/MCP own orchestration of the re-embed loop.JinaV5{Nano,Small}EmbedderholdArc<dyn TextEncoder>instead of an inline ort session. Production path usesOrtEncoder; tests useMockTextEncoderto capture exact strings passed to the encoder. This shape proves prefix correctness without requiring a real ONNX runtime in unit tests.truncate_dimafter.EmbedderBackend::Fastembedis still default; thejina-v5feature is opt-in.Out of scope
FastEmbedder(not needed by multilingual-e5)Anush008/fastembed-rs— separate parallel PR for the registration-only path on v5-text-nanoLicense
This change does NOT redistribute Jina v5 weights. Users who enable
jina-v5agree to CC BY-NC 4.0 by downloading from HuggingFace at runtime. Surfaced everywhere a user would notice (config, status, recall header).