Skip to content

feat: support Jina AI v5 Matryoshka embedding models (nano + small)#135

Open
metaphorics wants to merge 11 commits into
rtk-ai:mainfrom
metaphorics:feat/jina-v5-matryoshka
Open

feat: support Jina AI v5 Matryoshka embedding models (nano + small)#135
metaphorics wants to merge 11 commits into
rtk-ai:mainfrom
metaphorics:feat/jina-v5-matryoshka

Conversation

@metaphorics
Copy link
Copy Markdown

Closes #134.

Adds support for Jina AI's v5 Matryoshka text-embedding models (jina-embeddings-v5-text-{nano,small}-retrieval) as selectable backends behind a jina-v5 Cargo feature, gated OFF by default. License: CC BY-NC 4.0 (non-commercial) — surfaced in README, config/default.toml, icm config show, and icm recall output.

What's in this PR

Slice Commit What
S-store 36b4030 MigrationStatus { dim_changed, old_dim, new_dim, affected_rows } returned from schema migration; decouples icm-store from embedder logic
S-1 4cf9462+1995d50+ee977fa Embedder trait extended with default embed_query/embed_document/model_name/license; JinaV5NanoEmbedder (EuroBERT-based, 768d) via ort 2.0.0-rc.9 + tokenizers + hf-hub
S-2 e5e394f+818972d JinaV5SmallEmbedder (Qwen3-based, 1024d); shared truncate_and_renorm Matryoshka helper
S-3 f46d179 Auto re-embed loop in CLI/MCP when stored embedding dim differs from active model; --no-auto-reembed opt-out flag; integration test
S-4 32376f2 Asymmetric retrieval: embed_query prepends "retrieval.query: ", embed_document prepends "retrieval.passage: "; internal TextEncoder trait enables MockTextEncoder unit tests asserting exact prefix strings
S-5 fc528fd Docs/UX/license — README "Embedder backends" section with non-commercial warning; config/default.toml license comment; icm config show prints backend = jina-v5-nano and license = CC-BY-NC-4.0, non-commercial; icm recall header includes model: <hf-id>; CHANGELOG.md created

Verification

cargo test --features "embeddings jina-v5"   # 310 passed (8 suites, 4.29s)
cargo build --features "embeddings jina-v5"  # exit 0
cargo clippy --features "embeddings jina-v5" -- -D warnings  # No issues found

Architecture decisions

  1. Store stays data-only. icm-store has zero embedder references. The dim-change migration returns a MigrationStatus value; CLI/MCP own orchestration of the re-embed loop.
  2. Asymmetric retrieval via DI. JinaV5{Nano,Small}Embedder hold Arc<dyn TextEncoder> instead of an inline ort session. Production path uses OrtEncoder; tests use MockTextEncoder to capture exact strings passed to the encoder. This shape proves prefix correctness without requiring a real ONNX runtime in unit tests.
  3. Matryoshka truncation lives at the embedder level, not inside the encoder, so the encoder returns full-dim vectors and each embedder applies its own truncate_dim after.
  4. Default backend unchanged. EmbedderBackend::Fastembed is still default; the jina-v5 feature is opt-in.

Out of scope

  • Jina v3 / v4 (different architecture)
  • Asymmetric prefix injection on FastEmbedder (not needed by multilingual-e5)
  • Upstream contribution to Anush008/fastembed-rs — separate parallel PR for the registration-only path on v5-text-nano

License

This change does NOT redistribute Jina v5 weights. Users who enable jina-v5 agree to CC BY-NC 4.0 by downloading from HuggingFace at runtime. Surfaced everywhere a user would notice (config, status, recall header).

- Add MigrationStatus { dim_changed, old_dim, new_dim, affected_rows }
  to icm-store/src/lib.rs (public export)
- Change init_db / init_db_with_dims to return IcmResult<MigrationStatus>
  instead of IcmResult<()>; dim-change path populates all four fields
  (affected_rows comes from conn.execute() return value); no-change path
  returns MigrationStatus::default()
- Propagate through SqliteStore::with_dims / SqliteStore::new which now
  return IcmResult<(Self, MigrationStatus)>
- Update open_store in icm-cli/src/main.rs to destructure the tuple;
  fix all direct SqliteStore::new / SqliteStore::with_dims call sites in
  main.rs and learn_tests.rs
- No behavior change to the migration logic itself; 125 icm-store tests pass
…t, model_name, license

Adds four new methods to the Embedder trait, all with default impls so
existing implementors (FastEmbedder) keep working unchanged.

* embed_query / embed_document: hooks for asymmetric retrieval models.
  Default: delegate to embed(). Lets future models (e.g. jina-v5) use
  query/passage prefixes per HF convention.
* model_name / license: exposed for upcoming `icm config show` CLI
  output. Default: empty strings.

Recall paths in icm-cli (cmd_recall) and icm-mcp (icm_recall tool) now
call embed_query() instead of embed(). Store paths keep embed() since
embed_document defaults to embed anyway, so this is backward compatible
for every model that does not override the new methods.
…nizers

Adds a new Jina v5 text-nano-retrieval embedder that runs ONNX
inference locally via the ort crate and tokenizers, gated behind the
new `jina-v5` Cargo feature. The model weights are downloaded from
HuggingFace on first use into the standard hf-hub cache.

License: CC-BY-NC-4.0 (non-commercial). The new `Embedder::license()`
method returns "CC-BY-NC-4.0" so future `icm config show` output can
warn the user.

Workspace deps (each marked optional in icm-core/Cargo.toml so default
builds pull none of them):

* `hf-hub` for model download into the per-user cache.
* `ort` 2.0.0-rc.9 with `load-dynamic` (uses system onnxruntime; no
  binary download at build time) and `ndarray` (for try_extract_tensor).
* `tokenizers` 0.21 with default features — required for the BPE/regex
  backends; the bare `http` feature alone fails tokenizers' own
  compile_error! gate.
* `ndarray` for input tensor construction.

`EmbeddingsConfig` gains a `backend` enum (`fastembed` | `jina-v5-nano`)
and an optional `truncate_dim` (Matryoshka representation truncation:
32 / 64 / 128 / 256 / 512 / 768). Unsupported backend selections (e.g.
config requests `jina-v5-nano` but binary was built without the
feature) now produce an explicit error rather than silently falling
back to no-embeddings.

The `init_embedder` factory returns `Result<Option<Box<dyn Embedder>>>`
under every feature combination, so all CLI call sites now use
`embedder.as_deref()` instead of cfg-gated `as &dyn Embedder` casts.
The `MigrationStatus` returned by `open_store` is now logged when the
embedding dim changes.

Tests: 4 new unit tests for the Matryoshka `truncate_and_renorm`
helper (shape, unit-norm, n-clamping, zero-vector). All 291 existing
tests still pass; clippy clean across all feature combos.
- New JinaV5SmallEmbedder in crates/icm-core/src/jina_v5_small.rs
  gated behind the existing jina-v5 Cargo feature
- HF model: jinaai/jina-embeddings-v5-text-small-retrieval
- Default dim: 1024; valid Matryoshka dims: [32,64,128,256,512,768,1024]
- Reuses crate::jina_v5_nano::truncate_and_renorm (not duplicated)
- Same mean-pool + L2-norm ONNX pipeline as nano (Qwen3 vs EuroBERT
  is internal to the model; inference code is identical from our side)
- embed_query / embed_document asymmetric prefix injection deferred to S-4
- 3 unit tests: truncate_correct_dim, truncate_max_dim, invalid_dim_rejected
- Exported from lib.rs as JinaV5SmallEmbedder behind jina-v5 feature

Note: shared-core extraction across nano+small deferred — follow-up slice.
- EmbedderBackend::JinaV5Small added to config.rs enum (serde kebab-case
  serializes as "jina-v5-small")
- Updated truncate_dim doc comment to mention jina-v5-small alongside nano
- init_embedder (active, cfg(any(embeddings, jina-v5))): two new arms
  for JinaV5Small — cfg(jina-v5) success path + cfg(not(jina-v5)) error path
- init_embedder (no-features stub): one new arm for JinaV5Small with Err
- All three feature matrix builds pass clean:
    cargo build -p icm-cli --features jina-v5
    cargo build -p icm-cli --features jina-v5
    cargo build -p icm-cli --no-default-features --features jina-v5
…embed

- Add --no-auto-reembed global CLI flag (default: false) that skips
  automatic re-embedding when the embedder dimension changes between runs.
- Replace the warning-only dim-change block with auto re-embedding:
  - When an embedder is active, cmd_embed(force=false) runs immediately
    so NULL rows (cleared by migration) are repopulated.
  - --no-auto-reembed suppresses this for interactive CLI commands but
    is ignored for  (MCP always auto-reembeds).
  - When no embedder is active, a manual hint is printed instead.
- Broaden cmd_embed gate from #[cfg(feature = "embeddings")] to
  #[cfg(any(feature = "embeddings", feature = "jina-v5"))] so
  jina-v5-only builds can use the function.
- Add migration_tests::test_dim_change_detection_and_nulling:
  opens a store @384, stores 5 memories with embeddings, reopens @768,
  verifies dim_changed=true, affected_rows=5, and all embeddings=NULL.
- F-1: registration-only patch built and committed locally; saved as
  durable artifact at docs/issues/jina-v5-fastembed-rs-F1.patch.
  Apply with `git am` to a fastembed-rs fork. Snapshot test deferred
  (catch-all panic signals 'fill-me-in' to CI / maintainer).

- F-2: cannot ship as registration-only. v5-text-small is Qwen3
  decoder-based and Anush008/fastembed-rs main lacks the decoder
  ONNX plumbing (last-token pooling, position_ids/KV-cache
  injection). Closed PR rtk-ai#236 attempted this work. ICM's own ort
  integration in icm-core (S-2) covers the local consumer path.

Track 1 (ICM internal) is fully complete: S-store, S-1, S-2, S-3,
S-4, S-5 all merged.
@pszymkowiak
Copy link
Copy Markdown
Contributor

wshm · Automated triage by AI

📊 Automated PR Analysis

Type feature
🟡 Risk medium

Summary

Adds support for Jina AI v5 Matryoshka embedding models (nano and small) as opt-in backends behind a jina-v5 Cargo feature flag. Includes ONNX-based local inference, asymmetric retrieval with query/document prefixes, auto re-embed on dimension change, Matryoshka truncation, and comprehensive license surfacing for the CC BY-NC 4.0 weights.

Review Checklist

  • Tests present
  • Breaking change
  • Docs updated

Linked issues: #134


Analyzed automatically by wshm · This is an automated analysis, not a human review.

@pszymkowiak
Copy link
Copy Markdown
Contributor

Thanks for this — genuinely one of the cleanest feature PRs we've had. The slice breakdown, the TextEncoder DI for testable prefix assertions, the data-only MigrationStatus design, no unwrap() in production paths, and the feature-gating (zero new deps in default builds) are all excellent. 🙏

On licensing (CC BY-NC): we're comfortable shipping this as an opt-in, off-by-default backend where the user downloads the weights themselves. We won't enable it in any hosted/commercial deployment on our side, so it stays a self-hosted convenience. No blocker there — your surfacing of the license (config / config show / recall) is exactly right.

A few things before we merge:

1. Rebase needed

develop has moved (0.10.50 release + several merged PRs), so this now conflicts — mainly main.rs (touched by #242 / #233 / #231), Cargo.toml / Cargo.lock, and CHANGELOG.md. Please rebase onto develop (we target develop, not main, now). in_memory() on develop discards the migration status via ?;, so it should stay compatible with your with_dims tuple change.

2. Please add a CI job for the jina-v5 feature

CI runs cargo test --workspace / cargo clippy --workspace --all-targets, i.e. default features (embeddings + tui). Since jina-v5 is opt-in, none of the new code (jina_v5_nano/small, OrtEncoder, the auto-reembed path) is compiled or tested by CI — only your local run covers it, and it'll silently bit-rot on the next main.rs/store refactor. Could you add a job running:

cargo clippy -p icm-cli --features jina-v5 -- -D warnings
cargo test  -p icm-cli --features jina-v5

3. Can we see the retrieval gain measured?

The PR proves correctness but we couldn't find a quality measurement, and this isn't free at runtime (ort RC + a system ONNX Runtime requirement + 2–2.6× larger embeddings → bigger DB, slower vector search). A small recall@k (or nDCG@10) A/B on a realistic ICM memory set would let us justify turning it on.

Fair-baseline note: FastEmbedder.embed() currently sends raw text to multilingual-e5-base with no query: / passage: prefixes — but e5 requires them, so the "asymmetric prefix injection on FastEmbedder (not needed by multilingual-e5)" note isn't quite right. To compare apples-to-apples the baseline should be e5 with its prefixes, otherwise jina-v5's asymmetric retrieval is partly measured against a misconfigured e5. (We may land that e5-prefix fix separately — it benefits the default path for free.)

4. Security hardening (low severity given opt-in/self-hosted, but worth it)

  • Pin the HF model revision. api.model(HF_MODEL_ID).get("onnx/model.onnx") fetches from the repo's main HEAD, not a fixed commit. A compromised/updated upstream repo would be downloaded and executed. Please pin a commit SHA (Repo::with_revision).
  • ort load-dynamic resolves the ONNX Runtime via ORT_DYLIB_PATH / system search → a dlopen of an attacker-influenced path is code execution. Worth a doc note (and ideally a known/validated path) since this runs on user machines.

5. Minor: config & TUI surface

  • Backend is selectable via [embeddings].backend in TOML only — no env override and no CLI flag. Not blocking; just flagging for consistency with the rest of ICM's config (e.g. ICM_WEB_PASSWORD exists as an env var).
  • ICM ships a tui dashboard (default feature, icm dashboard). Its Overview tab shows global stats but won't reflect the active embedder backend/license after this PR. Since the TUI has no settings screen, no selector is needed — but surfacing backend = jina-v5-nano + the license tag in Overview (mirroring S-5's config show) would keep it consistent. Optional / follow-up.

Great work overall — the architecture is solid; we just want the gain on record and CI coverage before turning it on. 🚀

@pszymkowiak
Copy link
Copy Markdown
Contributor

Thanks for the substantial work @metaphorics — happy to land this with
license guardrails on top.

Licensing: both jina-embeddings-v5-text-nano-retrieval and
jina-embeddings-v5-text-small-retrieval are CC BY-NC 4.0 (per the
Hugging Face model cards). ICM ships under Apache-2.0 and is used in
commercial deployments — we can't silently download a non-commercial
model. The plumbing you've added is great independently of the
licensing question, so let's bolt on safety layers and we're good.

Required guardrails before merge

  1. Refuse auto-download by default. If the configured embedding
    model carries a non-commercial license, every code path that would
    trigger a download (icm extract-pending, first icm store,
    icm dashboard model switcher, etc.) must abort with a friendly
    error explaining the license and how to opt in. No silent fetch.

  2. Env-var opt-in. Setting ICM_ACCEPT_NONCOMMERCIAL_LICENSE=1
    (or equivalent — happy to bikeshed the name) bypasses the abort.
    Intended for CI / scripts / users who already read the license.

  3. Interactive confirmation in the TUI. First time a user picks
    a non-commercial model in icm dashboard (or any future model
    picker), show a modal:

    jina-embeddings-v5-text-small-retrieval is licensed under
    CC BY-NC 4.0 — non-commercial use only. Commercial deployments
    must license from Jina AI (sales@jina.ai).
    Type accept to confirm and download.
    Persisted answer goes into config so it's not re-asked each time.

  4. Persistent TUI banner. Whenever the active embedder is a
    non-commercial model, the TUI footer / status bar shows a small
    [NC license] tag so the user is reminded at every glance which
    regime their DB is locked into.

  5. README + --help mention. A short section listing the non-
    commercial bundled models and pointing to the env var.

  6. Rebase. PR is currently in conflict on 21 files since v0.10.50
    shipped (CHANGELOG, Cargo.lock, Cargo.toml, main.rs, …).

Want to take another pass, or would you prefer I add the licence-gate
pieces in a follow-up commit on your branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: support Jina AI v5 Matryoshka embedding models (nano + small)

2 participants