Skip to content

Commit 6535aa9

Browse files
committed
{"schema":"cmsg/1","type":"feat","scope":"doc-extension","summary":"Dense-first doc retrieval with sparse_mode and domain/repo filters","intent":"Make doc retrieval robust across languages while preserving exact-match recall","impact":"Runs dense retrieval on every docs_search_l0; optionally enables BM25 via sparse_mode auto/on/off; adds domain+repo filters; applies additive recency boost; updates API/MCP contracts and doc extension specs.","breaking":false,"risk":"medium","refs":["gh:hack-ink/ELF#84"]}
1 parent c2b75d5 commit 6535aa9

7 files changed

Lines changed: 853 additions & 311 deletions

File tree

apps/elf-api/src/routes.rs

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -106,6 +106,9 @@ struct DocsSearchL0Body {
106106
scope: Option<String>,
107107
status: Option<String>,
108108
doc_type: Option<String>,
109+
sparse_mode: Option<String>,
110+
domain: Option<String>,
111+
repo: Option<String>,
109112
agent_id: Option<String>,
110113
thread_id: Option<String>,
111114
updated_after: Option<String>,
@@ -1035,6 +1038,9 @@ async fn docs_search_l0(
10351038
scope: payload.scope,
10361039
status: payload.status,
10371040
doc_type: payload.doc_type,
1041+
sparse_mode: payload.sparse_mode,
1042+
domain: payload.domain,
1043+
repo: payload.repo,
10381044
agent_id: payload.agent_id,
10391045
thread_id: payload.thread_id,
10401046
updated_after: payload.updated_after,

apps/elf-mcp/src/server.rs

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -882,6 +882,12 @@ fn docs_search_l0_schema() -> Arc<JsonObject> {
882882
"ts_lte": { "type": ["string", "null"], "format": "date-time" },
883883
"top_k": { "type": ["integer", "null"] },
884884
"candidate_k": { "type": ["integer", "null"] },
885+
"sparse_mode": {
886+
"type": ["string", "null"],
887+
"enum": ["auto", "on", "off", null]
888+
},
889+
"domain": { "type": ["string", "null"] },
890+
"repo": { "type": ["string", "null"] },
885891
"explain": { "type": ["boolean", "null"] },
886892
"read_profile": { "type": ["string", "null"] }
887893
}
@@ -1555,6 +1561,9 @@ mod tests {
15551561
"updated_before",
15561562
"ts_gte",
15571563
"ts_lte",
1564+
"sparse_mode",
1565+
"domain",
1566+
"repo",
15581567
"explain",
15591568
];
15601569

@@ -1580,6 +1589,22 @@ mod tests {
15801589
serde_json::Value::Null,
15811590
])
15821591
);
1592+
assert_eq!(
1593+
properties.get("sparse_mode").and_then(serde_json::Value::as_object).and_then(
1594+
|field| {
1595+
field
1596+
.get("enum")
1597+
.and_then(serde_json::Value::as_array)
1598+
.map(|vals| vals.to_vec())
1599+
}
1600+
),
1601+
Some(vec![
1602+
serde_json::Value::String("auto".to_string()),
1603+
serde_json::Value::String("on".to_string()),
1604+
serde_json::Value::String("off".to_string()),
1605+
serde_json::Value::Null,
1606+
])
1607+
);
15831608
}
15841609

15851610
#[test]

docs/spec/system_doc_extension_v1_filters.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,8 +24,12 @@ Scope
2424
- `status` (optional string): defaults to `active` when omitted. Current implementation matches
2525
this value exactly against stored doc status (`active`/`deleted` in current schema).
2626
- `doc_type` (optional string): exact-match filter.
27+
- `sparse_mode` (optional string): retrieval fusion control mode:
28+
`auto` (default), `on`, `off`.
2729
- `agent_id` (optional string): exact-match filter.
2830
- `thread_id` (optional string): exact-match filter for `thread_id` payload field.
31+
- `domain` (optional string): exact-match filter for `domain` payload field.
32+
- `repo` (optional string): exact-match filter for `repo` payload field.
2933
- `updated_after` (optional string): RFC3339 timestamp lower bound for `updated_at`.
3034
- `updated_before` (optional string): RFC3339 timestamp upper bound for `updated_at`.
3135
- `ts_gte` (optional string): RFC3339 timestamp lower bound for `doc_ts`.
@@ -41,8 +45,16 @@ Scope
4145
Filter evaluation:
4246
- Every supplied filter is combined with logical AND.
4347
- `status` defaults to `active` when omitted.
48+
- `sparse_mode` is validated as one of `auto|on|off` (default `auto`).
49+
- `domain` requires `doc_type=search` and is rejected with `400` when used with other
50+
`doc_type` values or when `doc_type` is omitted.
51+
- `repo` requires `doc_type=dev` and is rejected with `400` when used with other
52+
`doc_type` values or when `doc_type` is omitted.
4453
- Invalid date values or `updated_after >= updated_before` are rejected with `400`.
4554
- Invalid date values or `ts_gte >= ts_lte` are rejected with `400`.
55+
- In `auto` sparse mode, sparse retrieval is enabled only when the query is judged as
56+
symbol-heavy / exact-match oriented; otherwise only dense retrieval is used.
57+
- `sparse_mode=on` runs both dense and sparse retrieval; `sparse_mode=off` runs dense-only.
4658

4759
Response behavior:
4860
- `docs_search_l0` always returns `trace_id`.
@@ -60,6 +72,8 @@ Each point used by `docs_search_l0` MUST include payload fields:
6072
- `doc_type`
6173
- `agent_id`
6274
- `thread_id`
75+
- `domain`
76+
- `repo`
6377
- `updated_at`
6478
- `doc_ts`
6579

@@ -75,6 +89,8 @@ Implementations MUST provision payload indexes for:
7589
- `doc_type` (keyword)
7690
- `agent_id` (keyword)
7791
- `thread_id` (keyword)
92+
- `domain` (keyword)
93+
- `repo` (keyword)
7894
- `updated_at` (datetime)
7995
- `doc_ts` (datetime)
8096

docs/spec/system_doc_extension_v1_trajectory.md

Lines changed: 46 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,12 @@ Allowed/expected stage names (in order):
4747
Ensures returned vector size matches the configured model/vector size.
4848

4949
4. `vector_search`
50-
Raw candidate retrieval from Qdrant.
50+
Dense and optional sparse retrieval from Qdrant.
51+
Dense retrieval runs first on every request; sparse retrieval is controlled by
52+
`sparse_mode` (`auto`, `on`, `off`).
53+
- `auto`: sparse retrieval only for symbol-heavy / exact-match style queries.
54+
- `on`: always run both dense and sparse retrieval.
55+
- `off`: dense-only retrieval.
5156

5257
5. `dedupe`
5358
Chunk-id deduplication between retrieval tiers.
@@ -56,7 +61,9 @@ Allowed/expected stage names (in order):
5661
Document/chunk metadata hydration from Postgres.
5762

5863
7. `result_projection`
59-
Final scored item projection and output truncation.
64+
Final scored item projection and output truncation.
65+
Implementations apply a recency tie-break using `updated_at` and expose the
66+
policy knobs in stage stats when available (`recency_tau_days`, `tie_breaker_weight`).
6067

6168
8. `level_selection` (excerpts only)
6269
`L0|L1|L2` selection and byte budget.
@@ -89,17 +96,52 @@ and `stage_name` values should be non-empty and meaningful for downstream reader
8996
{
9097
"stage_order": 1,
9198
"stage_name": "vector_search",
92-
"stats": { "raw_points": 12 }
99+
"stats": {
100+
"sparse_mode": "auto",
101+
"channels": ["dense"],
102+
"dense_raw_points": 24,
103+
"sparse_raw_points": 0,
104+
"raw_points": 24
105+
}
93106
},
94107
{
95108
"stage_order": 2,
96109
"stage_name": "result_projection",
97-
"stats": { "returned_items": 5, "pre_authorization_candidates": 8 }
110+
"stats": {
111+
"returned_items": 5,
112+
"pre_authorization_candidates": 8,
113+
"recency_tau_days": 60,
114+
"tie_breaker_weight": 0.12
115+
}
98116
}
99117
]
100118
}
101119
```
102120

121+
==================================================
122+
5) Evaluation Scenarios
123+
==================================================
124+
125+
- English dense-first over mixed-language docs (expected dense-first)
126+
- Request `sparse_mode` omitted or `off` for a normal English query.
127+
- Example: natural-language question with low symbol density from mixed `chat/dev` content.
128+
- `trajectory.stages.vector_search` should show `channels=["dense"]` and `sparse_raw_points=0` (or absent).
129+
- `trajectory.stages.result_projection` should show normal ranking output and no symbolic jump from sparse-only terms.
130+
131+
- Exact-match cases (`auto` vs `on`)
132+
- Query contains symbols / identifiers (`/`, `:`, `#`, hex, URLs, error codes like `ERR_...`, full stack traces, full identifiers).
133+
- With `sparse_mode=auto`, expect `channels=["dense"]` for generic prose and `channels` may include `"sparse"` when the query is symbol-heavy.
134+
- With `sparse_mode=on`, expect `channels` to include both `"dense"` and `"sparse"` even if `auto` would stay dense-only.
135+
- Compare `vector_search.raw_points` and `result_projection` stability across modes for the same corpus; `sparse_mode=on` should improve retrieval of exact token patterns in symbol-heavy queries.
136+
137+
- Recency bias checks
138+
- Configure `cfg.ranking.recency_tau_days` and `cfg.ranking.tie_breaker_weight` > 0.
139+
- In `trajectory.stages.result_projection`, verify fields:
140+
- `recency_tau_days` (current effective value),
141+
- `tie_breaker_weight` (current effective weight),
142+
- `pre_authorization_candidates` and `returned_items`.
143+
- Expected signal: newer `updated_at` chunks should move upward when fusion scores are close and tie-break is active.
144+
103145
```json
104146
{
105147
"schema": "doc_retrieval_trajectory/v1",

0 commit comments

Comments
 (0)