Skip to content

feat(retrieve): use tags metadata for cross-subtree retrieval#1162

Merged
MaojiaSheng merged 6 commits intovolcengine:mainfrom
13ernkastel:codex/issue-1147-tags-retrieval
Apr 3, 2026
Merged

feat(retrieve): use tags metadata for cross-subtree retrieval#1162
MaojiaSheng merged 6 commits intovolcengine:mainfrom
13ernkastel:codex/issue-1147-tags-retrieval

Conversation

@13ernkastel
Copy link
Copy Markdown
Contributor

@13ernkastel 13ernkastel commented Apr 1, 2026

Summary

  • auto-extract and persist tags during resource vectorization, while also allowing user-supplied tags on resource ingestion and search requests
  • thread tags through the resource/search services, SDK clients, and vector index filtering so callers can explicitly constrain retrieval by tag
  • expand HierarchicalRetriever with bounded, down-weighted tag-based cross-subtree discovery before BFS traversal starts
  • add focused retriever and router tests, and make the shared server test fixture use an isolated local config/AGFS setup

Related Issue

Fixes #1147

Why

Issue #1147 asks for tags to act as a lateral discovery signal across semantically distant subtrees. This keeps the existing hierarchical retrieval flow, but gives it a controlled way to discover related branches that the initial semantic top-K would otherwise miss.

Impact

  • resources can store merged auto-extracted and user-provided tags
  • find/search requests can explicitly scope retrieval by tags
  • global semantic hits can seed additional related subtrees through shared tags with capped expansion and lower initial scores

Validation

  • PYTHONPATH=/Users/lennonchia/Documents/Project/OpenViking /Users/lennonchia/Documents/Project/OpenViking/.venv/bin/python -m pytest -q tests/retrieve/test_hierarchical_retriever_target_dirs.py tests/retrieve/test_hierarchical_retriever_rerank.py tests/retrieve/test_hierarchical_retriever_tags.py tests/server/test_api_search.py::test_find_forwards_tags_to_service tests/server/test_api_resources.py::test_add_resource_forwards_tags_to_service

Notes

  • I attempted the broader tests/server/test_api_search.py suite locally, but the current local wheel is missing the native VectorDB PersistStore symbol, so full end-to-end vector search verification is still environment-dependent here.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

Failed to generate code suggestions for PR

@13ernkastel 13ernkastel changed the title [codex] use tags metadata for cross-subtree retrieval [feat] use tags metadata for cross-subtree retrieval Apr 1, 2026
@13ernkastel 13ernkastel marked this pull request as ready for review April 1, 2026 16:35
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

Failed to generate code suggestions for PR

@13ernkastel 13ernkastel changed the title [feat] use tags metadata for cross-subtree retrieval feat(retrieve): use tags metadata for cross-subtree retrieval Apr 1, 2026
@MaojiaSheng
Copy link
Copy Markdown
Collaborator

我建议对 tag 的内容命名进行约束,例如强制声明 tag 来源或原因:
user:machine-learning;user:model-training;auto:pytorch

@qin-ctx qin-ctx requested a review from zhoujh01 April 2, 2026 07:29
@13ernkastel
Copy link
Copy Markdown
Contributor Author

@MaojiaSheng 已按这个建议继续收紧了一版。

除了之前把标签写入/查询改成 user: / auto: 命名之外,这次又补了一层强约束:

  • 用户传入的资源标签现在会强制归一成 user:<tag>,即使调用方显式传了 auto: 之类的前缀也不会原样落库
  • 自动抽取 / summary 生成的标签会强制归一成 auto:<tag>
  • 补了资源写入、tag helper 和 vectorize 路径的回归覆盖

已推送:67f3069 (fix(tags): enforce canonical tag namespaces)

本地校验:

  • ruff format --check 通过
  • ruff check 通过
  • python3 -m py_compile 通过

更完整的 pytest 在本地环境里仍然会被现有的 AGFS 构建产物缺失问题挡住,所以这次先跑了聚焦在改动文件上的校验。

@MaojiaSheng MaojiaSheng merged commit e72b614 into volcengine:main Apr 3, 2026
11 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Apr 3, 2026
@zhoujh01
Copy link
Copy Markdown
Collaborator

zhoujh01 commented Apr 3, 2026

2026-04-03 14:28:03,771 - openviking - ERROR - Data validation failed: 1 validation error for DynamicData_my_test_context_v3 tags Input should be a valid string [type=string_type, input_value=[], input_type=list]

这个代码有点问题,tags写入类型不匹配,会导致写入失败。因为今天要发版,我先回滚了https://github.com/volcengine/OpenViking/pull/1200, @13ernkastel 你结合最新代码再修复一下。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Feature]: 利用 Tags 元数据增强跨子树检索能力

3 participants