Skip to content

feat(embed): add embedding job drain#33

Open
GaosCode wants to merge 2 commits intosteipete:mainfrom
GaosCode:feat/embedding-jobs
Open

feat(embed): add embedding job drain#33
GaosCode wants to merge 2 commits intosteipete:mainfrom
GaosCode:feat/embedding-jobs

Conversation

@GaosCode
Copy link
Copy Markdown

Summary

This PR is logically stacked on #32, but it targets main because this is a fork-based contribution.

It adds the storage and execution layer for queued message embeddings. sync --with-embeddings queues embedding work without calling the provider, and the new discrawl embed command explicitly drains pending jobs in bounded batches.

Until #32 lands, this PR will also show the provider abstraction diff from that PR. Please review #32 first. After #32 is merged, this PR should collapse to the embedding job drain changes only. If the upstream merge strategy prevents that, I will rebase this branch.

Changes

  • Add schema version 2 migration for embedding job metadata and stored message vectors.
  • Add local message_embeddings storage keyed by message, provider, model, and input version.
  • Extend embedding_jobs with provider/model/input version, error tracking, and lock metadata.
  • Add discrawl embed with --limit, --batch-size, and --rebuild.
  • Drain pending embedding jobs through the configured provider with capped input text.
  • Keep sync --with-embeddings non-blocking by only queueing work.
  • Avoid requeueing completed embedding jobs when unchanged messages are synced again.
  • Treat provider HTTP 429 responses as rate limits and stop the current drain run cleanly.

Testing

  • go test ./...

Added coverage for:

  • migrating unversioned v1 databases to schema v2
  • preserving completed embedding jobs across unchanged syncs
  • requeueing embedding work when normalized message content changes
  • draining pending jobs into stored vectors
  • handling empty content, provider failures, rebuilds, and rate limits

Risks / Notes

  • This PR does not add semantic or hybrid search ranking yet. It only adds the local vector storage and explicit embedding queue drain needed by the follow-up search PR.
  • Documentation updates are intentionally left out of this PR and can be submitted separately.
  • Because this PR targets main while depending on #<PR2_NUMBER>, the initial diff may be noisy until feat(embed): add local embedding providers #32 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant