feat(embed): add embedding job drain by GaosCode · Pull Request #33 · steipete/discrawl

GaosCode · 2026-04-12T10:26:12Z

Summary

This PR is logically stacked on #32, but it targets main because this is a fork-based contribution.

It adds the storage and execution layer for queued message embeddings. sync --with-embeddings queues embedding work without calling the provider, and the new discrawl embed command explicitly drains pending jobs in bounded batches.

Until #32 lands, this PR will also show the provider abstraction diff from that PR. Please review #32 first. After #32 is merged, this PR should collapse to the embedding job drain changes only. If the upstream merge strategy prevents that, I will rebase this branch.

Changes

Add schema version 2 migration for embedding job metadata and stored message vectors.
Add local message_embeddings storage keyed by message, provider, model, and input version.
Extend embedding_jobs with provider/model/input version, error tracking, and lock metadata.
Add discrawl embed with --limit, --batch-size, and --rebuild.
Drain pending embedding jobs through the configured provider with capped input text.
Keep sync --with-embeddings non-blocking by only queueing work.
Avoid requeueing completed embedding jobs when unchanged messages are synced again.
Treat provider HTTP 429 responses as rate limits and stop the current drain run cleanly.

Testing

go test ./...

Added coverage for:

migrating unversioned v1 databases to schema v2
preserving completed embedding jobs across unchanged syncs
requeueing embedding work when normalized message content changes
draining pending jobs into stored vectors
handling empty content, provider failures, rebuilds, and rate limits

Risks / Notes

This PR does not add semantic or hybrid search ranking yet. It only adds the local vector storage and explicit embedding queue drain needed by the follow-up search PR.
Documentation updates are intentionally left out of this PR and can be submitted separately.
Because this PR targets main while depending on #<PR2_NUMBER>, the initial diff may be noisy until feat(embed): add local embedding providers #32 is merged.

GaosCode added 2 commits April 12, 2026 16:23

feat(embed): add local embedding providers

b21dd05

feat(embed): add embedding job drain

81b704a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(embed): add embedding job drain#33

feat(embed): add embedding job drain#33
GaosCode wants to merge 2 commits intosteipete:mainfrom
GaosCode:feat/embedding-jobs

GaosCode commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

GaosCode commented Apr 12, 2026

Summary

Changes

Testing

Risks / Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant