Skip to content

Improve Knowledge Base re-embeddings#5200

Merged
IvanIvanoff merged 1 commit into
masterfrom
improve-knowledge-embeddings
Jun 4, 2026
Merged

Improve Knowledge Base re-embeddings#5200
IvanIvanoff merged 1 commit into
masterfrom
improve-knowledge-embeddings

Conversation

@IvanIvanoff
Copy link
Copy Markdown
Member

@IvanIvanoff IvanIvanoff commented Jun 4, 2026

Changes

Ticket

Checklist:

  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have tried to find clearer solution before commenting hard-to-understand parts of code
  • I have added tests that prove my fix is effective or that my feature works

Summary by CodeRabbit

  • New Features

    • Added background capability to remove stale post embeddings automatically.
  • Improvements

    • Improved embedding processing to be more reliable and consistent by making batch updates atomic and isolating batch failures so reindexes continue.
  • Tests

    • Added tests covering embedding cleanup and batch processing behavior.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Adds prune_all_stale/0 to remove embeddings for posts no longer published; embed_all_posts/0 gains per-batch error handling; embed_posts_batch/1 now replaces delete+insert with an atomic Repo.transaction. A new DB-backed test asserts prune_all_stale/0 behavior.

Changes

Post Embedding Atomicity and Pruning

Layer / File(s) Summary
Per-batch error handling in embed_all_posts
lib/sanbase/insight/post_embedding.ex
embed_all_posts/0 wraps each embed_posts_batch/1 call in try/rescue, logging error tuples and crashes per batch without aborting the loop.
Prune stale embeddings function
lib/sanbase/insight/post_embedding.ex
Adds public prune_all_stale/0 that computes published post IDs, deletes embedding rows whose post_id is not in that set via a single query, logs the number deleted, and returns the count.
Atomic batch replacement (delete+insert)
lib/sanbase/insight/post_embedding.ex
Removes the standalone Repo.delete_all and performs deletion of existing embeddings for the batch together with insert_all inside a single Sanbase.Repo.transaction/1, with comments noting embedding generation happens before the transaction.
DB-backed test for prune_all_stale
test/sanbase/insight/post_embedding_prune_test.exs
New async Sanbase.DataCase test module asserting PostEmbedding.prune_all_stale/0 returns 0 when no stale embeddings exist.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 In burrows deep the embeddings keep,

I sweep the stale and guard each heap,
One transaction binds the old and new,
Per-batch I catch what slips from view,
A hopping patch, both tidy and true.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Improve Knowledge Base re-embeddings' directly relates to the main changes: the pull request enhances the post embedding workflow with error resilience, atomic batch operations, and stale embedding cleanup.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch improve-knowledge-embeddings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/sanbase/insight/post_embedding_prune_test.exs (1)

1-11: ⚡ Quick win

Consider adding a test case that verifies stale embeddings are actually pruned.

The current test is a useful smoke test confirming the query executes without error, but it doesn't verify the core pruning logic. A more complete test would:

  1. Insert a published post with embeddings (should be kept)
  2. Insert an unpublished/draft post with embeddings (should be pruned)
  3. Call prune_all_stale/0
  4. Assert the return count matches expected deletions
  5. Assert embeddings for the published post still exist

This would catch regressions in the NOT IN (subquery) logic.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/sanbase/insight/post_embedding_prune_test.exs` around lines 1 - 11, Add
a focused test to verify prune_all_stale/0 actually deletes embeddings for
stale/unpublished posts while preserving published ones: in
Sanbase.Insight.PostEmbeddingPruneTest insert one published post with embeddings
and one unpublished/draft post with embeddings, call
PostEmbedding.prune_all_stale(), assert the returned deleted count equals the
number of embeddings for the draft post, and finally assert embeddings for the
published post still exist (and embeddings for the draft post are gone) so the
NOT IN (subquery) pruning logic is validated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/sanbase/insight/post_embedding.ex`:
- Around line 152-159: The transaction result from Sanbase.Repo.transaction
(wrapping Sanbase.Repo.delete_all(from(pe in __MODULE__, where: pe.post_id in
^post_ids)) and Sanbase.Repo.insert_all(__MODULE__, data)) is currently
discarded, so failures are silent; change the call to return the transaction
result and propagate it up out of the function that performs the batch write,
then update embed_all_posts/0 (which currently uses Enum.each and ignores
returns) to handle the {:ok, _} and {:error, reason} tuples (e.g., log the error
and decide whether to continue or halt) so batch insert errors are visible and
handled.

---

Nitpick comments:
In `@test/sanbase/insight/post_embedding_prune_test.exs`:
- Around line 1-11: Add a focused test to verify prune_all_stale/0 actually
deletes embeddings for stale/unpublished posts while preserving published ones:
in Sanbase.Insight.PostEmbeddingPruneTest insert one published post with
embeddings and one unpublished/draft post with embeddings, call
PostEmbedding.prune_all_stale(), assert the returned deleted count equals the
number of embeddings for the draft post, and finally assert embeddings for the
published post still exist (and embeddings for the draft post are gone) so the
NOT IN (subquery) pruning logic is validated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 19f67815-d5cc-4bce-a149-8ba375f192ab

📥 Commits

Reviewing files that changed from the base of the PR and between b154bde and 730996d.

📒 Files selected for processing (2)
  • lib/sanbase/insight/post_embedding.ex
  • test/sanbase/insight/post_embedding_prune_test.exs

Comment thread lib/sanbase/insight/post_embedding.ex
@IvanIvanoff IvanIvanoff force-pushed the improve-knowledge-embeddings branch from 730996d to 36f6322 Compare June 4, 2026 10:39
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/sanbase/insight/post_embedding.ex`:
- Around line 168-171: Remove the inline block comment above the atomic replace
logic and instead document the behavior on the embed_posts_batch/1 function: add
a `@doc` string to embed_posts_batch/1 that explains embeddings are generated
before the transaction and that the delete+insert are performed atomically, or
extract the delete+insert logic into a descriptively named helper (e.g.,
replace_embeddings_atomically/2) and add a `@doc` to that helper so the behavior
is self-documenting rather than using an inline comment.
- Around line 56-58: Remove the inline comment inside embed_all_posts/0 and
instead add a short `@doc` for embed_all_posts/0 that documents the per-batch
error handling behavior (that failing batches are logged and skipped and DB
errors inside the transaction are rescued in addition to {:error, _} rollbacks).
Alternatively, make the code self-documenting by renaming local variables or
extracting a helper (e.g., handle_batch_result/1 or log_and_skip_batch/1) so the
behavior is clear without an inline comment; ensure the new `@doc` or helper name
mentions logging and skipping failing batches.
- Around line 94-104: Add an `@spec` for prune_all_stale/0 (e.g. `@spec`
prune_all_stale() :: {:ok, non_neg_integer()} | {:error, term()}) and change the
implementation to return a tagged tuple: wrap the DB call in a safe block (use
try/rescue or with/case around Sanbase.Repo.delete_all to catch exceptions) so
successful deletes return {:ok, deleted} and failures return {:error, reason};
update Logger calls to log on success and errors accordingly and ensure callers
that relied on an integer are updated to handle the {:ok, count} / {:error,
reason} tuple.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4207c7dd-22bd-429b-9a3b-4676a0119a9e

📥 Commits

Reviewing files that changed from the base of the PR and between 730996d and 36f6322.

📒 Files selected for processing (2)
  • lib/sanbase/insight/post_embedding.ex
  • test/sanbase/insight/post_embedding_prune_test.exs
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/sanbase/insight/post_embedding_prune_test.exs

Comment thread lib/sanbase/insight/post_embedding.ex
Comment thread lib/sanbase/insight/post_embedding.ex
Comment thread lib/sanbase/insight/post_embedding.ex
@IvanIvanoff IvanIvanoff merged commit c4d6f7e into master Jun 4, 2026
3 checks passed
@IvanIvanoff IvanIvanoff deleted the improve-knowledge-embeddings branch June 4, 2026 10:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant