Skip to content

Allow incremental together with clustered#33

Open
Dtenwolde wants to merge 6 commits into
duckdb:mainfrom
Dtenwolde:incremental-clustered
Open

Allow incremental together with clustered#33
Dtenwolde wants to merge 6 commits into
duckdb:mainfrom
Dtenwolde:incremental-clustered

Conversation

@Dtenwolde

Copy link
Copy Markdown
Member

Follow-up to #32, where we temporarily disallowed using incremental=true and cluster_terms=true together.

This PR allows both options to be used together.

Compared to #32 clustered incremental maintenance has some extra cost because we preserve ordered terms access across inserts and deletes, but the overhead stays modest.

Benchmark summary:

  • Small datasets: essentially parity, generally within noise / ~10%.
  • 100k dataset:
    • Inserts: roughly equal to non-clustered incremental.
    • Mixed workloads: roughly equal to non-clustered incremental.
    • Deletes: largest overhead.
      • Batch delete of 100 documents: ~1.6x slower.
      • Batch delete of 500 documents: ~1.3x slower.
      • Other delete cases: generally within ~1.1x slowdown.

Overall, clustered + incremental index maintenance performs similarly for inserts and mixed workloads, with some delete overhead, but all measured scenarios stayed below 2x slowdown.

@Dtenwolde Dtenwolde mentioned this pull request Jul 2, 2026
@Dtenwolde

Copy link
Copy Markdown
Member Author

I suppose running CI in reduced mode is fine?

Comment thread src/fts_indexing.cpp
"%%fts_schema%%.terms;\n"
"DROP TABLE %%fts_schema%%.terms;\n"
"CREATE VIEW %%fts_schema%%.terms AS SELECT termid, docid, fieldid FROM "
"%%fts_schema%%.%s ORDER BY termid, fieldid, docid;",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this ORDER BY here make performance worse? The incrementally appended data (after the initial index creation) is not physically clustered, which I am OK with, but I don't think forcing an ORDER BY will improve performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants