psql_bm25s (Technical Report) is an
independent PostgreSQL extension for BM25-family lexical retrieval. It
is inspired by the public bm25s
work and implemented as a PostgreSQL-native access method.
The project keeps the BM25 contract explicit where it matters most:
- BM25-family semantics
- corpus-statistics-driven ranking
- query-first exact top-k retrieval
The mainline extension adds a PostgreSQL-specific storage and
maintenance design for database workloads with frequent INSERT,
UPDATE, and DELETE. No source code from the Python reference
implementation bm25s is vendored or copied into this repository.
Current mainline capabilities:
- native
CREATE INDEX USING psql_bm25s - indexed column types:
textvarchartext[]varchar[]int4[]
- multicolumn fusion indexes over
text[],varchar[],text, orvarcharcolumns - opt-in field-aware multicolumn indexes with query-time field weights
- canonical exact BM25 retrieval APIs:
psql_bm25s_query_tokens(...)psql_bm25s_query_ids(...)
- SQL-facing convenience surfaces:
psql_bm25s_query(...)psql_bm25s_query_prepared(...)tokens @@ 'query text'ORDER BY tokens <=> ... ASC LIMIT kORDER BY token_ids <=> ... ASC LIMIT k
- text preprocessing helpers:
- tokenization
- normalization
- optional stopwords
- optional English stemming
- optional Latin-diacritic folding
- token-stream and raw-text highlight and snippet helpers
- automatic mutable-workload maintenance:
- exact
INSERT/UPDATE - exact delete cleanup through PostgreSQL heap cleanup and index maintenance
- bounded deferred overlays for lower write cost
- exact
- maintenance introspection and recommendation helpers
- PostgreSQL-native durability and physical replication compatibility
- weighted multi-index query fusion helpers for field-aware retrieval
- C-backed hybrid BM25/vector late-fusion helpers without a hard vector extension dependency
The Python reference implementation bm25s project is the main public
reference for the eager sparse scoring formulation used as technical
background:
Related scoring concepts include:
robertsonluceneatirebm25lbm25+- eager sparse scoring
- CSC-style postings layout
- top-k retrieval over precomputed sparse scores
The extension does not try to reproduce the bm25s Python package interface.
It implements its own C and PostgreSQL storage, access-method, and
maintenance layers around BM25-family retrieval semantics.
bm25s is excellent for fast BM25 retrieval over static or externally
managed corpora. PostgreSQL adds different requirements:
- index persistence inside the database
- transactional writes
- crash recovery
- physical replication
- planner-visible SQL access
- operationally manageable index maintenance under row churn
psql_bm25s keeps BM25-family semantics explicit while adding the
storage and maintenance layers needed for the index to behave like a
database index rather than a one-shot offline artifact.
Current high-level read:
- the canonical rowset query path is the main performance path
- the current public benchmark reference is the PG18
15 x 5BEIR matrix - published BEIR reference targets are included in the project benchmark suite
- historical same-machine studies remain useful background, but they are no longer the default benchmark reference
- the mutable-maintenance design dramatically improves write-side maintenance cost relative to eager full rebuilds, while keeping exact reads and PostgreSQL durability/replication guarantees
Current performance matrix, with dataset size and QPS:
These numbers are benchmark context, not universal claims about any project. They depend on the measured versions, configuration, hardware, workload, and query settings. Third-party project names identify the measured engines for reproducibility.
| Dataset | Docs | Queries | Python reference implementation bm25s QPS |
psql_bm25s ids QPS |
psql_bm25s text[] QPS |
pg_search QPS |
vchord_bm25 QPS |
|---|---|---|---|---|---|---|---|
arguana |
8,674 | 1,406 | 1158.34 | 1402.63 | 1112.01 | 115.94 | 78.77 |
climate-fever |
5,416,593 | 1,535 | 3.04 | 57.78 | 50.75 | 2.84 | 5.25 |
cqadupstack |
457,199 | 13,145 | 111.56 | 443.13 | 438.42 | 13.99 | 60.51 |
dbpedia-entity |
4,635,922 | 467 | 3.47 | 128.19 | 91.19 | 5.21 | 23.66 |
fever |
5,416,568 | 123,142 | 3.15 | 97.56 | 80.15 | 5.62 | 12.13 |
fiqa |
57,638 | 6,648 | 810.51 | 1409.52 | 1186.41 | 17.76 | 190.57 |
hotpotqa |
5,233,329 | 97,852 | 4.16 | 55.40 | 49.86 | 3.42 | 9.31 |
msmarco |
8,841,823 | 509,962 | 1.61 | 96.67 | 82.13 | 4.44 | 18.20 |
nfcorpus |
3,633 | 3,237 | 3155.35 | 3373.94 | 3326.96 | 1132.17 | 1252.75 |
nq |
2,681,468 | 3,452 | 10.55 | 174.34 | 176.69 | 6.28 | 21.96 |
quora |
522,931 | 15,000 | 90.56 | 637.98 | 619.64 | 13.26 | 154.36 |
scidocs |
25,657 | 1,000 | 1203.09 | 1835.85 | 1614.92 | 17.89 | 367.04 |
scifact |
5,183 | 1,109 | 2964.86 | 2557.47 | 2240.18 | 500.04 | 629.42 |
trec-covid |
171,332 | 50 | 210.50 | 191.94 | 154.48 | 8.75 | 75.66 |
webis-touche2020 |
382,545 | 49 | 240.36 | 82.97 | 74.10 | 8.14 | 86.04 |
Trend view by dataset scale:
Current matrix readout:
- Median QPS ratios versus Python reference implementation
bm25sare3.97xforpsql_bm25s ids,3.93xforpsql_bm25s text[],0.54xforvchord_bm25, and0.17xforpg_search. - Dataset counts at or above Python reference implementation
bm25sare12/15forpsql_bm25s ids,11/15forpsql_bm25s text[],7/15forvchord_bm25, and3/15forpg_search. - On the largest workload,
msmarco, the measured QPS was96.67forpsql_bm25s ids,82.13forpsql_bm25s text[],18.20forvchord_bm25,4.44forpg_search, and1.61for the Python reference implementationbm25s. - Mutable-maintenance results are still important, but they are a separate dimension from this cross-engine read-performance matrix.
More Performance and Benchmark detail
You can get started in three common ways:
Download the package that matches your PostgreSQL major, OS, and architecture from:
Each release .zip is built with the PostgreSQL extension files staged
under their final install paths. A simple install flow is:
curl -L -o psql_bm25s.zip \
https://github.com/Intelligent-Internet/psql_bm25s/releases/latest/download/<package>.zip
unzip psql_bm25s.zip
sudo rsync -a <package>/ /Replace <package> with the actual archive directory name from the
release, for example psql_bm25s-vX.Y.Z-linux-x86_64-pg18.
After copying the files, restart PostgreSQL if needed, then enable the extension in the target database:
CREATE EXTENSION psql_bm25s;If you want a non-public extension schema, choose it at creation time with
CREATE EXTENSION psql_bm25s WITH SCHEMA ext;. The extension is not
relocatable after creation because SQL helper functions capture the extension
schema for safe wrapper resolution.
Prebuilt PostgreSQL 18 images are published to:
Pull either the floating PG18 tag or a versioned release tag:
docker pull ghcr.io/intelligent-internet/psql_bm25s:pg18
# or:
docker pull ghcr.io/intelligent-internet/psql_bm25s:pg18-vX.Y.ZStart PostgreSQL with the extension preinstalled:
docker run -d \
--name psql-bm25s-pg18 \
-e POSTGRES_PASSWORD=postgres \
-p 5432:5432 \
ghcr.io/intelligent-internet/psql_bm25s:pg18The image init scripts create psql_bm25s in postgres, template1,
and the optional POSTGRES_DB database on first boot.
For source builds, local validation, and release workflow notes, see Contribution.
After installing the extension, open psql and run the smallest useful
example: insert rows, index a text column with default options, and
run a top-k BM25 query.
CREATE EXTENSION psql_bm25s;
DROP TABLE IF EXISTS docs;
CREATE TABLE docs (
id integer primary key,
title text not null,
body text not null
);
INSERT INTO docs (id, title, body) VALUES
(1, 'red apple', 'fresh red apple fruit'),
(2, 'green apple', 'green apple slices'),
(3, 'orange citrus', 'orange citrus fruit'),
(4, 'cat guide', 'small cat animal care');
CREATE INDEX docs_bm25_idx
ON docs USING psql_bm25s (body);
SELECT d.id, d.title, h.score
FROM psql_bm25s_query(
'docs_bm25_idx'::regclass,
'apple fruit',
5
) AS h
JOIN docs AS d ON d.ctid = h.ctid
ORDER BY h.score DESC, d.id;With no WITH (...) options, the index uses Lucene-style BM25 and IDF
defaults with the realtime consistency policy.
The extension also supports varchar, pretokenized text[] and
varchar[], and integer token id arrays for applications that own
tokenization. Eventual and manual consistency are available for write-heavy
or batch-maintained corpora.
For SQL operators, multi-field search, hybrid vector/BM25 fusion, and
maintenance policy examples, start with:
- Supported Input Types
- Query Semantics
- Multi-Field Search
- Field-Aware Indexes
- Hybrid Vector/BM25 Search
- Index Policy
Use these as the default exact BM25 entry points:
psql_bm25s_query_ids(...)psql_bm25s_query_tokens(...)
These are the clearest semantic surface and the main benchmark path.
The extension exposes three layers of query surface:
- Canonical exact BM25 APIs
psql_bm25s_query_ids(...)psql_bm25s_query_tokens(...)
- SQL convenience retrieval
psql_bm25s_query(...)psql_bm25s_query_prepared(...)
- Planner/operator integration
@@<=>
The canonical exact BM25 contract is defined by rowset retrieval APIs
that return psql_bm25s_result_hit rows. They are the main benchmark path and
the clearest semantic surface.
Important semantic boundaries:
@@is a document-match predicate, not a ranking API<=>aligns with true BM25 ordering only when PostgreSQL executes a realpsql_bm25sindex scan- raw-query retrieval is exact for the supported surface, but grouped boolean and phrase queries may use bounded heap verification and are intentionally slower than simple array-based retrieval
For RAG-style ranking, psql_bm25s can combine BM25 candidates, vector
candidates, and other ranked SQL candidate sources into one weighted top-k
inside PostgreSQL. The fusion path is C-backed and keeps vector support
optional: pgvector, VectorChord, or another vector extension can own vector
retrieval, while psql_bm25s owns normalization, weighting, de-duplication,
and final ordering.
Use Hybrid Vector/BM25 Search for query examples and Hybrid Fusion Engine for implementation boundaries, performance guidance, and validation coverage.
psql_bm25s extends the original static bm25s storage idea with
PostgreSQL-native mutable-index maintenance. The goal is to keep BM25
semantics stable while making INSERT, UPDATE, DELETE, VACUUM,
restart, crash recovery, and physical replication operationally manageable.
The public maintenance switch is consistency:
realtime: default strong-freshness behavior for ordinary mutable tableseventual: query-first behavior for large knowledge bases that can tolerate short-term stale BM25 results while maintenance convergesmanual: explicit-refresh behavior for static corpora, benchmarks, and externally scheduled maintenance
The implementation records maintenance debt on the index, batches repeated writes, supports bounded overlays where appropriate, follows PostgreSQL's normal delete-cleanup lifecycle, and exposes automatic worker plus explicit maintenance helpers for operators.
For details:
- Index Policy explains the three consistency modes, write/query behavior, maintenance helpers, and scheduler guidance.
- Shared Generation Cache explains the
shared immutable cache tiers for large connection-pool deployments,
including the zero-configuration DSM path and the optional
shared_preload_librariesarena. - Index Parameters is the compact reference for
CREATE INDEX ... WITH (...)reloptions. - Architecture and Design summarizes the storage and access-method design behind mutable indexes.
- Online Maintenance Future Plan records the larger future direction for generationed publish semantics.
The current implementation is a native PostgreSQL index relation.
That means:
- index pages live in ordinary PostgreSQL index storage
- writes are WAL-logged
- the index survives restart and crash recovery
- the index participates in physical replication like other PostgreSQL indexes
Logical replication follows normal PostgreSQL behavior:
- table rows replicate
- index relations do not replicate as logical data objects
- indexes should be created or rebuilt on the subscriber side
Check these links for detailed documentation:
- Documentation Root
- Contribution
- Architecture and Design
- Technical Report
- API Reference
- Functions
- Index Parameters
- Supported Input Types
- Hybrid Vector/BM25 Search
- Hybrid Vector/BM25 Search Use-Case Design
- Hybrid Fusion Engine
- Query Semantics
- Multi-Field Search
- Field-Aware Indexes
- Index Policy
- Shared Generation Cache
- Online Maintenance Future Plan
- Performance and Benchmarks
- Testing and Validation
The remaining roadmap is broader project work rather than another round of local collector tweaks:
See also:
- deeper storage and representation work for lower read-path overhead
- stronger heap and
text[]fetch redesign, especially for phrase- and verification-heavy paths - more architectural query-state reuse where it can be made exact and stable
- stronger differential fuzzing and randomized correctness checks
- prepared and index-bound query values
- broader filtered ranked scans and clearer top-k diagnostics
- better native APIs for pretokenized corpora
- richer tokenizer and normalization profiles
- longer-run maintenance policy tuning and workload replay
- richer presentation helpers beyond token-stream output
