Skip to content

[MongoDB] Direct BSON Buffer -> JSON conversion#599

Merged
rkistner merged 29 commits intomainfrom
custom-bson-to-json
Apr 13, 2026
Merged

[MongoDB] Direct BSON Buffer -> JSON conversion#599
rkistner merged 29 commits intomainfrom
custom-bson-to-json

Conversation

@rkistner
Copy link
Copy Markdown
Contributor

@rkistner rkistner commented Apr 8, 2026

Builds on #598.

This provides an alternative implementation for rawToSqliteRow. For nested documents and arrays, this converts from Buffer -> JSON Buffer -> string, without intermediate bson.deserialize or JSON(Big).stringify. This can significantly reduce allocations and improve throughput in cases with large nested documents or arrays.

Initial micro-benchmarks comparing the new approach to the old one:

Scenario          Full doc    Event       Benchmark                               Ops/s       MiB/s
--------          --------    -----       ---------                               -----       -----
insert 1 KB       1.0 KB      1.3 KB      parseChangeDocument + rawToSqliteRow    148,970     183
                                          bson.deserialize + documentToSqliteRow  106,492     131
insert 10 KB      10 KB       10 KB       parseChangeDocument + rawToSqliteRow    133,342     1,336
                                          bson.deserialize + documentToSqliteRow  52,176      523
insert 100 KB     100 KB      100 KB      parseChangeDocument + rawToSqliteRow    75,847      7,426
                                          bson.deserialize + documentToSqliteRow  8,583       840
update 1 KB       1.0 KB      1.9 KB      parseChangeDocument + rawToSqliteRow    142,705     271
                                          bson.deserialize + documentToSqliteRow  81,868      155
update 10 KB      10 KB       20 KB       parseChangeDocument + rawToSqliteRow    121,012     2,357
                                          bson.deserialize + documentToSqliteRow  31,346      610
update 100 KB     100 KB      200 KB      parseChangeDocument + rawToSqliteRow    71,743      14,008
                                          bson.deserialize + documentToSqliteRow  4,483       875

End-to-end benchmarks compared to main (excludes #598 and #591). This is tested using documents of 100KB+ in size for initial snapshot, and making small updates to 2MB+ in size for the change stream test. This uses a local bucket storage database on NVMe disk, which significantly reduces the typical bucket storage overhead, instead just focusing on the CPU and memory overhead.

Type Initial snapshot Change stream Peak memory usage (change stream)
main 28.0MB/s 26.1MB/s 833MB
this PR 57.8MB/s 70.7MB/s 484MB

Implementation

The implementation uses a custom BSON parser. For each top-level value converted to JSON, we write the results into a Buffer, then convert that buffer to a string. In my early benchmarks, this was faster than using direct string concatenation.

I attempted to optimize the common cases as much as possible. The more esoteric types like regular expressions, DBPointer, etc are supported, but not specifically optimized for performance.

The implementation was largely using AI-assisted development (Codex), but with lots of manual effort to direct, review and test the implementation.

Since there are many edge cases, this relies on an extensive test suite to check for correctness, including matching the old implementation for the most part.

The old implementation is still kept around for:

  1. Testing
  2. Sampling the source types for the schema API

Copying from the jsdocs:
This attempts to match the behavior of bson.deserialize -> constructAfterRecord -> applyRowContext for the most part, with some intentional differences:

  1. Regular expression patterns options are preserved as-is, while the above normalizes to JS RegExp values.
  2. Full UTF-8 validation is not performed - we attempt to continue using replacement characters, as long as the resulting output remains valid.
  3. bson.deserialize has special-case handler for converting documents containing {$ref} -> DBRef. We don't do that here.

General principles followed:

  1. Correctness: Never produce invalid JSON.
  2. Performance: Optimize to be as performant as possible for common cases.
  3. Full BSON support: Support all valid BSON documents as input, including deprecated types, but without specifically optimizing for performance here.
  4. The source database is responsible for producing valid BSON - we don't test for all edge cases of invalid BSON.
  5. We do a best-effort attempt to support "degenerate" BSON cases as documented at https://specifications.readthedocs.io/en/latest/bson-corpus/bson-corpus/, since MongoDB can produce many of these cases.

Future optimizations

With these changes, CPU should be much less of a bottleneck for replicating from MongoDB. If we do need to optimize it further, there are some options:

  1. We can make the JSON serialization lazy - only triggering it on-demand when used by sync queries.
  2. We can take that further and only serialize specific sub-fields that are used in the sync queries, if relevant.
  3. We can use a native extension to do the conversion. In some early tests, using a Rust + N-API implementation could increase throughput by another 2x. However, that could add significant complexity to the build pipeline, so may not be worth it.

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Apr 8, 2026

⚠️ No Changeset found

Latest commit: 57e73c3

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@rkistner rkistner mentioned this pull request Apr 8, 2026
@rkistner rkistner changed the title [WIP] [MongoDB] Direct BSON Buffer -> JSON conversion [MongoDB] Direct BSON Buffer -> JSON conversion Apr 9, 2026
@rkistner rkistner marked this pull request as ready for review April 9, 2026 12:26
@rkistner rkistner requested a review from simolus3 April 9, 2026 13:03
Copy link
Copy Markdown
Contributor

@simolus3 simolus3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also looks good to me, the only potential issue I see are UTF-16 surrogate pairs. It might make sense to add tests for those.

I didn't check testcases and the benchmark in detail, but the implementation makes sense to me.

@rkistner rkistner requested a review from simolus3 April 9, 2026 15:08
simolus3
simolus3 previously approved these changes Apr 9, 2026
@rkistner rkistner force-pushed the mongo-json-direct branch from 6d7e475 to b68bea1 Compare April 13, 2026 11:57
Base automatically changed from mongo-json-direct to main April 13, 2026 13:03
@rkistner rkistner dismissed simolus3’s stale review April 13, 2026 13:03

The base branch was changed.

@rkistner rkistner force-pushed the custom-bson-to-json branch from d02a866 to 57e73c3 Compare April 13, 2026 13:17
@rkistner rkistner merged commit e5074f0 into main Apr 13, 2026
44 checks passed
@rkistner rkistner deleted the custom-bson-to-json branch April 13, 2026 13:38
Sleepful added a commit that referenced this pull request Apr 14, 2026
Merges upstream main which includes PR #591 (raw change streams) and
PR #599 (direct BSON Buffer -> JSON conversion).

Auth fix conflicts (types.ts, config.test.ts) resolved — both sides
had the same fix, upstream also added database name decoding.

ChangeStream.ts has 11 unresolved conflicts — PR #591 replaced the
MongoDB driver ChangeStream with a custom RawChangeStream using raw
aggregate + getMore. Our Cosmos DB changes need to be re-applied to
the new code structure. Resolved in the next commit.
Sleepful added a commit that referenced this pull request Apr 14, 2026
Merges upstream main which includes PR #591 (raw change streams) and
PR #599 (direct BSON Buffer -> JSON conversion).

Auth fix conflicts (types.ts, config.test.ts) resolved — both sides
had the same fix, upstream also added database name decoding.

ChangeStream.ts has 11 unresolved conflicts — PR #591 replaced the
MongoDB driver ChangeStream with a custom RawChangeStream using raw
aggregate + getMore. Our Cosmos DB changes need to be re-applied to
the new code structure. Resolved in the next commit.

resolve: ChangeStream.ts merge conflicts for raw change streams

Re-applied all Cosmos DB changes to the new raw change stream code
structure from PR #591. The raw aggregate approach is better for
Cosmos DB: no lazy ChangeStream init, explicit cursor management,
$changeStream stage built directly in pipeline.

Changes applied to new structure:
- detectCosmosDb() calls in getSnapshotLsn, initReplication, streamChangesInternal
- getEventTimestamp() adapted to ProjectedChangeStreamDocument type
- Sentinel checkpoint with BSON.deserialize for fullDocument (raw Buffer)
- Pipeline guards: skip $changeStreamSplitLargeEvent and showExpandedEvents
- Cluster-level aggregate (admin db + allChangesForCluster) when isCosmosDb
- startAtOperationTime fix (startAfter != null)
- Keepalive guard for Cosmos DB resume tokens
- .lte() dedup guard skip on Cosmos DB
- wallTime tracking for replication lag
- Added changeset for @powersync/service-module-mongodb (minor)

Verified: 59/59 standard MongoDB tests pass.
Cosmos DB cluster is currently down — tests blocked by TLS timeout.
Code audit of RawChangeStream.ts found no compatibility issues:
cursor ID type auto-fixed by BigInt, postBatchResumeToken needs
empirical verification when cluster is back.
Sleepful added a commit that referenced this pull request Apr 14, 2026
Merges upstream main which includes PR #591 (raw change streams) and
PR #599 (direct BSON Buffer -> JSON conversion).

Auth fix conflicts (types.ts, config.test.ts) resolved — both sides
had the same fix, upstream also added database name decoding.

ChangeStream.ts has 11 unresolved conflicts — PR #591 replaced the
MongoDB driver ChangeStream with a custom RawChangeStream using raw
aggregate + getMore. Our Cosmos DB changes need to be re-applied to
the new code structure. Resolved in the next commit.

resolve: ChangeStream.ts merge conflicts for raw change streams

Re-applied all Cosmos DB changes to the new raw change stream code
structure from PR #591. The raw aggregate approach is better for
Cosmos DB: no lazy ChangeStream init, explicit cursor management,
$changeStream stage built directly in pipeline.

Changes applied to new structure:
- detectCosmosDb() calls in getSnapshotLsn, initReplication, streamChangesInternal
- getEventTimestamp() adapted to ProjectedChangeStreamDocument type
- Sentinel checkpoint with BSON.deserialize for fullDocument (raw Buffer)
- Pipeline guards: skip $changeStreamSplitLargeEvent and showExpandedEvents
- Cluster-level aggregate (admin db + allChangesForCluster) when isCosmosDb
- startAtOperationTime fix (startAfter != null)
- Keepalive guard for Cosmos DB resume tokens
- .lte() dedup guard skip on Cosmos DB
- wallTime tracking for replication lag
- Added changeset for @powersync/service-module-mongodb (minor)

Verified: 59/59 standard MongoDB tests pass.
Cosmos DB cluster is currently down — tests blocked by TLS timeout.
Code audit of RawChangeStream.ts found no compatibility issues:
cursor ID type auto-fixed by BigInt, postBatchResumeToken needs
empirical verification when cluster is back.
Sleepful added a commit that referenced this pull request Apr 14, 2026
Merges upstream main which includes PR #591 (raw change streams) and
PR #599 (direct BSON Buffer -> JSON conversion).

Auth fix conflicts (types.ts, config.test.ts) resolved — both sides
had the same fix, upstream also added database name decoding.

ChangeStream.ts has 11 unresolved conflicts — PR #591 replaced the
MongoDB driver ChangeStream with a custom RawChangeStream using raw
aggregate + getMore. Our Cosmos DB changes need to be re-applied to
the new code structure. Resolved in the next commit.

resolve: ChangeStream.ts merge conflicts for raw change streams

Re-applied all Cosmos DB changes to the new raw change stream code
structure from PR #591. The raw aggregate approach is better for
Cosmos DB: no lazy ChangeStream init, explicit cursor management,
$changeStream stage built directly in pipeline.

Changes applied to new structure:
- detectCosmosDb() calls in getSnapshotLsn, initReplication, streamChangesInternal
- getEventTimestamp() adapted to ProjectedChangeStreamDocument type
- Sentinel checkpoint with BSON.deserialize for fullDocument (raw Buffer)
- Pipeline guards: skip $changeStreamSplitLargeEvent and showExpandedEvents
- Cluster-level aggregate (admin db + allChangesForCluster) when isCosmosDb
- startAtOperationTime fix (startAfter != null)
- Keepalive guard for Cosmos DB resume tokens
- .lte() dedup guard skip on Cosmos DB
- wallTime tracking for replication lag
- Added changeset for @powersync/service-module-mongodb (minor)

Verified: 59/59 standard MongoDB tests pass.
Cosmos DB cluster is currently down — tests blocked by TLS timeout.
Code audit of RawChangeStream.ts found no compatibility issues:
cursor ID type auto-fixed by BigInt, postBatchResumeToken needs
empirical verification when cluster is back.
Sleepful added a commit that referenced this pull request Apr 14, 2026
Merges upstream main which includes PR #591 (raw change streams) and
PR #599 (direct BSON Buffer -> JSON conversion).

Auth fix conflicts (types.ts, config.test.ts) resolved — both sides
had the same fix, upstream also added database name decoding.

ChangeStream.ts has 11 unresolved conflicts — PR #591 replaced the
MongoDB driver ChangeStream with a custom RawChangeStream using raw
aggregate + getMore. Our Cosmos DB changes need to be re-applied to
the new code structure. Resolved in the next commit.

resolve: ChangeStream.ts merge conflicts for raw change streams

Re-applied all Cosmos DB changes to the new raw change stream code
structure from PR #591. The raw aggregate approach is better for
Cosmos DB: no lazy ChangeStream init, explicit cursor management,
$changeStream stage built directly in pipeline.

Changes applied to new structure:
- detectCosmosDb() calls in getSnapshotLsn, initReplication, streamChangesInternal
- getEventTimestamp() adapted to ProjectedChangeStreamDocument type
- Sentinel checkpoint with BSON.deserialize for fullDocument (raw Buffer)
- Pipeline guards: skip $changeStreamSplitLargeEvent and showExpandedEvents
- Cluster-level aggregate (admin db + allChangesForCluster) when isCosmosDb
- startAtOperationTime fix (startAfter != null)
- Keepalive guard for Cosmos DB resume tokens
- .lte() dedup guard skip on Cosmos DB
- wallTime tracking for replication lag
- Added changeset for @powersync/service-module-mongodb (minor)

Verified: 59/59 standard MongoDB tests pass.
Cosmos DB cluster is currently down — tests blocked by TLS timeout.
Code audit of RawChangeStream.ts found no compatibility issues:
cursor ID type auto-fixed by BigInt, postBatchResumeToken needs
empirical verification when cluster is back.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants