You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MobilityDB (PostgreSQL) and MobilityDuck (DuckDB) are now a matched pair — MobilityDuck implements ~100% of MobilityDB's in-scope user-facing SQL surface (per MobilityDuck/doc/PARITY.md). MobilitySpark, by contrast, is currently a starter project:
Dimension
MobilityDuck
MobilitySpark today
Source files
hundreds
4 Java files
MEOS types as SQL types
All temporals + spans + sets + boxes
One demo class (TimestampWithValue, not even a MEOS type)
MEOS functions exposed
~100% of in-scope surface
Two: meos_initialize, meos_finalize
UDFs registered
Hundreds
One demo (PowerUDF — x²)
Aggregators
Full surface
None
Format I/O
WKT / WKB / MFJSON
None
Parity manifest
PARITY.md + PARITY-INVENTORY.md
None
This RFC asks the community to align on whether and how to drive MobilitySpark to MobilityDuck-equivalent parity.
Why this matters now
Three converging forces make this the right moment to decide:
MEOS-API (RFC #836) is being ratified as the machine-readable description of the MEOS C library. MobilitySpark parity is its largest single use case — hand-writing 487 UDF wrappers is unrealistic; auto-generating from meos-api.json is feasible. The parity work is also the strongest validation of the catalog at scale.
The MobilityDB ↔ MobilityDuck parity story has now stabilised; the conventions are documented; the patterns are reusable. MobilitySpark can adopt them rather than re-derive them.
Proposal — Pattern A (SQL-surface mirror), MEOS-API-driven
Mirror MobilityDuck's approach: register all MEOS types as Spark UDTs (with WKB-bytes serialisation), all MEOS functions as Spark UDFs/UDAFs, and document the divergences in a PARITY.md analogous to MobilityDuck's. Drive the bulk of the wiring from MEOS-API's meos-api.json catalog.
Architectural choices
Choice
MobilityDuck precedent
MobilitySpark proposal
UDT serialisation form
WKB bytes in DuckDB BLOB columns
Spark BinaryType columns with the same WKB bytes — values round-trip cleanly between the two engines
Operator strategy
Most operators registered; DuckDB rejects <<# / <<| / <</ (#/|// adjacent to angle brackets) → named-function fallback for those
Spark SQL does not support custom operators at all.Every MobilityDB operator becomes a named-function-only call. Reuse MobilityDuck's named-function table verbatim and extend it to cover the X-axis operators (<<, >>, &<, &>) too
Aggregate naming
Agg suffix to avoid collisions with same-named scalars (DuckDB function-name resolution is case-insensitive)
Same Agg suffix. Spark's UDF/UDAF registry shares a name namespace, so the same collision exists
geography as separate type
DuckDB-spatial has no separate geography; MobilityDuck stores geographic statics as geometry with geodetic flag
Spark has no geometry at all. MobilitySpark depends on Apache Sedona for the static geometry surface (sister project to DuckDB-spatial). Geographic temporal values (tgeography, tgeogpoint) work as MEOS-typed UDTs
Spark scalars can't return SETOF either → return array<struct<...>>, recover row semantics with explode() (Spark's unnest() analogue)
Indexes
DuckDB has no SP-GiST / GiST / GIN; MobilityDuck provides TRTREE
Spark has no indexes at all — bucketing/partitioning hints are the closest analog. Document as architectural block; don't try to replicate. The bbox-overlap pushdown that DuckDB does via TRTREE becomes a Catalyst-level filter pushdown in Spark, which is its own non-trivial design
Excluded type families
tcbuffer, tnpoint, tpose, trgeo, th3index (~5)
Same exclusion for v1; add later as MEOS exposes them
Version stamps
mobilityduck_version() + alias for mobilitydb_version()
mobilityspark_version() + same alias pattern
Phased implementation plan
Phase
Work
Effort (human-time)
MEOS-API leverage
0 — Architectural alignment
This RFC ratifies; Pattern A confirmed; Apache Sedona dependency confirmed; MEOS-API codegen path confirmed
1 day discussion
n/a
1 — Foundation framework
Generic MeosTypeUDT<T> base class with WKB serialisation; UDT registry that registers all types at SparkSession init; round-trip integration tests
~1 week hand-written OR ~1 day codegen-driven from MEOS-API's structs block
high — every UDT's schema is derivable from the catalog
3 — UDF wrapper generation
Auto-generate Spark UDFs for the ~487 user-facing MEOS functions; register at SparkSession init
~2 weeks (designing the codegen template is the bulk; running it is fast)
highest — this phase is the strongest validation of the catalog at scale
4 — Aggregators
Spark uses a separate Aggregator API; ~30 MEOS aggregates need typed-aggregator implementations. Hand-rolling required because Spark's Aggregator (zero/reduce/merge/finish) doesn't map cleanly onto MEOS aggregate-state functions
~2 weeks
medium — catalog exposes aggregate signatures but the zero/reduce/merge/finish plumbing needs hand-design
5 — Tests
Mirror MobilityDB's regression tests as parity manifests, analogous to MobilityDuck's test/queries/
~2 weeks
low — parity tests are reusable across MobilityDuck and MobilitySpark with engine-name substitution
6 — PARITY.md + PARITY-INVENTORY.md
Doc the parity claim, the operator-rename table (for Spark, the entire MobilityDB operator surface becomes named-function-only), the architectural mismatches
~3 days
low — most of the prose is portable from MobilityDuck's
Total without MEOS-API
~9–10 weeks
Total with MEOS-API
~3–4 weeks
What ships at the end of each phase
After Phase 1: a runnable mvn install builds the framework; one demo type works end-to-end (e.g., TstzSpan).
After Phase 2: all temporal/span/set/box types instantiable from SQL; values round-trip MobilityDB→MobilitySpark→MobilityDuck verifiable with WKB bytes.
After Phase 3: every MEOS scalar function callable from Spark SQL by its MobilityDB name (with the operator → named-function table documented).
After Phase 4: aggregates work with the Agg suffix.
After Phase 5: a parity-test gate runs in CI.
After Phase 6: a public PARITY claim of "~95–98% of in-scope surface" (the cap reflects architectural mismatches: no indexes, all operators are named functions).
Reusing MobilityDuck conventions verbatim
Most of MobilityDuck's PARITY.md text is portable. The mapping:
MobilityDuck section
MobilitySpark equivalent
Effort to port
TL;DR
Lead with same "~95–98% of in-scope surface; most queries run unchanged" framing
rewrite for Spark specifics (Sedona, no indexes)
What I can paste in unchanged
Same shape; same SQL function names
port verbatim
Operator forms
Replace with named-function-only table — Spark has no custom operators
larger rewrite
Operators rejected
Subsumes the entire MobilityDB operator surface for Spark, not just #/|// axes
extend MobilityDuck's table
Aggregate naming
Agg suffix — same convention
port verbatim
Version stamps
mobilityspark_* + mobilitydb_* aliases
port verbatim
geography not separate
Same conclusion via different reason (Spark has no geometry; depend on Sedona)
rewrite the rationale
Composite-row returns
Same shape (array<struct<...>>); explode() instead of unnest()
small rewrite
Indexes
Architectural block — Spark has no index machinery; document filter-pushdown as future work
rewrite
Type families not included
Same (cbuffer / npoint / pose / rgeo / h3index)
port verbatim
Alternatives considered
Pattern B (Dataset/DSL API) — Dataset<Temporal> with idiomatic Java/Scala methods, no SQL-surface mirror. Pros: native Spark feel; Cons: doesn't satisfy "parity" semantically; SQL users have nothing. Rejected as the v1 strategy. Could be added additionally after Pattern A is complete.
Partial parity — implement only the most-used N% of the MEOS surface, document the rest as TODO. Rejected because the codegen approach makes "all 487" no harder than "the most-used 50" — the increment is the codegen template, not the per-function effort.
No parity / leave MobilitySpark as demo — abandon the parity story; document MobilitySpark as a "starter project, not a parity target." Rejected unless the community signals this — strategic-direction analysis (project_post_parity_strategic_directions.md) treats the binding ecosystem as the H3-style architecture's load-bearing piece.
Phase 1 begins (foundation against stable JMEOS 1.3)
MEOS-API RFC #836 ratifies (or schema stabilises)
Phases 2–3 begin (codegen-driven types + UDFs)
Phases 1–4 complete
Phases 5–6 (tests + parity docs)
All complete
Update libmeos.org's bindings and architecture page to list MobilitySpark as a peer of MobilityDuck (currently it's not surfaced because the surface is too thin)
Asking for
Directional sign-off on Pattern A (SQL-surface mirror, mirroring MobilityDuck) over Pattern B or partial parity.
Confirmation of Apache Sedona as the static-geometry dependency — MobilityDuck composes with DuckDB-spatial; the Spark-ecosystem analogue is Sedona. Any objection (e.g., a different geometry layer the project should adopt instead)?
Coordination with MEOS-API and JMEOS#8 — sequencing (Phases 2–3 depend on MEOS-API; everything depends on JMEOS#8). The community's preferred ordering between "parity work" and "MEOS-API ratification" should be settled here.
Owner identification — this is multi-month work. Who would drive it once unblocked? @GaspardMerten (JMEOS context) is one natural fit; a dedicated MobilitySpark contributor would be another. The RFC isn't asking the filer to do the work — it's asking the community to align on the approach.
Operator-handling acceptance — Spark's no-custom-operators constraint means the entire MobilityDB operator surface becomes named-function-only in MobilitySpark. Confirm that's acceptable as the parity story (the alternative is no parity at all on the operator axis).
Background
MobilityDB (PostgreSQL) and MobilityDuck (DuckDB) are now a matched pair — MobilityDuck implements ~100% of MobilityDB's in-scope user-facing SQL surface (per
MobilityDuck/doc/PARITY.md). MobilitySpark, by contrast, is currently a starter project:TimestampWithValue, not even a MEOS type)meos_initialize,meos_finalizePowerUDF—x²)PARITY.md+PARITY-INVENTORY.mdThis RFC asks the community to align on whether and how to drive MobilitySpark to MobilityDuck-equivalent parity.
Why this matters now
Three converging forces make this the right moment to decide:
MobilitySpark#1for the JMEOS-1.3 bump tracking).meos-api.jsonis feasible. The parity work is also the strongest validation of the catalog at scale.Proposal — Pattern A (SQL-surface mirror), MEOS-API-driven
Mirror MobilityDuck's approach: register all MEOS types as Spark UDTs (with WKB-bytes serialisation), all MEOS functions as Spark UDFs/UDAFs, and document the divergences in a
PARITY.mdanalogous to MobilityDuck's. Drive the bulk of the wiring from MEOS-API'smeos-api.jsoncatalog.Architectural choices
BLOBcolumnsBinaryTypecolumns with the same WKB bytes — values round-trip cleanly between the two engines<<#/<<|/<</(#/|//adjacent to angle brackets) → named-function fallback for those<<,>>,&<,&>) tooAggsuffix to avoid collisions with same-named scalars (DuckDB function-name resolution is case-insensitive)Aggsuffix. Spark's UDF/UDAF registry shares a name namespace, so the same collision existsgeographyas separate typegeography; MobilityDuck stores geographic statics asgeometrywith geodetic flagtgeography,tgeogpoint) work as MEOS-typed UDTsfrechetDistancePath, etc.)SETOF→ returnLIST<STRUCT(...)>SETOFeither → returnarray<struct<...>>, recover row semantics withexplode()(Spark'sunnest()analogue)TRTREETRTREEbecomes a Catalyst-level filter pushdown in Spark, which is its own non-trivial designtcbuffer,tnpoint,tpose,trgeo,th3index(~5)mobilityduck_version()+ alias formobilitydb_version()mobilityspark_version()+ same alias patternPhased implementation plan
MeosTypeUDT<T>base class with WKB serialisation; UDT registry that registers all types at SparkSession init; round-trip integration testsstructsblockAggregatorAPI; ~30 MEOS aggregates need typed-aggregator implementations. Hand-rolling required because Spark's Aggregator (zero/reduce/merge/finish) doesn't map cleanly onto MEOS aggregate-state functionszero/reduce/merge/finishplumbing needs hand-designtest/queries/PARITY.md+PARITY-INVENTORY.mdWhat ships at the end of each phase
mvn installbuilds the framework; one demo type works end-to-end (e.g.,TstzSpan).Aggsuffix.Reusing MobilityDuck conventions verbatim
Most of MobilityDuck's PARITY.md text is portable. The mapping:
#/|//axesAggsuffix — same conventionmobilityspark_*+mobilitydb_*aliasesgeographynot separatearray<struct<...>>);explode()instead ofunnest()Alternatives considered
Pattern B (Dataset/DSL API) —
Dataset<Temporal>with idiomatic Java/Scala methods, no SQL-surface mirror. Pros: native Spark feel; Cons: doesn't satisfy "parity" semantically; SQL users have nothing. Rejected as the v1 strategy. Could be added additionally after Pattern A is complete.Partial parity — implement only the most-used N% of the MEOS surface, document the rest as TODO. Rejected because the codegen approach makes "all 487" no harder than "the most-used 50" — the increment is the codegen template, not the per-function effort.
No parity / leave MobilitySpark as demo — abandon the parity story; document MobilitySpark as a "starter project, not a parity target." Rejected unless the community signals this — strategic-direction analysis (
project_post_parity_strategic_directions.md) treats the binding ecosystem as the H3-style architecture's load-bearing piece.Sequencing
Asking for
Cc