RFC: Drive MobilitySpark to MobilityDB ↔ MobilityDuck parity (Pattern A, MEOS-API-driven)

## Background

MobilityDB (PostgreSQL) and MobilityDuck (DuckDB) are now a matched pair — MobilityDuck implements **~100% of MobilityDB's in-scope user-facing SQL surface** (per [`MobilityDuck/doc/PARITY.md`](https://github.com/MobilityDB/MobilityDuck/blob/doc/parity/doc/PARITY.md)). MobilitySpark, by contrast, is currently a **starter project**:

| Dimension | MobilityDuck | MobilitySpark today |
|---|---|---|
| Source files | hundreds | **4 Java files** |
| MEOS types as SQL types | All temporals + spans + sets + boxes | **One demo class** (`TimestampWithValue`, not even a MEOS type) |
| MEOS functions exposed | ~100% of in-scope surface | **Two**: `meos_initialize`, `meos_finalize` |
| UDFs registered | Hundreds | **One demo** (`PowerUDF` — `x²`) |
| Aggregators | Full surface | None |
| Format I/O | WKT / WKB / MFJSON | None |
| Parity manifest | `PARITY.md` + `PARITY-INVENTORY.md` | None |

This RFC asks the community to align on **whether and how to drive MobilitySpark to MobilityDuck-equivalent parity**.

## Why this matters now

Three converging forces make this the right moment to decide:

1. **JMEOS PR #8** ports the bindings to MEOS C API 1.3. Once it merges, MobilitySpark would be building parity against a stable JMEOS surface — sooner is better than building against the soon-to-be-stale 1.0 bindings (see `MobilitySpark#1` for the JMEOS-1.3 bump tracking).
2. **MEOS-API (RFC #836)** is being ratified as the machine-readable description of the MEOS C library. **MobilitySpark parity is its largest single use case** — hand-writing 487 UDF wrappers is unrealistic; auto-generating from `meos-api.json` is feasible. The parity work is also the strongest validation of the catalog at scale.
3. **The MobilityDB ↔ MobilityDuck parity story** has now stabilised; the conventions are documented; the patterns are reusable. MobilitySpark can adopt them rather than re-derive them.

## Proposal — Pattern A (SQL-surface mirror), MEOS-API-driven

Mirror MobilityDuck's approach: register all MEOS types as Spark UDTs (with WKB-bytes serialisation), all MEOS functions as Spark UDFs/UDAFs, and document the divergences in a `PARITY.md` analogous to MobilityDuck's. Drive the bulk of the wiring from MEOS-API's `meos-api.json` catalog.

### Architectural choices

| Choice | MobilityDuck precedent | MobilitySpark proposal |
|---|---|---|
| **UDT serialisation form** | WKB bytes in DuckDB `BLOB` columns | Spark `BinaryType` columns with the same WKB bytes — values round-trip cleanly between the two engines |
| **Operator strategy** | Most operators registered; DuckDB rejects `<<#` / `<<\|` / `<</` (`#`/`\|`/`/` adjacent to angle brackets) → named-function fallback for those | **Spark SQL does not support custom operators at all.** *Every* MobilityDB operator becomes a named-function-only call. Reuse MobilityDuck's named-function table verbatim and extend it to cover the X-axis operators (`<<`, `>>`, `&<`, `&>`) too |
| **Aggregate naming** | `Agg` suffix to avoid collisions with same-named scalars (DuckDB function-name resolution is case-insensitive) | **Same `Agg` suffix.** Spark's UDF/UDAF registry shares a name namespace, so the same collision exists |
| **`geography` as separate type** | DuckDB-spatial has no separate `geography`; MobilityDuck stores geographic statics as `geometry` with geodetic flag | **Spark has no geometry at all.** MobilitySpark depends on [Apache Sedona](https://sedona.apache.org/) for the static geometry surface (sister project to DuckDB-spatial). Geographic temporal values (`tgeography`, `tgeogpoint`) work as MEOS-typed UDTs |
| **Composite-row returns** (`frechetDistancePath`, etc.) | DuckDB scalars can't return `SETOF` → return `LIST<STRUCT(...)>` | Spark scalars can't return `SETOF` either → return `array<struct<...>>`, recover row semantics with `explode()` (Spark's `unnest()` analogue) |
| **Indexes** | DuckDB has no SP-GiST / GiST / GIN; MobilityDuck provides `TRTREE` | **Spark has no indexes at all** — bucketing/partitioning hints are the closest analog. Document as architectural block; don't try to replicate. The bbox-overlap pushdown that DuckDB does via `TRTREE` becomes a Catalyst-level filter pushdown in Spark, which is its own non-trivial design |
| **Excluded type families** | `tcbuffer`, `tnpoint`, `tpose`, `trgeo`, `th3index` (~5) | **Same exclusion** for v1; add later as MEOS exposes them |
| **Version stamps** | `mobilityduck_version()` + alias for `mobilitydb_version()` | `mobilityspark_version()` + same alias pattern |

### Phased implementation plan

| Phase | Work | Effort (human-time) | MEOS-API leverage |
|---|---|---|---|
| **0** — Architectural alignment | This RFC ratifies; Pattern A confirmed; Apache Sedona dependency confirmed; MEOS-API codegen path confirmed | 1 day discussion | n/a |
| **1** — Foundation framework | Generic `MeosTypeUDT<T>` base class with WKB serialisation; UDT registry that registers all types at SparkSession init; round-trip integration tests | ~1 week | low |
| **2** — UDTs for all MEOS types | ~28 UDT subclasses (8 temporals + 5 spans + 5 spansets + 8 sets + 2 boxes) | ~1 week hand-written **OR ~1 day codegen-driven from MEOS-API's `structs` block** | **high** — every UDT's schema is derivable from the catalog |
| **3** — UDF wrapper generation | Auto-generate Spark UDFs for the ~487 user-facing MEOS functions; register at SparkSession init | ~2 weeks (designing the codegen template is the bulk; running it is fast) | **highest** — this phase is the strongest validation of the catalog at scale |
| **4** — Aggregators | Spark uses a separate `Aggregator` API; ~30 MEOS aggregates need typed-aggregator implementations. Hand-rolling required because Spark's Aggregator (`zero`/`reduce`/`merge`/`finish`) doesn't map cleanly onto MEOS aggregate-state functions | ~2 weeks | medium — catalog exposes aggregate signatures but the `zero`/`reduce`/`merge`/`finish` plumbing needs hand-design |
| **5** — Tests | Mirror MobilityDB's regression tests as parity manifests, analogous to MobilityDuck's `test/queries/` | ~2 weeks | low — parity tests are reusable across MobilityDuck and MobilitySpark with engine-name substitution |
| **6** — `PARITY.md` + `PARITY-INVENTORY.md` | Doc the parity claim, the operator-rename table (for Spark, the entire MobilityDB operator surface becomes named-function-only), the architectural mismatches | ~3 days | low — most of the prose is portable from MobilityDuck's |
| **Total without MEOS-API** | | **~9–10 weeks** | |
| **Total with MEOS-API** | | **~3–4 weeks** | |

### What ships at the end of each phase

- After **Phase 1**: a runnable `mvn install` builds the framework; one demo type works end-to-end (e.g., `TstzSpan`).
- After **Phase 2**: all temporal/span/set/box types instantiable from SQL; values round-trip MobilityDB→MobilitySpark→MobilityDuck verifiable with WKB bytes.
- After **Phase 3**: every MEOS scalar function callable from Spark SQL by its MobilityDB name (with the operator → named-function table documented).
- After **Phase 4**: aggregates work with the `Agg` suffix.
- After **Phase 5**: a parity-test gate runs in CI.
- After **Phase 6**: a public PARITY claim of "~95–98% of in-scope surface" (the cap reflects architectural mismatches: no indexes, all operators are named functions).

## Reusing MobilityDuck conventions verbatim

Most of MobilityDuck's PARITY.md text is portable. The mapping:

| MobilityDuck section | MobilitySpark equivalent | Effort to port |
|---|---|---|
| TL;DR | Lead with same "~95–98% of in-scope surface; most queries run unchanged" framing | rewrite for Spark specifics (Sedona, no indexes) |
| What I can paste in unchanged | Same shape; same SQL function names | port verbatim |
| Operator forms | Replace with **named-function-only table** — Spark has no custom operators | larger rewrite |
| Operators rejected | Subsumes the entire MobilityDB operator surface for Spark, not just `#`/`\|`/`/` axes | extend MobilityDuck's table |
| Aggregate naming | `Agg` suffix — same convention | port verbatim |
| Version stamps | `mobilityspark_*` + `mobilitydb_*` aliases | port verbatim |
| `geography` not separate | Same conclusion via different reason (Spark has no geometry; depend on Sedona) | rewrite the rationale |
| Composite-row returns | Same shape (`array<struct<...>>`); `explode()` instead of `unnest()` | small rewrite |
| Indexes | **Architectural block** — Spark has no index machinery; document filter-pushdown as future work | rewrite |
| Type families not included | Same (cbuffer / npoint / pose / rgeo / h3index) | port verbatim |

## Alternatives considered

1. **Pattern B (Dataset/DSL API)** — `Dataset<Temporal>` with idiomatic Java/Scala methods, no SQL-surface mirror. Pros: native Spark feel; Cons: doesn't satisfy "parity" semantically; SQL users have nothing. **Rejected** as the v1 strategy. Could be added *additionally* after Pattern A is complete.

2. **Partial parity** — implement only the most-used N% of the MEOS surface, document the rest as TODO. **Rejected** because the codegen approach makes "all 487" no harder than "the most-used 50" — the increment is the codegen template, not the per-function effort.

3. **No parity / leave MobilitySpark as demo** — abandon the parity story; document MobilitySpark as a "starter project, not a parity target." **Rejected** unless the community signals this — strategic-direction analysis (`project_post_parity_strategic_directions.md`) treats the binding ecosystem as the H3-style architecture's load-bearing piece.

## Sequencing

| Trigger | Then |
|---|---|
| **This RFC ratified** | Phase 0 settled |
| **JMEOS PR #8 merges** | Phase 1 begins (foundation against stable JMEOS 1.3) |
| **MEOS-API RFC #836 ratifies (or schema stabilises)** | Phases 2–3 begin (codegen-driven types + UDFs) |
| **Phases 1–4 complete** | Phases 5–6 (tests + parity docs) |
| **All complete** | Update libmeos.org's bindings and architecture page to list MobilitySpark as a peer of MobilityDuck (currently it's not surfaced because the surface is too thin) |

## Asking for

1. **Directional sign-off on Pattern A** (SQL-surface mirror, mirroring MobilityDuck) over Pattern B or partial parity.
2. **Confirmation of Apache Sedona as the static-geometry dependency** — MobilityDuck composes with DuckDB-spatial; the Spark-ecosystem analogue is Sedona. Any objection (e.g., a different geometry layer the project should adopt instead)?
3. **Coordination with MEOS-API and JMEOS#8** — sequencing (Phases 2–3 depend on MEOS-API; everything depends on JMEOS#8). The community's preferred ordering between "parity work" and "MEOS-API ratification" should be settled here.
4. **Owner identification** — this is multi-month work. Who would drive it once unblocked? @GaspardMerten (JMEOS context) is one natural fit; a dedicated MobilitySpark contributor would be another. The RFC isn't asking the filer to do the work — it's asking the community to align on the approach.
5. **Operator-handling acceptance** — Spark's no-custom-operators constraint means the entire MobilityDB operator surface becomes named-function-only in MobilitySpark. Confirm that's acceptable as the parity story (the alternative is no parity at all on the operator axis).

## Cc

- @estebanzimanyi, @mahmsakr, @mschoema — MEOS / MobilityDB / MobilityDuck
- @nhungoc1508 — MobilityDuck (parity precedent author)
- @Diviloper — PyMEOS
- @Davichet-e — meos-rs
- @GaspardMerten — JMEOS / MobilityDB ecosystem (likely owner candidate)
- @JashanReel, @Nyuke235 — MEOS-API (catalog work that the codegen path depends on)


Choice	MobilityDuck precedent	MobilitySpark proposal
UDT serialisation form	WKB bytes in DuckDB `BLOB` columns	Spark `BinaryType` columns with the same WKB bytes — values round-trip cleanly between the two engines
Operator strategy	Most operators registered; DuckDB rejects `<<#` / `<<\|` / `<</` (`#`/`\|`/`/` adjacent to angle brackets) → named-function fallback for those	Spark SQL does not support custom operators at all. Every MobilityDB operator becomes a named-function-only call. Reuse MobilityDuck's named-function table verbatim and extend it to cover the X-axis operators (`<<`, `>>`, `&<`, `&>`) too
Aggregate naming	`Agg` suffix to avoid collisions with same-named scalars (DuckDB function-name resolution is case-insensitive)	Same `Agg` suffix. Spark's UDF/UDAF registry shares a name namespace, so the same collision exists
`geography` as separate type	DuckDB-spatial has no separate `geography`; MobilityDuck stores geographic statics as `geometry` with geodetic flag	Spark has no geometry at all. MobilitySpark depends on Apache Sedona for the static geometry surface (sister project to DuckDB-spatial). Geographic temporal values (`tgeography`, `tgeogpoint`) work as MEOS-typed UDTs
Composite-row returns (`frechetDistancePath`, etc.)	DuckDB scalars can't return `SETOF` → return `LIST<STRUCT(...)>`	Spark scalars can't return `SETOF` either → return `array<struct<...>>`, recover row semantics with `explode()` (Spark's `unnest()` analogue)
Indexes	DuckDB has no SP-GiST / GiST / GIN; MobilityDuck provides `TRTREE`	Spark has no indexes at all — bucketing/partitioning hints are the closest analog. Document as architectural block; don't try to replicate. The bbox-overlap pushdown that DuckDB does via `TRTREE` becomes a Catalyst-level filter pushdown in Spark, which is its own non-trivial design
Excluded type families	`tcbuffer`, `tnpoint`, `tpose`, `trgeo`, `th3index` (~5)	Same exclusion for v1; add later as MEOS exposes them
Version stamps	`mobilityduck_version()` + alias for `mobilitydb_version()`	`mobilityspark_version()` + same alias pattern

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Drive MobilitySpark to MobilityDB ↔ MobilityDuck parity (Pattern A, MEOS-API-driven) #3

Background

Why this matters now

Proposal — Pattern A (SQL-surface mirror), MEOS-API-driven

Architectural choices

Phased implementation plan

What ships at the end of each phase

Reusing MobilityDuck conventions verbatim

Alternatives considered

Sequencing

Asking for

Cc

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Dimension	MobilityDuck	MobilitySpark today
Source files	hundreds	4 Java files
MEOS types as SQL types	All temporals + spans + sets + boxes	One demo class (`TimestampWithValue`, not even a MEOS type)
MEOS functions exposed	~100% of in-scope surface	Two: `meos_initialize`, `meos_finalize`
UDFs registered	Hundreds	One demo (`PowerUDF` — `x²`)
Aggregators	Full surface	None
Format I/O	WKT / WKB / MFJSON	None
Parity manifest	`PARITY.md` + `PARITY-INVENTORY.md`	None

Phase	Work	Effort (human-time)	MEOS-API leverage
0 — Architectural alignment	This RFC ratifies; Pattern A confirmed; Apache Sedona dependency confirmed; MEOS-API codegen path confirmed	1 day discussion	n/a
1 — Foundation framework	Generic `MeosTypeUDT<T>` base class with WKB serialisation; UDT registry that registers all types at SparkSession init; round-trip integration tests	~1 week	low
2 — UDTs for all MEOS types	~28 UDT subclasses (8 temporals + 5 spans + 5 spansets + 8 sets + 2 boxes)	~1 week hand-written OR ~1 day codegen-driven from MEOS-API's `structs` block	high — every UDT's schema is derivable from the catalog
3 — UDF wrapper generation	Auto-generate Spark UDFs for the ~487 user-facing MEOS functions; register at SparkSession init	~2 weeks (designing the codegen template is the bulk; running it is fast)	highest — this phase is the strongest validation of the catalog at scale
4 — Aggregators	Spark uses a separate `Aggregator` API; ~30 MEOS aggregates need typed-aggregator implementations. Hand-rolling required because Spark's Aggregator (`zero`/`reduce`/`merge`/`finish`) doesn't map cleanly onto MEOS aggregate-state functions	~2 weeks	medium — catalog exposes aggregate signatures but the `zero`/`reduce`/`merge`/`finish` plumbing needs hand-design
5 — Tests	Mirror MobilityDB's regression tests as parity manifests, analogous to MobilityDuck's `test/queries/`	~2 weeks	low — parity tests are reusable across MobilityDuck and MobilitySpark with engine-name substitution
6 — `PARITY.md` + `PARITY-INVENTORY.md`	Doc the parity claim, the operator-rename table (for Spark, the entire MobilityDB operator surface becomes named-function-only), the architectural mismatches	~3 days	low — most of the prose is portable from MobilityDuck's
Total without MEOS-API		~9–10 weeks
Total with MEOS-API		~3–4 weeks

MobilityDuck section	MobilitySpark equivalent	Effort to port
TL;DR	Lead with same "~95–98% of in-scope surface; most queries run unchanged" framing	rewrite for Spark specifics (Sedona, no indexes)
What I can paste in unchanged	Same shape; same SQL function names	port verbatim
Operator forms	Replace with named-function-only table — Spark has no custom operators	larger rewrite
Operators rejected	Subsumes the entire MobilityDB operator surface for Spark, not just `#`/`\|`/`/` axes	extend MobilityDuck's table
Aggregate naming	`Agg` suffix — same convention	port verbatim
Version stamps	`mobilityspark_` + `mobilitydb_` aliases	port verbatim
`geography` not separate	Same conclusion via different reason (Spark has no geometry; depend on Sedona)	rewrite the rationale
Composite-row returns	Same shape (`array<struct<...>>`); `explode()` instead of `unnest()`	small rewrite
Indexes	Architectural block — Spark has no index machinery; document filter-pushdown as future work	rewrite
Type families not included	Same (cbuffer / npoint / pose / rgeo / h3index)	port verbatim

Trigger	Then
This RFC ratified	Phase 0 settled
JMEOS PR #8 merges	Phase 1 begins (foundation against stable JMEOS 1.3)
MEOS-API RFC #836 ratifies (or schema stabilises)	Phases 2–3 begin (codegen-driven types + UDFs)
Phases 1–4 complete	Phases 5–6 (tests + parity docs)
All complete	Update libmeos.org's bindings and architecture page to list MobilitySpark as a peer of MobilityDuck (currently it's not surfaced because the surface is too thin)

RFC: Drive MobilitySpark to MobilityDB ↔ MobilityDuck parity (Pattern A, MEOS-API-driven) #3

Description

Background

Why this matters now

Proposal — Pattern A (SQL-surface mirror), MEOS-API-driven

Architectural choices

Phased implementation plan

What ships at the end of each phase

Reusing MobilityDuck conventions verbatim

Alternatives considered

Sequencing

Asking for

Cc

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions