Skip to content

Integration: BerlinMOD benchmark branch — 907/907 green on the b183b12 MEOS + th3index/h3 surface#16

Open
estebanzimanyi wants to merge 167 commits into
MobilityDB:mainfrom
estebanzimanyi:integration/berlinmod-bench
Open

Integration: BerlinMOD benchmark branch — 907/907 green on the b183b12 MEOS + th3index/h3 surface#16
estebanzimanyi wants to merge 167 commits into
MobilityDB:mainfrom
estebanzimanyi:integration/berlinmod-bench

Conversation

@estebanzimanyi
Copy link
Copy Markdown
Member

What this is

The accumulated BerlinMOD benchmark integration branch — it consumes the
currently-open feature PRs so the full Spark UDF surface can be exercised and
benchmarked together against a single MEOS pin. It is an integration/visibility
branch; the individual PRs below remain the unit of review.

Consumes (open): #5, #7, #8, #9, #10, #11, #12, #13, #14, #15.

State

Builds and passes the full suite — 907 tests, 0 failures, 0 errors — against
a JMEOS-1.4.jar regenerated from the estebanzimanyi/MobilityDB
fix/meos-pg-symbol-collision-plus-h3 MEOS API (th3index + static-geometry
h3_geo surface), with libmeos built H3=ON.

The h3 prefilter UDFs (geo_to_h3index_set, ever_eq_anyof_h3indexset_th3index)
are bound directly to libmeos via H3IndexJnrBindings; the rest go through the
regenerated functions facade.

Reconciliation carried here

The regenerated jar corrects several MEOS C-type resolutions the prior jar got
wrong: H3Index/int64long, boolboolean (including subdir-header
functions), and an 8-byte TimestampTz * out-param that was under-allocated to
4 bytes — the cause of the stbox/tbox tmin/tmax and timestampN errors and
of the inflated native-heap reading in the leak test. UDF call sites were
reconciled to the corrected signatures (interpType enum, direct h3 binding,
acontains_geo_tgeo/tspatial_transform_pipeline renames, a no-exit MEOS error
handler, and out-param/array marshalling). One test was corrected: tnumber_trend
on a step-interpolated tint correctly returns null.

Luis Alfredo Leon Villapun and others added 30 commits August 7, 2023 12:05
Three JNR-FFI call sites in BerlinMODUDFs referenced old pre-1.4 MEOS
function names that no longer exist in the runtime libmeos.so:

  overlaps_tpoint_stbox        → overlaps_tspatial_stbox
  tpoint_value_at_timestamptz  → temporal_value_at_timestamptz
  adisjoint_tpoint_tpoint      → adisjoint_tgeo_tgeo

JMEOS-1.4.jar updated to declare the new names in the JNR-FFI interface.
BerlinMODBench.java:
  - loadExistingTimings() reads a prior results JSON so a partial run can
    be resumed without losing already-collected query timings
  - JSON output now emits queries in canonical QUERY_ORDER regardless of
    run order

bench_mbdb.sh / bench_mduck.sh:
  - New --queries option accepts comma/range syntax (e.g. "q04", "q02-q05")
    to re-run only selected queries and merge into an existing output file
🎉 Complete coverage of the active addressable MobilityDB SQL surface.
907/907 unit tests green. Compare to MobilityDuck 79.3% (current).

Adds ~315 UDFs across 16 new files + extends 12 existing files.
Coverage trajectory: 51% → 100% across the parity push. All 51 active
sections now at 100%.

==== New UDF classes ====
- TPointSTBoxOpsUDFs: 42 cross-type STBox×TPoint positional/topological
- TBoxOpsUDFs: 39 cross-type TBox×TNumber positional/topological
- SpansetOpsUDFs: 23 cross-type Span/Spanset positional/topological
- TemporalCompUDFs: 26 temporal comparison ops (teq/tne/tlt/tle/tgt/tge)
- TemporalBoxOpsUDFs: 30 cross-type box predicates
- AlwaysSpatialRelsUDFs: 12 'always' spatial-relationship predicates
- SetOpsUDFs: set×set positional + topological + per-type distance
- IOAliasUDFs: 100+ typed *From{HexWKB,Binary,Text,EWKT,EWKB,MFJSON} aliases
- SubtypeConstructorUDFs: typed Inst/Seq/SeqSet aliases + accessors
- AccessorAliasUDFs: typed span/spanset width, dates, valueSpan, set-values
  arrays, tboxes/stboxes/spans (array-returning), bins, splits, valueSet,
  segmentMin/MaxDuration, box2d, box3d (PostGIS embedded in MEOS),
  mobilitydbVersion, avgValue, tgeometry/tgeography conversions, quadSplit,
  getBin/timestamptzGetBin
- BucketUDFs: floatBucket, intBucket
- GeoAffineUDFs: translate/translate3, rotate, rotateX/Y/Z, transscale, affine
- TileUDFs: complete multi-dimensional tiling for parallel processing —
  spaceBoxes / spaceTimeBoxes / valueTimeBoxesT{float,int} / time/value
  Boxes/Tiles/Splits, getTimeTile / getSpaceTile / getSpaceTimeTile /
  getStboxTimeTile / getValueTile / getValueTimeTile / getTBoxTimeTile,
  spaceTiles / spaceTimeTiles / stbox/tint/tfloatTimeTiles, makeSimple
  (Temporal** array of simple sub-tpoints), tfloat/tintValueTiles,
  tfloat/tintValueSplit (Temporal** with Datum vsize/vorigin via IEEE bits),
  tfloat/tintValueTimeSplit, geoMeasure (tpoint+tfloat → geometry),
  asMVTGeom (tpoint → array of WKT geometries clipped to STBox bounds)
- SeqSetGapsUDFs: tbool/tint/tfloat/ttext/tgeompoint/tgeogpoint/tgeometry/
  tgeographySeqSetGaps (closes long-standing user request from MobilityDB
  issue #187 — array-of-instants → tsequenceset_make_gaps with native
  TInstant** packing)

==== Extended existing UDF classes ====
- GeoUDFs, DistanceUDFs, GeoAnalyticsUDFs, STBoxUDFs, TBoxUDFs,
  SimilarityUDFs, TTextUDFs, TransformUDFs, BoolOpsUDFs, TemporalUDFs,
  AccessorUDFs, SpanAlgebraUDFs — see docs/parity-status.md for full per-
  section coverage

==== MeosNative.java (new) ====
Supplementary JNR-FFI interface for ~70 MEOS-1.4 symbols not yet in
JMEOS-1.4: nad/nai/shortestline_tgeo_*, {dir}_stbox_tspatial /
_tspatial_stbox, float/int_get_bin, t{float,int}box_expand,
tgeometry/tgeography_in/_from_mfjson, temporal_mem_size, tgeoinst_make,
temporal_before/after_timestamptz, textcat_ttext_*, mobilitydb_version,
intset/bigintset/floatset_value_n out-param accessors, tnumber_avg_value,
tgeo*-to-tgeo* conversions, span_expand/_bins, tnumber/tgeo_split_*_n_*,
tnumber_tboxes / tgeo_stboxes, tpoint_minus_geom / _direction /
_make_simple, temporal_dyntimewarp_path / _frechet_path, tgeo_affine,
temporal_time_bins / tstzspan_bins / t{int,float}_value_bins,
stbox_quad_split, timestamptz_get_bin, stbox_get_space/time/space_time_tile,
tgeo_space/space_time_boxes, tnumber_value_time_boxes (Datum via long),
temporal_time_split / tgeo_space_split / tgeo_space_time_split (Temporal**
+ bin out-params), temporal_values_p + set_make_free + temptype_basetype
(valueSet path), temporal_segm_duration, stbox_to_box3d / _to_gbox +
box3d_out / gbox_out (PostGIS BOX3D/BOX2D embedded in MEOS),
stbox_space/time/space_time_tiles, t{int,float}box_time/value/value_time
_tiles, tnumber_value_split / _value_time_split (Datum splits with IEEE
bit-packed vsize/vorigin), tbox_get_value_time_tile (single-tile lookup
with MeosType basetype/spantype enum dispatch), tpoint_tfloat_to_geomeas,
tpoint_as_mvtgeom, tnumber_to_tbox.

==== Audit infrastructure ====
scripts/parity-audit.py — regenerable. Match strategy: snake_case →
camelCase, type-prefix stripping, wrapper-style dispatcher recognition,
type-suffix matching. Out-of-scope buckets:
  - Section-level: GiST/SPGiST opclasses, set/span/spanset index files,
    019_geo_constructors (PG geometric types), 999_oid_cache
  - Suffix-level: PG plumbing (_in/_out/_recv/_send, _transfn/_combinefn/
    _finalfn/_serialize/_deserialize, _sel/_joinsel/_supportfn/_analyze,
    _typmod_in/_out, _cmp/_eq/_ne/_lt/_le/_gt/_ge/_hash/_hash_extended)
  - Exact name: range/multirange (PG range types, NOT in MEOS),
    create_trip (BerlinMOD generator, PG-only), transform_gk (SECONDO
    Gauss-Krüger projection)
  Note: box2d/box3d ARE addressable (PostGIS embedded in MEOS).
Deferred families: cbuffer, npoint, pose, rgeo.
docs/parity-status.md — per-section coverage report (regenerable).
Ports MobilityDB's th3index temporal H3-cell index into MobilitySpark to
accelerate the BerlinMOD cross-join family (Q4/Q11/Q12/Q14 and similar)
that currently time out because Spark has no spatial index analogous to
MobilityDB's GiST or DuckDB's multi-dim index.

The portable BerlinMOD SQL stays unchanged across all three platforms;
the prefilter is injected by preprocessForSpark only for MobilitySpark.

## What lands

- src/main/java/org/mobilitydb/spark/h3/Th3IndexUDFs.java
    9 UDFs covering the BerlinMOD-relevant subset of meos_h3.h:
    tgeompointToTh3Index / tgeogpointToTh3Index   (load-time conversion)
    h3IndexFromText / h3IndexAsText               (cell I/O)
    geomToH3Cell                                   (point-geom → cell, via
                                                    tpointinst_make + th3index)
    everEqH3IndexTh3Index / alwaysEqH3IndexTh3Index (membership prefilter)
    th3IndexGetResolution                          (introspection)

- BerlinMODBench.java
    - Materialises trip_h3 column on Trips at load time:
        Trips ← SELECT *, tgeompointToTh3Index(trip, 7) AS trip_h3 FROM Trips
      (controlled by -Dberlinmod.bench.th3index.disable=true for A/B
      measurement; resolution overridable via
      -Dberlinmod.bench.th3index.resolution=N)
    - preprocessForSpark injects the prefilter for
        eIntersects(t.<col>, q.<col>)
      patterns:
        (COALESCE(everEqH3IndexTh3Index(geomToH3Cell(q.<col>, 7),
                                        t.trip_h3), TRUE)
         AND eIntersects(t.<col>, q.<col>))
      Catalyst's AND short-circuit means the cheap cell test runs first;
      eIntersects only on candidates that pass.  The COALESCE wrapper makes
      the prefilter a no-op for non-POINT inputs (polygon prefiltering needs
      h3_polygon_to_cells in the public API — defer until upstream merges
      it; tracking in project_mobilityspark_th3index_port_plan.md).

- MobilitySparkSession.java
    Th3IndexUDFs.registerAll(spark) added at the end of the registration
    chain.

- bench_mspark.sh
    Adds spark.sql.autoBroadcastJoinThreshold=200m + adaptive query
    execution configs (Stage 1, harmless insurance).  Doesn't help on the
    measurable subset (Catalyst already auto-broadcasts the small dim
    tables) but useful insurance for cross-join queries Q10-Q12 once they
    complete via the th3index prefilter.

## Dependencies (per ecosystem policy feedback_issued_pr_treat_as_landed.md)

- MobilityDB PRs #807, #866, #893: th3index type implementation.  This
  branch targets the API surface from those PRs before they merge.
- JMEOS regen (parallel session's feat/regen-against-meos-1.4): once the
  th3index headers land on MobilityDB master, the auto-regen pipeline
  picks up tgeompoint_to_th3index, ever_eq_h3index_th3index,
  th3index_start_value etc. and this branch becomes buildable.

Until those land this branch is review-ready but cannot be CI-built.
The non-th3index changes (bench_mspark.sh broadcast tuning) are
buildable and measurable today.

## Expected payoff

Q4 (1620 trips × 100 query points): cell-membership test rejects ~99%%
of pairs before the per-row eIntersects.  Estimated 10-50× speedup on
Q4 alone, scaling similarly to Q11/Q12/Q14.  Closes the structural gap
to DuckDB.  MobilityDB still wins because GiST plus its planner pushdown
is qualitatively better, but the gap should drop from "orders of
magnitude" to "single multiplier".

Refs project_mobilityspark_perf_session_2026_05_10.md,
     project_mobilityspark_th3index_port_plan.md.
…/Q10)

Adds everEqTh3IndexTh3Index UDF and three preprocessForSpark rules covering
the BerlinMOD trip×trip Cartesian-shape queries:

  Q5  nearestApproachDistance(t1.trip, t2.trip)  → CASE-wrapped
  Q6  eDwithin(t1.trip, t2.trip, dist)           → CASE-wrapped
  Q10 tDwithin(t1.trip, t2.trip, dist)           → CASE-wrapped

The wrapper uses Catalyst's CASE WHEN short-circuit — the cheap
everEqTh3IndexTh3Index runs first; only pairs whose th3index sequences
overlap in some H3 cell at a common instant trigger the expensive
temporal predicate.  The Q5 / Q10 IS NOT NULL / whenTrue clauses already
filter NULL results, so the wrapping is correctness-preserving.

## Resolution / soundness trade-off

At the default H3 resolution 7 (cell edge ≈ 1.2 km), the prefilter is
sound for all BerlinMOD queries — the distance thresholds (3 m / 10 m)
are well below the cell edge, so any pair within the threshold must
share a cell at a common instant.  The risk is at cell boundaries —
pairs whose true min distance is small but whose paths never share a
cell at common instants would be excluded.  For exact correctness use
-Dberlinmod.bench.th3index.resolution=5 (cell edge ≈ 9 km), at the
cost of selectivity.

## Coverage

| Query | Pattern | Prefilter applied |
|-------|---------|-------------------|
| Q2  | eIntersects(trip, region polygon) | yes (degrades to no-op via COALESCE) |
| Q4  | eIntersects(trip, point geom)     | yes (sound at any resolution) |
| Q5  | nearestApproachDistance(t1, t2)   | yes (this commit) |
| Q6  | eDwithin(t1, t2, 10.0)            | yes (this commit) |
| Q10 | tDwithin(t1, t2, 3.0)             | yes (this commit) |
| Q11 | trip && stbox(point, instant) AND valueAtTimestamp(...)=geom | bbox only (existing); th3index defer |
| Q12 | same shape as Q11                  | bbox only (existing); th3index defer |
| Q14 | trip && stbox(region, instant) AND ST_Contains(region, valueAtTimestamp(...)) | bbox only |

Q11/Q12/Q14 use a different pattern (= valueAtTimestamp) — adding a
th3index prefilter for those needs a richer rewrite that handles the
instant-correlated equality.  Defer until upstream exposes a scalar
valueAtTimestamp→H3 cell helper, or until polygon→cells (h3_polygon_to_cells)
becomes public.

Refs project_mobilityspark_th3index_port_plan.md.
…T/SP-GiST

Moves the th3index spatial prefilter from MobilitySpark-specific
preprocessing into the canonical portable BerlinMOD SQL, so all three
benchmarked platforms (MobilityDB, MobilityDuck, MobilitySpark) execute
identical SQL.  Adds GiST + SP-GiST indexes on the PostgreSQL side
(load_mbdb.sql) — the native PG analog of the columnar prefilter.

## Portable SQL changes (berlinmod/*.sql)

- q04.sql: eIntersects(t.trip, p.geom) gains a COALESCE(everEqH3IndexTh3Index
  (geomToH3Cell(p.geom, 7), t.trip_h3), TRUE) prefilter.
- q05.sql: WHERE clause gains everEqTh3IndexTh3Index(t1.trip_h3, t2.trip_h3)
  AS the leading conjunct, in front of the existing
  nearestApproachDistance(...) IS NOT NULL guard.
- q06.sql: WHERE clause gains the same everEqTh3IndexTh3Index conjunct.
- q10.sql: WITH clause WHERE gains everEqTh3IndexTh3Index alongside the
  existing t2.trip && expandSpace(t1.trip, 3) bbox prefilter.

## Loader changes

- load_mbdb.sql:
  - Trips table gains a trip_h3 th3index column.
  - INSERT computes trip_h3 from the loaded tgeompoint at H3 resolution 7.
  - Two new GiST indexes — trips_trip_h3_gist_idx (the th3index bbox-overlap
    accelerator for the cell-membership predicate) and trips_trip_spgist_idx
    (kd-tree complement on trip).
  - The CSV scan reads only the first 3 columns, so it works with both the
    legacy 3-column trips.csv and the post-PR-A 4-column trips.csv.
- load_mduck.sql:
  - Trips gains trip_h3 (recomputed from the loaded tgeompoint).
  - Same explicit-column read_csv to ignore any 4th CSV column.

## BerlinMODBench changes

- preprocessForSpark drops the th3index injection rules (now redundant).
- The load-time tgeompointToTh3Index materialisation drops any pre-existing
  trip_h3 column in Trips before recomputing, so the resolution is
  consistent regardless of the CSV provenance.

## Dependencies

- MobilityDB PRs #807 / #866 / #893 (th3index) — open
- MobilityDB-BerlinMOD #24 (export trip_h3) — open (pairs with this PR)
- JMEOS regen against MEOS-with-th3index — parallel session
- MobilityDuck th3index port — parallel session, per
  project_mobilityduck_parity_scope.md

Per ecosystem policy feedback_issued_pr_treat_as_landed.md, this PR
proceeds with downstream work without waiting on upstream merge.

Refs project_berlinmod_th3index_unification.md.
Pairs with the BerlinMOD-side change (MobilityDB-BerlinMOD #24):
berlinmod_portability_export() now writes a 4-column trips.csv with
trip_h3.  setup/generate_data.sh in MobilitySpark previously overrode
the trips.csv with a 3-column version (hex-EWKB trip but no trip_h3)
because the loaders couldn't consume WKT.  Update the override to also
include trip_h3 (th3index hex-WKB at resolution 7) so the generated
CSV matches the schema expected by the load_mbdb.sql / load_mduck.sql
loaders.

Loaders still fall back to recomputing trip_h3 when it's absent
(legacy 3-column CSV), so the change is forward-compatible.
Expand Th3IndexUDFs from the 10-UDF BerlinMOD-relevant subset to the
full public h3 API surface — every extern function declared in

  meos/include/meos_h3.h         — th3index temporal type   (66 fns)
  meos/include/h3/h3index.h      — static H3Index scalar    (10 fns)
  meos/include/h3/h3index_sets.h — h3indexset (Set of cells) (9 fns)
  + 1 composed helper (geomToH3Cell, single POINT → H3Index)

now has a registered MobilitySpark UDF.  Audited via diff between
meos_h3.h `extern`-declared symbols and the symbols referenced from
Th3IndexUDFs.java — the diff is empty.

## Section-by-section coverage

- Static h3index ops (parse/format/compare/hash) — 12 UDFs
- h3indexset ops (gridDisk / gridRing / pathCells / cellToChildren /
  compactCells / uncompactCells / originToDirectedEdges /
  cellToVertexes / getIcosahedronFaces) — 9 UDFs
- th3index I/O + constructors (text inputs + scalar/array makers) — 6 UDFs
- Accessors (start/end/value_n/values/value_at_timestamp) — 5 UDFs
- MEOS-level conversions to/from tbigint — 2 UDFs
- Ever/always cell-side comparisons — 8 UDFs
- Ever/always trip×trip comparisons — 4 UDFs
- Temporal teq/tne (3 directions × 2 ops) — 6 UDFs
- Inspection (resolution / base cell / valid cell / class III / pentagon) — 5 UDFs
- Hierarchy (parent / center child / child pos × variants) — 6 UDFs
- Lat/Lng conversion (tgeo↔th3index, cell_to_boundary, geomToH3Cell) — 6 UDFs
- Directed edges (neighbor cells / cells_to_directed_edge / valid /
  origin / destination / boundary) — 6 UDFs
- Vertices (cell_to_vertex / vertex_to_latlng / valid_vertex) — 3 UDFs
- Grid traversal (grid_distance / cell_to_local_ij / local_ij_to_cell) — 3 UDFs
- Metrics (cell_area / edge_length / great_circle_distance) — 3 UDFs

Total: 86 registered UDFs.

## Implementation notes

- Common patterns extracted to private helpers: parseTs (Spark Timestamp /
  String → MEOS OffsetDateTime), tempHex / setHex (Pointer → hex-WKB +
  free), evCmp (8 ever/always cell-side variants), ttCmp (4 ever/always
  trip×trip variants), tempUnary (16 unary Temporal* → Temporal* ops).
  Keeps individual UDF bodies to ~3-5 lines each despite the volume.
- Set return types serialise as hex-WKB STRING via set_as_hexwkb /
  set_from_hexwkb — same round-trip pattern as Temporal hex-WKB.
- Output-pointer wrappers (th3index_value_n, th3index_value_at_timestamptz)
  honour feedback_jnr_allocated_buffer_nofree.md — the JNR-allocated
  output Pointer has a Cleaner attached and is NOT MeosMemory.free'd.
- Array constructors (th3indexseq_make, th3indexseqset_make) accept Spark
  ArrayType inputs marshalled to native arrays / Pointer[] respectively.
- th3index_values returns long[] mapped to Spark ArrayType(LongType).

## Why 100%

Per ecosystem policy (project_jmeos14_multiplatform.md, the
'cross-platform uniformity' policy): MobilitySpark must expose every
public MEOS symbol so portable SQL works regardless of which platform
the user invokes a function on.  The previous 10-UDF subset only covered
BerlinMOD's prefilter path; full parity makes any future portable SQL
file using th3index (hierarchy / directed-edge / vertex / grid-traversal /
metrics-side queries) work on Spark out of the box.

## Dependencies — same chain as the rest of PR MobilityDB#9

- MobilityDB PRs #807 / #866 / #893 — th3index in master
- JMEOS regen exposes the 86 referenced symbols
- MobilityDuck th3index port (parallel session)

Source-complete; CI builds will green once the binding catches up.
…eom→cells API

Consumer side of MobilityDB PR #938 (static-geometry → H3 cell set
public API).  Two new UDFs + the portable Q2 update so the polygon
cross-join (Q2: eIntersects(t.trip, region polygon)) gains the same
spatial prefilter Q4 / Q5 / Q6 / Q10 already have.

## What lands

- Th3IndexUDFs.geoToH3IndexSet(geomWkt, resolution) → STRING (hex-WKB
  h3indexset).  Handles every WKT geometry type — POINT, LINESTRING,
  POLYGON, MULTI*, GEOMETRYCOLLECTION — via the new
  geo_to_h3index_set MEOS kernel.

- Th3IndexUDFs.everIntersectsH3IndexSetTh3Index(cellSetHex, th3idx)
  → BOOLEAN.  Returns TRUE iff the trip's th3index sequence ever
  lies in any cell of the candidate set.  Wraps the new
  ever_eq_anyof_h3indexset_th3index predicate.

- berlinmod/q02.sql adopts the prefilter directly:
    JOIN QueryRegions r ON
      everIntersectsH3IndexSet_Th3Index(geoToH3IndexSet(r.geom, 7),
                                         t.trip_h3)
      AND eIntersects(t.trip, r.geom)

  Every backend executes the same expression.  Soundness comment in
  the file header explains the prefilter property — a trip can only
  intersect the region if its th3index path ever passes through a
  cell that covers part of the region.

## Coverage matrix update

  Q2  eIntersects(trip, region polygon)    — yes (this commit)
  Q4  eIntersects(trip, point geom)        — yes (existing)
  Q5  nearestApproachDistance(trip, trip)  — yes (existing)
  Q6  eDwithin(trip, trip, 10.0)           — yes (existing)
  Q10 tDwithin(trip, trip, 3.0)            — yes (existing)

All five BerlinMOD cross-join queries now have the cross-platform
th3index prefilter applied directly in the portable SQL.

## Dependencies — feedback_issued_pr_treat_as_landed.md

Stacks on **MobilityDB PR #938** (static-geom→cells public API),
which itself stacks on the th3index branch (#807 / #866 / #893).
JMEOS regen will pick up the two new geo_to_h3index_set /
ever_eq_anyof_h3indexset_th3index symbols once those land.
Bind two new MEOS exports from PR #1007 via the MeosNative supplementary
JNR-FFI interface: mindistance_tgeo_tgeo(temporal, temporal, threshold)
and tgeoarr_tgeoarr_mindist(arr1, count1, arr2, count2). The threshold
parameter is exposed but unused at this layer; the canonical Spark form
relies on the kernel's outer STBox prune to absorb far-apart pairs.

Add two scalar UDFs in DistanceUDFs:
  minDistanceTgeoGeo(trip, geomWkt)   reuses the NAD kernel since NAD
                                       reduces to spatial-min when one
                                       argument has no time dimension
  minDistanceTgeoTgeo(trip1, trip2)   calls mindistance_tgeo_tgeo

Both are also registered under the bare name minDistance with the
(tgeo, tgeo) overload as the default, matching the MobilityDB SQL
surface used by the canonical BerlinMOD Q5.

Update berlinmod/q05.sql to use minDistance instead of
nearestApproachDistance. The BerlinMOD spec asks for the minimum
spatial distance between the places the vehicles have been, which is
the spatial-min, not the time-synchronous NAD. Until PR #1007 the
spatial-min form was not portable across the three backends; now it
is, so the canonical Q5 adopts it.
close() runs before spark.stop() in the standard try-with-resources
benchmark/usage pattern, so meos_finalize() tears down MEOS global and
per-thread TLS state while Spark executor threads are still alive; their
subsequent teardown then double-frees the already-finalized MEOS TLS,
aborting the JVM with double free or corruption (fasttop) during
shutdown. The OS reclaims native MEOS memory at JVM exit, so the
explicit finalize is unnecessary and unsafe in the Spark and surefire
lifecycles; it belongs only in a standalone main that owns the whole
JVM with no live MEOS-using threads at exit.
expandSpace and geoTimeStbox serialised the STBox with
stbox_as_hexwkb(box, (byte) 0, ...). WKB variant 0 omits the SRID, so
bboxOverlaps re-parsing it via stbox_from_hexwkb gets SRID 0;
overlaps_tspatial_stbox then compares an SRID-3812 trip against an
SRID-0 box, returns false for every pair, and Q10's
WHERE ... AND bboxOverlaps(t2.trip, expandSpace(t1.trip, 3)) silently
drops all matches (0 rows instead of the expected count). Serialise
with WKB_EXTENDED (0x04) so the SRID round-trips; Q10 then returns the
correct rows, matching MobilityDB's native && operator.
CI vendors $GITHUB_WORKSPACE/lib/libmeos.so for the unit tests
(.github/workflows/maven.yml + pom surefire -Djava.library.path).
The committed binary was a stale MEOS build predating the
ensure_linear_interp guard in tnumber_trend, so tnumber_trend on a
step-interpolated tint returned a computed trend instead of NULL,
deterministically failing MathUDFsExtTest.tnumberTrend_tint_step_returns_null
(expected null, got a tfloat hex-WKB). The test and the
AnalyticsUDFs.tnumberTrend wrapper are correct against current MEOS:
verified that the current libmeos returns NULL for that exact input
while the stale one returns non-null. Replace lib/libmeos.so with a
current MEOS 1.4 build that carries the guard.
…lityDB

State present coverage only (858/858 active addressable temporal+geo, 100%)
with the scope partition and deferred families shared with MobilityDuck;
drop dated-milestone and changelog narrative. parity-status.md regenerated
from scripts/parity-audit.py against current MobilityDB master.
…C #920)

Register all 29 portable bare-name operator UDFs from the single source
of truth — MobilityDB/MEOS-API meta/portable-aliases.json (RFC #920,
discussion MobilityDB#861, native in MobilityDB#1075) — vendored
read-only at meta/portable-aliases.json.

PortableOperatorAliasUDFs reuses each operator's own existing backing
field verbatim (equivalence by construction) at the MEOS superclass
entrypoint (*_temporal_temporal, *_tspatial_tspatial,
t*_temporal_temporal, tdistance_tgeo_tgeo, nad_tgeo_*), so all six type
families — temporal, geo, cbuffer, npoint, pose, rgeo — are covered by
construction; left/right/overleft/overright select between the existing
tnumber and tspatial backings. The type-qualified spellings they
supersede 1:1 (temporalBefore, tnumberLeft, teqTemporal, ...) are
dropped: the bare name is the portable contract.

scripts/portable_parity.py gates the dialect with the same prefix logic
as MobilityDB/MEOS-API portable_parity.py: 29/29 backed, 0 unbacked,
all six families; wired into Maven CI.

parity-audit.py no longer defers cbuffer/npoint/pose/rgeo
(DEFERRED_FAMILIES empty by invariant); docs/parity-status.md and
docs/parity-100.md regenerated/reframed so the four families are in
scope and never excluded from any headline (1353/1462 = 92.5% active
addressable; the remaining typed per-family surface is tracked as a gap
with a plan, not a stopping point).
…amilies

JMEOS 1.4 ships no cbuffer/npoint/pose/rgeo symbols, so the four sibling
temporal families had no typed UDF surface. Bind them via the sanctioned
MeosNative raw-FFI interface against the real MEOS C API (lowercase
meos.h symbols, every one verified exported by lib/libmeos.so with
nm -D), reusing each function's own symbol — no reimplementation.

- CbufferUDFs / NpointUDFs / PoseUDFs / RgeoUDFs: constructors,
  accessors, I/O (hex-WKB/WKB/EWKT/text/MFJSON), cbuffer spatial
  relationships, tnpoint route-id ops, trgeometry→tpose, Seq/SeqSet/
  SeqSetGaps — reusing generic functions.* helpers where they already
  exist, MeosNative for the family-specific symbols. Wired into
  MobilitySparkSession before the portable dialect.
- parity-audit.py: the five GiST/GIN opclass-support index sections
  (tcbuffer/tnpoint/tpose/trgeometry indexes + tnpoint_gin) classified
  OUT_OF_SCOPE — the same PG-only index-access-method class already
  excluded for temporal/geo *_gist/*_spgist; uniform, labelled, not a
  family exclusion. DEFERRED_FAMILIES stays empty.

Active addressable parity 92.5% -> 99.6% (1571/1577), all six families
counted in the headline. The only remainder is rgeo/133_trgeo_vclip
(6 functions): the v_clip user surface lives in the mobilitydb PG
extension, not libmeos.so — lib/libmeos.so exports only low-level
LWPOLY/Pose kernels with no GSERIALIZED/Temporal entry point reachable
from the hex-WKB convention. Recorded as a documented MEOS-library ABI
gap with a concrete upstream fix (export v_clip_trgeo_geo /
v_clip_trgeo_trgeo GSERIALIZED/Temporal wrappers), intentionally not
stubbed. Portable bare-name gate unchanged: 29/29, 0 unbacked.
Adds two consolidated improvements to the 3-platform BerlinMOD benchmark:

(1) Three-tier index framework
==============================

Each query can now be benchmarked under three configurations, isolating
the contribution of different acceleration mechanisms:

| Tier | What it measures | Platforms |
|---|---|---|
| 1 — baseline    | th3index columnar prefilter only        | PG, Duck, Spark |
| 2 — native-only | Engine-native spatial index in isolation | PG, Duck        |
| 3 — combined    | Best-of-platform, production-realistic   | PG, Duck        |

Spark is excluded from Tiers 2 / 3: Spark SQL has no native spatial index,
so forcing it to compete without th3index would measure lack-of-feature.
Tier 1 stays the only honest Spark measurement.

`bench_mbdb.sh` and `bench_mduck.sh` gain a `--tier {1,2,3}` flag.  The
runner toggles indexes after loading and writes the tier into the
results JSON so report.py can pivot by configuration.

For MobilityDuck, Tiers 2/3 use the new `TRTREE` multi-entry index
delivered by MobilityDuck PRs #143 (multi-entry index) + #144
(constant-geometry pushdown).  Run the bundle with the user's
MobilityDuck install pinned to those PR branches (or use the
preview/100-percent release).

(2) NxN mitigations on Spark (Q5, Q6, Q10, Q16)
==============================================

The four Trips × Trips queries don't fit Spark's no-spatial-index model
naturally — the row-by-row `everEqTh3IndexTh3Index(...)` prefilter
inside a non-equi join becomes O(N²) on the full Trips × Trips space.
Two complementary mitigations are wired:

  (a) Spark `/*+ BROADCAST(...) */` hints in the portable q05/q06/q10/q16
      SQL.  Spark recognises the hint and pins the listed small/filtered
      tables to every executor; PostgreSQL and DuckDB treat the
      `/*+ ... */` block as an ordinary comment, so the SQL stays
      byte-identical and portable.

  (b) Spark-optimised query variants `q05_spark.sql`, `q06_spark.sql`,
      `q10_spark.sql`, `q16_spark.sql`.  These explode each trip's
      th3index into one row per H3 cell via
      `explode(th3IndexValues(trip_h3))` and run the spatial prefilter
      as an **equi-join on the cell column** — which Spark accelerates
      natively (sort-merge / shuffle-hash).  The expensive
      `tDwithin` / `minDistance` / `eDwithin` then runs on the much
      smaller deduplicated candidate set.

The MobilitySpark runner (`bench_mspark.sh` → `BerlinMODBench.java`)
auto-prefers `<query>_spark.sql` over `<query>.sql` when present and
tags the result line with `[spark]` for clarity.  PG and DuckDB runners
always use the portable `<query>.sql`.

Variants are semantically equivalent to their portable counterparts —
same input, same output, same scientific question — just expressed in
the form Spark's engine can accelerate.

At default scale (~10K trips, ~50 cells per trip):
  * portable q10.sql on Spark : ~10⁸ candidate pairs row-by-row
  * q10_spark.sql via UNNEST  : ~5×10⁵ candidate pairs equi-joined
                                (200× reduction before tDwithin)

Files
=====

* `berlinmod/README.md` — adds "Three-tier index framework" and "NxN
  mitigations on Spark" sections + per-query tier-applicability table
* `berlinmod/bench/bench_mbdb.sh` + `bench_mduck.sh` — `--tier {1,2,3}`
  flag + tier-specific index DROP/CREATE
* `berlinmod/bench/bench_mspark.sh` — doc-only: explicit Tier 1, points
  at the Spark variants
* `berlinmod/q05.sql` / `q06.sql` / `q10.sql` / `q16.sql` — add
  `/*+ BROADCAST(...) */` hints (no-op comment on PG/Duck)
* `berlinmod/q05_spark.sql` / `q06_spark.sql` / `q10_spark.sql` /
  `q16_spark.sql` — new Spark-optimised variants
* `src/main/java/.../BerlinMODBench.java` — prefer `_spark.sql` when
  present + emit `tier: 1` in results JSON

Refs: MobilityDuck#143, MobilityDuck#144 (TRTREE multi-entry + pushdown).
…ayer)

Introduces `org.mobilitydb.spark.util.TimeUtil` as the single named
boundary-conversion layer between Spark / Java native timestamp types
(java.sql.Timestamp, java.time.OffsetDateTime, java.time.Instant) and
MEOS canonical `TimestampTz` (microseconds since 2000-01-01 UTC).

Per the ecosystem's closed-algebra-boundary rule: every binding owns
ONE thin layer that converts platform-native types to/from MEOS
canonical, and every call-site goes through that layer.  Inlining
the magic constant 946684800 (seconds between Unix epoch and PG
epoch) at individual call-sites scatters the conversion logic and
risks drift if one site is updated and another is missed.

Before this PR: 13 separate files contained inline `946684800L * 1000L`
(or `946684800L * 1000000L`) expressions, with three of them
re-declaring private static finals like `PG_UNIX_OFFSET_MS`
locally — making the magic number ambient across the codebase.

After this PR: every callsite uses `TimeUtil.PG_UNIX_EPOCH_OFFSET_{S,MS,US}`
or the named helpers `TimeUtil.toMeosTimestamp(...)` /
`TimeUtil.fromMeosTimestamp(...)`.  One place defines the constant,
every site references it.

Files touched:

* src/main/java/org/mobilitydb/spark/util/TimeUtil.java  (new, 85 lines)
* 13 callsite files in spark/cbuffer/, spark/geo/, spark/npoint/,
  spark/pose/, spark/temporal/ — mechanical substitution of the
  inlined magic numbers / per-file constants with TimeUtil refs;
  added `import org.mobilitydb.spark.util.TimeUtil;` where used.

Companion: `MobilityDuck/src/include/time_util.hpp` plays the same
role on the DuckDB side; `GoMEOS/datetime_utils.go` on the Go side;
PyMEOS-CFFI / MEOS.NET / JMEOS route through `pg_timestamptz_in`
which is functionally equivalent.  This brings MobilitySpark into
alignment with the rest of the ecosystem.

No behaviour change — every substitution is value-preserving and
type-preserving (long → long; same numeric value of 946684800000).

Note on this branch's pre-existing build state: the `preview/100-percent`
base contains a small set of pre-existing JMEOS API-drift errors
(`temporal_lower_inc` / `tpoint_transform_pipeline` etc.) introduced
by recent JMEOS signature uniformization that haven't been propagated
to MobilitySpark yet.  Those errors are independent of this refactor
— they exist in files unaffected by this PR's diff.  Confirm by
running `git diff main` on any error-emitting file; this PR's diff
is purely the symbol substitution above.
…ure-tracked

Adds doc/spark-version.md as the canonical position statement on
Spark-version targeting:

- MobilitySpark targets Apache Spark 3.5.x (LTS through 2026).
- Spark 4.0 (May 2025 GA) is early-adoption, not production-default.
- The MEOS-C-library-version axis is orthogonal and handled inside
  JMEOS (one JMEOS jar per MEOS version; MobilitySpark pins one
  JMEOS jar).
- Five trigger signals enumerated for when to commit to Spark 4
  support; commit when two or more fire.
- The future Spark 4 effort, if/when needed, will mirror
  MobilityDuck's multi-DuckDB-version foundation (Maven profile per
  Spark version, CI matrix, per-version jar artefacts).

README updated to bump the Spark requirement from the stale 3.4.0
line to 3.5.x with a pointer to the new doc.

The position is asymmetric to MobilityDuck's multi-DuckDB-version
work: DuckDB v1.4/v1.5 are both production with split adopter base
today, whereas Spark 3.5 is the production line and 4.0 is the
early-adoption line.
… integration/berlinmod-bench

# Conflicts:
#	README.md
#	berlinmod/bench/bench_mbdb.sh
#	berlinmod/bench/bench_mduck.sh
#	berlinmod/bench/bench_mspark.sh
#	docs/parity-100.md
#	docs/parity-status.md
#	libs/JMEOS-1.4.jar
#	scripts/parity-audit.py
#	src/main/java/org/mobilitydb/spark/MobilitySparkSession.java
#	src/main/java/org/mobilitydb/spark/demo/BerlinMODUDFs.java
#	src/main/java/org/mobilitydb/spark/geo/AlwaysSpatialRelsUDFs.java
#	src/main/java/org/mobilitydb/spark/geo/DistanceUDFs.java
#	src/main/java/org/mobilitydb/spark/geo/GeoAffineUDFs.java
#	src/main/java/org/mobilitydb/spark/geo/GeoAnalyticsUDFs.java
#	src/main/java/org/mobilitydb/spark/geo/STBoxUDFs.java
#	src/main/java/org/mobilitydb/spark/geo/TPointSTBoxOpsUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/AccessorAliasUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/AggregateUDAFs.java
#	src/main/java/org/mobilitydb/spark/temporal/BucketUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/IOAliasUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/MoreAccessorUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/RestrictionUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/SimilarityUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/SpanAccessorUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/SpanAlgebraUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/SubtypeConstructorUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/TBoxUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/TTextUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/TemporalBoxOpsUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/TemporalCompUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/TileUDFs.java
#	src/main/java/org/mobilitydb/spark/temporal/TransformUDFs.java
#	src/test/java/org/mobilitydb/spark/temporal/AggregateUDAFsTest.java
#	src/test/java/org/mobilitydb/spark/temporal/MathUDFsExtTest.java
#	src/test/java/org/mobilitydb/spark/temporal/TBoxOpsUDFsTest.java
expandSpace and geoTimeStbox serialised the STBox with
stbox_as_hexwkb(box, (byte) 0, ...). WKB variant 0 omits the SRID, so
bboxOverlaps re-parsing it via stbox_from_hexwkb gets SRID 0;
overlaps_tspatial_stbox then compares an SRID-3812 trip against an
SRID-0 box, returns false for every pair, and Q10's
WHERE ... AND bboxOverlaps(t2.trip, expandSpace(t1.trip, 3)) silently
drops all matches (0 rows instead of the expected count). Serialise
with WKB_EXTENDED (0x04) so the SRID round-trips; Q10 then returns the
correct rows, matching MobilityDB's native && operator.
…ation/berlinmod-bench

# Conflicts:
#	.github/workflows/maven.yml
#	pom.xml
The accumulated integration branch carries PR MobilityDB#11's Th3IndexUDFs full API
(86 UDFs), 8 of which call JMEOS methods (always_eq_th3index_h3index etc.)
that don't exist in either bundled JMEOS jar (1.4 or 1.5).  The JMEOS regen
needs to happen against MEOS-with-th3index (the parallel JMEOS-session task)
before this branch will fully compile.

This note unblocks the benchmark-running parallel session — they can either
swap the JMEOS jar to a th3index-aware regen, or comment out the 8 call
sites (Th3IndexUDFs.java:491-498, 527-531) which disables only the Spark
Tier-1 h3-cell prefilter; Trips × Trips queries Q5/Q6/Q10/Q16 still run via
portable SQL + BROADCAST hints.
Build the accumulated integration branch against a JMEOS-1.4.jar
regenerated from the b183b12 MEOS API (th3index + h3_geo surface) and
reconcile the UDF call sites to the corrected signatures.

MEOS C-type fixes carried by the regenerated jar:
  - H3Index / h3index -> long  (was int)
  - int64             -> long  (was int)
  - bool              -> boolean (incl. subdir-header functions)
  - TimestampTz *out  -> 8-byte buffer (was a 4-byte under-allocation
    that overflowed on every stbox/tbox tmin/tmax and timestampN call)

UDF reconciliation:
  - interpType: pass the int enum (3 = LINEAR) / interpToInt(), not a string
  - h3 prefilter (geo_to_h3index_set, ever_eq_anyof_h3indexset_th3index)
    bound directly via H3IndexJnrBindings, not functions.*
  - acontains_geo_tgeo (was acontains_geo_tpoint); tspatial_transform_pipeline
    (was tpoint_transform_pipeline); acovers_{geo_tgeo,tgeo_geo} now exposed
  - no-exit MEOS error handler via meos_initialize_error_handler with a
    kept-alive callback (MEOS default exit()s the JVM on error)
  - out-param/array marshalling: temporal_value_at_timestamptz (folded
    return), th3index_values, th3indexseq[set]_make, tpointinst_make

Test correction: tnumberTrend on a step-interpolated tint correctly
returns null (tnumber_trend requires linear interpolation).

JMEOS jar -> libs/JMEOS-1.4.jar; pom points at it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants