diff --git a/docs/beta-testing-edge-to-cloud.md b/docs/beta-testing-edge-to-cloud.md new file mode 100644 index 00000000..36f037a1 --- /dev/null +++ b/docs/beta-testing-edge-to-cloud.md @@ -0,0 +1,218 @@ +# Beta-Testing Guide: Edge-to-Cloud Temporal Data Lake + +This guide has two sections. + +- **[Part 1 — For all beta testers](#part-1--for-all-beta-testers)**: what to + install, what to run, what to check, where to send feedback. +- **[Part 2 — For MobilityDB committers](#part-2--for-mobilitydb-committers)**: + PR / branch / implementation status, known engineering limitations. + +--- + +## Part 1 — For all beta testers + +### What you are testing + +The **edge-to-cloud pipeline** for MobilityDB temporal data: + +1. Raw GPS pings (CSV or inline values) are loaded into DuckDB. +2. They are assembled into typed `tgeogpointSeq` trajectories — geodetic + (spheroidal-metre) sequences backed by MEOS. +3. The trajectories are written to a **TemporalParquet** shard: a standard + Parquet file whose `BYTE_ARRAY` column carries MEOS-WKB values and whose + file footer contains a `temporal` metadata key describing each column's + type, encoding, and CRS. The writer additionally projects O(1) + native-scalar bound columns (`ts_min`/`ts_max` TIMESTAMPTZ, + `lon_min`/`lon_max`/`lat_min`/`lat_max` DOUBLE) from each trajectory's + inline `stbox`. These are additive ordinary columns — the footer still + describes only the WKB column, so the file stays spec-conformant — but + they carry real Parquet min/max statistics, so a time/space predicate + prunes row groups without decoding any MEOS-WKB byte. +4. The same shard is queryable on DuckDB, MobilityDB (PostgreSQL), and Spark — + using an identical named-function SQL dialect. + +### Time budget + +Scenario A (synthetic data, no CSV): **~15 minutes** including the build. +Scenario B (your own GPS CSV): add ~10 minutes. + +### Install + +```bash +git clone --recurse-submodules --branch feat/edge-to-cloud-quickstart \ + https://github.com/MobilityDB/MobilityDuck.git +cd MobilityDuck +make # first build: 5–10 min (downloads MEOS + dependencies) + # subsequent builds: ~30 s +``` + +After the build, a DuckDB shell with MobilityDuck pre-loaded is at +`./build/release/duckdb`. + +> A community extension (one-line `INSTALL`) is coming once this beta +> validates the feature. For now, build from source is required. + +### Scenario A — Zero-data quickstart + +Generates 5 synthetic North Sea vessels from inline data — no CSV, no +download. Demonstrates the full pipeline in under 2 seconds. + +```bash +TZ=UTC ./build/release/duckdb -c ".read examples/quickstart/quickstart.sql" +``` + +**Expected output:** + +Query A — geodetic distance and peak speed per vessel: +``` +┌───────────┬────────────┬──────────┬──────────────┐ +│ entity_id │ ping_count │ length_m │ max_speed_ms │ +├───────────┼────────────┼──────────┼──────────────┤ +│ 5 │ 12 │ 172001.0 │ 26.22 │ +│ 1 │ 12 │ 170169.0 │ 25.93 │ +│ 2 │ 12 │ 158771.0 │ 24.21 │ +│ 3 │ 12 │ 83644.0 │ 12.7 │ +│ 4 │ 12 │ 37155.0 │ 5.64 │ +└───────────┴────────────┴──────────┴──────────────┘ +``` + +Key checkpoints: +- Distances are in **metres**, not degrees (vessel 5 ≈ 172 km — not 1.55°). +- Vessel 3 (Skagerrak) is present here but must **not** appear in Query B. + +Query B — vessels that passed through the Copenhagen bounding box: +``` +┌───────────┐ +│ entity_id │ +├───────────┤ +│ 1 │ +│ 2 │ +│ 4 │ +│ 5 │ +└───────────┘ +``` + +Query C — trip duration (all 5 vessels: 12 pings × 10 min = 1 h 50 min): +``` +┌───────────┬───────────────┐ +│ entity_id │ trip_duration │ +├───────────┼───────────────┤ +│ 1 │ 01:50:00 │ (all five rows identical) +└───────────┴───────────────┘ +``` + +### Scenario B — Same queries on MobilityDB (portability check) + +```bash +psql -d -f examples/quickstart/quickstart_mobilitydb.sql +``` + +Queries A, B, and C must produce **identical values** to Scenario A. +This is the portability claim: one named-function SQL file, two platforms. + +### Scenario C — Your own GPS data + +```bash +# Edit the five CONFIGURE macros at the top of the template: +$EDITOR examples/generic-ingest/generic_ingest.sql +# Set: csv_path, col_entity_id, col_lon, col_lat, col_ts + +# Run: +TZ=UTC ./build/release/duckdb -c ".read examples/generic-ingest/generic_ingest.sql" +``` + +Output: `trajectories.parquet` in the current directory, readable by +MobilityDB, MobilitySpark, and PyMEOS without any MobilityDuck installation. + +### Scenario D — Real-world AIS data (optional, ~1 million pings) + +Download one day of Danish AIS data from the Maritime Authority: + + +Place the downloaded CSV (e.g. `aisdk-2026-02-26.csv`) at any convenient path, +then edit the path at the top of the demo file before running: + +```bash +# Set the CSV path in the demo file (one line to edit): +sed -i "s|../../meos/examples/data/aisdk-2026-02-26.csv|/path/to/your.csv|" \ + examples/ais-data-lake/ais_data_lake.sql + +TZ=UTC ./build/release/duckdb -c ".read examples/ais-data-lake/ais_data_lake.sql" +``` + +The demo filters to Class A vessels and a 1-hour window, so it completes in +under 30 seconds on a laptop even with a full-day file. + +### How to report feedback + +Open an issue or leave a comment on the beta thread: + + +Please include: +- Platform + OS version +- Output of `gcc --version` or `clang --version` +- For build failures: the last 20 lines of `make` output +- For wrong results: the full query + actual vs expected output +- Any ergonomic friction (confusing errors, missing functions, surprising behaviour) + +--- + +## Part 2 — For MobilityDB committers + +### PR and branch + +The feature lands via **MobilityDuck PR #113**: + + +Branch: `feat/edge-to-cloud-quickstart` (1 commit on top of `main`). + +Related RFC threads: +- [Issue #830](https://github.com/MobilityDB/MobilityDB/issues/830) — TemporalParquet spec +- [PR #911](https://github.com/MobilityDB/MobilityDB/pull/911) — TemporalParquet doc PR +- [PR #917](https://github.com/MobilityDB/MobilityDB/pull/917) — edge-to-cloud SQL portability RFC +- [Discussion #861](https://github.com/MobilityDB/MobilityDB/discussions/861) — portable SQL naming +- [Discussion #913](https://github.com/MobilityDB/MobilityDB/discussions/913) — Temporal Data Lake architecture + +### Test suite + +```bash +make test # 1446 assertions across 36 test files, all must pass +``` + +The new file `test/sql/tgeogpoint.test` (16 assertions) is the regression +guard for the SRID + geodetic-flag fix. + +### Implementation status + +| Function / type | Status | +|---|---| +| `TGEOGPOINT` construction + string parse | done | +| `TGEOGPOINT` Parquet round-trip (`asBinary` / `tgeogpointFromBinary`) | done | +| `eIntersects(GEOMETRY, tgeogpoint)` and all 12 `(GEOMETRY, temporal)` predicates | done (geodetic fix: `geom_to_geog()`) | +| `temporalFooter(MAP)` → TemporalParquet JSON | done | +| `asBinary`/`fromBinary` for spans, spansets, tgeometry, tcbuffer, tnpoint, tpose, th3index | **not yet wired** | +| Automatic footer injection on `COPY TO '*.parquet'` | **not yet** — call `KV_METADATA {'temporal': temporalFooter(...)}` explicitly | +| `tIntersects(GEOMETRY, tgeogpoint)` | **not yet** — MEOS roundoff error on geodetic sequences | +| `tDwithin(GEOMETRY, tgeogpoint, dist)` | **not yet** — only planar | + +### Geodetic fix — what changed + +Two bugs in `src/geo/tgeompoint_functions.cpp` (all 12 `(GEOMETRY, temporal)` functions): + +**Bug 1** — SRID was hardcoded 0 before the temporal value was deserialized. +`tgeogpoint` has SRID 4326; the geometry had SRID 0 → "Operation on mixed SRID". +Fix: `srid = tspatial_srid(tgeom)` after deserialization. + +**Bug 2** — `FLAGS_SET_GEODETIC(gs->gflags, 1)` alone corrupts the 2D bbox layout +(`FLAGS_NDIMS_BOX` changes from 2 → 3, shifting geometry data read offset by 16 bytes). +Fix: `geom_to_geog(gs)` (public MEOS API) properly rebuilds the GSERIALIZED with a +3D bounding box and GEODETIC=1, mirroring PostGIS's implicit `geometry → geography` cast. + +### Review checklist (committer) + +- [ ] `make test`: 1446 assertions pass +- [ ] `docs/tgeogpoint-design.md` — "Spatial Predicates" section is accurate +- [ ] `examples/quickstart/quickstart.sql` — readable and self-contained for a new user +- [ ] `examples/generic-ingest/generic_ingest.sql` — instructions clear, macros well-named +- [ ] No `Co-Authored-By` or internal planning references in commit messages +- [ ] Confirm `temporalFooter()` output matches the TemporalParquet spec in PR #911 diff --git a/docs/testing-tz-neutral-policy.md b/docs/testing-tz-neutral-policy.md new file mode 100644 index 00000000..3f44549a --- /dev/null +++ b/docs/testing-tz-neutral-policy.md @@ -0,0 +1,166 @@ +# Timezone testing policy — MEOS ecosystem + +**Applies to:** MobilityDB, MobilityDuck, PyMEOS, JMEOS, meos-rs, and any other binding or tool using MEOS. + +## Problem + +MEOS formats timestamps using the PostgreSQL-derived internal timezone, which is +thread-local and defaults to the system/POSIX timezone. A test that hardcodes the +UTC offset `+00` in its expected value will fail on any machine where the system +timezone differs: + +```sql +-- Written on a UTC machine; breaks on UTC+1, UTC+2, etc. +SELECT tint '[1@2000-01-01]'::TEXT +---- +[1@2000-01-01 00:00:00+00] ← passes only on UTC systems +``` + +`+00` is the worst choice because it looks like "no offset" but is actually a +hardcoded assumption. Any other fixed offset (`+01`, `-08`) at least makes it +obvious that a specific timezone is required. + +## The root fix: pin the test timezone to something non-UTC + +MEOS reads the `TZ` environment variable at first initialisation +(via `select_default_timezone → getenv("TZ")`). Setting it before the test +process starts is the correct, lightweight fix — no code changes, no per-thread +hacks, no effect on production behaviour. + +``` +TZ=Europe/Brussels # UTC+1 winter / UTC+2 summer +``` + +This is directly analogous to PostgreSQL's own practice of using +`America/Los_Angeles` (`PST8PDT`, UTC-8/UTC-7) for its regression tests. +Non-UTC offsets matter because they expose bugs that UTC silently hides +(sign errors, DST boundary logic, off-by-an-hour conversions, etc.). + +## Platform-specific approach + +Different test frameworks have different capabilities; apply the right +tool for each. + +### pg_regress (MobilityDB PostgreSQL) + +pg_regress compares plain-text expected files line by line. There is no +programmatic comparison hook, so hardcoded offsets are **unavoidable**. +The correct approach is: + +1. Set `PGTZ=Europe/Brussels` (or the project's chosen zone) in the regress + environment. +2. Use only winter dates in test fixtures to get a stable `+01` offset — avoid + dates that cross DST boundaries. +3. Expected files contain `+01` consistently; CI sets the same env variable. + +### DuckDB sqllogictest (MobilityDuck) + +Two approaches are available — prefer them in the order listed: + +#### a) Numeric/boolean value accessors (first choice) + +Accessor functions return non-timestamp types; the result never contains an +offset at all: + +```sql +SELECT minValue(tint '[1@2000-01-01, 2@2000-01-02, 3@2000-01-03]') -- 1 +SELECT maxValue(tint '[1@2000-01-01, 2@2000-01-02, 3@2000-01-03]') -- 3 +SELECT startValue(tbool '[t@2000-01-01, f@2000-01-02]') -- true +SELECT endValue(tbool '[t@2000-01-01, f@2000-01-02]') -- false +SELECT round(minValue(tfloat '[1.5@2000-01-01, 3.5@2000-01-02]'), 6) -- 1.5 +SELECT duration(ttext '[hello@2000-01-01, world@2000-01-02]') -- 1 day +``` + +#### b) Nosort cross-validation (DuckDB sqllogictest only) + +Both queries go through MEOS at the same timezone, so even if their output +contains an offset, the two sides are always equal — the comparison is +TZ-neutral: + +```sql +query IT nosort label +SELECT id, tintFromBinary(val)::VARCHAR FROM tbl ORDER BY id + +query IT nosort label +SELECT id, tintFromBinary(asBinary(tintFromBinary(val)))::VARCHAR FROM tbl ORDER BY id +``` + +This pattern is **not portable** to pg_regress, pytest, JUnit, or Spark tests. + +#### c) BLOB byte comparison (for metadata round-trips) + +```sql +SELECT value = expected_json_string::BLOB AS ok +FROM parquet_kv_metadata(file) +WHERE key = 'my_key'::BLOB +``` + +### pytest (PyMEOS) + +Use Python accessor methods that return non-timestamp types: + +```python +assert t.min_value() == 1 +assert t.max_value() == 3 +assert t.start_value() == True +assert t.duration() == timedelta(days=2) +``` + +### JUnit / Kotlin (JMEOS) + +Same principle — compare domain values, not formatted strings: + +```java +assertEquals(1, t.minValue()); +assertEquals(3, t.maxValue()); +assertEquals(Duration.ofDays(2), t.duration()); +``` + +### Spark + +Set `spark.sql.session.timeZone` to the project's chosen zone and use that +offset consistently in expected strings, or extract domain values with the +MEOS UDF equivalents before asserting. + +## What NOT to do + +```sql +-- ✗ Hardcoded UTC offset — fails on non-UTC systems +SELECT tint '[1@2000-01-01]'::TEXT +---- +[1@2000-01-01 00:00:00+00] + +-- ✗ Hardcoded single-TZ offset — fails everywhere else +SELECT tint '[1@2000-01-01]'::TEXT +---- +[1@2000-01-01 00:00:00+01] + +-- ✗ Forcing MEOS timezone in extension/binding code — breaks users +meos_initialize_timezone("UTC"); // per-thread in DuckDB wrapper → bad + +-- ✗ DuckDB-only nosort used in pg_regress or pytest test +``` + +## Using +00 in INPUT (always fine) + +Using `+00` in **input** literals anchors the absolute UTC time regardless of +the display timezone. This is correct and encouraged: + +```sql +-- ✓ Input literal anchors to UTC; only the *display* will shift with TZ +COPY (SELECT asBinary(tgeompoint '[POINT(1 2)@2026-01-01 00:00:00+00]') AS traj) +TO 'file.parquet' (FORMAT PARQUET) +``` + +## Migration status (2026-05-07) + +| File / test suite | Framework | Status | +|---|---|---| +| `test/sql/parquet/temporal_parquet.test` | DuckDB sqllogictest | ✓ Fully TZ-neutral (accessors + nosort) | +| `test/sql/parquet/` (new tests) | DuckDB sqllogictest | ✓ Policy in force | +| `test/sql/parity/*.test` | DuckDB sqllogictest | ✗ Still use `+00` — needs `TZ=Europe/Brussels` sweep | +| `test/sql/tint.test`, `tfloat.test`, … | DuckDB sqllogictest | ✗ Still use `+00` — needs sweep | +| `test/sql/tgeompoint.test`, `tgeometry.test` | DuckDB sqllogictest | ✗ Still use `+00` — needs sweep | +| `test/sql/stbox.test` | DuckDB sqllogictest | ✗ Still use `+00` — needs sweep | +| MobilityDB `expected/*.out` | pg_regress | ✓ Uses America/Los_Angeles (PST8PDT) — existing approach is correct | +| PyMEOS tests | pytest | ✗ Audit needed — accessor approach may not be consistently applied | diff --git a/examples/generic-ingest/generic_ingest.sql b/examples/generic-ingest/generic_ingest.sql new file mode 100644 index 00000000..3f820d88 --- /dev/null +++ b/examples/generic-ingest/generic_ingest.sql @@ -0,0 +1,161 @@ +-- generic_ingest.sql — TemporalParquet ingest template (bring your own data) +-- +-- Converts any lon/lat/timestamp CSV into a TemporalParquet shard. +-- Edit the CONFIGURE macros below, then run from the MobilityDuck root: +-- TZ=UTC ./build/release/duckdb -c ".read examples/generic-ingest/generic_ingest.sql" +-- +-- Output: a self-describing Parquet file with MEOS-WKB trajectory column and +-- TemporalParquet footer metadata, readable by MobilityDB, MobilitySpark, PyMEOS. +-- +-- ───────────────────────────────────────────────────────────────────────────── +-- PREREQUISITES (build from source required — community extension coming soon) +-- ───────────────────────────────────────────────────────────────────────────── +-- +-- git clone --recurse-submodules https://github.com/MobilityDB/MobilityDuck.git +-- cd MobilityDuck +-- make # installs vcpkg + MEOS, builds the extension +-- +-- The two LOAD lines below assume a local build at ../../build/release/. +-- ───────────────────────────────────────────────────────────────────────────── + +LOAD '../../build/release/extension/mobilityduck/mobilityduck.duckdb_extension'; +LOAD '../../build/release/extension/parquet/parquet.duckdb_extension'; + +-- ───────────────────────────────────────────────────────────────────────────── +-- CONFIGURE: data source +-- ───────────────────────────────────────────────────────────────────────────── + +-- Path to your CSV file (wildcards accepted: 'data/*.csv') +-- The CSV must have a header row. +CREATE OR REPLACE MACRO csv_path() AS 'your_data.csv'; + +-- Output Parquet shard path +CREATE OR REPLACE MACRO output_path() AS 'trajectories.parquet'; + +-- ───────────────────────────────────────────────────────────────────────────── +-- CONFIGURE: column mapping +-- Replace these macro bodies with your actual column names. +-- ───────────────────────────────────────────────────────────────────────────── + +-- Column in your CSV that uniquely identifies each moving object +-- (vessel MMSI, vehicle ID, user ID, sensor tag, …) +CREATE OR REPLACE MACRO col_entity_id() AS 'entity_id'; + +-- Column containing longitude in WGS-84 decimal degrees (−180 … 180) +CREATE OR REPLACE MACRO col_lon() AS 'longitude'; + +-- Column containing latitude in WGS-84 decimal degrees (−90 … 90) +CREATE OR REPLACE MACRO col_lat() AS 'latitude'; + +-- Column containing the observation timestamp +-- DuckDB parses ISO-8601, Unix epoch (integer), and most common formats. +CREATE OR REPLACE MACRO col_ts() AS 'timestamp'; + +-- Minimum number of pings per entity to include in output (filters sparse tracks) +CREATE OR REPLACE MACRO min_pings() AS 3; + +-- ───────────────────────────────────────────────────────────────────────────── +-- Step 1: load and validate raw pings +-- +-- Drops rows with out-of-range coordinates and deduplicates (entity, ts) pairs +-- (common in AIS/GPS feeds that emit duplicate messages at the same timestamp). +-- ───────────────────────────────────────────────────────────────────────────── + +CREATE OR REPLACE TABLE raw_pings AS +SELECT + CAST(columns(col_entity_id()) AS BIGINT) AS entity_id, + CAST(columns(col_lon()) AS DOUBLE) AS lon, + CAST(columns(col_lat()) AS DOUBLE) AS lat, + CAST(columns(col_ts()) AS TIMESTAMPTZ) AS ts +FROM read_csv_auto(csv_path(), header = true, nullstr = '') +WHERE TRY_CAST(columns(col_lon()) AS DOUBLE) BETWEEN -180 AND 180 + AND TRY_CAST(columns(col_lat()) AS DOUBLE) BETWEEN -90 AND 90 +QUALIFY ROW_NUMBER() OVER ( + PARTITION BY CAST(columns(col_entity_id()) AS BIGINT), + CAST(columns(col_ts()) AS TIMESTAMPTZ) + ORDER BY CAST(columns(col_ts()) AS TIMESTAMPTZ) +) = 1; + +SELECT count(*) AS raw_pings, count(DISTINCT entity_id) AS entities FROM raw_pings; + +-- ───────────────────────────────────────────────────────────────────────────── +-- Step 2: build tgeogpointSeq trajectories +-- +-- One geodetic sequence per entity, ordered by timestamp. +-- Entities with fewer than min_pings() observations are excluded. +-- ───────────────────────────────────────────────────────────────────────────── + +CREATE OR REPLACE TABLE trajectories AS +SELECT + entity_id, + tgeogpointSeq( + list(TGEOGPOINT(ST_Point(lon, lat), ts) ORDER BY ts) + ) AS traj +FROM raw_pings +GROUP BY entity_id +HAVING count(*) >= min_pings(); + +SELECT count(*) AS trajectories FROM trajectories; + +-- ───────────────────────────────────────────────────────────────────────────── +-- Step 3: write TemporalParquet shard +-- +-- The TemporalParquet footer (KV_METADATA 'temporal') declares traj as a +-- tgeogpoint column encoded with MEOS-WKB. Any MEOS-aware reader can +-- reconstruct the typed value from the BYTE_ARRAY column without a schema file. +-- +-- The native-scalar bound columns (projected O(1) from each trajectory's +-- inline stbox) are additive ordinary TIMESTAMPTZ / DOUBLE columns: they +-- carry real Parquet min/max statistics so a reader prunes row groups on a +-- time/space predicate without decoding any MEOS-WKB byte. The footer +-- still describes only the WKB column, so the file stays spec-conformant. +-- ───────────────────────────────────────────────────────────────────────────── + +COPY ( + SELECT + entity_id, + asBinary(traj) AS traj, + numInstants(traj) AS ping_count, + Tmin(stbox(traj)) AS ts_min, + Tmax(stbox(traj)) AS ts_max, + Xmin(stbox(traj)) AS lon_min, + Xmax(stbox(traj)) AS lon_max, + Ymin(stbox(traj)) AS lat_min, + Ymax(stbox(traj)) AS lat_max + FROM trajectories +) +TO output_path() ( + FORMAT PARQUET, + ROW_GROUP_SIZE 1000, + KV_METADATA {'temporal': temporalFooter(MAP {'traj': 'tgeogpoint'})} +); + +-- Verify schema: traj must appear as BYTE_ARRAY +SELECT name, type FROM parquet_schema(output_path()) +WHERE name NOT IN ('duckdb_schema'); + +-- Verify footer +SELECT value = temporalFooter(MAP {'traj': 'tgeogpoint'})::BLOB AS footer_ok +FROM parquet_kv_metadata(output_path()) +WHERE key = 'temporal'::BLOB; + +-- ───────────────────────────────────────────────────────────────────────────── +-- Step 4: quick sanity analytics on the written shard +-- These same queries run unchanged on MobilityDB and MobilitySpark. +-- ───────────────────────────────────────────────────────────────────────────── + +-- Top 10 entities by geodetic trajectory length (metres) +SELECT + entity_id, + ping_count, + round(length(tgeogpointFromBinary(traj))) AS length_m, + round(maxValue(speed(tgeogpointFromBinary(traj))), 2) AS max_speed_ms +FROM read_parquet(output_path()) +ORDER BY length_m DESC +LIMIT 10; + +-- Distribution of ping counts +SELECT ping_count, count(*) AS entities +FROM read_parquet(output_path()) +GROUP BY ping_count +ORDER BY ping_count; diff --git a/examples/quickstart/quickstart.sql b/examples/quickstart/quickstart.sql new file mode 100644 index 00000000..f7292683 --- /dev/null +++ b/examples/quickstart/quickstart.sql @@ -0,0 +1,201 @@ +-- quickstart.sql — Edge-to-Cloud Temporal Data Lake demo (no external data needed) +-- +-- Demonstrates the full TemporalParquet pipeline with synthetic GPS trajectories: +-- 1. Generate 5 vessels × 12 pings from inline VALUES (no CSV required) +-- 2. Build tgeogpointSeq trajectories — geodetic WGS-84, length in metres +-- 3. Write TemporalParquet shard: asBinary() + temporalFooter() metadata +-- 4. Query the shard: length, speed, region intersection, trip duration +-- +-- Companion file: quickstart_mobilitydb.sql — same queries on PostgreSQL/MobilityDB. +-- +-- ───────────────────────────────────────────────────────────────────────────── +-- PREREQUISITES (build from source required — community extension coming soon) +-- ───────────────────────────────────────────────────────────────────────────── +-- +-- git clone --recurse-submodules https://github.com/MobilityDB/MobilityDuck.git +-- cd MobilityDuck +-- make # installs vcpkg + MEOS, builds the extension +-- # first build ~5-10 min; subsequent builds ~30 s +-- +-- Then run this file from the MobilityDuck root: +-- TZ=UTC ./build/release/duckdb -c ".read examples/quickstart/quickstart.sql" +-- +-- Or from the examples/quickstart/ directory: +-- TZ=UTC ../../build/release/duckdb :memory: -f quickstart.sql +-- +-- The two LOAD lines below assume a local build at ../../build/release/. +-- Adjust the paths if your build directory differs. +-- ───────────────────────────────────────────────────────────────────────────── + +LOAD '../../build/release/extension/mobilityduck/mobilityduck.duckdb_extension'; +LOAD '../../build/release/extension/parquet/parquet.duckdb_extension'; + +-- ───────────────────────────────────────────────────────────────────────────── +-- Step 1: generate synthetic pings (no external data needed) +-- +-- Five vessels depart from different positions in the North Sea / Kattegat area +-- and move linearly for 12 pings at 10-minute intervals (1h50m total). +-- Coordinates are in WGS-84 decimal degrees. +-- +-- Vessel coverage of the Copenhagen bounding box (lon 11.5–13.5, lat 55.0–56.5): +-- 1 → approaches from west, enters box around ping 7 +-- 2 → approaches from east, enters box around ping 3 +-- 3 → stays in Skagerrak, never enters box ← useful negative case +-- 4 → starts inside box, stays inside throughout +-- 5 → approaches from southwest, enters box around ping 10 +-- ───────────────────────────────────────────────────────────────────────────── + +CREATE OR REPLACE TABLE raw_pings AS +SELECT + entity_id, + round(start_lon + delta_lon * step, 6) AS lon, + round(start_lat + delta_lat * step, 6) AS lat, + -- to_timestamp(unix_epoch) avoids TIMESTAMPTZ+INTERVAL operator ambiguity + -- 1768464000 = 2026-01-15 08:00:00 UTC + to_timestamp(1768464000 + step * 600) AS ts +FROM (VALUES + -- entity_id start_lon start_lat delta_lon delta_lat + (1, 10.00, 55.50, 0.23, 0.05), + (2, 14.00, 56.00, -0.18, -0.08), + (3, 8.50, 57.50, 0.06, -0.06), + (4, 12.10, 55.20, 0.04, 0.02), + (5, 9.50, 54.50, 0.22, 0.06) +) t(entity_id, start_lon, start_lat, delta_lon, delta_lat), +generate_series(0, 11) g(step); + +SELECT entity_id, count(*) AS pings FROM raw_pings GROUP BY entity_id ORDER BY entity_id; + +-- ───────────────────────────────────────────────────────────────────────────── +-- Step 2: build tgeogpointSeq trajectories +-- +-- TGEOGPOINT(geometry, timestamptz) creates a geodetic instant. +-- tgeogpointSeq(list(... ORDER BY ts)) assembles them into a linear sequence. +-- The geodetic flag means length() and speed() return metres and m/s. +-- ───────────────────────────────────────────────────────────────────────────── + +CREATE OR REPLACE TABLE trajectories AS +SELECT + entity_id, + tgeogpointSeq( + list(TGEOGPOINT(ST_Point(lon, lat), ts) ORDER BY ts) + ) AS traj +FROM raw_pings +GROUP BY entity_id; + +SELECT count(*) AS trajectories FROM trajectories; + +-- ───────────────────────────────────────────────────────────────────────────── +-- Step 3: write TemporalParquet shard +-- +-- asBinary() → portable MEOS-WKB BLOB (BYTE_ARRAY in Parquet) +-- temporalFooter() → TemporalParquet JSON metadata injected via KV_METADATA +-- +-- Any MEOS-WKB-aware reader (MobilityDB, MobilitySpark, PyMEOS) can decode +-- the traj column using the base_type declared in the footer. +-- +-- The temporal value stays an opaque MEOS-WKB BLOB (the ratified spec). +-- To let a reader skip Parquet row groups WITHOUT decoding any row, also +-- project O(1) native-scalar bound columns from the trajectory's inline +-- spatiotemporal box (stbox): plain TIMESTAMPTZ / DOUBLE columns carry real +-- Parquet min/max statistics, so engine predicate pushdown prunes on them. +-- These sidecars are additive ordinary columns; the footer still describes +-- only the WKB column, so the file stays spec-conformant. +-- ───────────────────────────────────────────────────────────────────────────── + +COPY ( + SELECT + entity_id, + asBinary(traj) AS traj, + numInstants(traj) AS ping_count, + Tmin(stbox(traj)) AS ts_min, -- native time-bound sidecars + Tmax(stbox(traj)) AS ts_max, + Xmin(stbox(traj)) AS lon_min, -- native space-bound sidecars + Xmax(stbox(traj)) AS lon_max, + Ymin(stbox(traj)) AS lat_min, + Ymax(stbox(traj)) AS lat_max + FROM trajectories +) +TO 'edge_to_cloud_demo.parquet' ( + FORMAT PARQUET, + ROW_GROUP_SIZE 1000, + KV_METADATA {'temporal': temporalFooter(MAP {'traj': 'tgeogpoint'})} +); + +-- Verify the Parquet schema: traj is an opaque BYTE_ARRAY; the sidecars +-- are native scalar columns (the prunable shape) +SELECT name, type +FROM parquet_schema('edge_to_cloud_demo.parquet') +WHERE name NOT IN ('duckdb_schema'); + +-- Verify the TemporalParquet footer is embedded correctly +SELECT value = temporalFooter(MAP {'traj': 'tgeogpoint'})::BLOB AS footer_ok +FROM parquet_kv_metadata('edge_to_cloud_demo.parquet') +WHERE key = 'temporal'::BLOB; + +-- ───────────────────────────────────────────────────────────────────────────── +-- Step 4: analytics on the Parquet shard +-- +-- All queries use tgeogpointFromBinary() to reconstruct the typed value from +-- the BLOB column. The same named-function queries run unchanged on +-- MobilityDB and MobilitySpark — see quickstart_mobilitydb.sql. +-- ───────────────────────────────────────────────────────────────────────────── + +-- Query A: total distance and maximum speed per vessel +-- length() returns geodetic metres (spheroidal WGS-84 via Vincenty/Haversine) +-- speed() returns a tfloat of instantaneous speed in m/s; maxValue() extracts the peak +SELECT + entity_id, + ping_count, + round(length(tgeogpointFromBinary(traj))) AS length_m, + round(maxValue(speed(tgeogpointFromBinary(traj))), 2) AS max_speed_ms +FROM read_parquet('edge_to_cloud_demo.parquet') +ORDER BY length_m DESC; + +-- Query B: vessels that entered the Copenhagen bounding box +-- lon 11.5–13.5, lat 55.0–56.5 (approx. Øresund / Danish straits region) +-- eIntersects returns true if the trajectory ever enters the polygon. +-- DuckDB GEOMETRY is automatically promoted to geodetic when matched against +-- a tgeogpoint, so no SRID annotation on the polygon is required. +SELECT entity_id +FROM ( + SELECT entity_id, tgeogpointFromBinary(traj) AS traj + FROM read_parquet('edge_to_cloud_demo.parquet') +) +WHERE eIntersects( + ST_GeomFromText('POLYGON((11.5 55.0,13.5 55.0,13.5 56.5,11.5 56.5,11.5 55.0))'), + traj +) +ORDER BY entity_id; + +-- Query C: trip duration (timezone-independent interval) +SELECT + entity_id, + duration(tgeogpointFromBinary(traj))::VARCHAR AS trip_duration +FROM read_parquet('edge_to_cloud_demo.parquet') +ORDER BY entity_id; + +-- ───────────────────────────────────────────────────────────────────────────── +-- Query D: the lake predicate-pushdown pattern +-- +-- Filter on the native-scalar sidecars FIRST (plain TIMESTAMPTZ / DOUBLE +-- comparisons). DuckDB reads each Parquet row group's min/max statistics +-- and skips groups that cannot match, WITHOUT decoding any MEOS-WKB byte. +-- tgeogpointFromBinary() then runs only on the surviving rows. On a real +-- lake (millions of rows, remote object storage) this is the difference +-- between scanning the whole dataset and touching a few row groups. +-- +-- Window: vessels active in the Øresund box during the first hour. +-- ───────────────────────────────────────────────────────────────────────────── +SELECT + entity_id, + round(length(tgeogpointFromBinary(traj))) AS length_m +FROM read_parquet('edge_to_cloud_demo.parquet') +WHERE ts_min >= TIMESTAMPTZ '2026-01-15 08:00:00+00' + AND ts_max < TIMESTAMPTZ '2026-01-15 09:00:00+00' + AND lon_min <= 13.5 AND lon_max >= 11.5 + AND lat_min <= 56.5 AND lat_max >= 55.0 +ORDER BY entity_id; + +-- To see the pruning in the plan: +-- EXPLAIN ANALYZE SELECT ... (the query above) +-- and look for the Parquet scan's pushed Filters and reduced row count. diff --git a/examples/quickstart/quickstart_mobilitydb.sql b/examples/quickstart/quickstart_mobilitydb.sql new file mode 100644 index 00000000..52408746 --- /dev/null +++ b/examples/quickstart/quickstart_mobilitydb.sql @@ -0,0 +1,159 @@ +-- quickstart_mobilitydb.sql — Edge-to-Cloud demo on PostgreSQL / MobilityDB +-- +-- Companion to quickstart.sql. Builds the same five synthetic trajectories and +-- runs the same three analytics queries — proving the portable named-function +-- SQL dialect produces identical results on both DuckDB and PostgreSQL. +-- +-- Requirements: PostgreSQL with MobilityDB and PostGIS installed. +-- Run as: psql -d -f quickstart_mobilitydb.sql +-- +-- Reading a TemporalParquet shard written by MobilityDuck: +-- Install pg_parquet (https://github.com/CrunchyData/pg_parquet), then: +-- CREATE EXTENSION pg_parquet; +-- SELECT tgeogpointFromBinary(traj), ping_count +-- FROM parquet.read('edge_to_cloud_demo.parquet'); +-- Replace the WITH trajs CTE below with that table expression. + +-- ───────────────────────────────────────────────────────────────────────────── +-- Construction (MobilityDB syntax) +-- +-- Key difference from DuckDB: +-- DuckDB: tgeogpointSeq(list(TGEOGPOINT(ST_Point(lon,lat), ts) ORDER BY ts)) +-- MobilityDB: tgeogpointseq(array_agg( +-- format('SRID=4326;POINT(%s %s)@%s',lon,lat,ts)::tgeogpoint +-- ORDER BY ts)) +-- +-- The analytics queries (A, B, C) below are identical on both platforms. +-- ───────────────────────────────────────────────────────────────────────────── + +WITH raw AS ( + SELECT + entity_id, + round((start_lon + delta_lon * s)::numeric, 6)::float8 AS lon, + round((start_lat + delta_lat * s)::numeric, 6)::float8 AS lat, + TIMESTAMPTZ '2026-01-15 08:00:00+00' + (s * INTERVAL '10 minutes') AS ts + FROM (VALUES + -- entity_id start_lon start_lat delta_lon delta_lat + (1, 10.00, 55.50, 0.23, 0.05), + (2, 14.00, 56.00, -0.18, -0.08), + (3, 8.50, 57.50, 0.06, -0.06), + (4, 12.10, 55.20, 0.04, 0.02), + (5, 9.50, 54.50, 0.22, 0.06) + ) t(entity_id, start_lon, start_lat, delta_lon, delta_lat), + generate_series(0, 11) g(s) +), +trajs AS ( + SELECT + entity_id, + tgeogpointseq( + array_agg( + format('SRID=4326;POINT(%s %s)@%s', lon, lat, ts)::tgeogpoint + ORDER BY ts + ) + ) AS traj + FROM raw + GROUP BY entity_id +) +-- ───────────────────────────────────────────────────────────────────────────── +-- Query A: total distance and maximum speed per vessel +-- Identical to DuckDB — length() returns geodetic metres, speed() returns m/s +-- ───────────────────────────────────────────────────────────────────────────── +SELECT + entity_id, + round(length(traj)) AS length_m, + round(maxValue(speed(traj))::numeric, 2) AS max_speed_ms +FROM trajs +ORDER BY length_m DESC; + +-- Expected (same as DuckDB): +-- entity_id | length_m | max_speed_ms +-- -----------+----------+-------------- +-- 5 | 172001 | 26.22 +-- 1 | 170169 | 25.93 +-- 2 | 158771 | 24.21 +-- 3 | 83644 | 12.70 +-- 4 | 37155 | 5.64 + + +-- ───────────────────────────────────────────────────────────────────────────── +-- Query B: vessels that entered the Copenhagen bounding box +-- MobilityDB uses ST_GeomFromText with SRID=4326 prefix (EWKT). +-- DuckDB requires GEODSTBOX workaround (DuckDB geometry type carries no SRID). +-- Both return the same four vessels. +-- ───────────────────────────────────────────────────────────────────────────── +WITH raw AS ( + SELECT + entity_id, + round((start_lon + delta_lon * s)::numeric, 6)::float8 AS lon, + round((start_lat + delta_lat * s)::numeric, 6)::float8 AS lat, + TIMESTAMPTZ '2026-01-15 08:00:00+00' + (s * INTERVAL '10 minutes') AS ts + FROM (VALUES + (1, 10.00, 55.50, 0.23, 0.05), + (2, 14.00, 56.00, -0.18, -0.08), + (3, 8.50, 57.50, 0.06, -0.06), + (4, 12.10, 55.20, 0.04, 0.02), + (5, 9.50, 54.50, 0.22, 0.06) + ) t(entity_id, start_lon, start_lat, delta_lon, delta_lat), + generate_series(0, 11) g(s) +), +trajs AS ( + SELECT entity_id, + tgeogpointseq(array_agg( + format('SRID=4326;POINT(%s %s)@%s', lon, lat, ts)::tgeogpoint + ORDER BY ts)) AS traj + FROM raw GROUP BY entity_id +) +SELECT entity_id +FROM trajs +WHERE eIntersects( + ST_GeomFromText('SRID=4326;POLYGON((11.5 55.0,13.5 55.0,13.5 56.5,11.5 56.5,11.5 55.0))'), + traj +) +ORDER BY entity_id; + +-- Expected (same as DuckDB): +-- entity_id +-- ----------- +-- 1 +-- 2 +-- 4 +-- 5 + + +-- ───────────────────────────────────────────────────────────────────────────── +-- Query C: trip duration (timezone-independent interval) +-- ───────────────────────────────────────────────────────────────────────────── +WITH raw AS ( + SELECT + entity_id, + round((start_lon + delta_lon * s)::numeric, 6)::float8 AS lon, + round((start_lat + delta_lat * s)::numeric, 6)::float8 AS lat, + TIMESTAMPTZ '2026-01-15 08:00:00+00' + (s * INTERVAL '10 minutes') AS ts + FROM (VALUES + (1, 10.00, 55.50, 0.23, 0.05), + (2, 14.00, 56.00, -0.18, -0.08), + (3, 8.50, 57.50, 0.06, -0.06), + (4, 12.10, 55.20, 0.04, 0.02), + (5, 9.50, 54.50, 0.22, 0.06) + ) t(entity_id, start_lon, start_lat, delta_lon, delta_lat), + generate_series(0, 11) g(s) +), +trajs AS ( + SELECT entity_id, + tgeogpointseq(array_agg( + format('SRID=4326;POINT(%s %s)@%s', lon, lat, ts)::tgeogpoint + ORDER BY ts)) AS traj + FROM raw GROUP BY entity_id +) +SELECT entity_id, duration(traj) AS trip_duration +FROM trajs +ORDER BY entity_id; + +-- Expected (same as DuckDB): +-- entity_id | trip_duration +-- -----------+--------------- +-- 1 | 01:50:00 +-- 2 | 01:50:00 +-- 3 | 01:50:00 +-- 4 | 01:50:00 +-- 5 | 01:50:00