Dataset API #5949

gatesn · 2026-01-14T11:39:45Z

Dataset API

A unified API to sit between query engines and data sources that preserves support for late materialization, deferred decompression, and alternate device buffers.

Open Questions

What does capabilities negotiation look like? e.g. for unsupported encodings?
What about shared dictionaries?
Is deserialize_split in the right place? Without serde, splits can share state in the DataSourceScan. With serde, there's no where for shared state to live. Perhaps we should reconstruct a datasource on the worker? Should the datasource be serializable?

Signed-off-by: Nicholas Gates <[email protected]>

codecov · 2026-01-14T13:39:35Z

Codecov Report

❌ Patch coverage is 0% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.38%. Comparing base (f1684e6) to head (fad4095).
⚠️ Report is 2 commits behind head on develop.

Files with missing lines	Patch %	Lines
vortex-scan/src/v2/reader.rs	0.00%	11 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Nicholas Gates <[email protected]>

danking · 2026-01-14T14:55:28Z

vortex-dataset/README.md

+
+* `vortex-iceberg` - Expose Iceberg tables as a Vortex Dataset
+* `vortex-python` - Expose PyArrow Datasets as a Vortex Dataset
+* `vortex-layout` - Expose a Vortex Layout as a Vortex Dataset


we should definitely write a vortex-parquet

danking · 2026-01-14T15:04:50Z

vortex-dataset/src/source.rs

+
+#[derive(Debug, Clone, Default)]
+pub struct ScanRequest {
+    pub projection: Option<Expression>,


I don't love the fact that, in theory, someone could write Spark SQL which gets compiled to vortex Expression which gets compiled to PyArrow Expression.

Hunting a semantic-bug across those two compilers gives me the heebie jeebies but I don't have an alternative solution! It's better to hunt bugs in O(N+M) integrations than O(NM), assuming I care about all the integrations.

danking · 2026-01-14T15:07:42Z

vortex-scan/src/v2/source.rs

+
+    /// Returns the next batch of splits to be processed.
+    ///
+    /// This should not return _more_ than the max_batch_size splits, but may return fewer.


max_batch_size isn't defined anywhere; should that be in ScanRequest?

danking · 2026-01-14T15:15:27Z

vortex-scan/src/v2/source.rs

+pub trait DataSourceProvider: 'static {
+    /// URI schemes handled by this source provider.
+    ///
+    /// TODO(ngates): this might not be the right way to plugin sources.


Yeah, hmm, how do I read a PyArrow dataset? That's something sitting in memory in Python.

I'm not sure how duckdb does it, but I assume it just tries every other source and then, failing that, starts rummaging around in the Python variable environment?

Suppose I'm writing neoduckdb and I want to support the same magic, but I'm delegating to Vortex for all my scanning. I feel like init_source should return VortexResult<Option<DataSourceRef>> so that my engine can check if a data source exists? We don't usually use VortexResult for recoverable errors.

danking · 2026-01-14T15:18:17Z

vortex-scan/src/v2/source.rs

+    async fn init_source(&self, uri: String) -> VortexResult<DataSourceRef>;
+
+    /// Serialize a source split to bytes.
+    async fn serialize_split(&self, split: &dyn Split) -> VortexResult<Vec<u8>>;


I guess the Provider just assumes the split is a type it knows and downcasts it? Kinda seems like Providers should have a Split associated type and their DataSources & Scans are parameterized by that type?

Or maybe splits can just serialize themselves? What's the case where a Split lacks sufficient information to serialize itself?

I see you raised this above in your PR comment.

Yeah I kinda think the source and the split should be serializable. The source can play the role of Spark's broadcast whereas the split can play the role of Spark's Partition.

a10y · 2026-01-14T16:29:05Z

vortex-dataset/src/source.rs

+    /// Serialize a source split to bytes.
+    async fn serialize_split(&self, split: &dyn Split) -> VortexResult<Vec<u8>>;


this is the plan not the split data right?

Scan API

238c6a0

Signed-off-by: Nicholas Gates <[email protected]>

gatesn changed the title ~~Scan API~~ Dataset API Jan 14, 2026

Scan API

36f521d

Signed-off-by: Nicholas Gates <[email protected]>

Scan API

fad4095

Signed-off-by: Nicholas Gates <[email protected]>

danking reviewed Jan 14, 2026

View reviewed changes

a10y reviewed Jan 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dataset API #5949

Dataset API #5949

Uh oh!

gatesn commented Jan 14, 2026 •

edited

Loading

Uh oh!

codecov bot commented Jan 14, 2026 •

edited

Loading

Uh oh!

danking Jan 14, 2026

Uh oh!

danking Jan 14, 2026

Uh oh!

danking Jan 14, 2026

Uh oh!

danking Jan 14, 2026

Uh oh!

danking Jan 14, 2026

Uh oh!

danking Jan 14, 2026

Uh oh!

danking Jan 14, 2026

Uh oh!

a10y Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		/// Serialize a source split to bytes.
		async fn serialize_split(&self, split: &dyn Split) -> VortexResult<Vec<u8>>;

Dataset API #5949

Are you sure you want to change the base?

Dataset API #5949

Uh oh!

Conversation

gatesn commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dataset API

Open Questions

Uh oh!

codecov bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gatesn commented Jan 14, 2026 •

edited

Loading

codecov bot commented Jan 14, 2026 •

edited

Loading