Skip to content

Conversation

@gatesn
Copy link
Contributor

@gatesn gatesn commented Jan 14, 2026

Dataset API

A unified API to sit between query engines and data sources that preserves support for late materialization, deferred decompression, and alternate device buffers.

Open Questions

  • What does capabilities negotiation look like? e.g. for unsupported encodings?
  • What about shared dictionaries?
  • Is deserialize_split in the right place? Without serde, splits can share state in the DataSourceScan. With serde, there's no where for shared state to live. Perhaps we should reconstruct a datasource on the worker? Should the datasource be serializable?

Signed-off-by: Nicholas Gates <[email protected]>
@gatesn gatesn changed the title Scan API Dataset API Jan 14, 2026
Signed-off-by: Nicholas Gates <[email protected]>
@codecov
Copy link

codecov bot commented Jan 14, 2026

Codecov Report

❌ Patch coverage is 0% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.38%. Comparing base (f1684e6) to head (fad4095).
⚠️ Report is 2 commits behind head on develop.

Files with missing lines Patch % Lines
vortex-scan/src/v2/reader.rs 0.00% 11 Missing ⚠️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Nicholas Gates <[email protected]>

* `vortex-iceberg` - Expose Iceberg tables as a Vortex Dataset
* `vortex-python` - Expose PyArrow Datasets as a Vortex Dataset
* `vortex-layout` - Expose a Vortex Layout as a Vortex Dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should definitely write a vortex-parquet


#[derive(Debug, Clone, Default)]
pub struct ScanRequest {
pub projection: Option<Expression>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the fact that, in theory, someone could write Spark SQL which gets compiled to vortex Expression which gets compiled to PyArrow Expression.

Hunting a semantic-bug across those two compilers gives me the heebie jeebies but I don't have an alternative solution! It's better to hunt bugs in O(N+M) integrations than O(NM), assuming I care about all the integrations.


/// Returns the next batch of splits to be processed.
///
/// This should not return _more_ than the max_batch_size splits, but may return fewer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_batch_size isn't defined anywhere; should that be in ScanRequest?

pub trait DataSourceProvider: 'static {
/// URI schemes handled by this source provider.
///
/// TODO(ngates): this might not be the right way to plugin sources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, hmm, how do I read a PyArrow dataset? That's something sitting in memory in Python.

I'm not sure how duckdb does it, but I assume it just tries every other source and then, failing that, starts rummaging around in the Python variable environment?

Suppose I'm writing neoduckdb and I want to support the same magic, but I'm delegating to Vortex for all my scanning. I feel like init_source should return VortexResult<Option<DataSourceRef>> so that my engine can check if a data source exists? We don't usually use VortexResult for recoverable errors.

async fn init_source(&self, uri: String) -> VortexResult<DataSourceRef>;

/// Serialize a source split to bytes.
async fn serialize_split(&self, split: &dyn Split) -> VortexResult<Vec<u8>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the Provider just assumes the split is a type it knows and downcasts it? Kinda seems like Providers should have a Split associated type and their DataSources & Scans are parameterized by that type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe splits can just serialize themselves? What's the case where a Split lacks sufficient information to serialize itself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you raised this above in your PR comment.

Yeah I kinda think the source and the split should be serializable. The source can play the role of Spark's broadcast whereas the split can play the role of Spark's Partition.

Comment on lines +27 to +28
/// Serialize a source split to bytes.
async fn serialize_split(&self, split: &dyn Split) -> VortexResult<Vec<u8>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the plan not the split data right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants