-
Notifications
You must be signed in to change notification settings - Fork 114
Dataset API #5949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Dataset API #5949
Conversation
Signed-off-by: Nicholas Gates <[email protected]>
Signed-off-by: Nicholas Gates <[email protected]>
Codecov Report❌ Patch coverage is
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Nicholas Gates <[email protected]>
|
|
||
| * `vortex-iceberg` - Expose Iceberg tables as a Vortex Dataset | ||
| * `vortex-python` - Expose PyArrow Datasets as a Vortex Dataset | ||
| * `vortex-layout` - Expose a Vortex Layout as a Vortex Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should definitely write a vortex-parquet
|
|
||
| #[derive(Debug, Clone, Default)] | ||
| pub struct ScanRequest { | ||
| pub projection: Option<Expression>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love the fact that, in theory, someone could write Spark SQL which gets compiled to vortex Expression which gets compiled to PyArrow Expression.
Hunting a semantic-bug across those two compilers gives me the heebie jeebies but I don't have an alternative solution! It's better to hunt bugs in O(N+M) integrations than O(NM), assuming I care about all the integrations.
|
|
||
| /// Returns the next batch of splits to be processed. | ||
| /// | ||
| /// This should not return _more_ than the max_batch_size splits, but may return fewer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_batch_size isn't defined anywhere; should that be in ScanRequest?
| pub trait DataSourceProvider: 'static { | ||
| /// URI schemes handled by this source provider. | ||
| /// | ||
| /// TODO(ngates): this might not be the right way to plugin sources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, hmm, how do I read a PyArrow dataset? That's something sitting in memory in Python.
I'm not sure how duckdb does it, but I assume it just tries every other source and then, failing that, starts rummaging around in the Python variable environment?
Suppose I'm writing neoduckdb and I want to support the same magic, but I'm delegating to Vortex for all my scanning. I feel like init_source should return VortexResult<Option<DataSourceRef>> so that my engine can check if a data source exists? We don't usually use VortexResult for recoverable errors.
| async fn init_source(&self, uri: String) -> VortexResult<DataSourceRef>; | ||
|
|
||
| /// Serialize a source split to bytes. | ||
| async fn serialize_split(&self, split: &dyn Split) -> VortexResult<Vec<u8>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the Provider just assumes the split is a type it knows and downcasts it? Kinda seems like Providers should have a Split associated type and their DataSources & Scans are parameterized by that type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe splits can just serialize themselves? What's the case where a Split lacks sufficient information to serialize itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you raised this above in your PR comment.
Yeah I kinda think the source and the split should be serializable. The source can play the role of Spark's broadcast whereas the split can play the role of Spark's Partition.
| /// Serialize a source split to bytes. | ||
| async fn serialize_split(&self, split: &dyn Split) -> VortexResult<Vec<u8>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the plan not the split data right?
Dataset API
A unified API to sit between query engines and data sources that preserves support for late materialization, deferred decompression, and alternate device buffers.
Open Questions