RFC-8: Collections#343
Conversation
Automated Review URLs |
| #### `Collection` keys | ||
|
|
||
| * `"type"` (required). Value must be `"collection"`. | ||
| * `"nodes"` (required). Value must be an array of `CollectionNode` or `Collection` objects. |
There was a problem hiding this comment.
since every node has a unique name, why is this an array and not an object?
There was a problem hiding this comment.
Yeah that could also work.
There was a problem hiding this comment.
I wonder if representing an order may be desired, though. For example, https://ngff.openmicroscopy.org/latest/index.html#bf2raw states "Parsers like Bio-Formats define a strict, stable ordering of the images in a single container ...".
If it were an object the ordering would likely get lost in some JSON implementations. It could be represented through sortable node names, but that also seems less convenient.
There was a problem hiding this comment.
order might also be useful for collections of layers in the context of an image visualization tool. Although you can always add an "order" field to the elements that's an integer (sort of the reverse of adding a "name" field that must be unique in the container).
|
|
||
| ### Metadata | ||
|
|
||
| This RFC defines two main objects for OME-Zarr: `Collection`, `CollectionNode`. |
There was a problem hiding this comment.
A CollectionNode can be a Collection, so it's a bit confusing to say that these are two objects unless you explain that "object" here means something like "interface" or "protocol"
There was a problem hiding this comment.
What would be the best term here? Is it a class?
There was a problem hiding this comment.
As I understand it, there are currently 3 entities that need to be defined:
- collection
- multiscales
- root
collection and multiscales can be discriminated based on their type field, and collection has attributes that multiscales does not, so regular inheritance from a base class doesn't express their relationship very well.
Maybe defining these as protocols would work? e.g., there's a core Node protocol, which the fields {type, name, attributes}, and objects that implement Node can also implement Collection OR Multiscales (but not both, because of the requirement on the type key). Finally, there's a Root protocol which can only be implemented by a Collection
There was a problem hiding this comment.
Presumably bioformats2raw.layout and plate collections will still be around (not removed with this proposal). So a Node could be Collection or Multiscales or bioformats2raw or plate?
There was a problem hiding this comment.
actually I was wrong, regular inheritance isn't problematic for Collection and Multiscales -- there's a base Node, Collection and Multiscales (and anything else) inherit from Node (totally fine for them to add new attributes as children).
As for the requirement is that there be only 1 root node, I don't think that can be expressed in a type system easily as long as the root is structurally compatible with a Collection, but that can be added as a regular requirement
There was a problem hiding this comment.
If the requirement is just that the root node have version (weaker than requiring that only the root node have version), then this is a bit simpler.
There was a problem hiding this comment.
Presumably
bioformats2raw.layoutandplatecollections will still be around (not removed with this proposal). So aNodecould beCollectionorMultiscalesorbioformats2raworplate?
The idea is to remove bioformats2raw.layout and plate as separate entities with this proposal and express the functionality through attributes in the collection nodes. We need to work more on these.
There was a problem hiding this comment.
Could this work similar to how I proposed it for the coordinate transforms? In essence, the paths specified in the plate metadata could be allowed to contain a Collection, which would contan the reference to the path.
|
this is looking really cool! |
|
Looks nice! As a quick initial comment, it would be super helpful to have a minmal example that demonstrates the new metadata structure being proposed - the webknossos examples are nice, but I'm struggling to distinguish what's required and optional in those files because there's lots of extra (I think?) attributes. |
propose interface between rfc5 and rfc8
| }, { | ||
| "name": "..", | ||
| "type": "collection", | ||
| "path": "./nested_collection.json" |
There was a problem hiding this comment.
The collection should be a directory that contains a zarr.json, right?
e.g. "path": "./nested_collection.zarr"
There was a problem hiding this comment.
Ah, now I see that this standalone json file is proposed as part of this RFC. But that isn't covered until much later below under Examples Where is this collection metadata stored?. Maybe that should be moved up above this point?
If an implementation is using e.g. zarr-python or another zarr library to retrieve zarr metadata, then it may be kinda painful to also support fetching of vanilla file.json files using a different mechanism? Don't know about other libs.
|
I started a basic implementation of Collections spec for the validator at ome/ome-ngff-validator#62. |
|
As part of the CCP-volumeEM OME-NGFF Hackathon at EMBL-EBI Hinxton, we wrote down the following user story: User Story: Large-Scale Multi-Beam vEM TilingDefinitions
User Story Currently this is solved by stitching the sub-images to a single tile, each tile is saved as an individual pyramidal tiff containing the first 3 zoom levels. For optimized viewing the fourth zoom level is created by stitching 16 (?) tiles and saving that as an individual tiff. This raises 2 challenges:
To handle problem 1, the OME-NGFF metadata must scale to coordinateTransformation for each of the N (100 - 1600 if tiles are OME-Zarr, 6400 - 100k if sub-images are OME-Zarrs) sub-images to position them within a shared 2D physical space. This would allow a viewer to render a "stitched" multiscale global view that only exists where data was actually acquired, while still providing direct access to the underlying raw overlapping FOVs regardless of their local grid density. Rough idea of the Example file structure, where each slice in a volume is saved as an OME-Zarr collection with shared low-resolution pyramid layers: |
|
User story 1 : combining independent segmentation zarrs with raw image zarrs. We produce multiscale zarrs of our raw microscope images, using filtered downsampling. In viewers we want to give users an easy way of combining the two. In particular, our users are interested in seeing the data as if it were actually separate channels of the same volume. This may or may not be a viewer implementation detail, but it could be interesting if the spec supported this, pointing to two separate zarrs and treating them as consecutive channels. For our viewer, this only works if the spatial dimensions are the same, and can be transformed to the same origin (always trivially true for the data I describe). User story 2: dataset releases Is it practical to have one single very large collection? as in 1000s of zarrs or more? We would likely produce collections of matched raw+segmentation zarrs as described in my user story 2. |
Minor HCS spec updates
TODO:This post is a bit stream of consciousness-y - I hope I manage to express the bump I a stumbling over with the current state of transforms in here. In the version of this RFC, when
In ome/ngff-spec#117, this was made more explicit, so that these "input": {
"path": "./scale0",
"node": "node_name",
"name": "coordinate_system_name"
}And I think porting over this formalism is important, because instances of This has implications. In RFC8, the transforms for {
"ome": {
"version": "0.x",
"type": "collection",
"name": "example",
"attributes": {
"coordinateSystems": [
{
"id": "world",
"name": "world",
"axes": [...]
}
]
},
"nodes": [{
"name": "raw",
"type": "multiscale",
"nodes": [{
"id": "raw_0",
"type": "singlescale",
"path": {
"type": "zarr",
"path": "./raw/0"
},
"attributes": {
"coordinateTransformations": [
{
"type": "scale",
"scale": [1, 1, 1],
"input": {
"path": "raw_0",
},
"output": {
"name": "world"
"node": "raw_0"
}
}
]
}
}, ...]
}, ... ]
}
}The question I'm stuck with now: If the Singlescale is not inlined - where does the
I don't have a good idea about which to prefer, though. |
lubianat
left a comment
There was a problem hiding this comment.
(sorry, the approval was a misclick on GH mobile when hastily ok'ing ome/ngff-spec#128)
will-moore
left a comment
There was a problem hiding this comment.
Just adding comments, but seems I have to create a review...
|
Seems that adding comments to the changes page isn't working for me at the moment. So I'll add some here
Many of the collections I would like to represent with this spec contain images of different OME-Zarr versions. E.g. the figure at https://ome.github.io/omero-figure/?file=https://gist.githubusercontent.com/will-moore/75a7f0de5be0f7b4202d5f0229cadcc9/raw/ngff_images_figure.json or the list of samples at https://idr.github.io/ome-ngff-samples/ so this would be a blocker for many use-cases.
I'm not sure what the motivation is for |
|
Thanks @will-moore!
Collections will likely be a feature of OME-Zarr 1.0. I don't think it is reasonable to referentially include all previous versions of the spec in the 1.0 release because of the burden that would put on implementations.
The motivation for
Multiscales are now collections of Singlescales. The field
Multiscales with a single Singlescale are not disallowed, but not required anymore. Users can just create Singlescales as Zarr arrays without the need for enclosing Zarr groups. |
could you define the term "image" to mean "a Zarr array", and "multiscale image" to mean "a collection of images at different levels of detail". Starting with the more basic thing (a single array) and defining the collection in terms of that seems better than starting with the collection (multiscales) and defining the more basic thing in terms of it. |
|
It feels like we have been working on RFC-5 for a long time and have finally reached a consensus on transforms and scenes etc. But even before v0.6 is released we are proposing to re-work all that again (and other core concepts like Multiscales.datasets that have been around since v0.1). Are we saying that OME.zarr data v0.6 and earlier are not expected to be supported by tools that read v1.0 because they are too different? That would discourage adoption of OME.zarr v0.6 because it's sunsetted even before it's released. My first impression of RFC-8 was that it's a way of grouping existing Multiscales images into Collections. But this proposal looks like starting from scratch and ditching previous work and support for existing data? I'm not even sure I fully understand @jo-mueller's question above, except that it shows all the hard RFC-5 discussions are going to need to be revisited again? |
|
@will-moore thanks for the feedback. About my comment above, I think discussing intents and structure last week in Düsseldorf helped to structure my ideas for RFC8. I opened normanrz#4 with some suggestions that address some of my concerns. |
I appreciate the design work that has gone into RFC-5 and I think RFC-8 is building on top of that. I'll review with @jo-mueller next week whether to bring back the scene metadata.
I think it is important to look at RFC-8 as part of the long-term vision of the 1.0 release. This probably warrants its own RFC, but in my view 1.0 is supposed to be a long-term release that carries us through the next decade without breaking changes. Up until now every release of OME-Zarr has been breaking and I think that needs to stop to foster serious adoption. That also means this is the last opportunity in a while to break things in order to make the OME-Zarr spec more consistent and extensible. Basically, take all the learnings from the 0.x releases and make a great long-term 1.0 release.
I definitely think that tools should be considered compliant with the 1.0-spec if they only support v1.0 and no previous versions. This is already the case with 0.x versions. Only very few tools understand 0.1-0.3 and some tools only understand 0.5 and not 0.4 anymore. I think that is totally fine, because they are 0.x releases. That being said, I think the extension mechanism could be used to include 0.x OME-Zarrs in 1.0 Collections. Just define an extension node type that references 0.x multiscales. Tools could voluntarily support that, if they find it useful. I want to add that 0.5 -> 0.6 -> 1.0 are metadata-only changes. I don't think it is unreasonable for users to consider migrating the metadata. This will be less of a lift than the 2024 NGFF challenge, where we actually converted the data. |
|
seconding norman's POV. And a broader point about churn: churn during development is valuable if it buys a better released product. This churn affects devs for months, but users will interact with 1.0 for years. It would be unfortunate if they had to tolerate a deficient product because devs settled too early. Now is the time to fix stuff. It only gets harder later. |
|
I think this is a super-useful discussion here. If anything, it will help RFC8 authors to get a feeling from which direction to expect feedback or sharpen RFC8 towards. I think there are two separate things to take from this discussion: Minimally, I think the relationship between coordinate system and nodes needs to be clarified. To a degree, this already happened in 0.6.dev3 -> 0.6.dev4. The important thing to note here is that coordinate systems and transformations define their own graph like structure, that can be independent of the collection/node layout. Since a The other thing is the following:
I'm not so sure about that. In 0.x, the smallest interpretable, indivisible aggregation of data and metadata is the The introduction of the Don't get me wrong, I'm not opposed to renaming What I propose in normanrz#4 is simply a stratification and clarification of where metadata sits and what collections are expected to collect:
This is currently not necessarily the case with the Imho, making this restriction doesn't take from the expressiveness and elegance of RFC8, but adds to the integrity and reliability of images - aka multiscales - as an essential concept in the spec. |
This is the work-in-progress draft for RFC-8.
cc @jluethi @lorenzocerrone @tischi @perlman @matthewh-ebi