Skip to content

Support JWST and Roman#25

Merged
icweaver merged 12 commits into
mainfrom
strict
Apr 4, 2026
Merged

Support JWST and Roman#25
icweaver merged 12 commits into
mainfrom
strict

Conversation

@icweaver
Copy link
Copy Markdown
Member

@icweaver icweaver commented Mar 21, 2026

Hey folks, these are a few toggles I added to try and support some of the quirks I ran into "out in the wild". I've split them into the following three commits:

  1. Don't choke on unknown tags: dc40329
    1. This just provides a fallback for STScI's many asdf tags. Could be replaced by specific converters in the future.
    2. Also adds a metadata tag and a newer ndarray-1.1.0 tag that Roman seems to use now.
    3. This feature is behind an ASDF.load(<filename.asdf>; extensions = false) kwarg by default to preserve current behavior. Similar to the extensions kwarg in the Python impl.
  2. Checksum or not to checksum: 56a8c33
    1. Apparently some Roman data products store their file's checksum based on the uncompressed file instead of compressed (ASDF.jl's default). This breaks things.
    2. Added a flag to get around this, similarly to the Python impl.
    3. 2026-04-03: Update from ASDF office hours. This seems to be a genuine bug, so we will keep this checksum validation flag in on our end. Thanks for helping us confirm this, Brett!
  3. Blocks vs Frames: 0d425a0
    1. Roman seems to use the block layout for their Lz4 compression, instead of frame. This adds support for both, and automatically handles it using magic numbers.

Usage examples:

Does this seem reasonable to folks? I'd especially appreciate feedback for the last point, which I have little experience in and relied heavily on Claude for to come up with the needed incantations.

@icweaver icweaver mentioned this pull request Mar 21, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.74%. Comparing base (ff79be9) to head (ab17e30).
⚠️ Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #25      +/-   ##
==========================================
+ Coverage   99.70%   99.74%   +0.03%     
==========================================
  Files           1        1              
  Lines         343      389      +46     
==========================================
+ Hits          342      388      +46     
  Misses          1        1              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/ASDF.jl
data[3] == 0x4D && data[4] == 0x18)
return decode(LZ4FrameCodec(), data)
else
# If the data was originally created from Python's ASDF, then it will be in block instead of frame layout,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also produce the block layout? Should we? Can Python handle the frame layout?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding me, will add that in. Looks like they can, but as a plug-in https://github.com/asdf-format/asdf-compression

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in e012b21

There is a layout option for NDArrayWrapper now to flip between frame and block. Does that seem like a reasonable place to control things? I guess it doesn't really make sense for other compression schemes, so I just set the default layout as default

I don't love from a maintainability point of view that there is now a hand-rolled Lz4-specific encode and decode path to accommodate Python's asdf scheme. There's also the matter of compatibility with Lz4 frame support on the Python side, but since that's still an experimental plugin, maybe that can be a problem for future us to deal with.

I am now squarely out of my comfort zone and would gladly accept any suggestions for simplifying things, haha. Thanks again for taking a look!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is only for lz4 then I would call it lz4_layout, with values :frame and :block. I am sure that other compression schemes will also want to have options in the future, e.g. specifying the compression level.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, just renamed here e5b6bd4

@eschnett
Copy link
Copy Markdown
Collaborator

The checksum bit sounds weird. There is a standard for checksums, and THE major player in the field gets the implementation wrong? Does this make checksums useless in practice? Well, the world is what it is...

@icweaver
Copy link
Copy Markdown
Member Author

icweaver commented Mar 21, 2026

Yea, the checksum bit was bothering me too. Here's what I am seeing for the sample Roman data I tried:

julia> using ASDF, MD5

julia> af = ASDF.load_file("docs/data/roman.asdf"; extensions = true);

julia> ndarray = af.metadata["roman"]["data"];

julia> header = ndarray.lazy_block_headers.block_headers[ndarray.source + 1]
ASDF.BlockHeader(IOStream(<file docs/data/roman.asdf>), 176347, UInt8[0xd3, 0x42, 0x4c, 0x4b], 0x0030, 0x00000000, UInt8[0x6c, 0x7a, 0x34, 0x00], 0x0000000003ffe53b, 0x0000000003ffe53b, 0x0000000003fc0100, UInt8[0xef, 0x4e, 0x63, 0x45, 0xc4, 0xd6, 0xcd, 0xa0, 0xed, 0x4d, 0x14, 0x27, 0x43, 0xa7, 0xb2, 0xbc], true)

julia> block_data_start = header.position + 6 + header.header_size
176401

julia> seek(header.io, block_data_start)
IOStream(<file docs/data/roman.asdf>)

julia> data = Array{UInt8}(undef, header.used_size);

julia> nb = readbytes!(header.io, data)
67102011

julia> data_decompressed = ASDF.decode_Lz4(data);

julia> md5(data_decompressed) == header.checksum
true

I wonder if this behavior could just be an uncompressed data streaming thing for this specific instance with Roman's data products. I don't work with AWS S3, but it seems that this a selling point for them I guess. fwiw, the JWST data I tried looks to behave as expected (it's just downloaded as a regular file). Here are worked examples for both

Doc previews:

@icweaver icweaver added this to the Release v2 milestone Mar 25, 2026
@eschnett
Copy link
Copy Markdown
Collaborator

eschnett commented Apr 4, 2026

You have some unchecked to-do items: Is this PR ready to be merged, or are you still working on it?

@icweaver icweaver merged commit 5e5154b into main Apr 4, 2026
22 checks passed
@icweaver icweaver deleted the strict branch April 4, 2026 17:38
@icweaver
Copy link
Copy Markdown
Member Author

icweaver commented Apr 4, 2026

Thanks Erik, just updated the top-level comment and merged. According to office hours yesterday, the checksum issue is a genuine bug on the Python side, so according to Brett it sounded like we are good to go over here.

Re: DKIST and newer Roman files, most things look to work in informal tests I tried, and the remaining bits are tracked over in #32

@icweaver icweaver mentioned this pull request Apr 4, 2026
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants