(retriever) Add .store() task for persisting extracted images (#1675) by jioffe502 · Pull Request #1714 · NVIDIA/NeMo-Retriever

jioffe502 · 2026-03-24T20:21:32Z

Summary

Adds a new .store() pipeline task to nemo_retriever that persists extracted images (full-page + sub-page table/chart/infographic/image assets) to local or cloud storage via fsspec/UPath.

Wire .store() into InProcessIngestor, BatchIngestor (map_batches), and batch CLI via --store-images-uri
Add StoreParams with per-content toggles, image_format, and strip_base64 (default False — opt-in memory reduction, preserves multimodal embedding compatibility)
Ensure direct-write file extensions match actual encoded bytes (declared encoding field + magic-byte sniff fallback); cropped outputs follow configured image_format
Fine-grained control (per-content-type toggles, public_base_url, storage_options) available via StoreParams in library use; CLI exposes --store-images-uri for common case

Test plan

test_io_image_store.py: page images, structured crops, natural images, format consistency, stripping, edge cases
test_multimodal_embed.py: verifies store → embed behavior for multimodal with/without stripping
Integration: bo20 dataset (20 PDFs, 496 pages) → 842 stored assets (496 page images + 346 table/chart crops)

uv run pytest tests/test_io_image_store.py tests/test_multimodal_embed.py -q

Known limitations (follow-ups)

Schema coupling: store targets singular structured columns (table, chart, infographic) matching OCR stage output
Corrupt image_b64 skip: if direct image_b64 exists but decode fails, we skip rather than falling back to bbox_xyxy_norm crop from page image
_safe_stem collision: two files with same basename but different directories share output subdirectory; hash-based naming is a follow-up
Observability: per-type write counters (pages, tables, charts, failures) deferred

Closes #1675

🤖 Generated with Claude Code

- Add store_extracted_images() with fsspec/UPath support for local and cloud storage - Wire .store() into InProcessIngestor, BatchIngestor, and batch_pipeline CLI - Add StoreParams model, unit tests, and fsspec/universal-pathlib deps Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

…mn name fix - Fix column names (table/chart/infographic not plural) to match OCR output - Add magic byte sniffing so file extension matches actual image encoding - Make base64 stripping opt-in (strip_base64=False) to preserve multimodal compat - Add multimodal embed interaction tests and format consistency tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

storage_uri to absolute Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

jperez999

Would love to see an actual graph unit test for this store behavior. Perhaps we could run a few of the stages in a graph locally get to the OCR stage and export. Might be worth having tests that show we can export at multiple stages in the pipeline and get the things we are targeting at each store call.

jperez999 · 2026-03-30T23:53:33Z

nemo_retriever/src/nemo_retriever/params/models.py

    lancedb: LanceDbParams = Field(default_factory=LanceDbParams)


+class StoreParams(_ParamsModel):


Does this allow me to store full chunks also... I think that is the only permutation you dont have here, text..?

jperez999 · 2026-03-30T23:55:08Z

nemo_retriever/tests/test_io_image_store.py

+from nemo_retriever.params import StoreParams
+
+
+def _make_tiny_png_b64(width: int = 4, height: int = 4, color=(255, 0, 0)) -> str:


Some of these functions are doubled up in other files. We are going to need to consolidate that.

jioffe502 · 2026-04-01T14:57:15Z

`store_text=True` validation on bo20

Ran bo20 through StoreOperator (now a proper graph node) with store_text=True to validate text storage output. @jperez999 — wanted your eyes on whether this is the text output we expect before we finalize.

bo20 baseline: images only (`store_text=False`, default)

Content type	Files	Format
Page images	496	`.jpeg`
Table crops	147	`.png`
Chart crops	192	`.png`
Infographic crops	0	—
Natural images	0	—
Text files	0	—
Total	835	230 MB

bo20 with text: images + text (`store_text=True`)

Content type	Files	Format
Page images	496	`.jpeg`
Table crops	147	`.png`
Chart crops	192	`.png`
Page text	423	`.txt`
Table text	147	`.txt`
Chart text	192	`.txt`
Total	1,597	233 MB

Image counts are identical between runs — store_text only adds .txt files alongside existing content. Text adds ~3 MB (negligible).

What each text file contains

Page text (page_1.txt) — full OCR'd text for the page:

All Numbers in This Report
Have Been Rounded To
The Nearest Dollar
ANNUAL FINANCIAL REPORT
UPDATE DOCUMENT

Table text (page_97_table_0.txt) — markdown table output from OCR:

| TOWN | OF | Pawling |
| Schedule | of | Time | Deposits | and | Investments |
| For | the | Fiscal | Year | Ending | 2011 |

Chart text (page_3_chart_0.txt) — OCR'd chart description with data:

Caution Required: Comparing Canada's Debt to that of Other Countries
Figure 1: Net Debt as a Share of GDP (2020) for Select Advanced Countries

Notes

423 page text files vs 496 page images: 73 pages had empty/whitespace-only text (skipped by design)
Table/chart text counts match their image crop counts 1:1
store_text defaults to False — opt-in only, no behavior change for existing users
StoreOperator is now a graph node placed after OCR/caption, before embed. This also fixes the previous issue where store was silently skipped under graph runtime.

jioffe502 · 2026-04-01T15:28:51Z

Re: test helper consolidation

Investigated consolidating _make_tiny_png_b64 and its variants across the 5 test files. Hit a structural blocker:

tests/ sits outside src/, so nemo_retriever.tests isn't an importable package (pyproject.toml: [tool.setuptools.packages.find] where = ["src"])
conftest.py direct imports (from conftest import ...) also fail because tests/__init__.py makes pytest use package-based imports instead of adding conftest to sys.path

Will consolidate in a follow-up PR. Options being considered:

Add tests to pythonpath in pytest config
Move shared helpers into an installed nemo_retriever.testing submodule under src/

jioffe502 and others added 2 commits March 24, 2026 18:41

jioffe502 requested review from a team as code owners March 24, 2026 20:21

jioffe502 requested a review from edknv March 24, 2026 20:21

jioffe502 added 4 commits March 25, 2026 18:33

WIP: checkpoint before merge upstream/main (OOM-related fixes)

a36a1ea

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

Merge upstream/main into jioffe/store-extracted-images

3a31444

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

(retriever) Fix relative path URI error in .store() — resolve

fcfc67f

storage_uri to absolute Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

Merge upstream/main into jioffe/store-extracted-images

654ba4e

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

jperez999 reviewed Mar 30, 2026

View reviewed changes

randerzander mentioned this pull request Apr 1, 2026

[FEA]: Clean up returned results objects #1768

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(retriever) Add .store() task for persisting extracted images (#1675)#1714

(retriever) Add .store() task for persisting extracted images (#1675)#1714
jioffe502 wants to merge 6 commits intomainfrom
jioffe/store-extracted-images

jioffe502 commented Mar 24, 2026 •

edited

Loading

Uh oh!

jperez999 left a comment

Uh oh!

jperez999 Mar 30, 2026

Uh oh!

jperez999 Mar 30, 2026

Uh oh!

jioffe502 commented Apr 1, 2026

Uh oh!

jioffe502 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		lancedb: LanceDbParams = Field(default_factory=LanceDbParams)


		class StoreParams(_ParamsModel):

		from nemo_retriever.params import StoreParams


		def _make_tiny_png_b64(width: int = 4, height: int = 4, color=(255, 0, 0)) -> str:

Conversation

jioffe502 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Known limitations (follow-ups)

Uh oh!

jperez999 left a comment

Choose a reason for hiding this comment

Uh oh!

jperez999 Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

jperez999 Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

jioffe502 commented Apr 1, 2026

store_text=True validation on bo20

bo20 baseline: images only (store_text=False, default)

bo20 with text: images + text (store_text=True)

What each text file contains

Notes

Uh oh!

jioffe502 commented Apr 1, 2026

Re: test helper consolidation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jioffe502 commented Mar 24, 2026 •

edited

Loading

`store_text=True` validation on bo20

bo20 baseline: images only (`store_text=False`, default)

bo20 with text: images + text (`store_text=True`)