Skip to content

(retriever) Add .store() task for persisting extracted images (#1675)#1714

Open
jioffe502 wants to merge 6 commits intomainfrom
jioffe/store-extracted-images
Open

(retriever) Add .store() task for persisting extracted images (#1675)#1714
jioffe502 wants to merge 6 commits intomainfrom
jioffe/store-extracted-images

Conversation

@jioffe502
Copy link
Copy Markdown
Collaborator

@jioffe502 jioffe502 commented Mar 24, 2026

Summary

Adds a new .store() pipeline task to nemo_retriever that persists extracted images (full-page + sub-page table/chart/infographic/image assets) to local or cloud storage via fsspec/UPath.

  • Wire .store() into InProcessIngestor, BatchIngestor (map_batches), and batch CLI via --store-images-uri
  • Add StoreParams with per-content toggles, image_format, and strip_base64 (default False — opt-in memory reduction, preserves multimodal embedding compatibility)
  • Ensure direct-write file extensions match actual encoded bytes (declared encoding field + magic-byte sniff fallback); cropped outputs follow configured image_format
  • Fine-grained control (per-content-type toggles, public_base_url, storage_options) available via StoreParams in library use; CLI exposes --store-images-uri for common case

Test plan

  • test_io_image_store.py: page images, structured crops, natural images, format consistency, stripping, edge cases
  • test_multimodal_embed.py: verifies store → embed behavior for multimodal with/without stripping
  • Integration: bo20 dataset (20 PDFs, 496 pages) → 842 stored assets (496 page images + 346 table/chart crops)
uv run pytest tests/test_io_image_store.py tests/test_multimodal_embed.py -q

Known limitations (follow-ups)

  • Schema coupling: store targets singular structured columns (table, chart, infographic) matching OCR stage output
  • Corrupt image_b64 skip: if direct image_b64 exists but decode fails, we skip rather than falling back to bbox_xyxy_norm crop from page image
  • _safe_stem collision: two files with same basename but different directories share output subdirectory; hash-based naming is a follow-up
  • Observability: per-type write counters (pages, tables, charts, failures) deferred

Closes #1675

🤖 Generated with Claude Code

jioffe502 and others added 2 commits March 24, 2026 18:41
- Add store_extracted_images() with fsspec/UPath support for local and cloud storage
- Wire .store() into InProcessIngestor, BatchIngestor, and batch_pipeline CLI
- Add StoreParams model, unit tests, and fsspec/universal-pathlib deps

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
…mn name fix

- Fix column names (table/chart/infographic not plural) to match OCR output
- Add magic byte sniffing so file extension matches actual image encoding
- Make base64 stripping opt-in (strip_base64=False) to preserve multimodal compat
- Add multimodal embed interaction tests and format consistency tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
@jioffe502 jioffe502 requested review from a team as code owners March 24, 2026 20:21
@jioffe502 jioffe502 requested a review from edknv March 24, 2026 20:21
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
  storage_uri to absolute

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Copy link
Copy Markdown
Collaborator

@jperez999 jperez999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would love to see an actual graph unit test for this store behavior. Perhaps we could run a few of the stages in a graph locally get to the OCR stage and export. Might be worth having tests that show we can export at multiple stages in the pipeline and get the things we are targeting at each store call.

lancedb: LanceDbParams = Field(default_factory=LanceDbParams)


class StoreParams(_ParamsModel):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this allow me to store full chunks also... I think that is the only permutation you dont have here, text..?

from nemo_retriever.params import StoreParams


def _make_tiny_png_b64(width: int = 4, height: int = 4, color=(255, 0, 0)) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these functions are doubled up in other files. We are going to need to consolidate that.

@jioffe502
Copy link
Copy Markdown
Collaborator Author

store_text=True validation on bo20

Ran bo20 through StoreOperator (now a proper graph node) with store_text=True to validate text storage output. @jperez999 — wanted your eyes on whether this is the text output we expect before we finalize.

bo20 baseline: images only (store_text=False, default)

Content type Files Format
Page images 496 .jpeg
Table crops 147 .png
Chart crops 192 .png
Infographic crops 0
Natural images 0
Text files 0
Total 835 230 MB

bo20 with text: images + text (store_text=True)

Content type Files Format
Page images 496 .jpeg
Table crops 147 .png
Chart crops 192 .png
Page text 423 .txt
Table text 147 .txt
Chart text 192 .txt
Total 1,597 233 MB

Image counts are identical between runs — store_text only adds .txt files alongside existing content. Text adds ~3 MB (negligible).

What each text file contains

Page text (page_1.txt) — full OCR'd text for the page:

All Numbers in This Report
Have Been Rounded To
The Nearest Dollar
ANNUAL FINANCIAL REPORT
UPDATE DOCUMENT

Table text (page_97_table_0.txt) — markdown table output from OCR:

| TOWN | OF | Pawling |
| Schedule | of | Time | Deposits | and | Investments |
| For | the | Fiscal | Year | Ending | 2011 |

Chart text (page_3_chart_0.txt) — OCR'd chart description with data:

Caution Required: Comparing Canada's Debt to that of Other Countries
Figure 1: Net Debt as a Share of GDP (2020) for Select Advanced Countries

Notes

  • 423 page text files vs 496 page images: 73 pages had empty/whitespace-only text (skipped by design)
  • Table/chart text counts match their image crop counts 1:1
  • store_text defaults to False — opt-in only, no behavior change for existing users
  • StoreOperator is now a graph node placed after OCR/caption, before embed. This also fixes the previous issue where store was silently skipped under graph runtime.

@jioffe502
Copy link
Copy Markdown
Collaborator Author

Re: test helper consolidation

Investigated consolidating _make_tiny_png_b64 and its variants across the 5 test files. Hit a structural blocker:

  • tests/ sits outside src/, so nemo_retriever.tests isn't an importable package (pyproject.toml: [tool.setuptools.packages.find] where = ["src"])
  • conftest.py direct imports (from conftest import ...) also fail because tests/__init__.py makes pytest use package-based imports instead of adding conftest to sys.path

Will consolidate in a follow-up PR. Options being considered:

  • Add tests to pythonpath in pytest config
  • Move shared helpers into an installed nemo_retriever.testing submodule under src/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA]: nemo_retriever library: Support storing extracted images to disk

2 participants