(retriever) Add .store() task for persisting extracted images (#1675)#1714
(retriever) Add .store() task for persisting extracted images (#1675)#1714
Conversation
- Add store_extracted_images() with fsspec/UPath support for local and cloud storage - Wire .store() into InProcessIngestor, BatchIngestor, and batch_pipeline CLI - Add StoreParams model, unit tests, and fsspec/universal-pathlib deps Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
…mn name fix - Fix column names (table/chart/infographic not plural) to match OCR output - Add magic byte sniffing so file extension matches actual image encoding - Make base64 stripping opt-in (strip_base64=False) to preserve multimodal compat - Add multimodal embed interaction tests and format consistency tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
storage_uri to absolute Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
jperez999
left a comment
There was a problem hiding this comment.
Would love to see an actual graph unit test for this store behavior. Perhaps we could run a few of the stages in a graph locally get to the OCR stage and export. Might be worth having tests that show we can export at multiple stages in the pipeline and get the things we are targeting at each store call.
| lancedb: LanceDbParams = Field(default_factory=LanceDbParams) | ||
|
|
||
|
|
||
| class StoreParams(_ParamsModel): |
There was a problem hiding this comment.
Does this allow me to store full chunks also... I think that is the only permutation you dont have here, text..?
| from nemo_retriever.params import StoreParams | ||
|
|
||
|
|
||
| def _make_tiny_png_b64(width: int = 4, height: int = 4, color=(255, 0, 0)) -> str: |
There was a problem hiding this comment.
Some of these functions are doubled up in other files. We are going to need to consolidate that.
|
| Content type | Files | Format |
|---|---|---|
| Page images | 496 | .jpeg |
| Table crops | 147 | .png |
| Chart crops | 192 | .png |
| Infographic crops | 0 | — |
| Natural images | 0 | — |
| Text files | 0 | — |
| Total | 835 | 230 MB |
bo20 with text: images + text (store_text=True)
| Content type | Files | Format |
|---|---|---|
| Page images | 496 | .jpeg |
| Table crops | 147 | .png |
| Chart crops | 192 | .png |
| Page text | 423 | .txt |
| Table text | 147 | .txt |
| Chart text | 192 | .txt |
| Total | 1,597 | 233 MB |
Image counts are identical between runs — store_text only adds .txt files alongside existing content. Text adds ~3 MB (negligible).
What each text file contains
Page text (page_1.txt) — full OCR'd text for the page:
All Numbers in This Report
Have Been Rounded To
The Nearest Dollar
ANNUAL FINANCIAL REPORT
UPDATE DOCUMENT
Table text (page_97_table_0.txt) — markdown table output from OCR:
| TOWN | OF | Pawling |
| Schedule | of | Time | Deposits | and | Investments |
| For | the | Fiscal | Year | Ending | 2011 |
Chart text (page_3_chart_0.txt) — OCR'd chart description with data:
Caution Required: Comparing Canada's Debt to that of Other Countries
Figure 1: Net Debt as a Share of GDP (2020) for Select Advanced Countries
Notes
- 423 page text files vs 496 page images: 73 pages had empty/whitespace-only text (skipped by design)
- Table/chart text counts match their image crop counts 1:1
store_textdefaults toFalse— opt-in only, no behavior change for existing usersStoreOperatoris now a graph node placed after OCR/caption, before embed. This also fixes the previous issue where store was silently skipped under graph runtime.
Re: test helper consolidationInvestigated consolidating
Will consolidate in a follow-up PR. Options being considered:
|
Summary
Adds a new
.store()pipeline task tonemo_retrieverthat persists extracted images (full-page + sub-page table/chart/infographic/image assets) to local or cloud storage via fsspec/UPath..store()intoInProcessIngestor,BatchIngestor(map_batches), and batch CLI via--store-images-uriStoreParamswith per-content toggles,image_format, andstrip_base64(defaultFalse— opt-in memory reduction, preserves multimodal embedding compatibility)encodingfield + magic-byte sniff fallback); cropped outputs follow configuredimage_formatpublic_base_url,storage_options) available viaStoreParamsin library use; CLI exposes--store-images-urifor common caseTest plan
test_io_image_store.py: page images, structured crops, natural images, format consistency, stripping, edge casestest_multimodal_embed.py: verifies store → embed behavior for multimodal with/without strippingbo20dataset (20 PDFs, 496 pages) → 842 stored assets (496 page images + 346 table/chart crops)Known limitations (follow-ups)
table,chart,infographic) matching OCR stage outputimage_b64skip: if directimage_b64exists but decode fails, we skip rather than falling back tobbox_xyxy_normcrop from page image_safe_stemcollision: two files with same basename but different directories share output subdirectory; hash-based naming is a follow-upCloses #1675
🤖 Generated with Claude Code