docs(governance): AGENTS.md steward network by lbliii · Pull Request #2005 · NVIDIA-NeMo/Curator

lbliii · 2026-05-20T20:26:02Z

Summary

Introduces an AGENTS.md steward network for NeMo Curator — a root constitution plus ten scoped, domain-specific guides that activate passively when an agent works in their directory (per the agents.md convention). The system encodes the convictions, invariants, and review hooks each domain cares about so that an agent traversing the project has the local expert's voice in scope from the moment it enters a directory.

This is not an automated reviewer that replaces humans. It's a coordination layer on top of .github/CODEOWNERS: AI stewards advise; CODEOWNERS approve. The system already produced real findings during validation (see Validation below), including a determinism bug and an embedder-import regression.

How it works

graph TB
    USER([Human])
    USER --> AGENT

    subgraph ALWAYS[Always loaded]
        ROOT["Root AGENTS.md<br/>━━━━━━━<br/>• Product + architecture pillars<br/>• Stop &amp; Ask<br/>• Inference Acceleration concerns<br/>• Known Regression Patterns<br/>• Steward Signal Format<br/>• Doc Autopilot triggers"]
    end

    AGENT["Implementing agent<br/>━━━━━━━<br/>synthesis &amp; decisions"]
    ROOT --> AGENT

    subgraph SCOPED[Scoped — activate on directory entry]
        direction TB
        S1[nemo_curator/<br/>Pipeline &amp; Stage Contract]
        S2[backends/<br/>Executor Parity]
        S3[stages/deduplication/<br/>Determinism]
        S4[stages/text/<br/>Text modality]
        S5[stages/video/<br/>All-GPU video]
        S6[stages/synthetic/<br/>SDG]
        S7[tests/<br/>Parity + coverage]
        S8[fern/<br/>Canonical docs]
        S9[benchmarking/<br/>Defensible perf]
        S10[tutorials/<br/>First impression]
    end

    AGENT -->|enters scope| SCOPED
    AGENT -.->|delegates parallel work| SWARM["Subagent swarm<br/>(returns Steward Signal Format)"]
    SWARM -.-> AGENT

    SCOPED -.->|routes review to| CO["@NVIDIA-NeMo CODEOWNERS"]

Three activation modes:

Passive — when an agent enters a directory, the closest scoped AGENTS.md loads automatically via the agents.md convention. Root is always loaded, so repo-wide rules (Inference Acceleration, Known Regression Patterns, Doc Autopilot) are always in scope.
Synthesis — the implementing agent is the integration point. Stewards advise; the agent decides. Cross-references between stewards (e.g., "apply the Inference Acceleration concerns in root AGENTS.md") are followed by the agent, not by the file system.
Delegation — for parallel investigation or large reads, the agent spawns subagents. Each subagent reads root plus its closest scoped file and returns findings in the Steward Signal Format. The implementing agent synthesizes and decides. Triggers: `ask stewards`, `bugbash`, `review swarm`, `steward synthesis`, `audit docs`, `content audit`.

The stewards

Path	What it owns	CODEOWNERS
`AGENTS.md` (root)	Constitution: pillars, Stop & Ask, Inference Acceleration, Known Regression Patterns, Doc Autopilot, Steward Swarm protocol, Done Criteria	—
`nemo_curator/AGENTS.md`	`Task` / `ProcessingStage` / `Pipeline` / `Resources` ABI; per-stage resource ergonomics; task-centric / map-style / fault-tolerant pillars	curator_reviewers
`nemo_curator/backends/AGENTS.md`	Executor parity (Xenna / Ray Actor Pool / Ray Data); streaming + auto-balancing; backpressure; `runtime_env` honoring	@oyilmaz-nvidia @praateekmahajan @abhinavg4 @ayushdg
`nemo_curator/stages/deduplication/AGENTS.md`	exact / fuzzy / semantic determinism; ID-generator stability; RAPIDS dependency reality	@ayushdg @praateekmahajan
`nemo_curator/stages/text/AGENTS.md`	DocumentBatch contract; published Domain & Quality classifiers as public API; embedder surface	@sarahyurick @praateekmahajan @VibhuJawa (classifiers/embedders)
`nemo_curator/stages/video/AGENTS.md`	All-GPU end-to-end pipeline; speed-of-light per inference model; WebDataset output	@suiyoubi @abhinavg4
`nemo_curator/stages/synthetic/AGENTS.md`	Prompts as public API; in-process vs server-endpoint deployment patterns; OpenAI-compat	@huvunvidia
`tests/AGENTS.md`	`@pytest.mark.gpu` discipline; shared Ray cluster fixture; modality selection rules; 80% changed-line coverage	curator_reviewers + per-modality
`fern/AGENTS.md`	Canonical docs site; 2026-05-20 write-freeze on `docs/`; agentic features (Ask AI, llms.txt, MCP); Doc Autopilot ritual	@NVIDIA-NeMo/docs_team
`benchmarking/AGENTS.md`	Reproducibility; hardware + software-version + model + serving-stack capture; cost-per-token framing	@rlratzel + curator_reviewers split
`tutorials/AGENTS.md`	Imports resolve; extras-name correctness (`<modality>_cpu` / `<modality>_cuda12`); cluster-scheduled examples	curator_reviewers (advocacy: add per-modality routing)

Cross-cutting concerns (root-owned, always active)

Inference Acceleration. Speed-of-light per model (TensorRT-LLM, memory optimization, quantization); explicit in-process vs server-endpoint deployment pattern; vLLM canonical / Ray Serve preferred / Dynamo supported; benchmarks must capture model + serving stack + hardware.
Known Regression Patterns with verification recipes: fabricated CLI/config fields, stage-contract drift, executor parity drift, inference performance regression, deduplication CUDA gating, docs/ vs fern/ regression, doc-snippet rot, naming/counting drift, cross-page inconsistency, narrow-fix regression, unverified finding regression.
Convergence rule. When two or more stewards independently flag the same finding, it auto-promotes to P0 regardless of individual severity.
Global Sweep on accepted P0s. Wrong factual claims must be grep-corrected across the entire fern/ site plus tutorials/, .cursor/rules/, .github/copilot-instructions.md, README.md, api-design.md — narrow fixes are the dominant regression mode.
Docs-First Agent Artifact Evaluation. Before creating or expanding a cursor rule, Claude skill, MCP workflow, or prompt template, fix fern/ first if that would solve the problem for both humans and agents.

Validation

The steward network produced real findings during three swarm passes before this PR landed:

Bootstrap audit — 12 P0 corrections to the initial mandate (fabricated paths, wrong extras names, stale class names, CODEOWNERS gap for fern/).
Post-tightening re-audit — 5 factual errors caught after a redundancy-cut rewrite, plus 1 false-positive that exemplified the "Unverified finding regression" Known Regression Pattern.
State-of-domain pass — first audit of code (not just the mandate). Surfaced real issues including:
- IdGenerator.hash_files is caller-order-dependent (silent ID drift hazard) — nemo_curator/stages/deduplication/id_generator.py:47-49
- Embedders eagerly import torch / sentence_transformers / transformers at parse time, breaking CPU-only installs — nemo_curator/stages/text/embedders/{__init__.py:15,base.py:19-23}
- Resources._get_gpu_memory_gb() returns a hardcoded 24 GB fallback that masks misconfiguration — nemo_curator/stages/resources.py:29
- CompositeStage.with_(dict) is LSP-incompatible with ProcessingStage.with_(name=, resources=, ...) — nemo_curator/stages/base.py:262 vs :381
- Lazy-import drift in three modalities (dedup top-level RAPIDS, text embedders, video ffmpeg subprocess paths) — convergent finding from three independent stewards

These are tracked as follow-up work; this PR ships the network, not the fixes.

Commits in this PR

Bootstrap of the network and first-audit P0 corrections
Tightening pass to remove redundancy (-34% lines)
Second-audit P0/P1 corrections
Motivation reshape + Inference Acceleration cross-cutting concern
Drop redundant bootnote (the agents.md convention already says "read this file")
Switch scoped stewards to second person; reframe Inference Acceleration as a concern, not a separately-scoped steward
One steward-discovered fix to a text/modifiers/ docstring (cross-page-inconsistency pattern)

Open follow-ups

Wire one CI trigger per domain (Doc Autopilot merge gate; tutorial static-import smoke; backend parity test; benchmark regression detection).
Close CODEOWNERS gaps on tutorials/ and several tests/stages/<modality>/ subtrees.
Pin owned fern/ doc paths in each scoped steward's Own list (currently deferred to first Doc Autopilot sweep).
Fix the validated bugs above as separate PRs against their CODEOWNERS.

Test plan

markdownlint-cli2 clean across all 11 AGENTS.md files
No internal customer / team / number / roadmap leaks (grep-verified)
Spot-checked 6 of the highest-impact state-of-domain claims against source
Steward activation tested via a small edit in nemo_curator/stages/text/modifiers/modifier.py — relevant stewards activated, irrelevant stewards stayed quiet
First real PR review through the system (requires another PR to route through it)

🤖 Generated with Claude Code

Introduce a root constitution plus ten scoped AGENTS.md stewards covering the pipeline/stage contract, executor parity, deduplication, text/video/synthetic modalities, tests, the canonical fern docs site, benchmarking, and tutorials. Each scoped steward follows the same Point-of-View / Protect / Contract Checklist / Advocate / Serve Peers / Do Not / Own operating model and routes review to the relevant CODEOWNERS team. The root file encodes the Swarm, Content Audit, Convergence, Global Sweep, and Doc Autopilot protocols. Bootstrap fixes shipped alongside (validated by the inaugural steward swarm audit before commit): - Add fern/ to .github/CODEOWNERS (was uncovered by docs_team review; fell through to the default reviewers team) - Remove the stray AGENTS.md line from .gitignore (sat in the macOS Files block; no policy reason found) - Fix phantom path nemo_curator/examples/quickstart.py -> tutorials/ quickstart.py in api-design.md and .github/copilot-instructions.md (Global Sweep on convergent P0) The Docs Steward declares a write-freeze on docs/ effective today (2026-05-20); ongoing release notes and product docs land in fern/. The Deduplication Steward describes current top-level RAPIDS imports honestly (the package requires deduplication_cuda12 to import today) and tracks lazy-import work as open advocacy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Cut redundancy across all 11 AGENTS.md files: 1894 -> 1254 lines, a 34% reduction with no loss of mandate. The same facts had been stated 3-7 times across sections; tightening lets each file speak once. Operating-model change in root: Do Not and Serve Peers are now optional sections, included only when they carry weight a careful reader can't infer from the rest of the file. Most scoped stewards either drop them entirely or keep one or two non-obvious entries. Other cuts: - Stop restating root AGENTS.md content in scoped files; link instead. - Collapse adjacent overlapping sections in root (When To Consult + Ask Stewards merged into Steward Swarms; Stakes folded into Governance Alignment; Review Notes merged into Done Criteria). - Shorten 2-3 paragraph scoped-file intros to 1-2 sentences. - Remove "canonical paths to be pinned in the next audit" placeholders where the concrete fern/ paths are known to belong to the Docs Steward's first autopilot sweep. All 11 files lint clean (markdownlint-cli2). Mandate facts unchanged - this is editorial density, not a policy update. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Re-audit of the tightened mandates surfaced a few real factual errors that crept in during the rewrite: - text steward asserted `nemo_curator/stages/text/filters/` doesn't exist; it does (contains doc_filter.py, score_filter.py, and the fasttext/heuristic/histogram/token subpackages). Re-added to the subpackage list everywhere. - tutorials steward claimed per-modality CODEOWNERS routing for tutorial PRs; `.github/CODEOWNERS` has no `tutorials/` entries today. Reworded to "default only; adding routes is open advocacy". - pipeline steward cited `backends/utils.py` as the home of `None` tolerance for filter stages; it actually lives in adapters (e.g. xenna/adapter.py's process_data returns list[Task] | None). - fern steward asserted "matching directories" for all five versions; `latest.yml` is redirect-only with no `latest/` directory. Clarified. - synthetic Own list missed nested nemotron_cc/nemo_data_designer/{base,nemotron_cc}.py. One reported P0 (dedup `fuzzy/lsh/lsh.py:20` "isn't a RAPIDS import") was a steward false positive: line 20 IS `import cudf`. The dedup audit was wrong — exactly the "Unverified finding regression" Known Regression Pattern in root AGENTS.md. Not acted on. All 11 files lint clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

…rests Three changes: 1. Root AGENTS.md gets the real product pillars (Higher Accuracy / Faster Processing / Scalability / Classifier Models / Deploy Anywhere) plus the three architecture pillars (Task-centric / Map-style / Fault tolerant) — drawn from the public README and Fern site, not internal sourcing. 2. New cross-cutting Inference Acceleration Steward, encoded as an inline section in root AGENTS.md (the same shape as elbysodic's Surface Contract Steward — not a separate file). Coordinates the modality, backends, synthetic, and dedup stewards whenever a change touches an inference-bearing stage or the model-serving surface. Adds an "Inference performance regression" Known Regression Pattern and a Done Criterion requiring model + serving-stack + hardware context on inference benchmarks. 3. Every scoped steward now leads with motivation ("This domain exists because…") instead of description. Protect / Contract Checklist / Advocate follow from the motivation rather than restating it. Filter applied: no internal customer names, no internal team names, no internal numbers (token counts, GPU counts, internal benchmarks), no internal roadmap quotes, no internal sourcing ("X asked"). All references are to publicly observable facts (README, Fern site, HuggingFace download counts, public RAPIDS / Ray / vLLM / NVENC / NVDEC dependencies). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

The agents.md convention already establishes that agents read the relevant AGENTS.md files when working in a scope. Restating it in a blockquote at the top added noise without value. The Architecture Boundaries table already serves as the steward inventory. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Two changes: 1. Stewards address the agent directly. The AGENTS.md convention is that an agent loads the relevant file and embodies the role. Second person ("you defend X") makes that role-taking explicit; third-person description ("this domain defends X") leaves the agent at arm's length. Each scoped steward's opening + Point Of View now uses second person; Protect / Contract Checklist / Advocate stay impersonal (those are facts and surfaces). Root stays in first-person plural ("we protect") since the constitution speaks for the project to all agents. 2. Inference Acceleration is no longer framed as a "Steward." A steward is bound to a scope by its AGENTS.md file. The root agent IS the steward of cross-cutting concerns, so a sub-steward inside root with no directory of its own was confused. Reframed as a cross-cutting concern that root owns; scoped stewards reference it directly when their changes are inference-bearing. Also trimmed motivation paragraphs across all 10 scoped files. The fern intro that described what Fern is (chat, MCP, llms.txt) read like marketing copy and got cut — anyone reading the file is already in the repo and knows. Total: 1458 -> 1286 lines. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Found while testing the AGENTS.md steward activation: the inline docstring and a TypeError message referenced `output_field` (singular), but the actual parameter is `output_fields` (plural). The "cross-page inconsistency" Known Regression Pattern in root AGENTS.md named exactly this shape of drift. No behavior change; docstring and error-message corrections only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Two additions to root AGENTS.md: 1. Anti-Patterns is now a Don't / Do instead table (shape borrowed from PR NVIDIA-NeMo#1769). Concrete pairs covering: pip vs uv; @Property on stage attrs vs class attribute; _name/_resources/_batch_size override (the @Final trap); model loading in __init__ vs setup(); batch_size > 1 without process_batch; _metadata/_stage_perf drop on fan-out; print vs loguru.logger; unsigned commits; the TestThisClassFunction over-classification pattern. 2. Six new Known Regression Patterns the swarm hunts by default: - Stage lifecycle drift (__init__ / setup_on_node / setup) - process_batch mis-use - Metadata / _stage_perf propagation drop (especially fan-out) - Ray Data spec missing (IS_ACTOR_STAGE, IS_FANOUT_STAGE) - Ray Actor Pool mis-attribution (general backend vs dedup-only) - Test over-classification (TestThisClassFunction) - EmptyTask first-stage worker waste (missing max_workers_per_node) All patterns name a verification recipe so they're machine-checkable in autopilot mode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii · 2026-05-21T14:04:25Z

Pushed three commits responding to @sarahyurick's review and integrating Praateek's developer guide:

`docs(governance): apply Sarah's PR review — Ray Actor Pool is dedup-only` — reframed Ray Actor Pool as the dedup-batch executor across root, backends, benchmarking, video, tutorials. Dropped `ALM_BENCHMARK.md` / `AUDIO_PROFILING.md` references. Tutorials get ruff, not CI execution.
`docs(governance): seed stewards with Praateek's architectural discipline` — encoded the patterns from the NeMo Curator Developer Guide into the relevant stewards. The Pipeline & Stage Contract steward now carries the `init` / `setup_on_node` / `setup` lifecycle discipline, `process` vs `process_batch` rules, `_metadata` / `stage_perf` propagation (especially fan-out), task-size sweet spot, `with()` for variable resources, `CompositeStage`, and `Workflow`. Text / Video / SDG stewards cross-reference the setup discipline for model loading. Dedup steward gains the `Workflow` callout and explicit Ray Actor Pool scope. Tests steward gains the "no `TestThisClassFunction` over-classification" rule.
`docs(governance): integrate Praateek's What-NOT-to-Do + new KRPs` — root `AGENTS.md` Anti-Patterns is now a Don't / Do table (shape from docs: add AGENTS.md for AI agent and contributor guidance #1769), and six new Known Regression Patterns the swarm hunts by default: stage lifecycle drift, `process_batch` mis-use, metadata/_stage_perf drop on fan-out, Ray Data spec missing (`IS_ACTOR_STAGE`, `IS_FANOUT_STAGE`), Ray Actor Pool mis-attribution, test over-classification, and `EmptyTask` first-stage worker waste. Each pattern names a verification recipe so it's machine-checkable in autopilot mode.

The `What NOT to Do` shape and several concrete rules are borrowed from #1769 — the two PRs are complementary (#1769 = root-level developer onboarding; #2005 = scoped domain network). When both land they reinforce each other.

Verifying lint clean across all 11 `AGENTS.md` files.

…gation Path lists go stale on IA refactors and miss new pages added after pinning. Replace them with the discovery method itself. Root AGENTS.md gains an Impacted-Docs Discovery section: derive search terms from the diff (renamed classes, changed defaults, new extras, user-visible labels) and grep `fern/`, `tutorials/`, root markdown, and agent artifacts. Each hit either updates in the same PR, carries no-impact:<reason>, or escalates to the Docs Steward. Each scoped steward now lists its domain-specific search-term vocabulary instead of a frozen file path list: - Pipeline: ProcessingStage, Task, Pipeline, Resources, process / process_batch / setup / setup_on_node / with_, modality task names - Backends: XennaExecutor, RayDataExecutor, ray_data_stage_spec, IS_ACTOR_STAGE, IS_FANOUT_STAGE, max_workers_per_node - Dedup: IdGenerator, MinHash, LSH, SemanticDedup, deduplication_cuda12, cuDF/cuGraph/cuML, TextRemovalWorkflow - Text: classifier / embedder class names, HuggingFace IDs, Quality rubric labels (High/Medium/Low), Domain Classifier taxonomy - Video: VideoTask/Video/Clip, PyNvVideoCodec/CvCuda/pyav/NVDEC/NVENC, WebDataset output fields - Synthetic: prompt constants, model server identifiers (vLLM/NIM/ Ray Serve/Dynamo), OpenAI-compat - Tests: GPU markers, shared_ray_cluster, L0 scripts, coverage refs - Benchmarking: test-paths.yaml / nightly-benchmark.yaml, hardware references, cost-per-token claims - Tutorials: filenames, extras names, cluster-orchestration patterns Fern steward gains a Delegation destination block: when other stewards escalate abstraction-level changes (reshaped concept, terminology shift) it performs cross-page consistency, IA, and conceptual-page discovery the calling steward couldn't do via grep. Self-grep is the fast path. Delegation is the escalation when the change isn't symbol-derivable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

greptile-apps · 2026-05-21T14:30:13Z

+  authoritative for Fern. `docs/broken_links_*.json` is the
+  deprecated Sphinx site — ignore for Fern.
+- **Variable substitution.** `fern/substitute_variables.py` rewrites
+  `{{ current_release }}` / `<release/>`. Don't hand-pin versions
+  where substitution would apply.


Fabricated version file — latest.yml does not exist

The Protect section lists fern/versions/{latest,main,v25.09,v26.02,v26.04}.yml as the canonical set of version files, but fern/versions/latest.yml does not exist in the repository (only main.yml, v25.09.yml, v26.02.yml, and v26.04.yml are present). This is the exact pattern the root AGENTS.md defines as "Fabricated CLI / config fields" — and this steward file is the first place an agent will look when managing fern/ versioning. An agent that trusts this inventory will attempt to read, validate, or update a file that doesn't exist, then treat the missing file as a regression rather than an authoring error.

Disagreeing — fern/versions/latest.yml does exist on this branch:

$ ls fern/versions/ latest.yml main main.yml v25.09 v25.09.yml v26.02 v26.02.yml v26.04 v26.04.yml

The file is referenced as the redirect-only manifest (no matching latest/ directory, which the steward already calls out two lines later: "latest.yml is redirect-only — no latest/ directory"). This is the second "Unverified finding regression" pattern on this PR from this tool — the Known Regression Pattern is in root AGENTS.md for a reason. No code change needed.

sarahyurick · 2026-05-21T15:05:49Z

+   against.
+3. The nightly cron runs all entries in `nightly-benchmark.yaml` on
+   4×A100; results post to the rapids-workflows-nightly-tests
+   channel.


This seems to be referencing an internal Slack channel.

Good catch — dropped the internal Slack channel name. Now reads "team's results sink" in 02f6c75.

sarahyurick · 2026-05-21T15:06:14Z

+- Hardware references (H100, L40S, A100, GB200) tied to specific
+  workloads
+- Cost-per-token / cost-per-hour-of-video claims
+- Headline speedup numbers (e.g., RedPajama-v2 fuzzy-dedup figures


I thought it was CC not red pajama?

The current public README headline is RedPajama v2: "16× faster fuzzy deduplication on 8 TB RedPajama v2 (1.78 trillion tokens)". Nemotron-CC is cited separately as the end-to-end recipe. To avoid pinning a specific corpus that may drift, I softened the steward to point at README first as the source of truth (02f6c75). If we're swapping the headline benchmark to Nemotron-CC, the README + the public site update first, and the steward then auto-discovers via grep.

sarahyurick · 2026-05-21T15:11:16Z

+    (`self.model = AutoModel.from_pretrained(...).to("cuda")`).
+    GPU-resident state lives here.
+  - A stage that overrides `setup()` is auto-routed to a Ray Data
+    **Actor**; stateless stages become Ray Data **Tasks**. Overriding


I know this is true for text filters and modifiers but I don't think this auto detection is true across the board right? I think you normally have to specify if you want it to be an actor (appropriate for stages that persist some type of object in memory per actor, like a tokenizer or model).

Good correction. Praateek's deck described auto-routing as a general behavior; you're saying it's opportunistic in practice. Reframed in 02f6c75: auto-detection works for simpler text filter/modifier stages but isn't reliable across the board, and any stage holding a tokenizer / model / lookup table in memory should explicitly set RayStageSpecKeys.IS_ACTOR_STAGE: True in ray_stage_spec.

sarahyurick · 2026-05-21T15:13:32Z

+- **Fault tolerant** — stages survive preemption and reschedule; partial state is recoverable
+
+The same pipeline definition must run unchanged across Xenna, Ray Actor
+Pool, and Ray Data executors.


Lingering comment should not group Ray Actor Pool with Xenna and Ray Data.

Fixed in 02f6c75 — root now reads: "The same pipeline definition must run unchanged across the streaming executors (Xenna and Ray Data). Ray Actor Pool is the dedup-batch executor — appropriate for shuffle-based dedup and full-data-state passes, not for general streaming workloads."

sarahyurick · 2026-05-21T15:21:27Z

+  `setup()`. Loading in `__init__` serializes the model to every
+  replica. Downloading weights belongs in `setup_on_node()`. See the
+  setup-discipline rule in [parent](../../AGENTS.md).
+- **Tokenizer and model artifact handling.** HuggingFace-pinned


This isn't necessarily text-specific but since it is common for text, I would like us to specify that tokenization and model inference should always be split into 2 stages and defined together in a composite stage. Like in general, if we are loading some type of model object in memory, it is usually better for each model to be its own stage. This is especially true for the model forward pass which has a huge advantage when it has more GPU resources available in the pipeline, as opposed to having to share the time and resources with the tokenization step.

Strong guidance — added in 02f6c75 as a new Protect bullet on the text steward:\n\n> Tokenization and model inference are always separate stages. When a pipeline needs both (classifier inference, embedder inference), split them into two stages and compose them with CompositeStage. Tokenization is CPU-bound; inference is GPU-bound. Keeping them in one stage forces the GPU stage to share its replicas with tokenization, hurting throughput. Each model loaded in memory deserves its own stage so the GPU stage can scale independently.\n\nI kept it on the text steward since that's where it activates most often, but the convention is general — happy to move it to Pipeline steward if you'd prefer it broader.

sarahyurick · 2026-05-21T15:24:08Z

Maybe worth mentioning, if you are adding a model test that requires a HF_TOKEN, then you will need to coordinate with the automation team to get the model configured with the token in CI/CD.

Added in 02f6c75:\n\n> HF_TOKEN-gated tests require automation-team coordination. If a test loads a HuggingFace model that needs an access token, the test will fail in CI without the token configured. Coordinate with @NVIDIA-NeMo/automation to add the token (and any model-specific access grants) to the CI secrets before merging. Use pytest.skip if the token is absent locally — never pytest.fail, which hard-blocks contributors without the secret.\n\nThe steward swarm flagged three test files using pytest.fail on missing HF_TOKEN — those are tracked as follow-up fixes.

Seven correctness fixes from Sarah's inline comments on PR NVIDIA-NeMo#2005: 1. Root AGENTS.md: rephrase Xenna/Ray Actor Pool/Ray Data grouping — streaming is Xenna + Ray Data; Ray Actor Pool is dedup-batch. 2. nemo_curator/AGENTS.md: Ray Data auto-detection of stateful stages based on setup() is opportunistic, not guaranteed. Stages holding tokenizers/models/lookup tables in memory should explicitly set RayStageSpecKeys.IS_ACTOR_STAGE: True in ray_stage_spec. 3. benchmarking/AGENTS.md: drop the internal Slack channel name — replace with "team's results sink". 4. benchmarking/AGENTS.md: soften the RedPajama-specific reference; the canonical fuzzy-dedup and Nemotron-CC end-to-end recipe are both cited in README. Discoverer should grep README first. 5. nemo_curator/stages/text/AGENTS.md: text/modules/ is utility code (add_id, joiner, splitter), not filter/modifier code. Remove from the filter/modifier grouping. 6. nemo_curator/stages/text/AGENTS.md: add architectural guidance — tokenization and model inference are always separate stages, composed via CompositeStage. Tokenization is CPU-bound; inference is GPU-bound; shared stages force GPU replicas to share with tokenization. 7. tests/AGENTS.md: HF_TOKEN-gated tests require automation-team coordination; use pytest.skip when token absent locally — never pytest.fail. Greptile's P1 on fern/AGENTS.md:47 (latest.yml fabricated) is a false positive — latest.yml exists (verified via ls fern/versions/). Will reply on the PR thread; no code change needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

sarahyurick · 2026-06-01T19:28:40Z

+
+## Do Not
+
+- Use real network calls (HuggingFace download, remote APIs) in CPU


I think small Hugging Face downloads are okay. We use them for classifier and embedder tests.

sarahyurick · 2026-06-01T19:30:03Z

+- **Programmatic embedder registry** mirroring
+  `classifiers/__init__.py:_LAZY` so docs don't hand-count.
+  Auto-generate `fern/` reference pages from `__all__`.
+- **Clearer separation** between `text/experimental/` and stable


Maybe let's remove this point. I think text/experimental/ will not always be there.

lbliii and others added 7 commits May 20, 2026 14:59

lbliii requested review from a team, abhinavg4, ayushdg, huvunvidia, oyilmaz-nvidia, praateekmahajan, rlratzel, sarahyurick and suiyoubi as code owners May 20, 2026 20:26

copy-pr-bot Bot temporarily deployed to public May 20, 2026 20:26 Inactive

copy-pr-bot Bot temporarily deployed to test May 20, 2026 20:26 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 20, 2026 20:26 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 20, 2026 20:26 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci May 20, 2026 20:26 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 21, 2026 13:43 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 21, 2026 13:43 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci May 21, 2026 13:43 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 13:47 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 13:48 Inactive

copy-pr-bot Bot temporarily deployed to public May 21, 2026 13:53 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 21, 2026 13:53 Error

copy-pr-bot Bot temporarily deployed to nemo-ci May 21, 2026 13:53 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 21, 2026 13:53 Error

copy-pr-bot Bot temporarily deployed to nemo-ci May 21, 2026 13:53 Inactive

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

sarahyurick reviewed May 21, 2026

View reviewed changes

lbliii and others added 2 commits May 21, 2026 17:08

Merge branch 'main' into lbliii/refine-pasted-prompt

b03dc78

sarahyurick mentioned this pull request Jun 1, 2026

docs: add AGENTS.md for AI agent and contributor guidance #1769

Closed

sarahyurick reviewed Jun 1, 2026

View reviewed changes


		## Do Not

		- Use real network calls (HuggingFace download, remote APIs) in CPU

Conversation

lbliii commented May 20, 2026

Summary

How it works

The stewards

Cross-cutting concerns (root-owned, always active)

Validation

Commits in this PR

Open follow-ups

Test plan

Uh oh!

lbliii commented May 21, 2026

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants