Skip to content

docs(governance): AGENTS.md steward network#2005

Open
lbliii wants to merge 13 commits into
NVIDIA-NeMo:mainfrom
lbliii:lbliii/refine-pasted-prompt
Open

docs(governance): AGENTS.md steward network#2005
lbliii wants to merge 13 commits into
NVIDIA-NeMo:mainfrom
lbliii:lbliii/refine-pasted-prompt

Conversation

@lbliii
Copy link
Copy Markdown
Contributor

@lbliii lbliii commented May 20, 2026

Summary

Introduces an AGENTS.md steward network for NeMo Curator — a root constitution plus ten scoped, domain-specific guides that activate passively when an agent works in their directory (per the agents.md convention). The system encodes the convictions, invariants, and review hooks each domain cares about so that an agent traversing the project has the local expert's voice in scope from the moment it enters a directory.

This is not an automated reviewer that replaces humans. It's a coordination layer on top of .github/CODEOWNERS: AI stewards advise; CODEOWNERS approve. The system already produced real findings during validation (see Validation below), including a determinism bug and an embedder-import regression.

How it works

graph TB
    USER([Human])
    USER --> AGENT

    subgraph ALWAYS[Always loaded]
        ROOT["Root AGENTS.md<br/>━━━━━━━<br/>• Product + architecture pillars<br/>• Stop &amp; Ask<br/>• Inference Acceleration concerns<br/>• Known Regression Patterns<br/>• Steward Signal Format<br/>• Doc Autopilot triggers"]
    end

    AGENT["Implementing agent<br/>━━━━━━━<br/>synthesis &amp; decisions"]
    ROOT --> AGENT

    subgraph SCOPED[Scoped — activate on directory entry]
        direction TB
        S1[nemo_curator/<br/>Pipeline &amp; Stage Contract]
        S2[backends/<br/>Executor Parity]
        S3[stages/deduplication/<br/>Determinism]
        S4[stages/text/<br/>Text modality]
        S5[stages/video/<br/>All-GPU video]
        S6[stages/synthetic/<br/>SDG]
        S7[tests/<br/>Parity + coverage]
        S8[fern/<br/>Canonical docs]
        S9[benchmarking/<br/>Defensible perf]
        S10[tutorials/<br/>First impression]
    end

    AGENT -->|enters scope| SCOPED
    AGENT -.->|delegates parallel work| SWARM["Subagent swarm<br/>(returns Steward Signal Format)"]
    SWARM -.-> AGENT

    SCOPED -.->|routes review to| CO["@NVIDIA-NeMo CODEOWNERS"]
Loading

Three activation modes:

  1. Passive — when an agent enters a directory, the closest scoped AGENTS.md loads automatically via the agents.md convention. Root is always loaded, so repo-wide rules (Inference Acceleration, Known Regression Patterns, Doc Autopilot) are always in scope.
  2. Synthesis — the implementing agent is the integration point. Stewards advise; the agent decides. Cross-references between stewards (e.g., "apply the Inference Acceleration concerns in root AGENTS.md") are followed by the agent, not by the file system.
  3. Delegation — for parallel investigation or large reads, the agent spawns subagents. Each subagent reads root plus its closest scoped file and returns findings in the Steward Signal Format. The implementing agent synthesizes and decides. Triggers: `ask stewards`, `bugbash`, `review swarm`, `steward synthesis`, `audit docs`, `content audit`.

The stewards

Path What it owns CODEOWNERS
AGENTS.md (root) Constitution: pillars, Stop & Ask, Inference Acceleration, Known Regression Patterns, Doc Autopilot, Steward Swarm protocol, Done Criteria
nemo_curator/AGENTS.md Task / ProcessingStage / Pipeline / Resources ABI; per-stage resource ergonomics; task-centric / map-style / fault-tolerant pillars curator_reviewers
nemo_curator/backends/AGENTS.md Executor parity (Xenna / Ray Actor Pool / Ray Data); streaming + auto-balancing; backpressure; runtime_env honoring @oyilmaz-nvidia @praateekmahajan @abhinavg4 @ayushdg
nemo_curator/stages/deduplication/AGENTS.md exact / fuzzy / semantic determinism; ID-generator stability; RAPIDS dependency reality @ayushdg @praateekmahajan
nemo_curator/stages/text/AGENTS.md DocumentBatch contract; published Domain & Quality classifiers as public API; embedder surface @sarahyurick @praateekmahajan @VibhuJawa (classifiers/embedders)
nemo_curator/stages/video/AGENTS.md All-GPU end-to-end pipeline; speed-of-light per inference model; WebDataset output @suiyoubi @abhinavg4
nemo_curator/stages/synthetic/AGENTS.md Prompts as public API; in-process vs server-endpoint deployment patterns; OpenAI-compat @huvunvidia
tests/AGENTS.md @pytest.mark.gpu discipline; shared Ray cluster fixture; modality selection rules; 80% changed-line coverage curator_reviewers + per-modality
fern/AGENTS.md Canonical docs site; 2026-05-20 write-freeze on docs/; agentic features (Ask AI, llms.txt, MCP); Doc Autopilot ritual @NVIDIA-NeMo/docs_team
benchmarking/AGENTS.md Reproducibility; hardware + software-version + model + serving-stack capture; cost-per-token framing @rlratzel + curator_reviewers split
tutorials/AGENTS.md Imports resolve; extras-name correctness (<modality>_cpu / <modality>_cuda12); cluster-scheduled examples curator_reviewers (advocacy: add per-modality routing)

Cross-cutting concerns (root-owned, always active)

  • Inference Acceleration. Speed-of-light per model (TensorRT-LLM, memory optimization, quantization); explicit in-process vs server-endpoint deployment pattern; vLLM canonical / Ray Serve preferred / Dynamo supported; benchmarks must capture model + serving stack + hardware.
  • Known Regression Patterns with verification recipes: fabricated CLI/config fields, stage-contract drift, executor parity drift, inference performance regression, deduplication CUDA gating, docs/ vs fern/ regression, doc-snippet rot, naming/counting drift, cross-page inconsistency, narrow-fix regression, unverified finding regression.
  • Convergence rule. When two or more stewards independently flag the same finding, it auto-promotes to P0 regardless of individual severity.
  • Global Sweep on accepted P0s. Wrong factual claims must be grep-corrected across the entire fern/ site plus tutorials/, .cursor/rules/, .github/copilot-instructions.md, README.md, api-design.md — narrow fixes are the dominant regression mode.
  • Docs-First Agent Artifact Evaluation. Before creating or expanding a cursor rule, Claude skill, MCP workflow, or prompt template, fix fern/ first if that would solve the problem for both humans and agents.

Validation

The steward network produced real findings during three swarm passes before this PR landed:

  • Bootstrap audit — 12 P0 corrections to the initial mandate (fabricated paths, wrong extras names, stale class names, CODEOWNERS gap for fern/).
  • Post-tightening re-audit — 5 factual errors caught after a redundancy-cut rewrite, plus 1 false-positive that exemplified the "Unverified finding regression" Known Regression Pattern.
  • State-of-domain pass — first audit of code (not just the mandate). Surfaced real issues including:
    • IdGenerator.hash_files is caller-order-dependent (silent ID drift hazard) — nemo_curator/stages/deduplication/id_generator.py:47-49
    • Embedders eagerly import torch / sentence_transformers / transformers at parse time, breaking CPU-only installs — nemo_curator/stages/text/embedders/{__init__.py:15,base.py:19-23}
    • Resources._get_gpu_memory_gb() returns a hardcoded 24 GB fallback that masks misconfiguration — nemo_curator/stages/resources.py:29
    • CompositeStage.with_(dict) is LSP-incompatible with ProcessingStage.with_(name=, resources=, ...)nemo_curator/stages/base.py:262 vs :381
    • Lazy-import drift in three modalities (dedup top-level RAPIDS, text embedders, video ffmpeg subprocess paths) — convergent finding from three independent stewards

These are tracked as follow-up work; this PR ships the network, not the fixes.

Commits in this PR

  1. Bootstrap of the network and first-audit P0 corrections
  2. Tightening pass to remove redundancy (-34% lines)
  3. Second-audit P0/P1 corrections
  4. Motivation reshape + Inference Acceleration cross-cutting concern
  5. Drop redundant bootnote (the agents.md convention already says "read this file")
  6. Switch scoped stewards to second person; reframe Inference Acceleration as a concern, not a separately-scoped steward
  7. One steward-discovered fix to a text/modifiers/ docstring (cross-page-inconsistency pattern)

Open follow-ups

  • Wire one CI trigger per domain (Doc Autopilot merge gate; tutorial static-import smoke; backend parity test; benchmark regression detection).
  • Close CODEOWNERS gaps on tutorials/ and several tests/stages/<modality>/ subtrees.
  • Pin owned fern/ doc paths in each scoped steward's Own list (currently deferred to first Doc Autopilot sweep).
  • Fix the validated bugs above as separate PRs against their CODEOWNERS.

Test plan

  • markdownlint-cli2 clean across all 11 AGENTS.md files
  • No internal customer / team / number / roadmap leaks (grep-verified)
  • Spot-checked 6 of the highest-impact state-of-domain claims against source
  • Steward activation tested via a small edit in nemo_curator/stages/text/modifiers/modifier.py — relevant stewards activated, irrelevant stewards stayed quiet
  • First real PR review through the system (requires another PR to route through it)

🤖 Generated with Claude Code

lbliii and others added 7 commits May 20, 2026 14:59
Introduce a root constitution plus ten scoped AGENTS.md stewards
covering the pipeline/stage contract, executor parity, deduplication,
text/video/synthetic modalities, tests, the canonical fern docs site,
benchmarking, and tutorials. Each scoped steward follows the same
Point-of-View / Protect / Contract Checklist / Advocate / Serve Peers
/ Do Not / Own operating model and routes review to the relevant
CODEOWNERS team. The root file encodes the Swarm, Content Audit,
Convergence, Global Sweep, and Doc Autopilot protocols.

Bootstrap fixes shipped alongside (validated by the inaugural steward
swarm audit before commit):

- Add fern/ to .github/CODEOWNERS (was uncovered by docs_team review;
  fell through to the default reviewers team)
- Remove the stray AGENTS.md line from .gitignore (sat in the macOS
  Files block; no policy reason found)
- Fix phantom path nemo_curator/examples/quickstart.py -> tutorials/
  quickstart.py in api-design.md and .github/copilot-instructions.md
  (Global Sweep on convergent P0)

The Docs Steward declares a write-freeze on docs/ effective today
(2026-05-20); ongoing release notes and product docs land in fern/.
The Deduplication Steward describes current top-level RAPIDS imports
honestly (the package requires deduplication_cuda12 to import today)
and tracks lazy-import work as open advocacy.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Cut redundancy across all 11 AGENTS.md files: 1894 -> 1254 lines, a
34% reduction with no loss of mandate. The same facts had been stated
3-7 times across sections; tightening lets each file speak once.

Operating-model change in root: Do Not and Serve Peers are now
optional sections, included only when they carry weight a careful
reader can't infer from the rest of the file. Most scoped stewards
either drop them entirely or keep one or two non-obvious entries.

Other cuts:
- Stop restating root AGENTS.md content in scoped files; link instead.
- Collapse adjacent overlapping sections in root (When To Consult +
  Ask Stewards merged into Steward Swarms; Stakes folded into
  Governance Alignment; Review Notes merged into Done Criteria).
- Shorten 2-3 paragraph scoped-file intros to 1-2 sentences.
- Remove "canonical paths to be pinned in the next audit" placeholders
  where the concrete fern/ paths are known to belong to the Docs
  Steward's first autopilot sweep.

All 11 files lint clean (markdownlint-cli2). Mandate facts unchanged
- this is editorial density, not a policy update.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Re-audit of the tightened mandates surfaced a few real factual errors
that crept in during the rewrite:

- text steward asserted `nemo_curator/stages/text/filters/` doesn't
  exist; it does (contains doc_filter.py, score_filter.py, and the
  fasttext/heuristic/histogram/token subpackages). Re-added to the
  subpackage list everywhere.
- tutorials steward claimed per-modality CODEOWNERS routing for
  tutorial PRs; `.github/CODEOWNERS` has no `tutorials/` entries
  today. Reworded to "default only; adding routes is open advocacy".
- pipeline steward cited `backends/utils.py` as the home of `None`
  tolerance for filter stages; it actually lives in adapters (e.g.
  xenna/adapter.py's process_data returns list[Task] | None).
- fern steward asserted "matching directories" for all five versions;
  `latest.yml` is redirect-only with no `latest/` directory.
  Clarified.
- synthetic Own list missed nested
  nemotron_cc/nemo_data_designer/{base,nemotron_cc}.py.

One reported P0 (dedup `fuzzy/lsh/lsh.py:20` "isn't a RAPIDS import")
was a steward false positive: line 20 IS `import cudf`. The dedup
audit was wrong — exactly the "Unverified finding regression" Known
Regression Pattern in root AGENTS.md. Not acted on.

All 11 files lint clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
…rests

Three changes:

1. Root AGENTS.md gets the real product pillars (Higher Accuracy /
   Faster Processing / Scalability / Classifier Models / Deploy
   Anywhere) plus the three architecture pillars (Task-centric /
   Map-style / Fault tolerant) — drawn from the public README and
   Fern site, not internal sourcing.

2. New cross-cutting Inference Acceleration Steward, encoded as an
   inline section in root AGENTS.md (the same shape as elbysodic's
   Surface Contract Steward — not a separate file). Coordinates the
   modality, backends, synthetic, and dedup stewards whenever a
   change touches an inference-bearing stage or the model-serving
   surface. Adds an "Inference performance regression" Known
   Regression Pattern and a Done Criterion requiring model +
   serving-stack + hardware context on inference benchmarks.

3. Every scoped steward now leads with motivation ("This domain
   exists because…") instead of description. Protect / Contract
   Checklist / Advocate follow from the motivation rather than
   restating it.

Filter applied: no internal customer names, no internal team names,
no internal numbers (token counts, GPU counts, internal benchmarks),
no internal roadmap quotes, no internal sourcing ("X asked"). All
references are to publicly observable facts (README, Fern site,
HuggingFace download counts, public RAPIDS / Ray / vLLM / NVENC /
NVDEC dependencies).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
The agents.md convention already establishes that agents read the
relevant AGENTS.md files when working in a scope. Restating it in a
blockquote at the top added noise without value. The Architecture
Boundaries table already serves as the steward inventory.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Two changes:

1. Stewards address the agent directly. The AGENTS.md convention is
   that an agent loads the relevant file and embodies the role.
   Second person ("you defend X") makes that role-taking explicit;
   third-person description ("this domain defends X") leaves the
   agent at arm's length. Each scoped steward's opening + Point Of
   View now uses second person; Protect / Contract Checklist /
   Advocate stay impersonal (those are facts and surfaces). Root
   stays in first-person plural ("we protect") since the constitution
   speaks for the project to all agents.

2. Inference Acceleration is no longer framed as a "Steward." A
   steward is bound to a scope by its AGENTS.md file. The root agent
   IS the steward of cross-cutting concerns, so a sub-steward inside
   root with no directory of its own was confused. Reframed as a
   cross-cutting concern that root owns; scoped stewards reference
   it directly when their changes are inference-bearing.

Also trimmed motivation paragraphs across all 10 scoped files. The
fern intro that described what Fern is (chat, MCP, llms.txt) read
like marketing copy and got cut — anyone reading the file is already
in the repo and knows. Total: 1458 -> 1286 lines.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Found while testing the AGENTS.md steward activation: the inline
docstring and a TypeError message referenced `output_field` (singular),
but the actual parameter is `output_fields` (plural). The
"cross-page inconsistency" Known Regression Pattern in root AGENTS.md
named exactly this shape of drift.

No behavior change; docstring and error-message corrections only.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Two additions to root AGENTS.md:

1. Anti-Patterns is now a Don't / Do instead table (shape borrowed
   from PR NVIDIA-NeMo#1769). Concrete pairs covering: pip vs uv; @Property on
   stage attrs vs class attribute; _name/_resources/_batch_size
   override (the @Final trap); model loading in __init__ vs setup();
   batch_size > 1 without process_batch; _metadata/_stage_perf drop
   on fan-out; print vs loguru.logger; unsigned commits; the
   TestThisClassFunction over-classification pattern.

2. Six new Known Regression Patterns the swarm hunts by default:
   - Stage lifecycle drift (__init__ / setup_on_node / setup)
   - process_batch mis-use
   - Metadata / _stage_perf propagation drop (especially fan-out)
   - Ray Data spec missing (IS_ACTOR_STAGE, IS_FANOUT_STAGE)
   - Ray Actor Pool mis-attribution (general backend vs dedup-only)
   - Test over-classification (TestThisClassFunction)
   - EmptyTask first-stage worker waste (missing max_workers_per_node)

All patterns name a verification recipe so they're machine-checkable
in autopilot mode.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
@lbliii
Copy link
Copy Markdown
Contributor Author

lbliii commented May 21, 2026

Pushed three commits responding to @sarahyurick's review and integrating Praateek's developer guide:

  1. `docs(governance): apply Sarah's PR review — Ray Actor Pool is dedup-only` — reframed Ray Actor Pool as the dedup-batch executor across root, backends, benchmarking, video, tutorials. Dropped `ALM_BENCHMARK.md` / `AUDIO_PROFILING.md` references. Tutorials get ruff, not CI execution.

  2. `docs(governance): seed stewards with Praateek's architectural discipline` — encoded the patterns from the NeMo Curator Developer Guide into the relevant stewards. The Pipeline & Stage Contract steward now carries the `init` / `setup_on_node` / `setup` lifecycle discipline, `process` vs `process_batch` rules, `_metadata` / `stage_perf` propagation (especially fan-out), task-size sweet spot, `with()` for variable resources, `CompositeStage`, and `Workflow`. Text / Video / SDG stewards cross-reference the setup discipline for model loading. Dedup steward gains the `Workflow` callout and explicit Ray Actor Pool scope. Tests steward gains the "no `TestThisClassFunction` over-classification" rule.

  3. `docs(governance): integrate Praateek's What-NOT-to-Do + new KRPs` — root `AGENTS.md` Anti-Patterns is now a Don't / Do table (shape from docs: add AGENTS.md for AI agent and contributor guidance #1769), and six new Known Regression Patterns the swarm hunts by default: stage lifecycle drift, `process_batch` mis-use, metadata/_stage_perf drop on fan-out, Ray Data spec missing (`IS_ACTOR_STAGE`, `IS_FANOUT_STAGE`), Ray Actor Pool mis-attribution, test over-classification, and `EmptyTask` first-stage worker waste. Each pattern names a verification recipe so it's machine-checkable in autopilot mode.

The `What NOT to Do` shape and several concrete rules are borrowed from #1769 — the two PRs are complementary (#1769 = root-level developer onboarding; #2005 = scoped domain network). When both land they reinforce each other.

Verifying lint clean across all 11 `AGENTS.md` files.

…gation

Path lists go stale on IA refactors and miss new pages added after
pinning. Replace them with the discovery method itself.

Root AGENTS.md gains an Impacted-Docs Discovery section: derive
search terms from the diff (renamed classes, changed defaults, new
extras, user-visible labels) and grep `fern/`, `tutorials/`, root
markdown, and agent artifacts. Each hit either updates in the same
PR, carries no-impact:<reason>, or escalates to the Docs Steward.

Each scoped steward now lists its domain-specific search-term
vocabulary instead of a frozen file path list:

- Pipeline: ProcessingStage, Task, Pipeline, Resources, process /
  process_batch / setup / setup_on_node / with_, modality task names
- Backends: XennaExecutor, RayDataExecutor, ray_data_stage_spec,
  IS_ACTOR_STAGE, IS_FANOUT_STAGE, max_workers_per_node
- Dedup: IdGenerator, MinHash, LSH, SemanticDedup, deduplication_cuda12,
  cuDF/cuGraph/cuML, TextRemovalWorkflow
- Text: classifier / embedder class names, HuggingFace IDs, Quality
  rubric labels (High/Medium/Low), Domain Classifier taxonomy
- Video: VideoTask/Video/Clip, PyNvVideoCodec/CvCuda/pyav/NVDEC/NVENC,
  WebDataset output fields
- Synthetic: prompt constants, model server identifiers (vLLM/NIM/
  Ray Serve/Dynamo), OpenAI-compat
- Tests: GPU markers, shared_ray_cluster, L0 scripts, coverage refs
- Benchmarking: test-paths.yaml / nightly-benchmark.yaml, hardware
  references, cost-per-token claims
- Tutorials: filenames, extras names, cluster-orchestration patterns

Fern steward gains a Delegation destination block: when other
stewards escalate abstraction-level changes (reshaped concept,
terminology shift) it performs cross-page consistency, IA, and
conceptual-page discovery the calling steward couldn't do via grep.

Self-grep is the fast path. Delegation is the escalation when the
change isn't symbol-derivable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Comment thread fern/AGENTS.md
Comment on lines +43 to +47
authoritative for Fern. `docs/broken_links_*.json` is the
deprecated Sphinx site — ignore for Fern.
- **Variable substitution.** `fern/substitute_variables.py` rewrites
`{{ current_release }}` / `<release/>`. Don't hand-pin versions
where substitution would apply.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Fabricated version file — latest.yml does not exist

The Protect section lists fern/versions/{latest,main,v25.09,v26.02,v26.04}.yml as the canonical set of version files, but fern/versions/latest.yml does not exist in the repository (only main.yml, v25.09.yml, v26.02.yml, and v26.04.yml are present). This is the exact pattern the root AGENTS.md defines as "Fabricated CLI / config fields" — and this steward file is the first place an agent will look when managing fern/ versioning. An agent that trusts this inventory will attempt to read, validate, or update a file that doesn't exist, then treat the missing file as a regression rather than an authoring error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagreeing — fern/versions/latest.yml does exist on this branch:

$ ls fern/versions/
latest.yml
main
main.yml
v25.09
v25.09.yml
v26.02
v26.02.yml
v26.04
v26.04.yml

The file is referenced as the redirect-only manifest (no matching latest/ directory, which the steward already calls out two lines later: "latest.yml is redirect-only — no latest/ directory"). This is the second "Unverified finding regression" pattern on this PR from this tool — the Known Regression Pattern is in root AGENTS.md for a reason. No code change needed.

Comment thread benchmarking/AGENTS.md Outdated
against.
3. The nightly cron runs all entries in `nightly-benchmark.yaml` on
4×A100; results post to the rapids-workflows-nightly-tests
channel.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be referencing an internal Slack channel.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — dropped the internal Slack channel name. Now reads "team's results sink" in 02f6c75.

Comment thread benchmarking/AGENTS.md Outdated
- Hardware references (H100, L40S, A100, GB200) tied to specific
workloads
- Cost-per-token / cost-per-hour-of-video claims
- Headline speedup numbers (e.g., RedPajama-v2 fuzzy-dedup figures
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was CC not red pajama?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current public README headline is RedPajama v2: "16× faster fuzzy deduplication on 8 TB RedPajama v2 (1.78 trillion tokens)". Nemotron-CC is cited separately as the end-to-end recipe. To avoid pinning a specific corpus that may drift, I softened the steward to point at README first as the source of truth (02f6c75). If we're swapping the headline benchmark to Nemotron-CC, the README + the public site update first, and the steward then auto-discovers via grep.

Comment thread nemo_curator/AGENTS.md Outdated
(`self.model = AutoModel.from_pretrained(...).to("cuda")`).
GPU-resident state lives here.
- A stage that overrides `setup()` is auto-routed to a Ray Data
**Actor**; stateless stages become Ray Data **Tasks**. Overriding
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is true for text filters and modifiers but I don't think this auto detection is true across the board right? I think you normally have to specify if you want it to be an actor (appropriate for stages that persist some type of object in memory per actor, like a tokenizer or model).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good correction. Praateek's deck described auto-routing as a general behavior; you're saying it's opportunistic in practice. Reframed in 02f6c75: auto-detection works for simpler text filter/modifier stages but isn't reliable across the board, and any stage holding a tokenizer / model / lookup table in memory should explicitly set RayStageSpecKeys.IS_ACTOR_STAGE: True in ray_stage_spec.

Comment thread AGENTS.md Outdated
- **Fault tolerant** — stages survive preemption and reschedule; partial state is recoverable

The same pipeline definition must run unchanged across Xenna, Ray Actor
Pool, and Ray Data executors.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lingering comment should not group Ray Actor Pool with Xenna and Ray Data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 02f6c75 — root now reads: "The same pipeline definition must run unchanged across the streaming executors (Xenna and Ray Data). Ray Actor Pool is the dedup-batch executor — appropriate for shuffle-based dedup and full-data-state passes, not for general streaming workloads."

Comment thread nemo_curator/stages/text/AGENTS.md Outdated
`setup()`. Loading in `__init__` serializes the model to every
replica. Downloading weights belongs in `setup_on_node()`. See the
setup-discipline rule in [parent](../../AGENTS.md).
- **Tokenizer and model artifact handling.** HuggingFace-pinned
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessarily text-specific but since it is common for text, I would like us to specify that tokenization and model inference should always be split into 2 stages and defined together in a composite stage. Like in general, if we are loading some type of model object in memory, it is usually better for each model to be its own stage. This is especially true for the model forward pass which has a huge advantage when it has more GPU resources available in the pipeline, as opposed to having to share the time and resources with the tokenization step.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong guidance — added in 02f6c75 as a new Protect bullet on the text steward:\n\n> Tokenization and model inference are always separate stages. When a pipeline needs both (classifier inference, embedder inference), split them into two stages and compose them with CompositeStage. Tokenization is CPU-bound; inference is GPU-bound. Keeping them in one stage forces the GPU stage to share its replicas with tokenization, hurting throughput. Each model loaded in memory deserves its own stage so the GPU stage can scale independently.\n\nI kept it on the text steward since that's where it activates most often, but the convention is general — happy to move it to Pipeline steward if you'd prefer it broader.

Comment thread tests/AGENTS.md
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth mentioning, if you are adding a model test that requires a HF_TOKEN, then you will need to coordinate with the automation team to get the model configured with the token in CI/CD.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 02f6c75:\n\n> HF_TOKEN-gated tests require automation-team coordination. If a test loads a HuggingFace model that needs an access token, the test will fail in CI without the token configured. Coordinate with @NVIDIA-NeMo/automation to add the token (and any model-specific access grants) to the CI secrets before merging. Use pytest.skip if the token is absent locally — never pytest.fail, which hard-blocks contributors without the secret.\n\nThe steward swarm flagged three test files using pytest.fail on missing HF_TOKEN — those are tracked as follow-up fixes.

lbliii and others added 2 commits May 21, 2026 17:08
Seven correctness fixes from Sarah's inline comments on PR NVIDIA-NeMo#2005:

1. Root AGENTS.md: rephrase Xenna/Ray Actor Pool/Ray Data grouping —
   streaming is Xenna + Ray Data; Ray Actor Pool is dedup-batch.

2. nemo_curator/AGENTS.md: Ray Data auto-detection of stateful stages
   based on setup() is opportunistic, not guaranteed. Stages holding
   tokenizers/models/lookup tables in memory should explicitly set
   RayStageSpecKeys.IS_ACTOR_STAGE: True in ray_stage_spec.

3. benchmarking/AGENTS.md: drop the internal Slack channel name —
   replace with "team's results sink".

4. benchmarking/AGENTS.md: soften the RedPajama-specific reference;
   the canonical fuzzy-dedup and Nemotron-CC end-to-end recipe are
   both cited in README. Discoverer should grep README first.

5. nemo_curator/stages/text/AGENTS.md: text/modules/ is utility code
   (add_id, joiner, splitter), not filter/modifier code. Remove from
   the filter/modifier grouping.

6. nemo_curator/stages/text/AGENTS.md: add architectural guidance —
   tokenization and model inference are always separate stages, composed
   via CompositeStage. Tokenization is CPU-bound; inference is GPU-bound;
   shared stages force GPU replicas to share with tokenization.

7. tests/AGENTS.md: HF_TOKEN-gated tests require automation-team
   coordination; use pytest.skip when token absent locally — never
   pytest.fail.

Greptile's P1 on fern/AGENTS.md:47 (latest.yml fabricated) is a false
positive — latest.yml exists (verified via ls fern/versions/). Will
reply on the PR thread; no code change needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Comment thread tests/AGENTS.md

## Do Not

- Use real network calls (HuggingFace download, remote APIs) in CPU
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think small Hugging Face downloads are okay. We use them for classifier and embedder tests.

- **Programmatic embedder registry** mirroring
`classifiers/__init__.py:_LAZY` so docs don't hand-count.
Auto-generate `fern/` reference pages from `__all__`.
- **Clearer separation** between `text/experimental/` and stable
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's remove this point. I think text/experimental/ will not always be there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants