docs(governance): AGENTS.md steward network#2005
Conversation
Introduce a root constitution plus ten scoped AGENTS.md stewards covering the pipeline/stage contract, executor parity, deduplication, text/video/synthetic modalities, tests, the canonical fern docs site, benchmarking, and tutorials. Each scoped steward follows the same Point-of-View / Protect / Contract Checklist / Advocate / Serve Peers / Do Not / Own operating model and routes review to the relevant CODEOWNERS team. The root file encodes the Swarm, Content Audit, Convergence, Global Sweep, and Doc Autopilot protocols. Bootstrap fixes shipped alongside (validated by the inaugural steward swarm audit before commit): - Add fern/ to .github/CODEOWNERS (was uncovered by docs_team review; fell through to the default reviewers team) - Remove the stray AGENTS.md line from .gitignore (sat in the macOS Files block; no policy reason found) - Fix phantom path nemo_curator/examples/quickstart.py -> tutorials/ quickstart.py in api-design.md and .github/copilot-instructions.md (Global Sweep on convergent P0) The Docs Steward declares a write-freeze on docs/ effective today (2026-05-20); ongoing release notes and product docs land in fern/. The Deduplication Steward describes current top-level RAPIDS imports honestly (the package requires deduplication_cuda12 to import today) and tracks lazy-import work as open advocacy. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
Cut redundancy across all 11 AGENTS.md files: 1894 -> 1254 lines, a 34% reduction with no loss of mandate. The same facts had been stated 3-7 times across sections; tightening lets each file speak once. Operating-model change in root: Do Not and Serve Peers are now optional sections, included only when they carry weight a careful reader can't infer from the rest of the file. Most scoped stewards either drop them entirely or keep one or two non-obvious entries. Other cuts: - Stop restating root AGENTS.md content in scoped files; link instead. - Collapse adjacent overlapping sections in root (When To Consult + Ask Stewards merged into Steward Swarms; Stakes folded into Governance Alignment; Review Notes merged into Done Criteria). - Shorten 2-3 paragraph scoped-file intros to 1-2 sentences. - Remove "canonical paths to be pinned in the next audit" placeholders where the concrete fern/ paths are known to belong to the Docs Steward's first autopilot sweep. All 11 files lint clean (markdownlint-cli2). Mandate facts unchanged - this is editorial density, not a policy update. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
Re-audit of the tightened mandates surfaced a few real factual errors
that crept in during the rewrite:
- text steward asserted `nemo_curator/stages/text/filters/` doesn't
exist; it does (contains doc_filter.py, score_filter.py, and the
fasttext/heuristic/histogram/token subpackages). Re-added to the
subpackage list everywhere.
- tutorials steward claimed per-modality CODEOWNERS routing for
tutorial PRs; `.github/CODEOWNERS` has no `tutorials/` entries
today. Reworded to "default only; adding routes is open advocacy".
- pipeline steward cited `backends/utils.py` as the home of `None`
tolerance for filter stages; it actually lives in adapters (e.g.
xenna/adapter.py's process_data returns list[Task] | None).
- fern steward asserted "matching directories" for all five versions;
`latest.yml` is redirect-only with no `latest/` directory.
Clarified.
- synthetic Own list missed nested
nemotron_cc/nemo_data_designer/{base,nemotron_cc}.py.
One reported P0 (dedup `fuzzy/lsh/lsh.py:20` "isn't a RAPIDS import")
was a steward false positive: line 20 IS `import cudf`. The dedup
audit was wrong — exactly the "Unverified finding regression" Known
Regression Pattern in root AGENTS.md. Not acted on.
All 11 files lint clean.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
…rests
Three changes:
1. Root AGENTS.md gets the real product pillars (Higher Accuracy /
Faster Processing / Scalability / Classifier Models / Deploy
Anywhere) plus the three architecture pillars (Task-centric /
Map-style / Fault tolerant) — drawn from the public README and
Fern site, not internal sourcing.
2. New cross-cutting Inference Acceleration Steward, encoded as an
inline section in root AGENTS.md (the same shape as elbysodic's
Surface Contract Steward — not a separate file). Coordinates the
modality, backends, synthetic, and dedup stewards whenever a
change touches an inference-bearing stage or the model-serving
surface. Adds an "Inference performance regression" Known
Regression Pattern and a Done Criterion requiring model +
serving-stack + hardware context on inference benchmarks.
3. Every scoped steward now leads with motivation ("This domain
exists because…") instead of description. Protect / Contract
Checklist / Advocate follow from the motivation rather than
restating it.
Filter applied: no internal customer names, no internal team names,
no internal numbers (token counts, GPU counts, internal benchmarks),
no internal roadmap quotes, no internal sourcing ("X asked"). All
references are to publicly observable facts (README, Fern site,
HuggingFace download counts, public RAPIDS / Ray / vLLM / NVENC /
NVDEC dependencies).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
The agents.md convention already establishes that agents read the relevant AGENTS.md files when working in a scope. Restating it in a blockquote at the top added noise without value. The Architecture Boundaries table already serves as the steward inventory. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
Two changes:
1. Stewards address the agent directly. The AGENTS.md convention is
that an agent loads the relevant file and embodies the role.
Second person ("you defend X") makes that role-taking explicit;
third-person description ("this domain defends X") leaves the
agent at arm's length. Each scoped steward's opening + Point Of
View now uses second person; Protect / Contract Checklist /
Advocate stay impersonal (those are facts and surfaces). Root
stays in first-person plural ("we protect") since the constitution
speaks for the project to all agents.
2. Inference Acceleration is no longer framed as a "Steward." A
steward is bound to a scope by its AGENTS.md file. The root agent
IS the steward of cross-cutting concerns, so a sub-steward inside
root with no directory of its own was confused. Reframed as a
cross-cutting concern that root owns; scoped stewards reference
it directly when their changes are inference-bearing.
Also trimmed motivation paragraphs across all 10 scoped files. The
fern intro that described what Fern is (chat, MCP, llms.txt) read
like marketing copy and got cut — anyone reading the file is already
in the repo and knows. Total: 1458 -> 1286 lines.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Lawrence Lane <llane@nvidia.com>
Found while testing the AGENTS.md steward activation: the inline docstring and a TypeError message referenced `output_field` (singular), but the actual parameter is `output_fields` (plural). The "cross-page inconsistency" Known Regression Pattern in root AGENTS.md named exactly this shape of drift. No behavior change; docstring and error-message corrections only. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
Two additions to root AGENTS.md: 1. Anti-Patterns is now a Don't / Do instead table (shape borrowed from PR NVIDIA-NeMo#1769). Concrete pairs covering: pip vs uv; @Property on stage attrs vs class attribute; _name/_resources/_batch_size override (the @Final trap); model loading in __init__ vs setup(); batch_size > 1 without process_batch; _metadata/_stage_perf drop on fan-out; print vs loguru.logger; unsigned commits; the TestThisClassFunction over-classification pattern. 2. Six new Known Regression Patterns the swarm hunts by default: - Stage lifecycle drift (__init__ / setup_on_node / setup) - process_batch mis-use - Metadata / _stage_perf propagation drop (especially fan-out) - Ray Data spec missing (IS_ACTOR_STAGE, IS_FANOUT_STAGE) - Ray Actor Pool mis-attribution (general backend vs dedup-only) - Test over-classification (TestThisClassFunction) - EmptyTask first-stage worker waste (missing max_workers_per_node) All patterns name a verification recipe so they're machine-checkable in autopilot mode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
|
Pushed three commits responding to @sarahyurick's review and integrating Praateek's developer guide:
The `What NOT to Do` shape and several concrete rules are borrowed from #1769 — the two PRs are complementary (#1769 = root-level developer onboarding; #2005 = scoped domain network). When both land they reinforce each other. Verifying lint clean across all 11 `AGENTS.md` files. |
…gation Path lists go stale on IA refactors and miss new pages added after pinning. Replace them with the discovery method itself. Root AGENTS.md gains an Impacted-Docs Discovery section: derive search terms from the diff (renamed classes, changed defaults, new extras, user-visible labels) and grep `fern/`, `tutorials/`, root markdown, and agent artifacts. Each hit either updates in the same PR, carries no-impact:<reason>, or escalates to the Docs Steward. Each scoped steward now lists its domain-specific search-term vocabulary instead of a frozen file path list: - Pipeline: ProcessingStage, Task, Pipeline, Resources, process / process_batch / setup / setup_on_node / with_, modality task names - Backends: XennaExecutor, RayDataExecutor, ray_data_stage_spec, IS_ACTOR_STAGE, IS_FANOUT_STAGE, max_workers_per_node - Dedup: IdGenerator, MinHash, LSH, SemanticDedup, deduplication_cuda12, cuDF/cuGraph/cuML, TextRemovalWorkflow - Text: classifier / embedder class names, HuggingFace IDs, Quality rubric labels (High/Medium/Low), Domain Classifier taxonomy - Video: VideoTask/Video/Clip, PyNvVideoCodec/CvCuda/pyav/NVDEC/NVENC, WebDataset output fields - Synthetic: prompt constants, model server identifiers (vLLM/NIM/ Ray Serve/Dynamo), OpenAI-compat - Tests: GPU markers, shared_ray_cluster, L0 scripts, coverage refs - Benchmarking: test-paths.yaml / nightly-benchmark.yaml, hardware references, cost-per-token claims - Tutorials: filenames, extras names, cluster-orchestration patterns Fern steward gains a Delegation destination block: when other stewards escalate abstraction-level changes (reshaped concept, terminology shift) it performs cross-page consistency, IA, and conceptual-page discovery the calling steward couldn't do via grep. Self-grep is the fast path. Delegation is the escalation when the change isn't symbol-derivable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
| authoritative for Fern. `docs/broken_links_*.json` is the | ||
| deprecated Sphinx site — ignore for Fern. | ||
| - **Variable substitution.** `fern/substitute_variables.py` rewrites | ||
| `{{ current_release }}` / `<release/>`. Don't hand-pin versions | ||
| where substitution would apply. |
There was a problem hiding this comment.
Fabricated version file —
latest.yml does not exist
The Protect section lists fern/versions/{latest,main,v25.09,v26.02,v26.04}.yml as the canonical set of version files, but fern/versions/latest.yml does not exist in the repository (only main.yml, v25.09.yml, v26.02.yml, and v26.04.yml are present). This is the exact pattern the root AGENTS.md defines as "Fabricated CLI / config fields" — and this steward file is the first place an agent will look when managing fern/ versioning. An agent that trusts this inventory will attempt to read, validate, or update a file that doesn't exist, then treat the missing file as a regression rather than an authoring error.
There was a problem hiding this comment.
Disagreeing — fern/versions/latest.yml does exist on this branch:
$ ls fern/versions/
latest.yml
main
main.yml
v25.09
v25.09.yml
v26.02
v26.02.yml
v26.04
v26.04.yml
The file is referenced as the redirect-only manifest (no matching latest/ directory, which the steward already calls out two lines later: "latest.yml is redirect-only — no latest/ directory"). This is the second "Unverified finding regression" pattern on this PR from this tool — the Known Regression Pattern is in root AGENTS.md for a reason. No code change needed.
| against. | ||
| 3. The nightly cron runs all entries in `nightly-benchmark.yaml` on | ||
| 4×A100; results post to the rapids-workflows-nightly-tests | ||
| channel. |
There was a problem hiding this comment.
This seems to be referencing an internal Slack channel.
There was a problem hiding this comment.
Good catch — dropped the internal Slack channel name. Now reads "team's results sink" in 02f6c75.
| - Hardware references (H100, L40S, A100, GB200) tied to specific | ||
| workloads | ||
| - Cost-per-token / cost-per-hour-of-video claims | ||
| - Headline speedup numbers (e.g., RedPajama-v2 fuzzy-dedup figures |
There was a problem hiding this comment.
I thought it was CC not red pajama?
There was a problem hiding this comment.
The current public README headline is RedPajama v2: "16× faster fuzzy deduplication on 8 TB RedPajama v2 (1.78 trillion tokens)". Nemotron-CC is cited separately as the end-to-end recipe. To avoid pinning a specific corpus that may drift, I softened the steward to point at README first as the source of truth (02f6c75). If we're swapping the headline benchmark to Nemotron-CC, the README + the public site update first, and the steward then auto-discovers via grep.
| (`self.model = AutoModel.from_pretrained(...).to("cuda")`). | ||
| GPU-resident state lives here. | ||
| - A stage that overrides `setup()` is auto-routed to a Ray Data | ||
| **Actor**; stateless stages become Ray Data **Tasks**. Overriding |
There was a problem hiding this comment.
I know this is true for text filters and modifiers but I don't think this auto detection is true across the board right? I think you normally have to specify if you want it to be an actor (appropriate for stages that persist some type of object in memory per actor, like a tokenizer or model).
There was a problem hiding this comment.
Good correction. Praateek's deck described auto-routing as a general behavior; you're saying it's opportunistic in practice. Reframed in 02f6c75: auto-detection works for simpler text filter/modifier stages but isn't reliable across the board, and any stage holding a tokenizer / model / lookup table in memory should explicitly set RayStageSpecKeys.IS_ACTOR_STAGE: True in ray_stage_spec.
| - **Fault tolerant** — stages survive preemption and reschedule; partial state is recoverable | ||
|
|
||
| The same pipeline definition must run unchanged across Xenna, Ray Actor | ||
| Pool, and Ray Data executors. |
There was a problem hiding this comment.
Lingering comment should not group Ray Actor Pool with Xenna and Ray Data.
There was a problem hiding this comment.
Fixed in 02f6c75 — root now reads: "The same pipeline definition must run unchanged across the streaming executors (Xenna and Ray Data). Ray Actor Pool is the dedup-batch executor — appropriate for shuffle-based dedup and full-data-state passes, not for general streaming workloads."
| `setup()`. Loading in `__init__` serializes the model to every | ||
| replica. Downloading weights belongs in `setup_on_node()`. See the | ||
| setup-discipline rule in [parent](../../AGENTS.md). | ||
| - **Tokenizer and model artifact handling.** HuggingFace-pinned |
There was a problem hiding this comment.
This isn't necessarily text-specific but since it is common for text, I would like us to specify that tokenization and model inference should always be split into 2 stages and defined together in a composite stage. Like in general, if we are loading some type of model object in memory, it is usually better for each model to be its own stage. This is especially true for the model forward pass which has a huge advantage when it has more GPU resources available in the pipeline, as opposed to having to share the time and resources with the tokenization step.
There was a problem hiding this comment.
Strong guidance — added in 02f6c75 as a new Protect bullet on the text steward:\n\n> Tokenization and model inference are always separate stages. When a pipeline needs both (classifier inference, embedder inference), split them into two stages and compose them with CompositeStage. Tokenization is CPU-bound; inference is GPU-bound. Keeping them in one stage forces the GPU stage to share its replicas with tokenization, hurting throughput. Each model loaded in memory deserves its own stage so the GPU stage can scale independently.\n\nI kept it on the text steward since that's where it activates most often, but the convention is general — happy to move it to Pipeline steward if you'd prefer it broader.
There was a problem hiding this comment.
Maybe worth mentioning, if you are adding a model test that requires a HF_TOKEN, then you will need to coordinate with the automation team to get the model configured with the token in CI/CD.
There was a problem hiding this comment.
Added in 02f6c75:\n\n> HF_TOKEN-gated tests require automation-team coordination. If a test loads a HuggingFace model that needs an access token, the test will fail in CI without the token configured. Coordinate with @NVIDIA-NeMo/automation to add the token (and any model-specific access grants) to the CI secrets before merging. Use pytest.skip if the token is absent locally — never pytest.fail, which hard-blocks contributors without the secret.\n\nThe steward swarm flagged three test files using pytest.fail on missing HF_TOKEN — those are tracked as follow-up fixes.
Seven correctness fixes from Sarah's inline comments on PR NVIDIA-NeMo#2005: 1. Root AGENTS.md: rephrase Xenna/Ray Actor Pool/Ray Data grouping — streaming is Xenna + Ray Data; Ray Actor Pool is dedup-batch. 2. nemo_curator/AGENTS.md: Ray Data auto-detection of stateful stages based on setup() is opportunistic, not guaranteed. Stages holding tokenizers/models/lookup tables in memory should explicitly set RayStageSpecKeys.IS_ACTOR_STAGE: True in ray_stage_spec. 3. benchmarking/AGENTS.md: drop the internal Slack channel name — replace with "team's results sink". 4. benchmarking/AGENTS.md: soften the RedPajama-specific reference; the canonical fuzzy-dedup and Nemotron-CC end-to-end recipe are both cited in README. Discoverer should grep README first. 5. nemo_curator/stages/text/AGENTS.md: text/modules/ is utility code (add_id, joiner, splitter), not filter/modifier code. Remove from the filter/modifier grouping. 6. nemo_curator/stages/text/AGENTS.md: add architectural guidance — tokenization and model inference are always separate stages, composed via CompositeStage. Tokenization is CPU-bound; inference is GPU-bound; shared stages force GPU replicas to share with tokenization. 7. tests/AGENTS.md: HF_TOKEN-gated tests require automation-team coordination; use pytest.skip when token absent locally — never pytest.fail. Greptile's P1 on fern/AGENTS.md:47 (latest.yml fabricated) is a false positive — latest.yml exists (verified via ls fern/versions/). Will reply on the PR thread; no code change needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>
|
|
||
| ## Do Not | ||
|
|
||
| - Use real network calls (HuggingFace download, remote APIs) in CPU |
There was a problem hiding this comment.
I think small Hugging Face downloads are okay. We use them for classifier and embedder tests.
| - **Programmatic embedder registry** mirroring | ||
| `classifiers/__init__.py:_LAZY` so docs don't hand-count. | ||
| Auto-generate `fern/` reference pages from `__all__`. | ||
| - **Clearer separation** between `text/experimental/` and stable |
There was a problem hiding this comment.
Maybe let's remove this point. I think text/experimental/ will not always be there.
Summary
Introduces an
AGENTS.mdsteward network for NeMo Curator — a root constitution plus ten scoped, domain-specific guides that activate passively when an agent works in their directory (per the agents.md convention). The system encodes the convictions, invariants, and review hooks each domain cares about so that an agent traversing the project has the local expert's voice in scope from the moment it enters a directory.This is not an automated reviewer that replaces humans. It's a coordination layer on top of
.github/CODEOWNERS: AI stewards advise; CODEOWNERS approve. The system already produced real findings during validation (see Validation below), including a determinism bug and an embedder-import regression.How it works
graph TB USER([Human]) USER --> AGENT subgraph ALWAYS[Always loaded] ROOT["Root AGENTS.md<br/>━━━━━━━<br/>• Product + architecture pillars<br/>• Stop & Ask<br/>• Inference Acceleration concerns<br/>• Known Regression Patterns<br/>• Steward Signal Format<br/>• Doc Autopilot triggers"] end AGENT["Implementing agent<br/>━━━━━━━<br/>synthesis & decisions"] ROOT --> AGENT subgraph SCOPED[Scoped — activate on directory entry] direction TB S1[nemo_curator/<br/>Pipeline & Stage Contract] S2[backends/<br/>Executor Parity] S3[stages/deduplication/<br/>Determinism] S4[stages/text/<br/>Text modality] S5[stages/video/<br/>All-GPU video] S6[stages/synthetic/<br/>SDG] S7[tests/<br/>Parity + coverage] S8[fern/<br/>Canonical docs] S9[benchmarking/<br/>Defensible perf] S10[tutorials/<br/>First impression] end AGENT -->|enters scope| SCOPED AGENT -.->|delegates parallel work| SWARM["Subagent swarm<br/>(returns Steward Signal Format)"] SWARM -.-> AGENT SCOPED -.->|routes review to| CO["@NVIDIA-NeMo CODEOWNERS"]Three activation modes:
AGENTS.mdloads automatically via the agents.md convention. Root is always loaded, so repo-wide rules (Inference Acceleration, Known Regression Patterns, Doc Autopilot) are always in scope.The stewards
AGENTS.md(root)nemo_curator/AGENTS.mdTask/ProcessingStage/Pipeline/ResourcesABI; per-stage resource ergonomics; task-centric / map-style / fault-tolerant pillarsnemo_curator/backends/AGENTS.mdruntime_envhonoringnemo_curator/stages/deduplication/AGENTS.mdnemo_curator/stages/text/AGENTS.mdnemo_curator/stages/video/AGENTS.mdnemo_curator/stages/synthetic/AGENTS.mdtests/AGENTS.md@pytest.mark.gpudiscipline; shared Ray cluster fixture; modality selection rules; 80% changed-line coveragefern/AGENTS.mddocs/; agentic features (Ask AI, llms.txt, MCP); Doc Autopilot ritualbenchmarking/AGENTS.mdtutorials/AGENTS.md<modality>_cpu/<modality>_cuda12); cluster-scheduled examplesCross-cutting concerns (root-owned, always active)
docs/vsfern/regression, doc-snippet rot, naming/counting drift, cross-page inconsistency, narrow-fix regression, unverified finding regression.fern/site plustutorials/,.cursor/rules/,.github/copilot-instructions.md,README.md,api-design.md— narrow fixes are the dominant regression mode.fern/first if that would solve the problem for both humans and agents.Validation
The steward network produced real findings during three swarm passes before this PR landed:
fern/).IdGenerator.hash_filesis caller-order-dependent (silent ID drift hazard) —nemo_curator/stages/deduplication/id_generator.py:47-49torch/sentence_transformers/transformersat parse time, breaking CPU-only installs —nemo_curator/stages/text/embedders/{__init__.py:15,base.py:19-23}Resources._get_gpu_memory_gb()returns a hardcoded 24 GB fallback that masks misconfiguration —nemo_curator/stages/resources.py:29CompositeStage.with_(dict)is LSP-incompatible withProcessingStage.with_(name=, resources=, ...)—nemo_curator/stages/base.py:262vs:381ffmpegsubprocess paths) — convergent finding from three independent stewardsThese are tracked as follow-up work; this PR ships the network, not the fixes.
Commits in this PR
text/modifiers/docstring (cross-page-inconsistency pattern)Open follow-ups
tutorials/and severaltests/stages/<modality>/subtrees.fern/doc paths in each scoped steward's Own list (currently deferred to first Doc Autopilot sweep).Test plan
markdownlint-cli2clean across all 11AGENTS.mdfilesnemo_curator/stages/text/modifiers/modifier.py— relevant stewards activated, irrelevant stewards stayed quiet🤖 Generated with Claude Code