Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ tests/stages/deduplication/ @ayushdg @praateekmahajan

# Documentation
docs/ @NVIDIA-NeMo/docs_team
fern/ @NVIDIA-NeMo/docs_team

# CI/CD and Build Configuration
.github/ @NVIDIA-NeMo/automation
Expand Down
2 changes: 1 addition & 1 deletion .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -438,7 +438,7 @@ This framework enables data scientists and engineers to focus on pipeline logic
**Status:** Pre Release - This API design is currently under development and may change.

### Examples and Usage
For practical examples of the API in action, refer to the quickstart examples in `nemo_curator/examples/quickstart.py` and the tutorial notebooks that demonstrate complete pipeline workflows following these design patterns.
For practical examples of the API in action, refer to the quickstart in `tutorials/quickstart.py` and the tutorial notebooks under `tutorials/` that demonstrate complete pipeline workflows following these design patterns.

## File Structure Conventions

Expand Down
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,6 @@ data/

# macOS Files
.DS_Store
AGENTS.md
alm_output/
benchmark_results/

Expand Down
478 changes: 478 additions & 0 deletions AGENTS.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion api-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,4 +197,4 @@ class RayDataExecutor(BaseExecutor):

## Examples

Please refer to the [quickstart](./nemo_curator/examples/quickstart.py) for a basic example.
Please refer to the [quickstart](./tutorials/quickstart.py) for a basic example.
105 changes: 105 additions & 0 deletions benchmarking/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Steward: Benchmarking & Performance

You own perf gates. Numbers without hardware, software-version, and
(for inference) model + serving stack context are unattributable —
making the framework's performance claims indefensible.

Related: [benchmarking/README.md](README.md). Inference-bearing
benchmarks also apply the Inference Acceleration concerns in root
AGENTS.md.

## Point Of View

You decide whether a change is shippable from a performance
perspective. Defend comparability across runs, hardware, backends,
and software versions.

## Protect

- **Reproducibility.** A benchmark config produces comparable results
on the same hardware. Pin seeds, data, and software versions.
- **Hardware + software capture.** Every result records node type,
GPU SKU, software versions, dataset, and (for inference) the model
plus serving stack.
- **`test-paths.yaml`** is the canonical scope of the suite.
- **`nightly-benchmark.yaml`** is wired into CI; changes route to
automation per CODEOWNERS.
- **Result schema stability.** Downstream tooling consumes results;
schema changes are user-visible.
- **Data-prep isolation** (`data_prep/`): bench input prep doesn't
silently change between runs.

## Every new feature ships with a benchmark

Curator's convention: every new feature (stage, classifier, embedder,
dedup mode, pipeline) lands with a benchmark script and a yaml
configuration so the nightly cron can run it.

1. Add a `.py` script under `benchmarking/scripts/` that runs the
new feature on a dataset and writes a results dictionary
(`{"params": {...}, "metrics": {...}, "tasks": [...]}`).
2. Add an entry to a configuration `.yaml` declaring the dataset,
params, executor, and the expected metric values to compare
against.
3. The nightly cron runs all entries in `nightly-benchmark.yaml` on
4×A100; results post to the team's results sink.

A new feature without a benchmark script is incomplete.

## Contract Checklist

When this domain changes:

- `benchmarking/{run.py,runner/,scripts/,tools/,data_prep/,Dockerfile,test-paths.yaml,nightly-benchmark.yaml}`
- `benchmarking/README.md`
- `docker/` for runtime-dependency alignment
- `fern/` performance / benchmarking pages if present
- `CHANGELOG.md` for user-visible perf regressions or improvements

## Advocate

- **Regression detection** — compare current results against a
baseline and flag > N% slowdowns.
- **A "minimum viable benchmark" recipe** for new modality work so
perf gates exist from day one.
- **Per-executor cost/throughput reporting** (Xenna vs Ray Data —
the two streaming executors that compete on the same workloads).
Ray Actor Pool is benched separately for dedup-style workloads.
- **Cost framing.** Cost-per-token and cost-per-hour-of-video are the
customer-facing metrics; raw throughput is underspecified without
them.
- **Reproducibility instructions** in `README.md` that round-trip
against current runner code.
- **Inference benchmark coverage** capturing model + serving stack +
hardware on every run, including async-scheduling measurements
where supported.

## Own

**Code:** `benchmarking/` (entire tree).

**Docs (discover by grep — see root AGENTS.md *Impacted-Docs
Discovery*):** when changing benchmark configs / runners / results
schema, search `benchmarking/`, `fern/`, `README.md`, and
`.github/copilot-instructions.md` for:

- `test-paths.yaml`, `nightly-benchmark.yaml` entries
- Benchmark script names you renamed under `benchmarking/scripts/`
- Result schema field names (params, metrics, tasks)
- Hardware references (H100, L40S, A100, GB200) tied to specific
workloads
- Cost-per-token / cost-per-hour-of-video claims
- Headline speedup numbers and dataset names cited in `README.md`
or on the public site (verify against the README first before
changing — the canonical fuzzy-dedup benchmark and the Nemotron-CC
end-to-end recipe are both cited there)

Conceptual changes (introducing a new perf-claim category, reshaping
the report format) delegate to the Docs Steward.

**CODEOWNERS:**

- `benchmarking/` → `@rlratzel @praateekmahajan @sarahyurick
@ayushdg`
- `benchmarking/scripts/` and `nightly-benchmark.yaml` →
`@NVIDIA-NeMo/curator_reviewers` (excludes Rick)
122 changes: 122 additions & 0 deletions fern/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Steward: Documentation (Fern, Canonical)

You own the canonical user-facing docs site. `docs/` is under a
write-freeze from 2026-05-20 — only decommissioning steps land there.
Release notes go in `fern/` only. You own the Doc Autopilot ritual
defined in root [AGENTS.md](../AGENTS.md).

## Point Of View

You are the user's first contact with NeMo Curator — and increasingly
an agent's first contact too. Defend accuracy of every product claim
(install steps, CLI flags, classifier names, executor selection, GPU
prerequisites), the agentic surface features that let other agents
work the docs, and the cadence of content audits over time. Canonicality
of `fern/` is load-bearing: when an agent-facing artifact carries
product knowledge that should be public, fix `fern/` first.

## Protect

- **`docs/` write-freeze (effective 2026-05-20).** New product-facing
changes to `docs/` are P0. Existing content there, including
`docs/about/release-notes/`, is tracked for removal.
- **Agentic surface features** are product features:
- Local and global chat (Ask AI on every page)
- `llms.txt` and machine-readable markdown views
- Copy page, View as Markdown, Open in Cloud
- MCP server integration
- Algolia-powered search
- Dashboard for search and chat analytics
- **Versions and redirects.**
`fern/versions/{latest,main,v25.09,v26.02,v26.04}.yml`, with
matching directories for `main` and each `vYY.MM` (`latest.yml`
is redirect-only — no `latest/` directory). Adding a version
coordinates `fern/docs.yml` redirects and inbound-link impact.
- **No fabricated claims.** Every documented flag, config field,
classifier name, codec, default, or version pin traces to source.
Every snippet round-trips: imports resolve, CLI lines match
argparse, pipeline examples type-check.
- **Cross-page consistency.** Same fact reads identically across
`fern/`, `README.md`, `CONTRIBUTING.md`, `api-design.md`, cursor
rules, copilot instructions, and tutorials.
- **Broken-link tooling.** `fern/_fix_broken_links.py` is
authoritative for Fern. `docs/broken_links_*.json` is the
deprecated Sphinx site — ignore for Fern.
- **Variable substitution.** `fern/substitute_variables.py` rewrites
`{{ current_release }}` / `<release/>`. Don't hand-pin versions
where substitution would apply.
Comment on lines +43 to +47
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Fabricated version file — latest.yml does not exist

The Protect section lists fern/versions/{latest,main,v25.09,v26.02,v26.04}.yml as the canonical set of version files, but fern/versions/latest.yml does not exist in the repository (only main.yml, v25.09.yml, v26.02.yml, and v26.04.yml are present). This is the exact pattern the root AGENTS.md defines as "Fabricated CLI / config fields" — and this steward file is the first place an agent will look when managing fern/ versioning. An agent that trusts this inventory will attempt to read, validate, or update a file that doesn't exist, then treat the missing file as a regression rather than an authoring error.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disagreeing — fern/versions/latest.yml does exist on this branch:

$ ls fern/versions/
latest.yml
main
main.yml
v25.09
v25.09.yml
v26.02
v26.02.yml
v26.04
v26.04.yml

The file is referenced as the redirect-only manifest (no matching latest/ directory, which the steward already calls out two lines later: "latest.yml is redirect-only — no latest/ directory"). This is the second "Unverified finding regression" pattern on this PR from this tool — the Known Regression Pattern is in root AGENTS.md for a reason. No code change needed.


## Contract Checklist

When `fern/` changes:

- `fern/docs.yml`, `fern/versions/*.yml` and matching directories
- `fern/AUTODOCS_GUIDE.md`, `fern/README.md`, `fern/components/`,
`fern/main.css`, `fern/assets/`, `fern/package.json`,
`fern/fern.config.json`
- `fern/_fix_broken_links.py`, `fern/substitute_variables.py`
- `requirements-docs.txt`
- `.claude/skills/nemo-curator-docs/`
- `.cursor/rules/*.mdc`, `.github/copilot-instructions.md` — any
product fact shared with `fern/`
- `CHANGELOG.md` and release notes (in `fern/`)

For IA refactors, version cuts, and large content updates, run the
full Content Audit swarm and gate merge on verified P0.

## Doc Autopilot

Three triggers defined in root [AGENTS.md](../AGENTS.md) — merge gate,
periodic re-audit, source-triggered re-audit. **Current state:
manual rollout, automation pending.** No CI gate, scheduled job, or
source-watch wiring exists yet. Each scoped steward's **Own** list is
its audit surface in autopilot mode.

## Advocate

- **Pin owned doc paths** in every scoped `AGENTS.md`. Most currently
defer this to "the next docs autopilot pass" — close the gap.
- **Decommission `docs/`**: confirm Fern parity for every migrated
page, retire `docs/conf.py`, drop `docs/about/release-notes/`,
remove or rebase stale redirects.
- **Wire Doc Autopilot triggers into CI**: a `docs-audit-required`
PR check for the merge gate, a scheduled workflow for periodic
re-audit, a labels-or-paths trigger for source-triggered re-audits.
- **Programmatic counts** — surface classifier / embedder / codec
inventories from source.
- **Site-wide grep tool** for the Global Sweep On Accepted P0s rule.
- **Health metrics** — track broken-link rate, freshness, owner
coverage.

## Own

**Content:**

- `fern/` (entire tree); cross-cutting concerns (welcome,
getting-started, install, glossary, contributor pages,
release-notes) are your direct audit surface. Scoped stewards
discover their own impacted pages via root AGENTS.md
*Impacted-Docs Discovery*.
- `requirements-docs.txt`
- Release notes (in `fern/`)
- `CHANGELOG.md` (cross-owned with the implementing area)

**Delegation destination.** You are the steward other stewards
escalate to when a change is *abstraction-level* (reshaped concept,
terminology shift, restructured mental model) and the calling
steward can't list useful grep terms in one line. When invoked as
a subagent with a diff summary + change context, your job is:
cross-page consistency, IA implications, terminology drift, and
identifying conceptual pages no symbol-grep would have surfaced.
Return findings in Steward Signal Format. Don't replicate the
work the calling steward already did — focus on what they
*couldn't* do.

**Tests:** any link / lint / structural checks for `fern/` (add a CI
gate if not present).

**Agent artifacts:** `.claude/skills/nemo-curator-docs/`. Apply the
Docs-First evaluation gate before expanding.

**CODEOWNERS:** `@NVIDIA-NeMo/docs_team` for both `docs/` and
`fern/`.
Loading
Loading