docs: rewrite README and expand CONTRIBUTING per PM review and OSS audit by lbliii · Pull Request #2030 · NVIDIA-NeMo/Curator

lbliii · 2026-05-26T16:54:21Z

Summary

Addresses the PM's README review and the OSS Community-Friendly audit. Touches three files at the repo root.

README.md — trust-first rewrite

Quick Start split into three paths: CPU smoke test (text_cpu, no GPU), GPU text (text_cuda12, with prereqs), and Docker (recommended for video/audio).
Canonical install URLs — replaced latest/admin/installation.html with latest/get-started/installation everywhere; dropped .html suffixes.
Benchmark claims moved into a sourced table (workload, hardware, baseline, result, source link to the throughput docs) with a caveat about the per-panel token-count discrepancy.
Version drift fixed — removed the stale "Ray 2.54" mention; removed Python-version copy so pyproject.toml (>=3.11,<3.14) and the PyPI badge are the single source of truth.
Product story reframed — hero now leads with who + job + outcome; added "Use NeMo Curator when…" bullets; Nemotron proof pulled up.
Structure collapsed to 8 sections; dropped the duplicated per-modality feature catalog (lives in docs).
Image refs switched to absolute raw.githubusercontent.com URLs so they render on PyPI.

Audit must-haves now satisfied

README (was 5/8, now 8/8):

Apache-2.0 named in text (new ## License section).
Prominent local CONTRIBUTING.md link (no longer routes to NVIDIA/NeMo).
## Getting Help with channel-boundary table and explicit best-effort/no-SLA wording.
## Roadmap linking GitHub Milestones, project board, and release notes.

CONTRIBUTING (was 4/8, now 8/8):

## Ways to Contribute defines 7 contribution types.
## Your First Contribution links good first issue / help wanted queries and gives starter areas.
## Asking Questions and Discussing Changes maps needs to Discussions vs Issues.
## Code of Conduct section linking the new CODE_OF_CONDUCT.md.

CODE_OF_CONDUCT.md (new)

Adopts Contributor Covenant 2.1 by reference (canonical URL) rather than reproducing the full text inline. Covers Pledge, Standard, Scope, Reporting, Enforcement, and Attribution.

⚠️ Action needed before merge

CODE_OF_CONDUCT.md reports to a <TODO: confirm reporting contact — e.g., sw-coc@nvidia.com> placeholder. OSPO/legal needs to confirm the real address before this lands; the audit item is structurally satisfied but the channel isn't real yet.

Open follow-ups (PM PoA P4 — out of scope for this PR)

Confirm production-default executor (Cosmos-Xenna vs Ray Actor Pool vs Ray Data).
GA vs tutorial-level support per modality.
"Latest" version naming convention.

Test plan

Render README.md on GitHub and confirm all images load.
Render README.md on a Markdown preview tool that mimics PyPI (e.g. twine check) and confirm images load and link targets resolve.
Click every link in README.md, CONTRIBUTING.md, and CODE_OF_CONDUCT.md; confirm no 404s.
Confirm good first issue and help wanted label query URLs return results.
OSPO sign-off on the reporting contact placeholder in CODE_OF_CONDUCT.md.

🤖 Generated with Claude Code

README: trust-first rewrite. Quick Start split into CPU smoke, GPU text, and Docker paths. Canonical install URLs (get-started/installation). Benchmark claims moved to a sourced table with caveats. Removed stale Python/Ray version mentions; pyproject is the single source of truth. Added Getting Help (channel boundaries, no SLA), Roadmap (Milestones + project board), explicit Apache-2.0 License section, and prominent local Contributing link. Image refs use absolute raw.githubusercontent URLs so they render on PyPI. CONTRIBUTING: added Ways to Contribute (contribution types), Your First Contribution (good first issue + help wanted), Asking Questions / Discussing Changes (Discussions vs Issues), and Code of Conduct sections. CODE_OF_CONDUCT.md: new file adopting Contributor Covenant 2.1 by reference. Reporting contact is a flagged placeholder pending OSPO confirmation. Closes audit gaps: - README must-haves: license-by-name, prominent contributing link, getting-help boundaries, roadmap link. - CONTRIBUTING must-haves: contribution types, good-first-issue guidance, Code of Conduct link, questions/discussions channel. Signed-off-by: Lawrence Lane <llane@nvidia.com>

greptile-apps · 2026-05-26T16:57:14Z

Greptile Summary

This PR rewrites README.md, expands CONTRIBUTING.md, and adds a new CODE_OF_CONDUCT.md to satisfy a PM review and OSS community-friendly audit. The changes are documentation-only with no code impact.

README.md is restructured around a three-path Quick Start (CPU/GPU/Docker), a sourced benchmark table, a new "How It Works" architecture section, and new ## Roadmap, ## Getting Help, and ## License sections; image refs switched to absolute raw.githubusercontent.com URLs for PyPI compatibility.
CONTRIBUTING.md gains seven new sections (Ways to Contribute, Your First Contribution, Asking Questions, Code of Conduct) while preserving the existing technical setup and PR guidelines.
CODE_OF_CONDUCT.md (new) adopts Contributor Covenant 2.1 by reference and routes violation reports through GitHub-native channels pending OSPO confirmation of a dedicated email alias.

Confidence Score: 4/5

Safe to merge once the executor default claim is resolved; all other changes are additive documentation improvements.

The README now asserts XennaExecutor (Cosmos-Xenna) is the production default as a settled fact, while the PR description simultaneously lists confirming the production-default executor as an open follow-up out of scope for this PR. A user who reads this README today and picks their deployment stack based on that statement will be acting on an unconfirmed claim. Everything else in the three files — the Quick Start paths, benchmark table, CONTRIBUTING expansion, and CODE_OF_CONDUCT — looks correct and clean.

README.md line 123 (executor default claim) and lines 92/138 (Nemotron-CC URL path change from data_curation/ to data/curation/ — needs link verification against the Nemotron repo).

Important Files Changed

Filename	Overview
README.md	Major rewrite: new hero copy, three-path Quick Start, sourced benchmark table, condensed architecture section, added Roadmap/Getting Help/License sections. One confirmed content-accuracy issue (executor default claim vs. open follow-up) and one unverified URL path change for Nemotron-CC links.
CONTRIBUTING.md	Significantly expanded: added Ways to Contribute, Your First Contribution, Asking Questions, and Code of Conduct sections; section headings capitalised; existing technical content (style, setup, tests, PR guidelines) unchanged.
CODE_OF_CONDUCT.md	New file adopting Contributor Covenant 2.1 by reference; reporting placeholder resolved via GitHub-native channels with a note that a dedicated email will follow OSPO confirmation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User opens README Quick Start] --> B{Choose path}
    B --> C[Path A — CPU smoke test]
    B --> D[Path B — GPU text pipeline]
    B --> E[Path C — Docker]

    C --> C1["uv venv && source .venv/bin/activate"]
    C1 --> C2["uv pip install nemo-curator[text_cpu]"]
    C2 --> C3["python -c 'import nemo_curator; print(version)'"]

    D --> D1{Prerequisites met?}
    D1 -->|CUDA 12, Linux x86_64, ~16 GB VRAM| D2["uv venv && source .venv/bin/activate"]
    D1 -->|Missing| D_fail[Install CUDA toolkit / NVIDIA driver]
    D2 --> D3["uv pip install nemo-curator[text_cuda12]"]
    D3 --> D4["python tutorials/quickstart.py"]

    E --> E1[Pull nemo-curator image from NGC]
    E1 --> E2[Run container with codec libraries pre-configured]
    E2 --> E3[Run video / audio pipelines]

    C3 & D4 & E3 --> F[Full Installation Guide]

_{Reviews (6): Last reviewed commit: "Merge branch 'main' into lbliii/update-r..." | Re-trigger Greptile}

greptile-apps · 2026-05-26T16:57:18Z

+All reports will be reviewed and investigated promptly and fairly. Maintainers are obligated to respect the privacy and security of the reporter of any incident.
+
+## Enforcement
+


Unresolved TODO placeholder blocks actionable reporting

The reporting contact is still <TODO: confirm reporting contact — e.g., sw-coc@nvidia.com>. A CoC with no reachable enforcement channel is worse than none — reporters cannot escalate violations, and the file ships with raw angle-bracket template syntax visible to every contributor. The PR description flags OSPO sign-off as required; this should be a hard merge gate, not a follow-up.

Fixed in 06748dd — removed the <TODO> placeholder. Reports now route through GitHub-native channels (Report abuse + direct maintainer contact), which are reachable today without OSPO sign-off. A dedicated email alias can be swapped in later when confirmed; no longer a merge blocker.

greptile-apps · 2026-05-26T16:57:19Z

+### Path B — GPU text pipeline (CUDA 12, supported Linux)

-Process and curate high-quality text datasets for large language model (LLM) training with multilingual support.
+The bundled quickstart starts Ray, downloads a Hugging Face model, and runs a sentiment classification pipeline on GPU.

-| Category | Features | Documentation |
-|----------|----------|---------------|
-| **Data Sources** | Common Crawl • Wikipedia • ArXiv • Custom datasets | [Load Data](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html) |
-| **Quality Filtering** | 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type | [Quality Assessment](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/heuristic.html) |
-| **Deduplication** | Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) | [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) |
-| **Processing** | Text cleaning • Language identification | [Content Processing](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/text-cleaning.html) |
+**Prerequisites:** CUDA 12 toolkit, NVIDIA driver supporting CUDA 12, Linux x86_64, ~16 GB GPU memory, network access to Hugging Face.

---
+```bash
+python -m venv .venv && source .venv/bin/activate
+uv pip install "nemo-curator[text_cuda12]"
+python tutorials/quickstart.py
+```


Path B requires uv but lists no prerequisite for it

Path A uses plain pip, while Path B switches to uv pip install without telling the reader they need uv installed first. A first-time user who follows Path B after reading Path A will get command not found: uv and have no hint about how to proceed. Either add a short uv install note, or use plain pip here for consistency with Path A.

Suggested change

### Path B — GPU text pipeline (CUDA 12, supported Linux)

Process and curate high-quality text datasets for large language model (LLM) training with multilingual support.

The bundled quickstart starts Ray, downloads a Hugging Face model, and runs a sentiment classification pipeline on GPU.

| Category | Features | Documentation |

|----------|----------|---------------|

| **Data Sources** | Common Crawl • Wikipedia • ArXiv • Custom datasets | [Load Data](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html) |

| **Quality Filtering** | 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type | [Quality Assessment](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/heuristic.html) |

| **Deduplication** | Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) | [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) |

| **Processing** | Text cleaning • Language identification | [Content Processing](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/text-cleaning.html) |

**Prerequisites:** CUDA 12 toolkit, NVIDIA driver supporting CUDA 12, Linux x86_64, ~16 GB GPU memory, network access to Hugging Face.

---

```bash

python -m venv .venv && source .venv/bin/activate

uv pip install "nemo-curator[text_cuda12]"

python tutorials/quickstart.py

```

### Path B — GPU text pipeline (CUDA 12, supported Linux)

The bundled quickstart starts Ray, downloads a Hugging Face model, and runs a sentiment classification pipeline on GPU.

**Prerequisites:** CUDA 12 toolkit, NVIDIA driver supporting CUDA 12, Linux x86_64, ~16 GB GPU memory, network access to Hugging Face.

> **Note:** Path B uses [`uv`](https://github.com/astral-sh/uv) for faster dependency resolution. Install it first with `pip install uv` or follow the [uv install guide](https://docs.astral.sh/uv/getting-started/installation/).

```bash

python -m venv .venv && source .venv/bin/activate

uv pip install "nemo-curator[text_cuda12]"

python tutorials/quickstart.py

Fixed in 06748dd — added an inline prereq note pointing readers to pip install uv and the uv install guide. Kept uv because it's what the path optimizes for; the note prevents the command not found cliff.

greptile-apps · 2026-05-26T16:57:20Z

+## Roadmap

-**Results:** Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages.
+Release work is tracked through versioned [GitHub Milestones](https://github.com/NVIDIA-NeMo/Curator/milestones). Larger initiatives live on the [project board](https://github.com/orgs/NVIDIA-NeMo/projects). For shipped changes, see the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes).


Roadmap links to all org projects instead of the Curator-specific board

https://github.com/orgs/NVIDIA-NeMo/projects lands on every project in the NVIDIA-NeMo org, not the Curator project board. Readers won't know which board to look at. Link directly to the Curator project (e.g. /projects/1) or remove the board link until the specific URL is confirmed.

Suggested change

Release work is tracked through versioned [GitHub Milestones](https://github.com/NVIDIA-NeMo/Curator/milestones). Larger initiatives live on the [project board](https://github.com/orgs/NVIDIA-NeMo/projects). For shipped changes, see the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes).

Release work is tracked through versioned [GitHub Milestones](https://github.com/NVIDIA-NeMo/Curator/milestones). For shipped changes, see the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes).

Fixed in 06748dd — dropped the org-level project board link. Couldn't confirm a Curator-specific board URL (token lacks read:project scope), so following the suggestion to remove it until the specific URL is known. Milestones + release notes still link.

- CODE_OF_CONDUCT.md: drop <TODO> placeholder reporting contact. Route reports through GitHub-native channels (Report abuse + direct maintainer contact) that are reachable today, with a note that a dedicated alias will be added once OSPO confirms one. - README.md: add prerequisite note for `uv` to Path B Quick Start so first-time readers don't hit `command not found: uv`. - README.md: drop org-level project board link from the Roadmap section (it lands on all NVIDIA-NeMo projects, not Curator); Milestones + release notes remain. Signed-off-by: Lawrence Lane <llane@nvidia.com>

sarahyurick · 2026-06-01T19:02:20Z

+New to the project? Start here:
+
+1. Browse issues labeled [`good first issue`](https://github.com/NVIDIA-NeMo/Curator/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) — these are scoped to be approachable without deep familiarity with the codebase.
+2. For slightly larger work, look at [`help wanted`](https://github.com/NVIDIA-NeMo/Curator/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22).


I don't think we use this tag.

Fixed in 58889bf — removed the help wanted reference. Confirmed the repo has 0 issues ever under that label (vs 33 under good first issue), so it's just good first issue now.

sarahyurick · 2026-06-01T19:03:44Z

-## Features by Modality
+```bash
+python -m venv .venv && source .venv/bin/activate
+pip install "nemo-curator[text_cpu]"


We should encourage uv instead.

Fixed in 58889bf — Quick Start now standardizes on uv for both Path A and Path B, with a one-time install step up top.

sarahyurick · 2026-06-01T19:04:23Z

-| **Quality Filtering** | 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type | [Quality Assessment](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/heuristic.html) |
-| **Deduplication** | Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) | [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) |
-| **Processing** | Text cleaning • Language identification | [Content Processing](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/text-cleaning.html) |
+**Prerequisites:** CUDA 12 toolkit, NVIDIA driver supporting CUDA 12, Linux x86_64, ~16 GB GPU memory, network access to Hugging Face. This path uses [`uv`](https://docs.astral.sh/uv/) for faster dependency resolution — install it first with `pip install uv` (or follow the [uv install guide](https://docs.astral.sh/uv/getting-started/installation/)).


Install uv with curl right?

Right — fixed in 58889bf. Added a one-time curl -LsSf https://astral.sh/uv/install.sh | sh step at the top of Quick Start, and both paths now use uv venv / uv pip.

sarahyurick · 2026-06-01T19:06:12Z

+NeMo Curator powers the data pipelines behind [NVIDIA Nemotron](https://developer.nvidia.com/nemotron) models. The [Nemotron-4 pre-training dataset](https://arxiv.org/abs/2402.16819) was curated using NeMo Curator's text pipeline across 8+ trillion tokens of multilingual web data — quality filtering, deduplication, and domain classification at scale.

---
+The [Nemotron-CC curation pipeline](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) uses NeMo Curator end-to-end — from Common Crawl extraction through language ID, exact/fuzzy/substring deduplication, ensemble quality classification, and LLM-based synthetic data generation — to reproduce the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). The SDG stage is available as an [in-repo tutorial](tutorials/synthetic/nemotron_cc/).


It looks like the first link is broken.

Fixed in 58889bf — the Nemotron-CC recipe path was wrong (data_curation should be data/curation). Corrected both occurrences; verified 200.

sarahyurick · 2026-06-01T19:07:02Z

+| Metric | Workload | Hardware | Baseline | NeMo Curator | Source |
+|--------|----------|---------|----------|--------------|--------|
+| Fuzzy dedupe speedup | RedPajama v2 subset | 3× H100 80 GB nodes | CPU-based alternative | 10.7 h → 0.65 h (**~16×**) | [Throughput docs](https://docs.nvidia.com/nemo/curator/latest/about/concepts/scaling/throughput) |
+| Total cost of ownership | RedPajama v2 subset | 3× H100 80 GB nodes | CPU-based alternative | $315 → $190 (**~40% lower**) | [Throughput docs](https://docs.nvidia.com/nemo/curator/latest/about/concepts/scaling/throughput) |
+| GPU scaling (1→4 nodes) | RedPajama v2 subset | 1, 2, 4 × H100 80 GB nodes | Single-node run | 2.05 h → 1.01 h → 0.50 h | [Throughput docs](https://docs.nvidia.com/nemo/curator/latest/about/concepts/scaling/throughput) |


We probably don't need to list the same link 3 times here.

Fixed in 58889bf — dropped the Source column entirely. The section intro already links the throughput docs once, so the per-row repetition is gone.

sarahyurick · 2026-06-01T19:09:53Z

+|----------|------|
+| Installation guide (CPU, GPU, Docker, source) | [docs.nvidia.com/nemo/curator/latest/get-started/installation](https://docs.nvidia.com/nemo/curator/latest/get-started/installation) |
+| Container image | [nemo-curator on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) |
+| Infrastructure (Slurm, Kubernetes, multi-node) | [Infrastructure docs](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index) |


Broken link.

Fixed in 58889bf — reference/infrastructure/index → reference/infra (canonical path per the live site's llms.txt index); verified 200.

sarahyurick · 2026-06-01T19:10:31Z

+| Installation guide (CPU, GPU, Docker, source) | [docs.nvidia.com/nemo/curator/latest/get-started/installation](https://docs.nvidia.com/nemo/curator/latest/get-started/installation) |
+| Container image | [nemo-curator on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) |
+| Infrastructure (Slurm, Kubernetes, multi-node) | [Infrastructure docs](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index) |
+| API reference | [API docs](https://docs.nvidia.com/nemo/curator/latest/apidocs/index) |


Broken link.

Fixed in 58889bf — apidocs/index → api/reference/api-reference; verified 200.

sarahyurick · 2026-06-01T19:10:41Z

+| Container image | [nemo-curator on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) |
+| Infrastructure (Slurm, Kubernetes, multi-node) | [Infrastructure docs](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index) |
+| API reference | [API docs](https://docs.nvidia.com/nemo/curator/latest/apidocs/index) |
+| Concepts | [Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/index) |


Broken link.

Fixed in 58889bf — about/concepts/index → about/concepts; verified 200.

sarahyurick · 2026-06-01T19:14:24Z

+| Concepts | [Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/index) |

-### Quality Improvements
+Supported versions are defined in [`pyproject.toml`](pyproject.toml) and exposed on the PyPI badge above; the README does not duplicate them to avoid drift.


I don't think the pyproject lists Curator versions right? Just dependencies?

Good catch — fixed in 58889bf. pyproject pins Python and dependency versions, not the Curator release version. Reworded to 'Supported Python and dependency versions are defined in pyproject.toml.'

sarahyurick · 2026-06-01T19:15:08Z

+## Roadmap

-**Results:** Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages.
+Release work is tracked through versioned [GitHub Milestones](https://github.com/NVIDIA-NeMo/Curator/milestones). For shipped changes, see the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes).


Milestones looks empty to me.

Fixed in 58889bf — the milestones are indeed all empty (0 issues). Reframed the Roadmap section to point to the release notes for shipped work and Issues/Discussions for planned direction, and dropped the empty Milestones link.

README: - Standardize Quick Start on uv: add one-time `curl` installer step and use `uv venv`/`uv pip` in both Path A and Path B (was plain pip in A, pip-installed uv in B). - Fix broken doc links to canonical paths verified against the live site's llms.txt index: - infrastructure: reference/infrastructure/index -> reference/infra - API reference: apidocs/index -> api/reference/api-reference - concepts: about/concepts/index -> about/concepts - Fix broken Nemotron-CC recipe link (data_curation -> data/curation), both occurrences. - Remove the Source column from the benchmark table (the same throughput link was repeated 3x); the section intro already links the source once. - Correct the executor claim: XennaExecutor (Cosmos-Xenna) is the production default with experimental Ray backends, per the API docs (resolves the unconfirmed "default for video" claim Greptile flagged). - Clarify the pyproject note: it pins Python/dependency versions, not Curator release versions. - Roadmap: point to release notes (shipped) + Issues/Discussions (planned); drop the empty Milestones link. CONTRIBUTING: - Remove the `help wanted` label reference; the repo has never used it (0 issues). Keep `good first issue` (33 issues). Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot · 2026-06-02T19:15:10Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

lbliii · 2026-06-02T19:16:05Z

Thanks @sarahyurick — all review comments addressed in 58889bf. Summary:

uv standardization: Quick Start now has a one-time curl installer and both paths use uv venv / uv pip.
Broken links fixed (all verified 200 against the live site's llms.txt index):
- reference/infrastructure/index → reference/infra
- apidocs/index → api/reference/api-reference
- about/concepts/index → about/concepts
- Nemotron-CC recipe: data_curation → data/curation (both occurrences)
Benchmark table: dropped the repeated Source column (intro links the source once).
pyproject note: reworded — it pins Python/dependency versions, not Curator release versions.
Roadmap: milestones are empty, so reframed to release notes (shipped) + Issues/Discussions (planned); dropped the empty Milestones link.
help wanted: removed from CONTRIBUTING (0 issues ever; kept good first issue with 33).
Executor claim (Greptile): corrected to "XennaExecutor (Cosmos-Xenna) is the production default, with experimental Ray backends," per the API docs.

I ran a link check across README, CONTRIBUTING, and CODE_OF_CONDUCT — all external links return 200 and all relative paths exist. Ready for another look.

lbliii · 2026-06-03T15:28:42Z

/ok to test 8a09b67

sarahyurick · 2026-06-03T16:49:12Z

/ok to test 56e3d06

lbliii requested a review from a team as a code owner May 26, 2026 16:54

lbliii requested review from suiyoubi and removed request for a team May 26, 2026 16:54

copy-pr-bot Bot temporarily deployed to public May 26, 2026 16:54 Inactive

greptile-apps Bot reviewed May 26, 2026

View reviewed changes

lbliii requested a review from arhamm1 May 26, 2026 16:58

copy-pr-bot Bot temporarily deployed to public May 26, 2026 16:58 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 16:59 Inactive

copy-pr-bot Bot temporarily deployed to public May 26, 2026 17:04 Inactive

lbliii self-assigned this May 27, 2026

copy-pr-bot Bot temporarily deployed to public May 27, 2026 15:55 Inactive

copy-pr-bot Bot temporarily deployed to test May 27, 2026 15:55 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 27, 2026 15:56 Inactive

copy-pr-bot Bot temporarily deployed to public May 28, 2026 18:17 Inactive

copy-pr-bot Bot temporarily deployed to public May 28, 2026 18:23 Inactive

sarahyurick requested changes Jun 1, 2026

View reviewed changes

Merge branch 'main' into lbliii/update-readme-pm-review

4a50ef0

copy-pr-bot Bot temporarily deployed to public June 2, 2026 18:57 Inactive

copy-pr-bot Bot temporarily deployed to public June 2, 2026 19:01 Inactive

copy-pr-bot Bot temporarily deployed to public June 2, 2026 19:07 Inactive

lbliii requested a review from sarahyurick June 2, 2026 19:16

sarahyurick approved these changes Jun 2, 2026

View reviewed changes

Merge branch 'main' into lbliii/update-readme-pm-review

8a09b67

lbliii enabled auto-merge (squash) June 3, 2026 15:16

copy-pr-bot Bot temporarily deployed to public June 3, 2026 15:29 Inactive

copy-pr-bot Bot temporarily deployed to public June 3, 2026 15:33 Inactive

sarahyurick added the docs-only label Jun 3, 2026

Merge branch 'main' into lbliii/update-readme-pm-review

56e3d06

copy-pr-bot Bot temporarily deployed to public June 3, 2026 16:49 Inactive

lbliii merged commit 9bd9023 into NVIDIA-NeMo:main Jun 3, 2026
24 checks passed

copy-pr-bot Bot temporarily deployed to public June 3, 2026 16:53 Inactive

copy-pr-bot Bot temporarily deployed to public June 3, 2026 16:54 Inactive

copy-pr-bot Bot temporarily deployed to public June 3, 2026 16:59 Inactive

		All reports will be reviewed and investigated promptly and fairly. Maintainers are obligated to respect the privacy and security of the reporter of any incident.

		## Enforcement

	Release work is tracked through versioned [GitHub Milestones](https://github.com/NVIDIA-NeMo/Curator/milestones). Larger initiatives live on the [project board](https://github.com/orgs/NVIDIA-NeMo/projects). For shipped changes, see the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes).
	Release work is tracked through versioned [GitHub Milestones](https://github.com/NVIDIA-NeMo/Curator/milestones). For shipped changes, see the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes).

Conversation

lbliii commented May 26, 2026

Summary

README.md — trust-first rewrite

Audit must-haves now satisfied

CODE_OF_CONDUCT.md (new)

⚠️ Action needed before merge

Open follow-ups (PM PoA P4 — out of scope for this PR)

Test plan

Uh oh!

greptile-apps Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot Bot commented Jun 2, 2026

Uh oh!

lbliii commented Jun 2, 2026

Uh oh!

lbliii commented Jun 3, 2026

Uh oh!

sarahyurick commented Jun 3, 2026

Uh oh!

greptile-apps Bot commented May 26, 2026 •

edited

Loading