docs: rewrite README and expand CONTRIBUTING per PM review and OSS audit#2030
Conversation
README: trust-first rewrite. Quick Start split into CPU smoke, GPU text, and Docker paths. Canonical install URLs (get-started/installation). Benchmark claims moved to a sourced table with caveats. Removed stale Python/Ray version mentions; pyproject is the single source of truth. Added Getting Help (channel boundaries, no SLA), Roadmap (Milestones + project board), explicit Apache-2.0 License section, and prominent local Contributing link. Image refs use absolute raw.githubusercontent URLs so they render on PyPI. CONTRIBUTING: added Ways to Contribute (contribution types), Your First Contribution (good first issue + help wanted), Asking Questions / Discussing Changes (Discussions vs Issues), and Code of Conduct sections. CODE_OF_CONDUCT.md: new file adopting Contributor Covenant 2.1 by reference. Reporting contact is a flagged placeholder pending OSPO confirmation. Closes audit gaps: - README must-haves: license-by-name, prominent contributing link, getting-help boundaries, roadmap link. - CONTRIBUTING must-haves: contribution types, good-first-issue guidance, Code of Conduct link, questions/discussions channel. Signed-off-by: Lawrence Lane <llane@nvidia.com>
Greptile SummaryThis PR rewrites
Confidence Score: 4/5Safe to merge once the executor default claim is resolved; all other changes are additive documentation improvements. The README now asserts XennaExecutor (Cosmos-Xenna) is the production default as a settled fact, while the PR description simultaneously lists confirming the production-default executor as an open follow-up out of scope for this PR. A user who reads this README today and picks their deployment stack based on that statement will be acting on an unconfirmed claim. Everything else in the three files — the Quick Start paths, benchmark table, CONTRIBUTING expansion, and CODE_OF_CONDUCT — looks correct and clean. README.md line 123 (executor default claim) and lines 92/138 (Nemotron-CC URL path change from data_curation/ to data/curation/ — needs link verification against the Nemotron repo). Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[User opens README Quick Start] --> B{Choose path}
B --> C[Path A — CPU smoke test]
B --> D[Path B — GPU text pipeline]
B --> E[Path C — Docker]
C --> C1["uv venv && source .venv/bin/activate"]
C1 --> C2["uv pip install nemo-curator[text_cpu]"]
C2 --> C3["python -c 'import nemo_curator; print(version)'"]
D --> D1{Prerequisites met?}
D1 -->|CUDA 12, Linux x86_64, ~16 GB VRAM| D2["uv venv && source .venv/bin/activate"]
D1 -->|Missing| D_fail[Install CUDA toolkit / NVIDIA driver]
D2 --> D3["uv pip install nemo-curator[text_cuda12]"]
D3 --> D4["python tutorials/quickstart.py"]
E --> E1[Pull nemo-curator image from NGC]
E1 --> E2[Run container with codec libraries pre-configured]
E2 --> E3[Run video / audio pipelines]
C3 & D4 & E3 --> F[Full Installation Guide]
Reviews (6): Last reviewed commit: "Merge branch 'main' into lbliii/update-r..." | Re-trigger Greptile |
| All reports will be reviewed and investigated promptly and fairly. Maintainers are obligated to respect the privacy and security of the reporter of any incident. | ||
|
|
||
| ## Enforcement | ||
|
|
There was a problem hiding this comment.
Unresolved TODO placeholder blocks actionable reporting
The reporting contact is still <TODO: confirm reporting contact — e.g., sw-coc@nvidia.com>. A CoC with no reachable enforcement channel is worse than none — reporters cannot escalate violations, and the file ships with raw angle-bracket template syntax visible to every contributor. The PR description flags OSPO sign-off as required; this should be a hard merge gate, not a follow-up.
There was a problem hiding this comment.
Fixed in 06748dd — removed the <TODO> placeholder. Reports now route through GitHub-native channels (Report abuse + direct maintainer contact), which are reachable today without OSPO sign-off. A dedicated email alias can be swapped in later when confirmed; no longer a merge blocker.
| ### Path B — GPU text pipeline (CUDA 12, supported Linux) | ||
|
|
||
| Process and curate high-quality text datasets for large language model (LLM) training with multilingual support. | ||
| The bundled quickstart starts Ray, downloads a Hugging Face model, and runs a sentiment classification pipeline on GPU. | ||
|
|
||
| | Category | Features | Documentation | | ||
| |----------|----------|---------------| | ||
| | **Data Sources** | Common Crawl • Wikipedia • ArXiv • Custom datasets | [Load Data](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html) | | ||
| | **Quality Filtering** | 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type | [Quality Assessment](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/heuristic.html) | | ||
| | **Deduplication** | Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) | [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) | | ||
| | **Processing** | Text cleaning • Language identification | [Content Processing](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/text-cleaning.html) | | ||
| **Prerequisites:** CUDA 12 toolkit, NVIDIA driver supporting CUDA 12, Linux x86_64, ~16 GB GPU memory, network access to Hugging Face. | ||
|
|
||
| --- | ||
| ```bash | ||
| python -m venv .venv && source .venv/bin/activate | ||
| uv pip install "nemo-curator[text_cuda12]" | ||
| python tutorials/quickstart.py | ||
| ``` |
There was a problem hiding this comment.
Path B requires
uv but lists no prerequisite for it
Path A uses plain pip, while Path B switches to uv pip install without telling the reader they need uv installed first. A first-time user who follows Path B after reading Path A will get command not found: uv and have no hint about how to proceed. Either add a short uv install note, or use plain pip here for consistency with Path A.
| ### Path B — GPU text pipeline (CUDA 12, supported Linux) | |
| Process and curate high-quality text datasets for large language model (LLM) training with multilingual support. | |
| The bundled quickstart starts Ray, downloads a Hugging Face model, and runs a sentiment classification pipeline on GPU. | |
| | Category | Features | Documentation | | |
| |----------|----------|---------------| | |
| | **Data Sources** | Common Crawl • Wikipedia • ArXiv • Custom datasets | [Load Data](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html) | | |
| | **Quality Filtering** | 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type | [Quality Assessment](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/heuristic.html) | | |
| | **Deduplication** | Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) | [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) | | |
| | **Processing** | Text cleaning • Language identification | [Content Processing](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/text-cleaning.html) | | |
| **Prerequisites:** CUDA 12 toolkit, NVIDIA driver supporting CUDA 12, Linux x86_64, ~16 GB GPU memory, network access to Hugging Face. | |
| --- | |
| ```bash | |
| python -m venv .venv && source .venv/bin/activate | |
| uv pip install "nemo-curator[text_cuda12]" | |
| python tutorials/quickstart.py | |
| ``` | |
| ### Path B — GPU text pipeline (CUDA 12, supported Linux) | |
| The bundled quickstart starts Ray, downloads a Hugging Face model, and runs a sentiment classification pipeline on GPU. | |
| **Prerequisites:** CUDA 12 toolkit, NVIDIA driver supporting CUDA 12, Linux x86_64, ~16 GB GPU memory, network access to Hugging Face. | |
| > **Note:** Path B uses [`uv`](https://github.com/astral-sh/uv) for faster dependency resolution. Install it first with `pip install uv` or follow the [uv install guide](https://docs.astral.sh/uv/getting-started/installation/). | |
| ```bash | |
| python -m venv .venv && source .venv/bin/activate | |
| uv pip install "nemo-curator[text_cuda12]" | |
| python tutorials/quickstart.py |
There was a problem hiding this comment.
Fixed in 06748dd — added an inline prereq note pointing readers to pip install uv and the uv install guide. Kept uv because it's what the path optimizes for; the note prevents the command not found cliff.
| ## Roadmap | ||
|
|
||
| **Results:** Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages. | ||
| Release work is tracked through versioned [GitHub Milestones](https://github.com/NVIDIA-NeMo/Curator/milestones). Larger initiatives live on the [project board](https://github.com/orgs/NVIDIA-NeMo/projects). For shipped changes, see the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes). |
There was a problem hiding this comment.
Roadmap links to all org projects instead of the Curator-specific board
https://github.com/orgs/NVIDIA-NeMo/projects lands on every project in the NVIDIA-NeMo org, not the Curator project board. Readers won't know which board to look at. Link directly to the Curator project (e.g. /projects/1) or remove the board link until the specific URL is confirmed.
| Release work is tracked through versioned [GitHub Milestones](https://github.com/NVIDIA-NeMo/Curator/milestones). Larger initiatives live on the [project board](https://github.com/orgs/NVIDIA-NeMo/projects). For shipped changes, see the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes). | |
| Release work is tracked through versioned [GitHub Milestones](https://github.com/NVIDIA-NeMo/Curator/milestones). For shipped changes, see the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes). |
There was a problem hiding this comment.
Fixed in 06748dd — dropped the org-level project board link. Couldn't confirm a Curator-specific board URL (token lacks read:project scope), so following the suggestion to remove it until the specific URL is known. Milestones + release notes still link.
- CODE_OF_CONDUCT.md: drop <TODO> placeholder reporting contact. Route reports through GitHub-native channels (Report abuse + direct maintainer contact) that are reachable today, with a note that a dedicated alias will be added once OSPO confirms one. - README.md: add prerequisite note for `uv` to Path B Quick Start so first-time readers don't hit `command not found: uv`. - README.md: drop org-level project board link from the Roadmap section (it lands on all NVIDIA-NeMo projects, not Curator); Milestones + release notes remain. Signed-off-by: Lawrence Lane <llane@nvidia.com>
| New to the project? Start here: | ||
|
|
||
| 1. Browse issues labeled [`good first issue`](https://github.com/NVIDIA-NeMo/Curator/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) — these are scoped to be approachable without deep familiarity with the codebase. | ||
| 2. For slightly larger work, look at [`help wanted`](https://github.com/NVIDIA-NeMo/Curator/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22). |
There was a problem hiding this comment.
I don't think we use this tag.
There was a problem hiding this comment.
Fixed in 58889bf — removed the help wanted reference. Confirmed the repo has 0 issues ever under that label (vs 33 under good first issue), so it's just good first issue now.
| ## Features by Modality | ||
| ```bash | ||
| python -m venv .venv && source .venv/bin/activate | ||
| pip install "nemo-curator[text_cpu]" |
There was a problem hiding this comment.
We should encourage uv instead.
There was a problem hiding this comment.
Fixed in 58889bf — Quick Start now standardizes on uv for both Path A and Path B, with a one-time install step up top.
| | **Quality Filtering** | 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type | [Quality Assessment](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/heuristic.html) | | ||
| | **Deduplication** | Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) | [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) | | ||
| | **Processing** | Text cleaning • Language identification | [Content Processing](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/text-cleaning.html) | | ||
| **Prerequisites:** CUDA 12 toolkit, NVIDIA driver supporting CUDA 12, Linux x86_64, ~16 GB GPU memory, network access to Hugging Face. This path uses [`uv`](https://docs.astral.sh/uv/) for faster dependency resolution — install it first with `pip install uv` (or follow the [uv install guide](https://docs.astral.sh/uv/getting-started/installation/)). |
There was a problem hiding this comment.
Install uv with curl right?
There was a problem hiding this comment.
Right — fixed in 58889bf. Added a one-time curl -LsSf https://astral.sh/uv/install.sh | sh step at the top of Quick Start, and both paths now use uv venv / uv pip.
| NeMo Curator powers the data pipelines behind [NVIDIA Nemotron](https://developer.nvidia.com/nemotron) models. The [Nemotron-4 pre-training dataset](https://arxiv.org/abs/2402.16819) was curated using NeMo Curator's text pipeline across 8+ trillion tokens of multilingual web data — quality filtering, deduplication, and domain classification at scale. | ||
|
|
||
| --- | ||
| The [Nemotron-CC curation pipeline](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) uses NeMo Curator end-to-end — from Common Crawl extraction through language ID, exact/fuzzy/substring deduplication, ensemble quality classification, and LLM-based synthetic data generation — to reproduce the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). The SDG stage is available as an [in-repo tutorial](tutorials/synthetic/nemotron_cc/). |
There was a problem hiding this comment.
It looks like the first link is broken.
There was a problem hiding this comment.
Fixed in 58889bf — the Nemotron-CC recipe path was wrong (data_curation should be data/curation). Corrected both occurrences; verified 200.
| | Metric | Workload | Hardware | Baseline | NeMo Curator | Source | | ||
| |--------|----------|---------|----------|--------------|--------| | ||
| | Fuzzy dedupe speedup | RedPajama v2 subset | 3× H100 80 GB nodes | CPU-based alternative | 10.7 h → 0.65 h (**~16×**) | [Throughput docs](https://docs.nvidia.com/nemo/curator/latest/about/concepts/scaling/throughput) | | ||
| | Total cost of ownership | RedPajama v2 subset | 3× H100 80 GB nodes | CPU-based alternative | $315 → $190 (**~40% lower**) | [Throughput docs](https://docs.nvidia.com/nemo/curator/latest/about/concepts/scaling/throughput) | | ||
| | GPU scaling (1→4 nodes) | RedPajama v2 subset | 1, 2, 4 × H100 80 GB nodes | Single-node run | 2.05 h → 1.01 h → 0.50 h | [Throughput docs](https://docs.nvidia.com/nemo/curator/latest/about/concepts/scaling/throughput) | |
There was a problem hiding this comment.
We probably don't need to list the same link 3 times here.
There was a problem hiding this comment.
Fixed in 58889bf — dropped the Source column entirely. The section intro already links the throughput docs once, so the per-row repetition is gone.
| |----------|------| | ||
| | Installation guide (CPU, GPU, Docker, source) | [docs.nvidia.com/nemo/curator/latest/get-started/installation](https://docs.nvidia.com/nemo/curator/latest/get-started/installation) | | ||
| | Container image | [nemo-curator on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) | | ||
| | Infrastructure (Slurm, Kubernetes, multi-node) | [Infrastructure docs](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index) | |
There was a problem hiding this comment.
Fixed in 58889bf — reference/infrastructure/index → reference/infra (canonical path per the live site's llms.txt index); verified 200.
| | Installation guide (CPU, GPU, Docker, source) | [docs.nvidia.com/nemo/curator/latest/get-started/installation](https://docs.nvidia.com/nemo/curator/latest/get-started/installation) | | ||
| | Container image | [nemo-curator on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) | | ||
| | Infrastructure (Slurm, Kubernetes, multi-node) | [Infrastructure docs](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index) | | ||
| | API reference | [API docs](https://docs.nvidia.com/nemo/curator/latest/apidocs/index) | |
There was a problem hiding this comment.
Fixed in 58889bf — apidocs/index → api/reference/api-reference; verified 200.
| | Container image | [nemo-curator on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) | | ||
| | Infrastructure (Slurm, Kubernetes, multi-node) | [Infrastructure docs](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index) | | ||
| | API reference | [API docs](https://docs.nvidia.com/nemo/curator/latest/apidocs/index) | | ||
| | Concepts | [Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/index) | |
There was a problem hiding this comment.
Fixed in 58889bf — about/concepts/index → about/concepts; verified 200.
| | Concepts | [Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/index) | | ||
|
|
||
| ### Quality Improvements | ||
| Supported versions are defined in [`pyproject.toml`](pyproject.toml) and exposed on the PyPI badge above; the README does not duplicate them to avoid drift. |
There was a problem hiding this comment.
I don't think the pyproject lists Curator versions right? Just dependencies?
There was a problem hiding this comment.
Good catch — fixed in 58889bf. pyproject pins Python and dependency versions, not the Curator release version. Reworded to 'Supported Python and dependency versions are defined in pyproject.toml.'
| ## Roadmap | ||
|
|
||
| **Results:** Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages. | ||
| Release work is tracked through versioned [GitHub Milestones](https://github.com/NVIDIA-NeMo/Curator/milestones). For shipped changes, see the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes). |
There was a problem hiding this comment.
Milestones looks empty to me.
There was a problem hiding this comment.
Fixed in 58889bf — the milestones are indeed all empty (0 issues). Reframed the Roadmap section to point to the release notes for shipped work and Issues/Discussions for planned direction, and dropped the empty Milestones link.
README: - Standardize Quick Start on uv: add one-time `curl` installer step and use `uv venv`/`uv pip` in both Path A and Path B (was plain pip in A, pip-installed uv in B). - Fix broken doc links to canonical paths verified against the live site's llms.txt index: - infrastructure: reference/infrastructure/index -> reference/infra - API reference: apidocs/index -> api/reference/api-reference - concepts: about/concepts/index -> about/concepts - Fix broken Nemotron-CC recipe link (data_curation -> data/curation), both occurrences. - Remove the Source column from the benchmark table (the same throughput link was repeated 3x); the section intro already links the source once. - Correct the executor claim: XennaExecutor (Cosmos-Xenna) is the production default with experimental Ray backends, per the API docs (resolves the unconfirmed "default for video" claim Greptile flagged). - Clarify the pyproject note: it pins Python/dependency versions, not Curator release versions. - Roadmap: point to release notes (shipped) + Issues/Discussions (planned); drop the empty Milestones link. CONTRIBUTING: - Remove the `help wanted` label reference; the repo has never used it (0 issues). Keep `good first issue` (33 issues). Signed-off-by: Lawrence Lane <llane@nvidia.com>
|
Thanks @sarahyurick — all review comments addressed in 58889bf. Summary:
I ran a link check across README, CONTRIBUTING, and CODE_OF_CONDUCT — all external links return 200 and all relative paths exist. Ready for another look. |
|
/ok to test 8a09b67 |
|
/ok to test 56e3d06 |
Summary
Addresses the PM's README review and the OSS Community-Friendly audit. Touches three files at the repo root.
README.md — trust-first rewrite
text_cpu, no GPU), GPU text (text_cuda12, with prereqs), and Docker (recommended for video/audio).latest/admin/installation.htmlwithlatest/get-started/installationeverywhere; dropped.htmlsuffixes.pyproject.toml(>=3.11,<3.14) and the PyPI badge are the single source of truth.raw.githubusercontent.comURLs so they render on PyPI.Audit must-haves now satisfied
README (was 5/8, now 8/8):
## Licensesection).CONTRIBUTING.mdlink (no longer routes toNVIDIA/NeMo).## Getting Helpwith channel-boundary table and explicit best-effort/no-SLA wording.## Roadmaplinking GitHub Milestones, project board, and release notes.CONTRIBUTING (was 4/8, now 8/8):
## Ways to Contributedefines 7 contribution types.## Your First Contributionlinksgood first issue/help wantedqueries and gives starter areas.## Asking Questions and Discussing Changesmaps needs to Discussions vs Issues.## Code of Conductsection linking the newCODE_OF_CONDUCT.md.CODE_OF_CONDUCT.md (new)
Adopts Contributor Covenant 2.1 by reference (canonical URL) rather than reproducing the full text inline. Covers Pledge, Standard, Scope, Reporting, Enforcement, and Attribution.
CODE_OF_CONDUCT.mdreports to a<TODO: confirm reporting contact — e.g., sw-coc@nvidia.com>placeholder. OSPO/legal needs to confirm the real address before this lands; the audit item is structurally satisfied but the channel isn't real yet.Open follow-ups (PM PoA P4 — out of scope for this PR)
Test plan
twine check) and confirm images load and link targets resolve.good first issueandhelp wantedlabel query URLs return results.CODE_OF_CONDUCT.md.🤖 Generated with Claude Code