Skip to content

docs: rewrite README and expand CONTRIBUTING per PM review and OSS audit#2030

Merged
lbliii merged 7 commits into
NVIDIA-NeMo:mainfrom
lbliii:lbliii/update-readme-pm-review
Jun 3, 2026
Merged

docs: rewrite README and expand CONTRIBUTING per PM review and OSS audit#2030
lbliii merged 7 commits into
NVIDIA-NeMo:mainfrom
lbliii:lbliii/update-readme-pm-review

Conversation

@lbliii
Copy link
Copy Markdown
Contributor

@lbliii lbliii commented May 26, 2026

Summary

Addresses the PM's README review and the OSS Community-Friendly audit. Touches three files at the repo root.

README.md — trust-first rewrite

  • Quick Start split into three paths: CPU smoke test (text_cpu, no GPU), GPU text (text_cuda12, with prereqs), and Docker (recommended for video/audio).
  • Canonical install URLs — replaced latest/admin/installation.html with latest/get-started/installation everywhere; dropped .html suffixes.
  • Benchmark claims moved into a sourced table (workload, hardware, baseline, result, source link to the throughput docs) with a caveat about the per-panel token-count discrepancy.
  • Version drift fixed — removed the stale "Ray 2.54" mention; removed Python-version copy so pyproject.toml (>=3.11,<3.14) and the PyPI badge are the single source of truth.
  • Product story reframed — hero now leads with who + job + outcome; added "Use NeMo Curator when…" bullets; Nemotron proof pulled up.
  • Structure collapsed to 8 sections; dropped the duplicated per-modality feature catalog (lives in docs).
  • Image refs switched to absolute raw.githubusercontent.com URLs so they render on PyPI.

Audit must-haves now satisfied

README (was 5/8, now 8/8):

  • Apache-2.0 named in text (new ## License section).
  • Prominent local CONTRIBUTING.md link (no longer routes to NVIDIA/NeMo).
  • ## Getting Help with channel-boundary table and explicit best-effort/no-SLA wording.
  • ## Roadmap linking GitHub Milestones, project board, and release notes.

CONTRIBUTING (was 4/8, now 8/8):

  • ## Ways to Contribute defines 7 contribution types.
  • ## Your First Contribution links good first issue / help wanted queries and gives starter areas.
  • ## Asking Questions and Discussing Changes maps needs to Discussions vs Issues.
  • ## Code of Conduct section linking the new CODE_OF_CONDUCT.md.

CODE_OF_CONDUCT.md (new)

Adopts Contributor Covenant 2.1 by reference (canonical URL) rather than reproducing the full text inline. Covers Pledge, Standard, Scope, Reporting, Enforcement, and Attribution.

⚠️ Action needed before merge

CODE_OF_CONDUCT.md reports to a <TODO: confirm reporting contact — e.g., sw-coc@nvidia.com> placeholder. OSPO/legal needs to confirm the real address before this lands; the audit item is structurally satisfied but the channel isn't real yet.

Open follow-ups (PM PoA P4 — out of scope for this PR)

  • Confirm production-default executor (Cosmos-Xenna vs Ray Actor Pool vs Ray Data).
  • GA vs tutorial-level support per modality.
  • "Latest" version naming convention.

Test plan

  • Render README.md on GitHub and confirm all images load.
  • Render README.md on a Markdown preview tool that mimics PyPI (e.g. twine check) and confirm images load and link targets resolve.
  • Click every link in README.md, CONTRIBUTING.md, and CODE_OF_CONDUCT.md; confirm no 404s.
  • Confirm good first issue and help wanted label query URLs return results.
  • OSPO sign-off on the reporting contact placeholder in CODE_OF_CONDUCT.md.

🤖 Generated with Claude Code

README: trust-first rewrite. Quick Start split into CPU smoke, GPU text,
and Docker paths. Canonical install URLs (get-started/installation).
Benchmark claims moved to a sourced table with caveats. Removed stale
Python/Ray version mentions; pyproject is the single source of truth.
Added Getting Help (channel boundaries, no SLA), Roadmap (Milestones +
project board), explicit Apache-2.0 License section, and prominent local
Contributing link. Image refs use absolute raw.githubusercontent URLs so
they render on PyPI.

CONTRIBUTING: added Ways to Contribute (contribution types), Your First
Contribution (good first issue + help wanted), Asking Questions /
Discussing Changes (Discussions vs Issues), and Code of Conduct sections.

CODE_OF_CONDUCT.md: new file adopting Contributor Covenant 2.1 by
reference. Reporting contact is a flagged placeholder pending OSPO
confirmation.

Closes audit gaps:
- README must-haves: license-by-name, prominent contributing link,
  getting-help boundaries, roadmap link.
- CONTRIBUTING must-haves: contribution types, good-first-issue
  guidance, Code of Conduct link, questions/discussions channel.

Signed-off-by: Lawrence Lane <llane@nvidia.com>
@lbliii lbliii requested a review from a team as a code owner May 26, 2026 16:54
@lbliii lbliii requested review from suiyoubi and removed request for a team May 26, 2026 16:54
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 26, 2026

Greptile Summary

This PR rewrites README.md, expands CONTRIBUTING.md, and adds a new CODE_OF_CONDUCT.md to satisfy a PM review and OSS community-friendly audit. The changes are documentation-only with no code impact.

  • README.md is restructured around a three-path Quick Start (CPU/GPU/Docker), a sourced benchmark table, a new "How It Works" architecture section, and new ## Roadmap, ## Getting Help, and ## License sections; image refs switched to absolute raw.githubusercontent.com URLs for PyPI compatibility.
  • CONTRIBUTING.md gains seven new sections (Ways to Contribute, Your First Contribution, Asking Questions, Code of Conduct) while preserving the existing technical setup and PR guidelines.
  • CODE_OF_CONDUCT.md (new) adopts Contributor Covenant 2.1 by reference and routes violation reports through GitHub-native channels pending OSPO confirmation of a dedicated email alias.

Confidence Score: 4/5

Safe to merge once the executor default claim is resolved; all other changes are additive documentation improvements.

The README now asserts XennaExecutor (Cosmos-Xenna) is the production default as a settled fact, while the PR description simultaneously lists confirming the production-default executor as an open follow-up out of scope for this PR. A user who reads this README today and picks their deployment stack based on that statement will be acting on an unconfirmed claim. Everything else in the three files — the Quick Start paths, benchmark table, CONTRIBUTING expansion, and CODE_OF_CONDUCT — looks correct and clean.

README.md line 123 (executor default claim) and lines 92/138 (Nemotron-CC URL path change from data_curation/ to data/curation/ — needs link verification against the Nemotron repo).

Important Files Changed

Filename Overview
README.md Major rewrite: new hero copy, three-path Quick Start, sourced benchmark table, condensed architecture section, added Roadmap/Getting Help/License sections. One confirmed content-accuracy issue (executor default claim vs. open follow-up) and one unverified URL path change for Nemotron-CC links.
CONTRIBUTING.md Significantly expanded: added Ways to Contribute, Your First Contribution, Asking Questions, and Code of Conduct sections; section headings capitalised; existing technical content (style, setup, tests, PR guidelines) unchanged.
CODE_OF_CONDUCT.md New file adopting Contributor Covenant 2.1 by reference; reporting placeholder resolved via GitHub-native channels with a note that a dedicated email will follow OSPO confirmation.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[User opens README Quick Start] --> B{Choose path}
    B --> C[Path A — CPU smoke test]
    B --> D[Path B — GPU text pipeline]
    B --> E[Path C — Docker]

    C --> C1["uv venv && source .venv/bin/activate"]
    C1 --> C2["uv pip install nemo-curator[text_cpu]"]
    C2 --> C3["python -c 'import nemo_curator; print(version)'"]

    D --> D1{Prerequisites met?}
    D1 -->|CUDA 12, Linux x86_64, ~16 GB VRAM| D2["uv venv && source .venv/bin/activate"]
    D1 -->|Missing| D_fail[Install CUDA toolkit / NVIDIA driver]
    D2 --> D3["uv pip install nemo-curator[text_cuda12]"]
    D3 --> D4["python tutorials/quickstart.py"]

    E --> E1[Pull nemo-curator image from NGC]
    E1 --> E2[Run container with codec libraries pre-configured]
    E2 --> E3[Run video / audio pipelines]

    C3 & D4 & E3 --> F[Full Installation Guide]
Loading

Reviews (6): Last reviewed commit: "Merge branch 'main' into lbliii/update-r..." | Re-trigger Greptile

Comment thread CODE_OF_CONDUCT.md
All reports will be reviewed and investigated promptly and fairly. Maintainers are obligated to respect the privacy and security of the reporter of any incident.

## Enforcement

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Unresolved TODO placeholder blocks actionable reporting

The reporting contact is still <TODO: confirm reporting contact — e.g., sw-coc@nvidia.com>. A CoC with no reachable enforcement channel is worse than none — reporters cannot escalate violations, and the file ships with raw angle-bracket template syntax visible to every contributor. The PR description flags OSPO sign-off as required; this should be a hard merge gate, not a follow-up.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 06748dd — removed the <TODO> placeholder. Reports now route through GitHub-native channels (Report abuse + direct maintainer contact), which are reachable today without OSPO sign-off. A dedicated email alias can be swapped in later when confirmed; no longer a merge blocker.

Comment thread README.md
Comment on lines +57 to +67
### Path B — GPU text pipeline (CUDA 12, supported Linux)

Process and curate high-quality text datasets for large language model (LLM) training with multilingual support.
The bundled quickstart starts Ray, downloads a Hugging Face model, and runs a sentiment classification pipeline on GPU.

| Category | Features | Documentation |
|----------|----------|---------------|
| **Data Sources** | Common Crawl • Wikipedia • ArXiv • Custom datasets | [Load Data](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html) |
| **Quality Filtering** | 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type | [Quality Assessment](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/heuristic.html) |
| **Deduplication** | Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) | [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) |
| **Processing** | Text cleaning • Language identification | [Content Processing](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/text-cleaning.html) |
**Prerequisites:** CUDA 12 toolkit, NVIDIA driver supporting CUDA 12, Linux x86_64, ~16 GB GPU memory, network access to Hugging Face.

---
```bash
python -m venv .venv && source .venv/bin/activate
uv pip install "nemo-curator[text_cuda12]"
python tutorials/quickstart.py
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Path B requires uv but lists no prerequisite for it

Path A uses plain pip, while Path B switches to uv pip install without telling the reader they need uv installed first. A first-time user who follows Path B after reading Path A will get command not found: uv and have no hint about how to proceed. Either add a short uv install note, or use plain pip here for consistency with Path A.

Suggested change
### Path B — GPU text pipeline (CUDA 12, supported Linux)
Process and curate high-quality text datasets for large language model (LLM) training with multilingual support.
The bundled quickstart starts Ray, downloads a Hugging Face model, and runs a sentiment classification pipeline on GPU.
| Category | Features | Documentation |
|----------|----------|---------------|
| **Data Sources** | Common Crawl • Wikipedia • ArXiv • Custom datasets | [Load Data](https://docs.nvidia.com/nemo/curator/latest/curate-text/load-data/index.html) |
| **Quality Filtering** | 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type | [Quality Assessment](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/heuristic.html) |
| **Deduplication** | Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) | [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) |
| **Processing** | Text cleaning • Language identification | [Content Processing](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/text-cleaning.html) |
**Prerequisites:** CUDA 12 toolkit, NVIDIA driver supporting CUDA 12, Linux x86_64, ~16 GB GPU memory, network access to Hugging Face.
---
```bash
python -m venv .venv && source .venv/bin/activate
uv pip install "nemo-curator[text_cuda12]"
python tutorials/quickstart.py
```
### Path B — GPU text pipeline (CUDA 12, supported Linux)
The bundled quickstart starts Ray, downloads a Hugging Face model, and runs a sentiment classification pipeline on GPU.
**Prerequisites:** CUDA 12 toolkit, NVIDIA driver supporting CUDA 12, Linux x86_64, ~16 GB GPU memory, network access to Hugging Face.
> **Note:** Path B uses [`uv`](https://github.com/astral-sh/uv) for faster dependency resolution. Install it first with `pip install uv` or follow the [uv install guide](https://docs.astral.sh/uv/getting-started/installation/).
```bash
python -m venv .venv && source .venv/bin/activate
uv pip install "nemo-curator[text_cuda12]"
python tutorials/quickstart.py

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 06748dd — added an inline prereq note pointing readers to pip install uv and the uv install guide. Kept uv because it's what the path optimizes for; the note prevents the command not found cliff.

Comment thread README.md Outdated
## Roadmap

**Results:** Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages.
Release work is tracked through versioned [GitHub Milestones](https://github.com/NVIDIA-NeMo/Curator/milestones). Larger initiatives live on the [project board](https://github.com/orgs/NVIDIA-NeMo/projects). For shipped changes, see the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Roadmap links to all org projects instead of the Curator-specific board

https://github.com/orgs/NVIDIA-NeMo/projects lands on every project in the NVIDIA-NeMo org, not the Curator project board. Readers won't know which board to look at. Link directly to the Curator project (e.g. /projects/1) or remove the board link until the specific URL is confirmed.

Suggested change
Release work is tracked through versioned [GitHub Milestones](https://github.com/NVIDIA-NeMo/Curator/milestones). Larger initiatives live on the [project board](https://github.com/orgs/NVIDIA-NeMo/projects). For shipped changes, see the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes).
Release work is tracked through versioned [GitHub Milestones](https://github.com/NVIDIA-NeMo/Curator/milestones). For shipped changes, see the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 06748dd — dropped the org-level project board link. Couldn't confirm a Curator-specific board URL (token lacks read:project scope), so following the suggestion to remove it until the specific URL is known. Milestones + release notes still link.

- CODE_OF_CONDUCT.md: drop <TODO> placeholder reporting contact.
  Route reports through GitHub-native channels (Report abuse + direct
  maintainer contact) that are reachable today, with a note that a
  dedicated alias will be added once OSPO confirms one.
- README.md: add prerequisite note for `uv` to Path B Quick Start so
  first-time readers don't hit `command not found: uv`.
- README.md: drop org-level project board link from the Roadmap
  section (it lands on all NVIDIA-NeMo projects, not Curator);
  Milestones + release notes remain.

Signed-off-by: Lawrence Lane <llane@nvidia.com>
Comment thread CONTRIBUTING.md Outdated
New to the project? Start here:

1. Browse issues labeled [`good first issue`](https://github.com/NVIDIA-NeMo/Curator/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) — these are scoped to be approachable without deep familiarity with the codebase.
2. For slightly larger work, look at [`help wanted`](https://github.com/NVIDIA-NeMo/Curator/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted%22).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we use this tag.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 58889bf — removed the help wanted reference. Confirmed the repo has 0 issues ever under that label (vs 33 under good first issue), so it's just good first issue now.

Comment thread README.md Outdated
## Features by Modality
```bash
python -m venv .venv && source .venv/bin/activate
pip install "nemo-curator[text_cpu]"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should encourage uv instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 58889bf — Quick Start now standardizes on uv for both Path A and Path B, with a one-time install step up top.

Comment thread README.md Outdated
| **Quality Filtering** | 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type | [Quality Assessment](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/heuristic.html) |
| **Deduplication** | Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) | [Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html) |
| **Processing** | Text cleaning • Language identification | [Content Processing](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/content-processing/text-cleaning.html) |
**Prerequisites:** CUDA 12 toolkit, NVIDIA driver supporting CUDA 12, Linux x86_64, ~16 GB GPU memory, network access to Hugging Face. This path uses [`uv`](https://docs.astral.sh/uv/) for faster dependency resolution — install it first with `pip install uv` (or follow the [uv install guide](https://docs.astral.sh/uv/getting-started/installation/)).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Install uv with curl right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right — fixed in 58889bf. Added a one-time curl -LsSf https://astral.sh/uv/install.sh | sh step at the top of Quick Start, and both paths now use uv venv / uv pip.

Comment thread README.md Outdated
NeMo Curator powers the data pipelines behind [NVIDIA Nemotron](https://developer.nvidia.com/nemotron) models. The [Nemotron-4 pre-training dataset](https://arxiv.org/abs/2402.16819) was curated using NeMo Curator's text pipeline across 8+ trillion tokens of multilingual web data — quality filtering, deduplication, and domain classification at scale.

---
The [Nemotron-CC curation pipeline](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/data_curation/nemotron-cc) uses NeMo Curator end-to-end — from Common Crawl extraction through language ID, exact/fuzzy/substring deduplication, ensemble quality classification, and LLM-based synthetic data generation — to reproduce the [Nemotron-CC datasets](https://huggingface.co/datasets/nvidia/Nemotron-CC-v2). The SDG stage is available as an [in-repo tutorial](tutorials/synthetic/nemotron_cc/).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the first link is broken.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 58889bf — the Nemotron-CC recipe path was wrong (data_curation should be data/curation). Corrected both occurrences; verified 200.

Comment thread README.md Outdated
Comment on lines +92 to +96
| Metric | Workload | Hardware | Baseline | NeMo Curator | Source |
|--------|----------|---------|----------|--------------|--------|
| Fuzzy dedupe speedup | RedPajama v2 subset | 3× H100 80 GB nodes | CPU-based alternative | 10.7 h → 0.65 h (**~16×**) | [Throughput docs](https://docs.nvidia.com/nemo/curator/latest/about/concepts/scaling/throughput) |
| Total cost of ownership | RedPajama v2 subset | 3× H100 80 GB nodes | CPU-based alternative | $315 → $190 (**~40% lower**) | [Throughput docs](https://docs.nvidia.com/nemo/curator/latest/about/concepts/scaling/throughput) |
| GPU scaling (1→4 nodes) | RedPajama v2 subset | 1, 2, 4 × H100 80 GB nodes | Single-node run | 2.05 h → 1.01 h → 0.50 h | [Throughput docs](https://docs.nvidia.com/nemo/curator/latest/about/concepts/scaling/throughput) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need to list the same link 3 times here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 58889bf — dropped the Source column entirely. The section intro already links the throughput docs once, so the per-row repetition is gone.

Comment thread README.md Outdated
|----------|------|
| Installation guide (CPU, GPU, Docker, source) | [docs.nvidia.com/nemo/curator/latest/get-started/installation](https://docs.nvidia.com/nemo/curator/latest/get-started/installation) |
| Container image | [nemo-curator on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) |
| Infrastructure (Slurm, Kubernetes, multi-node) | [Infrastructure docs](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken link.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 58889bfreference/infrastructure/indexreference/infra (canonical path per the live site's llms.txt index); verified 200.

Comment thread README.md Outdated
| Installation guide (CPU, GPU, Docker, source) | [docs.nvidia.com/nemo/curator/latest/get-started/installation](https://docs.nvidia.com/nemo/curator/latest/get-started/installation) |
| Container image | [nemo-curator on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) |
| Infrastructure (Slurm, Kubernetes, multi-node) | [Infrastructure docs](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index) |
| API reference | [API docs](https://docs.nvidia.com/nemo/curator/latest/apidocs/index) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken link.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 58889bfapidocs/indexapi/reference/api-reference; verified 200.

Comment thread README.md Outdated
| Container image | [nemo-curator on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo-curator) |
| Infrastructure (Slurm, Kubernetes, multi-node) | [Infrastructure docs](https://docs.nvidia.com/nemo/curator/latest/reference/infrastructure/index) |
| API reference | [API docs](https://docs.nvidia.com/nemo/curator/latest/apidocs/index) |
| Concepts | [Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/index) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broken link.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 58889bfabout/concepts/indexabout/concepts; verified 200.

Comment thread README.md Outdated
| Concepts | [Concepts](https://docs.nvidia.com/nemo/curator/latest/about/concepts/index) |

### Quality Improvements
Supported versions are defined in [`pyproject.toml`](pyproject.toml) and exposed on the PyPI badge above; the README does not duplicate them to avoid drift.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the pyproject lists Curator versions right? Just dependencies?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 58889bf. pyproject pins Python and dependency versions, not the Curator release version. Reworded to 'Supported Python and dependency versions are defined in pyproject.toml.'

Comment thread README.md Outdated
## Roadmap

**Results:** Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages.
Release work is tracked through versioned [GitHub Milestones](https://github.com/NVIDIA-NeMo/Curator/milestones). For shipped changes, see the [release notes](https://docs.nvidia.com/nemo/curator/latest/about/release-notes).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Milestones looks empty to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 58889bf — the milestones are indeed all empty (0 issues). Reframed the Roadmap section to point to the release notes for shipped work and Issues/Discussions for planned direction, and dropped the empty Milestones link.

README:
- Standardize Quick Start on uv: add one-time `curl` installer step and
  use `uv venv`/`uv pip` in both Path A and Path B (was plain pip in A,
  pip-installed uv in B).
- Fix broken doc links to canonical paths verified against the live
  site's llms.txt index:
  - infrastructure: reference/infrastructure/index -> reference/infra
  - API reference: apidocs/index -> api/reference/api-reference
  - concepts: about/concepts/index -> about/concepts
- Fix broken Nemotron-CC recipe link (data_curation -> data/curation),
  both occurrences.
- Remove the Source column from the benchmark table (the same throughput
  link was repeated 3x); the section intro already links the source once.
- Correct the executor claim: XennaExecutor (Cosmos-Xenna) is the
  production default with experimental Ray backends, per the API docs
  (resolves the unconfirmed "default for video" claim Greptile flagged).
- Clarify the pyproject note: it pins Python/dependency versions, not
  Curator release versions.
- Roadmap: point to release notes (shipped) + Issues/Discussions
  (planned); drop the empty Milestones link.

CONTRIBUTING:
- Remove the `help wanted` label reference; the repo has never used it
  (0 issues). Keep `good first issue` (33 issues).

Signed-off-by: Lawrence Lane <llane@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lbliii
Copy link
Copy Markdown
Contributor Author

lbliii commented Jun 2, 2026

Thanks @sarahyurick — all review comments addressed in 58889bf. Summary:

  • uv standardization: Quick Start now has a one-time curl installer and both paths use uv venv / uv pip.
  • Broken links fixed (all verified 200 against the live site's llms.txt index):
    • reference/infrastructure/indexreference/infra
    • apidocs/indexapi/reference/api-reference
    • about/concepts/indexabout/concepts
    • Nemotron-CC recipe: data_curationdata/curation (both occurrences)
  • Benchmark table: dropped the repeated Source column (intro links the source once).
  • pyproject note: reworded — it pins Python/dependency versions, not Curator release versions.
  • Roadmap: milestones are empty, so reframed to release notes (shipped) + Issues/Discussions (planned); dropped the empty Milestones link.
  • help wanted: removed from CONTRIBUTING (0 issues ever; kept good first issue with 33).
  • Executor claim (Greptile): corrected to "XennaExecutor (Cosmos-Xenna) is the production default, with experimental Ray backends," per the API docs.

I ran a link check across README, CONTRIBUTING, and CODE_OF_CONDUCT — all external links return 200 and all relative paths exist. Ready for another look.

@lbliii lbliii requested a review from sarahyurick June 2, 2026 19:16
@lbliii lbliii enabled auto-merge (squash) June 3, 2026 15:16
@lbliii
Copy link
Copy Markdown
Contributor Author

lbliii commented Jun 3, 2026

/ok to test 8a09b67

@sarahyurick
Copy link
Copy Markdown
Contributor

/ok to test 56e3d06

@lbliii lbliii merged commit 9bd9023 into NVIDIA-NeMo:main Jun 3, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants