Skip to content

[benchmarking] Multiple fixes to stabilize the nightly benchmark suite#2035

Draft
rlratzel wants to merge 14 commits into
NVIDIA-NeMo:mainfrom
rlratzel:2606-update_benchmark_env_check
Draft

[benchmarking] Multiple fixes to stabilize the nightly benchmark suite#2035
rlratzel wants to merge 14 commits into
NVIDIA-NeMo:mainfrom
rlratzel:2606-update_benchmark_env_check

Conversation

@rlratzel
Copy link
Copy Markdown
Contributor

@rlratzel rlratzel commented May 27, 2026

Summary

Bundles eight independent fixes uncovered while triaging the Curator
nightly benchmark suite against main (reference nemo-ci pipelines
52840568,
53098506,
53323567,
and 53333039).

Pass-rate progression: 23/37 → 28/37 → 6/8 (scoped) → 1/2 (scoped) → expected
2/2 with the latest commit.

1. Make undefined env vars in config non-fatal by default (b737202d)

  • resolve_env_vars previously raised ValueError on the first undefined
    ${VAR} reference, halting the entire benchmark session even when the
    missing var was used by only a few entries.
  • Default behavior is now to substitute an empty string and log a warning
    so unrelated entries can still run.
  • Adds --strict-config-check CLI flag to run.py to restore the old
    fail-fast behavior.

2. New --entries-exact flag, exact-match entry filtering (ab1d4385, e8c6a52f, 7b351693)

Fixes a substring-aliasing bug in CI per-job invocations:

  • --entries uses pytest's -k substring expression evaluator, so
    --entries audio_tagging_tts_xenna also matches
    audio_tagging_tts_xenna_repeat. In CI, where each per-job script runs
    --entries <entry-name>, this caused the non-_repeat job to also
    execute the _repeat entry, polluting that entry's per-entry results
    dir and crashing the subsequent legitimate _repeat SLURM job with
    Capture file ... already exists at Ray cluster setup.
  • Adds --entries-exact accepting a comma-separated list of exact entry
    names. Every supplied name must match a configured (enabled) entry,
    otherwise the run aborts with a ValueError listing the unknown names
    alongside the available entry names.
  • Mutually exclusive with --entries (CLI and Session.from_dict both
    enforce this).
  • benchmarking/tools/ci_benchmark_launcher.sh switched from --entries
    to --entries-exact.
  • Interactive --entries substring expression semantics are unchanged.

3. Bump too-tight timeout_s values for several entries

Multiple nightly-benchmark.yaml entries had timeout_s values that
didn't cover the actual wall time on the EOS H100 runner. These split
into two categories:

  • Pre-existing tight limits that overshot during Ray teardown (Add style check #3.1).
  • "No-op" timeouts — set back when the entry effectively did no work
    (e.g. rpv2 data was unreadable, or fuzzy_id_generator artifacts were
    stale and the test failed fast); once the underlying issue was fixed
    and the test ran for real, the historical ceiling was too tight (Add style check #3.2,
    Add style check #3.3).

3.1 ndd_ray_serve_dp4: 700 → 1200 (ab1d4385)

3.2 exact_dedup_identification: 500 → 1500 (21647b4f)

  • The 500s value was set when the test was effectively a no-op (the rpv2
    dataset was unreadable due to filesystem permissions, so the test failed
    fast on stat()). Once rpv2 access was restored, the test SLURM-killed
    at 68% (516/755) of the "Inserting into shuffler" phase.
  • Shuffler runs at ~1.83 it/s × 755 items = ~410s on EOS H100, plus Ray
    cluster setup + dataset stat + post-shuffle dedup compute + cleanup ≈
    realistic wall 800-1000s.
  • 1500s gives ~50% headroom.
  • Reference failed job:
    https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/329434352

3.3 dedup_removal_*: 1100/1500 → 1800/1800 (c2f3a6b0)

Once the upstream fuzzy_id_generator artifacts were refreshed (see
section 4 — the delete_scratch: false fix made this possible), the
dedup_removal tests actually ran the workload for the first time:

  • dedup_removal_raydata: observed wall ~1419s (92% complete when killed
    at the prior 1100s ceiling) → 1800s (~27% headroom over the estimated
    1500s full wall).
  • dedup_removal_xenna: passed at 1523s wall with the prior 1500s ceiling
    (zero margin) → 1800s as a preemptive bump to match raydata and absorb
    run-to-run variance.

Reference: nemo-ci pipeline 53333039 / leaf 53333328 — raydata killed by
SLURM TIME LIMIT during normal pipeline execution while xenna finished
with zero headroom.

4. Install lynx in the CI benchmark launcher (d0eac901)

The math benchmarks (math_preprocess, math_preprocess_classifier,
math_preprocess_llm_cleanup) shell out to the lynx text browser via
nemo_curator/stages/math/download/html_extractors/lynx.py for HTML
extraction. lynx is not in the Curator container, so those benchmarks
fail with RuntimeError: lynx executable not found in PATH.

lynx is GPL-licensed, so we deliberately do not bake it into the
redistributable Curator image. Instead it is installed transiently in the
existing benchmark container at CI run time, used during the run, and
discarded with the container. The published image stays GPL-free; the
apt-installed lynx only lives for the lifetime of each CI container.

5. (Reverted) Preserve scratch dir for fuzzy_dedup_identification (63691509 then a3c38ac6)

Initial commit 63691509 added delete_scratch: false to the
fuzzy_dedup_identification entry to keep its scratch artifacts around for
the downstream dedup_removal_* benchmarks. In practice the consumption
path is the canonical dataset path:

{datasets_path}/cleaned_exact_dedup_all_cc_fuzzy_output_nightly_container_paths/
  fuzzy_id_generator.json
  FuzzyDuplicateIds/

Operators promote a known-good fuzzy_dedup output to this path once, and
the dedup_removal entries consume it from there on every subsequent
pipeline — there is no same-pipeline dependency. Given that workflow, the
per-entry delete_scratch: false override is unnecessary and just leaves
unused data on lustre across runs. Reverted by a3c38ac6; the entry now
uses the session-level default (delete_scratch: true).

6. Range domain_label_games_count metric +/-5% (0c375b7e)

Both domain_classification_raydata and domain_classification_xenna
had a requirement on domain_label_games_count with exact_value: 149816.
Run-to-run output of the classifier drifts by several hundred to a few
thousand classifications, causing repeated false failures while the actual
benchmark (throughput, total docs, number of domains predicted) is
healthy.

Loosens the metric to a +/-5% range (min_value: 142325,
max_value: 157307) which still catches genuine regressions of the
classifier output while tolerating normal run-to-run variability.
domain_label_news_count is left at exact_value: 2817 pending further
investigation.

Verification

Across multiple nemo-ci pipelines:

  • 53098506
    — first verification of commits 1, 2, 3.1, 3.2 (and Make NeMo-Curator installable in non GPU environments #4 below): 28/37
    pass (up from 23/37 baseline). Confirmed
    audio_tagging_tts_xenna_repeat works (--entries-exact), ndd_ray_serve_dp4
    works (timeout bump), exact_dedup_identification works (rpv2 + timeout),
    audio_tagging_tts_xenna works without HF_SECRET_KEY set.
  • 53323567
    — scoped 8-entry run of commits 4, 5, 6: 6/8 pass (math_preprocess*
    all pass thanks to lynx install; domain_classification_* pass with the
    metric range; fuzzy_dedup_identification passes). The two failures
    (dedup_removal_*) were caused by a same-pipeline race condition with
    fuzzy_dedup_identification — fixed by operators manually copying the
    fresh fuzzy_id_generator.json artifacts to the canonical dataset path
    between runs.
  • 53333039
    — scoped 2-entry run with fresh artifacts in place: dedup_removal_xenna
    passes (1523s, no margin); dedup_removal_raydata SLURM-killed by TIME
    LIMIT at 92% completion → addressed by commit c2f3a6b0 (Add style check #3.3).

A fresh pipeline against the latest tip (which includes c2f3a6b0) is
required to confirm both dedup_removal entries pass cleanly.

Test plan

Undefined env vars

  • audio_tagging_tts_xenna ran successfully in 53098506 with
    HF_SECRET_KEY unset.
  • Run with --strict-config-check; confirm it still exits with the
    original ValueError.
  • Run an unmodified config (all env vars defined); confirm no
    behavior change.

--entries-exact

  • CI per-job invocation ran exactly one entry per job in 53098506
    and 53323567; no cross-entry pollution observed.
  • --entries-exact <typo> exits with an error listing the unknown
    name and the available entry names.
  • --entries-exact a,b,c runs only those three entries in YAML order.
  • Passing both --entries and --entries-exact exits with a
    "mutually exclusive" error.

Timeout bumps

  • ndd_ray_serve_dp4 finished without TIME LIMIT cancellation in
    53098506.
  • exact_dedup_identification finished without TIME LIMIT in 53098506.
  • After c2f3a6b0 lands or in a follow-up run, confirm
    dedup_removal_raydata and dedup_removal_xenna both finish without
    TIME LIMIT cancellation. (Both passed in 53398187.)

lynx install

  • math_preprocess* entries all passed in 53323567 — confirmed
    RuntimeError: lynx executable not found no longer fires.

fuzzy_dedup scratch retention (now reverted)

  • One-time artifact promotion completed: fresh fuzzy_id_generator.json +
    FuzzyDuplicateIds/ written to the canonical dataset path
    ({datasets_path}/cleaned_exact_dedup_all_cc_fuzzy_output_nightly_container_paths/).
  • dedup_removal_* consume the promoted artifacts in a clean
    pipeline (verified in 53398187).
  • After artifact promotion, the per-entry delete_scratch: false
    override on fuzzy_dedup_identification is no longer needed and has
    been reverted (commit a3c38ac6).

domain_label_games_count range

  • domain_classification_* both passed in 53323567 with the new
    range satisfied.

🤖 Generated with Claude Code

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 27, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rlratzel rlratzel changed the title [benchmarking] Make undefined env vars in config non-fatal by default [benchmarking] Nightly-suite fixes: undefined env vars, --entries-exact, ndd_ray_serve_dp4 timeout May 28, 2026
--engine-kwargs='{"tensor_parallel_size": 1}'
--autoscaling-config='{"min_replicas": 4, "max_replicas": 4}'
timeout_s: 700
timeout_s: 1200 # warm-run wall ~700s observed; headroom added for cold vLLM model load (cf. ndd_dynamo_dp4: 2700)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why this happened now?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because even dynamo should not take 2700s rn since we upgraded versions to 1.1.0 (i.e.we can reduce that)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to investigate that. I know we had success on the DGX-A100 machine with 700s, but I don't know yet if a larger timeout is needed for the other machine running nightlies because it's slower in general, or if something else is causing a longer runtime on the other machine.

rlratzel and others added 4 commits May 29, 2026 13:23
`resolve_env_vars` previously raised `ValueError` on the first undefined
`${VAR}` reference, halting the entire session even when the missing var
was only used by a few entries. The default is now to substitute an empty
string and log a warning. Pass `--strict-config-check` to `run.py` to
restore the old fail-fast behavior.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
In nemo-ci pipeline 52840568 / leaf 52841002, ndd_ray_serve_dp4
was SLURM-killed by TIME LIMIT at ~11:30 wall — only seconds
after its benchmark subprocess succeeded (Output 853/853,
benchmark wall 249s, total subprocess wall ~450s).

The existing timeout_s: 700 converts to SLURM --time=00:11:40,
giving no headroom for Ray teardown or cold vLLM model load.

Bump to 1200s (20 min):
- ~70% headroom over the observed warm-run wall
- Still well below ndd_dynamo_dp4's 2700s ceiling, which is
  documented for cold flash-attn / gpt-oss-20b loads

Reference failed job:
https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/327777542

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
The existing --entries flag uses pytest's "-k" expression evaluator,
which does substring matching on bare identifiers. This is correct
for interactive use but dangerous for automated callers that target
a single known entry: passing --entries foo also selects foo_repeat,
foo_extra, etc.

Concrete failure: in nemo-ci pipeline 52840568 / leaf 52841002, the
SLURM job for entry "audio_tagging_tts_xenna" was invoked with
--entries audio_tagging_tts_xenna, which also matched the sibling
"audio_tagging_tts_xenna_repeat". That entry was executed within the
non-_repeat SLURM job, leaving a logs/ray.log file in the _repeat
entry's per-entry results dir. The legitimate _repeat SLURM job then
crashed at Ray cluster setup because logs/ is preserved by design
(run.py:175-178) and the stale ray.log capture file collided.

Changes:
* benchmarking/run.py: add --entry-exact-name argparse flag (mutually
  exclusive with --entries); pass through to Session.from_dict.
* benchmarking/runner/session.py: extend Session.from_dict to accept
  entry_exact_name; when set, filter by exact entry-name equality
  (takes precedence over entry_filter_expr; passing both raises
  ValueError).
* benchmarking/tools/ci_benchmark_launcher.sh: switch CI per-job
  invocation from --entries to --entry-exact-name. ENTRY_NAME is
  already populated by the per-entry CI job generator with the exact
  entry name, so no value change is needed.

Interactive --entries behavior is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
Brings the new flag in line with --entries in both look and feel:

* Accepts a comma-separated list of one or more exact entry names,
  not just a single name. This matches the mental model of --entries
  (which conceptually selects a set of entries) and lets a single
  invocation target any subset by exact name.
* Every name in the list must exactly match a configured (enabled)
  entry; otherwise the run aborts with a ValueError that lists the
  unknown names alongside the available entry names. This makes
  typos a hard error rather than a silent no-op.
* Duplicates in the input are collapsed; result order follows the
  YAML, matching how --entries behaves.

* benchmarking/run.py: rename argparse flag --entry-exact-name to
  --entries-exact; parse comma-separated value into list[str]; reject
  empty / whitespace-only inputs; wrap Session.from_dict in
  try/except to surface ValueError as a clean CLI error.
* benchmarking/runner/session.py: rename parameter entry_exact_name
  to entries_exact (list[str]); add strict validation that every
  requested name matches a configured entry; error message lists
  both missing and available names.
* benchmarking/tools/ci_benchmark_launcher.sh: rename flag in the CI
  per-job invocation; ENTRY_NAME is a single name today so this
  works as a single-element list.

Interactive --entries semantics unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
@rlratzel rlratzel force-pushed the 2606-update_benchmark_env_check branch from ed1855e to 7b35169 Compare May 29, 2026 18:23
In nemo-ci pipeline 53092974 / leaf 53093663, exact_dedup_identification
was SLURM-killed by TIME LIMIT at 68% (516/755) into the "Inserting
into shuffler" phase. The shuffler ran at ~1.83 it/s on EOS H100, so
the shuffler phase alone needs ~410s, plus Ray cluster setup, dataset
stat, post-shuffle dedup compute, and cleanup — total wall ~800-1000s,
not fitting in a 500s budget.

The previous 500s value was set when the test was effectively a no-op
(the rpv2 dataset was unreadable due to filesystem permissions, so the
test failed fast on stat() before doing any real work — see PR
description for the rpv2 access fix story). 500s was also reportedly
sufficient on a faster system; EOS may simply be slower for this
workload. Worth investigating after the test is unblocked.

Bump to 1500s (25 min):
- ~50% headroom over the estimated 1000s realistic wall on EOS
- Parallel in spirit to the ndd_ray_serve_dp4 700 -> 1200 bump
  earlier in this PR

Reference failed job:
https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/329434352

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
@rlratzel rlratzel changed the title [benchmarking] Nightly-suite fixes: undefined env vars, --entries-exact, ndd_ray_serve_dp4 timeout [benchmarking] Nightly-suite fixes: undefined env vars, --entries-exact, timeout adjustments May 29, 2026
rlratzel and others added 3 commits June 1, 2026 14:06
The math benchmarks (math_preprocess, math_preprocess_classifier,
math_preprocess_llm_cleanup) shell out to the lynx text browser via
nemo_curator/stages/math/download/html_extractors/lynx.py for HTML
extraction. lynx is not present in the Curator benchmark container,
so those benchmarks currently fail with:

  RuntimeError: lynx executable not found in PATH

lynx is GPL-licensed, so we deliberately do not bake it into the
redistributable Curator image. Instead it is installed transiently in
the existing benchmark container at CI run time. The image we publish
stays GPL-free; the apt-installed lynx lives only for the lifetime of
the CI container.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
Adds `delete_scratch: false` to the fuzzy_dedup_identification entry so
its scratch directory (under session_entry_dir/scratch/{cache,output})
is retained after the entry finishes. The downstream dedup_removal_*
benchmarks read these artifacts at known paths, so the prior default
session-level cleanup (delete_scratch: true) was wiping them out
before they could be consumed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
Both domain_classification_raydata and domain_classification_xenna
have a requirement on domain_label_games_count with exact_value:
149816. Run-to-run output of the classifier drifts by several
hundred to a few thousand classifications, causing repeated false
failures while the actual benchmark (throughput, total docs, number
of domains predicted) is healthy.

Loosens the metric to a +/- 5% range (142325 .. 157307) which still
catches genuine regressions of the classifier output while tolerating
normal run-to-run variability. domain_label_news_count is left at
exact_value: 2817 pending further investigation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
@rlratzel rlratzel changed the title [benchmarking] Nightly-suite fixes: undefined env vars, --entries-exact, timeout adjustments [benchmarking] Multiple fixes to stabilize the nightly benchmark suite Jun 1, 2026
rlratzel and others added 6 commits June 2, 2026 05:12
…rk_env_check

Signed-off-by: rlratzel <rratzel@nvidia.com>
Both dedup_removal entries had timeout_s values that fit when the test
was effectively a no-op (it failed fast on stale fuzzy_id_generator
artifacts before doing real work). Once the upstream
fuzzy_id_generator data was refreshed, the actual benchmarks ran for
real:

* dedup_removal_raydata observed wall ~1419s (92% complete when killed
  at the prior 1100s ceiling) -> 1800s (~27% headroom over the
  estimated 1500s full wall)
* dedup_removal_xenna passed at 1523s wall with the prior 1500s
  ceiling (no margin) -> 1800s as a preemptive bump to match raydata
  and absorb run-to-run variance

Reference: nemo-ci pipeline 53333039 / leaf 53333328 — raydata killed
by SLURM TIME LIMIT during normal pipeline execution while xenna
finished successfully but with zero headroom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
The earlier delete_scratch: false override on fuzzy_dedup_identification
was added so the entry's scratch/output artifacts could be picked up
by the downstream dedup_removal_* benchmarks within the same pipeline.

In practice the artifacts are consumed via the canonical dataset path:

  {datasets_path}/cleaned_exact_dedup_all_cc_fuzzy_output_nightly_container_paths/
    fuzzy_id_generator.json
    FuzzyDuplicateIds/

Operators promote a known-good fuzzy_dedup output to this path once,
and dedup_removal_raydata / dedup_removal_xenna consume it from there
on every subsequent pipeline (no same-pipeline dependency). With that
workflow in place, the per-entry delete_scratch override is no longer
needed and just leaves unused data on lustre across runs. Revert to
the session-level default (delete_scratch: true).

This reverts commit 6369150.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
For cross-host benchmark comparisons (e.g. A100 vs H100) the host's
/dev/shm size differs (~550 GB vs ~1 TB) and the container inherits the
host default. Provide an opt-in env var to remount /dev/shm at a chosen
size. Unset preserves prior behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
Background `nvidia-smi` poller writes one CSV per entry covering all GPUs on
the node, independent of CUDA_VISIBLE_DEVICES. Lets us verify post-run
whether Ray/Xenna actually honored the visible-device mask (any nonzero util
on masked indices ⇒ leakage).

Unset → no polling, preserving prior behavior. Subprocess is killed on EXIT
via trap so a python crash doesn't leave it orphaned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
The benchmarking container cap of 1 TiB on the A100 host shrinks the
container's visible memory to ~50% of host and shm to ~25%, which makes
env.json report 1024 GiB / 512 GiB even though the host has ~2 TiB /
~1008 GiB. EOS reports the host values directly, so the A100 vs EOS
comparison shows an artificial environmental mismatch. Raising the cap
to 2 TiB lets A100's container see the host's full memory, matching
EOS.

Also remove the CURATOR_SHM_SIZE_BYTES env-var block from
ci_benchmark_launcher.sh: pyxis-on-EOS does not grant CAP_SYS_ADMIN, so
the remount silently fell back to the WARNING branch and never applied.
With A100 raised to match EOS, the toggle is no longer needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants