[benchmarking] Multiple fixes to stabilize the nightly benchmark suite#2035
Draft
rlratzel wants to merge 14 commits into
Draft
[benchmarking] Multiple fixes to stabilize the nightly benchmark suite#2035rlratzel wants to merge 14 commits into
rlratzel wants to merge 14 commits into
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
| --engine-kwargs='{"tensor_parallel_size": 1}' | ||
| --autoscaling-config='{"min_replicas": 4, "max_replicas": 4}' | ||
| timeout_s: 700 | ||
| timeout_s: 1200 # warm-run wall ~700s observed; headroom added for cold vLLM model load (cf. ndd_dynamo_dp4: 2700) |
Contributor
There was a problem hiding this comment.
I'm wondering why this happened now?
Contributor
There was a problem hiding this comment.
Because even dynamo should not take 2700s rn since we upgraded versions to 1.1.0 (i.e.we can reduce that)
Contributor
Author
There was a problem hiding this comment.
I need to investigate that. I know we had success on the DGX-A100 machine with 700s, but I don't know yet if a larger timeout is needed for the other machine running nightlies because it's slower in general, or if something else is causing a longer runtime on the other machine.
praateekmahajan
approved these changes
May 29, 2026
`resolve_env_vars` previously raised `ValueError` on the first undefined
`${VAR}` reference, halting the entire session even when the missing var
was only used by a few entries. The default is now to substitute an empty
string and log a warning. Pass `--strict-config-check` to `run.py` to
restore the old fail-fast behavior.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
In nemo-ci pipeline 52840568 / leaf 52841002, ndd_ray_serve_dp4 was SLURM-killed by TIME LIMIT at ~11:30 wall — only seconds after its benchmark subprocess succeeded (Output 853/853, benchmark wall 249s, total subprocess wall ~450s). The existing timeout_s: 700 converts to SLURM --time=00:11:40, giving no headroom for Ray teardown or cold vLLM model load. Bump to 1200s (20 min): - ~70% headroom over the observed warm-run wall - Still well below ndd_dynamo_dp4's 2700s ceiling, which is documented for cold flash-attn / gpt-oss-20b loads Reference failed job: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/327777542 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>
The existing --entries flag uses pytest's "-k" expression evaluator, which does substring matching on bare identifiers. This is correct for interactive use but dangerous for automated callers that target a single known entry: passing --entries foo also selects foo_repeat, foo_extra, etc. Concrete failure: in nemo-ci pipeline 52840568 / leaf 52841002, the SLURM job for entry "audio_tagging_tts_xenna" was invoked with --entries audio_tagging_tts_xenna, which also matched the sibling "audio_tagging_tts_xenna_repeat". That entry was executed within the non-_repeat SLURM job, leaving a logs/ray.log file in the _repeat entry's per-entry results dir. The legitimate _repeat SLURM job then crashed at Ray cluster setup because logs/ is preserved by design (run.py:175-178) and the stale ray.log capture file collided. Changes: * benchmarking/run.py: add --entry-exact-name argparse flag (mutually exclusive with --entries); pass through to Session.from_dict. * benchmarking/runner/session.py: extend Session.from_dict to accept entry_exact_name; when set, filter by exact entry-name equality (takes precedence over entry_filter_expr; passing both raises ValueError). * benchmarking/tools/ci_benchmark_launcher.sh: switch CI per-job invocation from --entries to --entry-exact-name. ENTRY_NAME is already populated by the per-entry CI job generator with the exact entry name, so no value change is needed. Interactive --entries behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>
Brings the new flag in line with --entries in both look and feel: * Accepts a comma-separated list of one or more exact entry names, not just a single name. This matches the mental model of --entries (which conceptually selects a set of entries) and lets a single invocation target any subset by exact name. * Every name in the list must exactly match a configured (enabled) entry; otherwise the run aborts with a ValueError that lists the unknown names alongside the available entry names. This makes typos a hard error rather than a silent no-op. * Duplicates in the input are collapsed; result order follows the YAML, matching how --entries behaves. * benchmarking/run.py: rename argparse flag --entry-exact-name to --entries-exact; parse comma-separated value into list[str]; reject empty / whitespace-only inputs; wrap Session.from_dict in try/except to surface ValueError as a clean CLI error. * benchmarking/runner/session.py: rename parameter entry_exact_name to entries_exact (list[str]); add strict validation that every requested name matches a configured entry; error message lists both missing and available names. * benchmarking/tools/ci_benchmark_launcher.sh: rename flag in the CI per-job invocation; ENTRY_NAME is a single name today so this works as a single-element list. Interactive --entries semantics unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>
ed1855e to
7b35169
Compare
In nemo-ci pipeline 53092974 / leaf 53093663, exact_dedup_identification was SLURM-killed by TIME LIMIT at 68% (516/755) into the "Inserting into shuffler" phase. The shuffler ran at ~1.83 it/s on EOS H100, so the shuffler phase alone needs ~410s, plus Ray cluster setup, dataset stat, post-shuffle dedup compute, and cleanup — total wall ~800-1000s, not fitting in a 500s budget. The previous 500s value was set when the test was effectively a no-op (the rpv2 dataset was unreadable due to filesystem permissions, so the test failed fast on stat() before doing any real work — see PR description for the rpv2 access fix story). 500s was also reportedly sufficient on a faster system; EOS may simply be slower for this workload. Worth investigating after the test is unblocked. Bump to 1500s (25 min): - ~50% headroom over the estimated 1000s realistic wall on EOS - Parallel in spirit to the ndd_ray_serve_dp4 700 -> 1200 bump earlier in this PR Reference failed job: https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/329434352 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>
4 tasks
The math benchmarks (math_preprocess, math_preprocess_classifier, math_preprocess_llm_cleanup) shell out to the lynx text browser via nemo_curator/stages/math/download/html_extractors/lynx.py for HTML extraction. lynx is not present in the Curator benchmark container, so those benchmarks currently fail with: RuntimeError: lynx executable not found in PATH lynx is GPL-licensed, so we deliberately do not bake it into the redistributable Curator image. Instead it is installed transiently in the existing benchmark container at CI run time. The image we publish stays GPL-free; the apt-installed lynx lives only for the lifetime of the CI container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>
Adds `delete_scratch: false` to the fuzzy_dedup_identification entry so
its scratch directory (under session_entry_dir/scratch/{cache,output})
is retained after the entry finishes. The downstream dedup_removal_*
benchmarks read these artifacts at known paths, so the prior default
session-level cleanup (delete_scratch: true) was wiping them out
before they could be consumed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
Both domain_classification_raydata and domain_classification_xenna have a requirement on domain_label_games_count with exact_value: 149816. Run-to-run output of the classifier drifts by several hundred to a few thousand classifications, causing repeated false failures while the actual benchmark (throughput, total docs, number of domains predicted) is healthy. Loosens the metric to a +/- 5% range (142325 .. 157307) which still catches genuine regressions of the classifier output while tolerating normal run-to-run variability. domain_label_news_count is left at exact_value: 2817 pending further investigation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>
…rk_env_check Signed-off-by: rlratzel <rratzel@nvidia.com>
Both dedup_removal entries had timeout_s values that fit when the test was effectively a no-op (it failed fast on stale fuzzy_id_generator artifacts before doing real work). Once the upstream fuzzy_id_generator data was refreshed, the actual benchmarks ran for real: * dedup_removal_raydata observed wall ~1419s (92% complete when killed at the prior 1100s ceiling) -> 1800s (~27% headroom over the estimated 1500s full wall) * dedup_removal_xenna passed at 1523s wall with the prior 1500s ceiling (no margin) -> 1800s as a preemptive bump to match raydata and absorb run-to-run variance Reference: nemo-ci pipeline 53333039 / leaf 53333328 — raydata killed by SLURM TIME LIMIT during normal pipeline execution while xenna finished successfully but with zero headroom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>
The earlier delete_scratch: false override on fuzzy_dedup_identification
was added so the entry's scratch/output artifacts could be picked up
by the downstream dedup_removal_* benchmarks within the same pipeline.
In practice the artifacts are consumed via the canonical dataset path:
{datasets_path}/cleaned_exact_dedup_all_cc_fuzzy_output_nightly_container_paths/
fuzzy_id_generator.json
FuzzyDuplicateIds/
Operators promote a known-good fuzzy_dedup output to this path once,
and dedup_removal_raydata / dedup_removal_xenna consume it from there
on every subsequent pipeline (no same-pipeline dependency). With that
workflow in place, the per-entry delete_scratch override is no longer
needed and just leaves unused data on lustre across runs. Revert to
the session-level default (delete_scratch: true).
This reverts commit 6369150.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: rlratzel <rratzel@nvidia.com>
For cross-host benchmark comparisons (e.g. A100 vs H100) the host's /dev/shm size differs (~550 GB vs ~1 TB) and the container inherits the host default. Provide an opt-in env var to remount /dev/shm at a chosen size. Unset preserves prior behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>
Background `nvidia-smi` poller writes one CSV per entry covering all GPUs on the node, independent of CUDA_VISIBLE_DEVICES. Lets us verify post-run whether Ray/Xenna actually honored the visible-device mask (any nonzero util on masked indices ⇒ leakage). Unset → no polling, preserving prior behavior. Subprocess is killed on EXIT via trap so a python crash doesn't leave it orphaned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>
The benchmarking container cap of 1 TiB on the A100 host shrinks the container's visible memory to ~50% of host and shm to ~25%, which makes env.json report 1024 GiB / 512 GiB even though the host has ~2 TiB / ~1008 GiB. EOS reports the host values directly, so the A100 vs EOS comparison shows an artificial environmental mismatch. Raising the cap to 2 TiB lets A100's container see the host's full memory, matching EOS. Also remove the CURATOR_SHM_SIZE_BYTES env-var block from ci_benchmark_launcher.sh: pyxis-on-EOS does not grant CAP_SYS_ADMIN, so the remount silently fell back to the WARNING branch and never applied. With A100 raised to match EOS, the toggle is no longer needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: rlratzel <rratzel@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bundles eight independent fixes uncovered while triaging the Curator
nightly benchmark suite against
main(reference nemo-ci pipelines52840568,
53098506,
53323567,
and 53333039).
Pass-rate progression: 23/37 → 28/37 → 6/8 (scoped) → 1/2 (scoped) → expected
2/2 with the latest commit.
1. Make undefined env vars in config non-fatal by default (
b737202d)resolve_env_varspreviously raisedValueErroron the first undefined${VAR}reference, halting the entire benchmark session even when themissing var was used by only a few entries.
so unrelated entries can still run.
--strict-config-checkCLI flag torun.pyto restore the oldfail-fast behavior.
2. New
--entries-exactflag, exact-match entry filtering (ab1d4385,e8c6a52f,7b351693)Fixes a substring-aliasing bug in CI per-job invocations:
--entriesuses pytest's-ksubstring expression evaluator, so--entries audio_tagging_tts_xennaalso matchesaudio_tagging_tts_xenna_repeat. In CI, where each per-job script runs--entries <entry-name>, this caused the non-_repeatjob to alsoexecute the
_repeatentry, polluting that entry's per-entry resultsdir and crashing the subsequent legitimate
_repeatSLURM job withCapture file ... already existsat Ray cluster setup.--entries-exactaccepting a comma-separated list of exact entrynames. Every supplied name must match a configured (enabled) entry,
otherwise the run aborts with a
ValueErrorlisting the unknown namesalongside the available entry names.
--entries(CLI andSession.from_dictbothenforce this).
benchmarking/tools/ci_benchmark_launcher.shswitched from--entriesto
--entries-exact.--entriessubstring expression semantics are unchanged.3. Bump too-tight
timeout_svalues for several entriesMultiple
nightly-benchmark.yamlentries hadtimeout_svalues thatdidn't cover the actual wall time on the EOS H100 runner. These split
into two categories:
(e.g. rpv2 data was unreadable, or fuzzy_id_generator artifacts were
stale and the test failed fast); once the underlying issue was fixed
and the test ran for real, the historical ceiling was too tight (Add style check #3.2,
Add style check #3.3).
3.1
ndd_ray_serve_dp4: 700 → 1200 (ab1d4385)--timeceiling ~13s after its benchmarksubprocess succeeded — Ray teardown overhead pushed total wall over the
wire.
https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/327777542
3.2
exact_dedup_identification: 500 → 1500 (21647b4f)dataset was unreadable due to filesystem permissions, so the test failed
fast on
stat()). Once rpv2 access was restored, the test SLURM-killedat 68% (516/755) of the "Inserting into shuffler" phase.
cluster setup + dataset stat + post-shuffle dedup compute + cleanup ≈
realistic wall 800-1000s.
https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/jobs/329434352
3.3
dedup_removal_*: 1100/1500 → 1800/1800 (c2f3a6b0)Once the upstream
fuzzy_id_generatorartifacts were refreshed (seesection 4 — the
delete_scratch: falsefix made this possible), thededup_removal tests actually ran the workload for the first time:
dedup_removal_raydata: observed wall ~1419s (92% complete when killedat the prior 1100s ceiling) → 1800s (~27% headroom over the estimated
1500s full wall).
dedup_removal_xenna: passed at 1523s wall with the prior 1500s ceiling(zero margin) → 1800s as a preemptive bump to match raydata and absorb
run-to-run variance.
Reference: nemo-ci pipeline 53333039 / leaf 53333328 — raydata killed by
SLURM TIME LIMIT during normal pipeline execution while xenna finished
with zero headroom.
4. Install
lynxin the CI benchmark launcher (d0eac901)The math benchmarks (
math_preprocess,math_preprocess_classifier,math_preprocess_llm_cleanup) shell out to thelynxtext browser vianemo_curator/stages/math/download/html_extractors/lynx.pyfor HTMLextraction.
lynxis not in the Curator container, so those benchmarksfail with
RuntimeError: lynx executable not found in PATH.lynxis GPL-licensed, so we deliberately do not bake it into theredistributable Curator image. Instead it is installed transiently in the
existing benchmark container at CI run time, used during the run, and
discarded with the container. The published image stays GPL-free; the
apt-installed
lynxonly lives for the lifetime of each CI container.5. (Reverted) Preserve scratch dir for
fuzzy_dedup_identification(63691509thena3c38ac6)Initial commit
63691509addeddelete_scratch: falseto thefuzzy_dedup_identificationentry to keep its scratch artifacts around forthe downstream
dedup_removal_*benchmarks. In practice the consumptionpath is the canonical dataset path:
Operators promote a known-good fuzzy_dedup output to this path once, and
the dedup_removal entries consume it from there on every subsequent
pipeline — there is no same-pipeline dependency. Given that workflow, the
per-entry
delete_scratch: falseoverride is unnecessary and just leavesunused data on lustre across runs. Reverted by
a3c38ac6; the entry nowuses the session-level default (
delete_scratch: true).6. Range
domain_label_games_countmetric +/-5% (0c375b7e)Both
domain_classification_raydataanddomain_classification_xennahad a requirement on
domain_label_games_countwithexact_value: 149816.Run-to-run output of the classifier drifts by several hundred to a few
thousand classifications, causing repeated false failures while the actual
benchmark (throughput, total docs, number of domains predicted) is
healthy.
Loosens the metric to a +/-5% range (
min_value: 142325,max_value: 157307) which still catches genuine regressions of theclassifier output while tolerating normal run-to-run variability.
domain_label_news_countis left atexact_value: 2817pending furtherinvestigation.
Verification
Across multiple nemo-ci pipelines:
— first verification of commits 1, 2, 3.1, 3.2 (and Make NeMo-Curator installable in non GPU environments #4 below): 28/37
pass (up from 23/37 baseline). Confirmed
audio_tagging_tts_xenna_repeatworks (--entries-exact),ndd_ray_serve_dp4works (timeout bump),
exact_dedup_identificationworks (rpv2 + timeout),audio_tagging_tts_xennaworks withoutHF_SECRET_KEYset.— scoped 8-entry run of commits 4, 5, 6: 6/8 pass (
math_preprocess*all pass thanks to lynx install;
domain_classification_*pass with themetric range;
fuzzy_dedup_identificationpasses). The two failures(
dedup_removal_*) were caused by a same-pipeline race condition withfuzzy_dedup_identification — fixed by operators manually copying the
fresh
fuzzy_id_generator.jsonartifacts to the canonical dataset pathbetween runs.
— scoped 2-entry run with fresh artifacts in place:
dedup_removal_xennapasses (1523s, no margin);
dedup_removal_raydataSLURM-killed by TIMELIMIT at 92% completion → addressed by commit
c2f3a6b0(Add style check #3.3).A fresh pipeline against the latest tip (which includes
c2f3a6b0) isrequired to confirm both dedup_removal entries pass cleanly.
Test plan
Undefined env vars
audio_tagging_tts_xennaran successfully in 53098506 withHF_SECRET_KEYunset.--strict-config-check; confirm it still exits with theoriginal
ValueError.behavior change.
--entries-exactand 53323567; no cross-entry pollution observed.
--entries-exact <typo>exits with an error listing the unknownname and the available entry names.
--entries-exact a,b,cruns only those three entries in YAML order.--entriesand--entries-exactexits with a"mutually exclusive" error.
Timeout bumps
ndd_ray_serve_dp4finished without TIME LIMIT cancellation in53098506.
exact_dedup_identificationfinished without TIME LIMIT in 53098506.c2f3a6b0lands or in a follow-up run, confirmdedup_removal_raydataanddedup_removal_xennaboth finish withoutTIME LIMIT cancellation. (Both passed in 53398187.)
lynx install
math_preprocess*entries all passed in 53323567 — confirmedRuntimeError: lynx executable not foundno longer fires.fuzzy_dedup scratch retention (now reverted)
fuzzy_id_generator.json+FuzzyDuplicateIds/written to the canonical dataset path(
{datasets_path}/cleaned_exact_dedup_all_cc_fuzzy_output_nightly_container_paths/).dedup_removal_*consume the promoted artifacts in a cleanpipeline (verified in 53398187).
delete_scratch: falseoverride on
fuzzy_dedup_identificationis no longer needed and hasbeen reverted (commit
a3c38ac6).domain_label_games_count range
domain_classification_*both passed in 53323567 with the newrange satisfied.
🤖 Generated with Claude Code