caption quality evaluation by weijiac0619 · Pull Request #1980 · NVIDIA-NeMo/Curator

weijiac0619 · 2026-05-13T20:09:59Z

Description

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-05-13T20:10:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-05-13T20:20:22Z

Greptile Summary

This PR introduces a new eval/video/ module for evaluating video caption quality using a "Summarize-then-Align" approach: clips are compressed by an LLM summarizer and then scored against CosmosEmbed1 video embeddings via cosine similarity.

build_benchmark_dataset.py: Samples ~3 000 source videos, runs the NeMo Curator embedding pipeline, applies K-means (K=200) on CosmosEmbed1-224p embeddings, and symlinks one representative clip per cluster into a reusable benchmark directory.
caption_clipscore.py: Loads pre-computed video embeddings, batch-summarizes captions with a vLLM-hosted LLM, encodes summaries with CosmosEmbed1, and writes per-clip cosine-similarity scores to a CSV alongside per-model mean statistics.
README.md: Documents the full four-step workflow (build dataset → generate captions → score → evaluate new model) with CLI examples and expected baseline scores.

Confidence Score: 3/5

The two new scripts work end-to-end for the happy path but have correctness gaps that can silently produce wrong benchmark composition and wrong scores.

The fallback branch in _kmeans_select never records the selected clip's source video in used_sources, so a later cluster can legitimately pick a clip from the same source, violating the one-per-source-video invariant without any error or warning. Combined with the previously noted zero-norm division in _cosine_sim and the basename-collision bug in _build_output, there are multiple independent paths that produce a quietly incorrect benchmark or incorrect scores.

eval/video/build_benchmark_dataset.py (_kmeans_select fallback logic) and eval/video/caption_clipscore.py (_cosine_sim zero-norm guard)

Important Files Changed

Filename	Overview
eval/init.py	Empty init file; marks the eval/ directory as a Python package.
eval/video/init.py	Empty init file; marks eval/video/ as a Python package.
eval/video/README.md	Documentation for the caption quality evaluation workflow; covers dataset construction, regression workflow, and CLI usage examples.
eval/video/build_benchmark_dataset.py	New script that builds a 200-clip benchmark via K-means on CosmosEmbed1 embeddings; the fallback path in _kmeans_select omits the selected clip's source from used_sources, allowing duplicate-source clips to appear in the benchmark.
eval/video/caption_clipscore.py	New script that evaluates caption quality via LLM summarization + CosmosEmbed1 cosine similarity; empty captions produce meaningless scores silently, and _cosine_sim lacks a zero-norm guard (already flagged in a prior review outside the diff).

Sequence Diagram

sequenceDiagram
    participant U as User
    participant B as build_benchmark_dataset.py
    participant P as NeMo Curator Pipeline
    participant KM as KMeans (_kmeans_select)
    participant C as caption_clipscore.py
    participant LLM as vLLM Summarizer
    participant CE as CosmosEmbed1

    U->>B: --video-dir, --output-dir, --num-clusters
    B->>B: _sample_videos() → 3000 mp4 symlinks
    B->>P: _run_embedding_pipeline()
    P-->>B: "ce1_embd/*.pickle, clips/*.mp4, metas/v0/*.json"
    B->>KM: "_kmeans_select(emb_dir, meta_dir, K=200)"
    KM-->>B: selected[(uid, meta)] x 200
    B->>B: _build_output() → symlinks + selected_uids.txt
    U->>C: --embedding-dir, --caption-dirs, --uid-list
    C->>C: _load_uid_list() → common_uuids
    C->>C: _collect_tasks() → (uid, label, caption) tuples
    alt No cached summaries
        C->>LLM: _summarize_captions(tasks)
        LLM-->>C: summaries[]
    else --load-summaries
        C->>C: Load from JSON cache
    end
    C->>CE: _score_summaries() → get_text_embedding per summary
    CE-->>C: text embeddings
    C->>C: _cosine_sim(vid_emb, text_emb) per clip
    C-->>U: results.csv + per-model mean scores

_{Reviews (5): Last reviewed commit: "add CosmosEmbed1 variant descriptions" | Re-trigger Greptile}

greptile-apps · 2026-05-13T20:20:26Z

+        # Symlink embedding
+        src = f"{emb_dir}/{uid}.pickle"
+        dst = f"{output_dir}/ce1_embd/{uid}.pickle"
+        if os.path.exists(src) and not os.path.exists(dst):
+            os.symlink(os.path.abspath(src), dst)
+
+        # Symlink clip
+        src = f"{clip_dir}/{uid}.mp4"
+        dst = f"{output_dir}/clips/{uid}.mp4"
+        if os.path.exists(src) and not os.path.exists(dst):
+            os.symlink(os.path.abspath(src), dst)
+
+        # Symlink source video
+        src_video = meta.get("source_video", "")
+        if src_video and os.path.exists(src_video):
+            vid_name = os.path.basename(src_video)
+            dst = f"{output_dir}/input/{vid_name}"
+            if vid_name not in input_videos_linked and not os.path.exists(dst):
+                os.symlink(os.path.abspath(src_video), dst)
+                input_videos_linked.add(vid_name)


os.path.exists() returns False for broken symlinks, so if a prior run left a broken symlink at dst (e.g. the pipeline output was later cleaned up), the guard passes and os.symlink() immediately raises FileExistsError: [Errno 17] File exists. This affects all three symlink creation blocks (lines 211, 217, 225) and would crash a restart of the script. Replace os.path.exists(dst) with os.path.lexists(dst), which returns True for both valid and broken symlinks.

Suggested change

# Symlink embedding

src = f"{emb_dir}/{uid}.pickle"

dst = f"{output_dir}/ce1_embd/{uid}.pickle"

if os.path.exists(src) and not os.path.exists(dst):

os.symlink(os.path.abspath(src), dst)

# Symlink clip

src = f"{clip_dir}/{uid}.mp4"

dst = f"{output_dir}/clips/{uid}.mp4"

if os.path.exists(src) and not os.path.exists(dst):

os.symlink(os.path.abspath(src), dst)

# Symlink source video

src_video = meta.get("source_video", "")

if src_video and os.path.exists(src_video):

vid_name = os.path.basename(src_video)

dst = f"{output_dir}/input/{vid_name}"

if vid_name not in input_videos_linked and not os.path.exists(dst):

os.symlink(os.path.abspath(src_video), dst)

input_videos_linked.add(vid_name)

# Symlink embedding

src = f"{emb_dir}/{uid}.pickle"

dst = f"{output_dir}/ce1_embd/{uid}.pickle"

if os.path.exists(src) and not os.path.lexists(dst):

os.symlink(os.path.abspath(src), dst)

# Symlink clip

src = f"{clip_dir}/{uid}.mp4"

dst = f"{output_dir}/clips/{uid}.mp4"

if os.path.exists(src) and not os.path.lexists(dst):

os.symlink(os.path.abspath(src), dst)

# Symlink source video

src_video = meta.get("source_video", "")

if src_video and os.path.exists(src_video):

vid_name = os.path.basename(src_video)

dst = f"{output_dir}/input/{vid_name}"

if vid_name not in input_videos_linked and not os.path.lexists(dst):

os.symlink(os.path.abspath(src_video), dst)

input_videos_linked.add(vid_name)

greptile-apps · 2026-05-13T20:20:27Z

+        if not os.path.exists(dst):
+            os.symlink(os.path.abspath(src), dst)


The same broken-symlink race is present in main(): if dst is a broken symlink from a prior run, os.path.exists(dst) returns False and os.symlink() throws FileExistsError. Use os.path.lexists(dst) here too.

Suggested change

if not os.path.exists(dst):

os.symlink(os.path.abspath(src), dst)

if not os.path.lexists(dst):

os.symlink(os.path.abspath(src), dst)

greptile-apps · 2026-05-13T20:20:28Z

+        print(f"\nLoading cached summaries from: {load_summaries}")
+        with open(load_summaries) as f:
+            summary_cache = json.load(f)
+        summaries = [summary_cache.get(uid, {}).get(label, "") for uid, label, _ in tasks]


When --load-summaries is used and a (uid, label) pair is absent from the cache (e.g. when scoring a new model against an old cache that only contains other models), the summary silently becomes "". That empty string is later passed to model.get_text_embedding(""), which produces an arbitrary or near-zero embedding, and the resulting cosine-similarity score is silently wrong. A warning when the summary is missing will surface this immediately rather than producing a quietly incorrect result.

Suggested change

summaries = [summary_cache.get(uid, {}).get(label, "") for uid, label, _ in tasks]

summaries = []

missing = 0

for uid, label, _ in tasks:

s = summary_cache.get(uid, {}).get(label, "")

if not s:

missing += 1

summaries.append(s)

if missing:

print(f" Warning: {missing} (uid, label) pairs have no cached summary and will produce invalid scores.")

greptile-apps · 2026-05-13T20:20:29Z

+    for i, (uid, label, _caption) in enumerate(tqdm(tasks, unit="cap")):
+        with open(f"{embedding_dir}/{uid}.pickle", "rb") as f:
+            arr = pickle.load(f)  # noqa: S301
+        vid_emb = torch.from_numpy(arr).squeeze(0)
+        text_emb = model.get_text_embedding(summaries[i]).squeeze(0)
+        clip_scores.setdefault(uid, {})[label] = _cosine_sim(vid_emb, text_emb)


Redundant embedding loads per model label

The loop opens and unpickles {uid}.pickle once per task, meaning the same file is read n_models times for every clip (e.g. 3x for 3 models). With 200 clips and 3 models that is 600 reads instead of 200. Consider caching the video embedding per uid inside the loop to avoid the repeated I/O.

suiyoubi · 2026-05-27T13:07:24Z

is it possible to add caption_clipscore.py to the benchmark suites ? We can pre-cache the dataset, just need to make sure we have consistent clipscore for the same model

suiyoubi · 2026-05-27T13:30:27Z

+    emb_dir = f"{pipeline_output_dir}/ce1_embd"
+    clip_dir = f"{pipeline_output_dir}/clips"
+
+    os.makedirs(f"{output_dir}/ce1_embd", exist_ok=True)


do we need to create dirs for these ? these should be created by the pipeline already ? I would lean towards asserting those exists (so we make sure the embedding pipeline works )

This script is run once to build the cached benchmark dataset, not part of nightly.
Pipeline output dirs are already asserted (lines 181-186). The os.makedirs create the benchmark output directory structure. these don't exist yet, so they need to be created.

Signed-off-by: Weijia Chen <weijiac@dgx-a100-01.aselab.nvidia.com>

greptile-apps · 2026-05-27T18:17:21Z

+    parser.add_argument(
+        "--split-duration",
+        type=float,
+        default=10.0,
+        help="Fixed-stride split duration in seconds (default: 10.0).",
+    )
+    parser.add_argument(
+        "--aesthetic-threshold",
+        type=float,
+        default=3.5,
+        help="Minimum aesthetic score for clip filtering (default: 3.5).",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=42,
+        help="Random seed for sampling and K-means (default: 42).",
+    )
+    return parser.parse_args()
+


KMeans crashes when pipeline produces fewer clips than num_clusters

np.stack(embeddings) raises ValueError: need at least one array to stack when the embedding directory is empty (e.g., every clip was filtered out by the aesthetic threshold), and KMeans(n_clusters=200).fit_predict(...) raises ValueError: n_samples=N should be >= n_clusters=200 when the aesthetic filter or small sample size yields fewer clips than the requested number of clusters. Both failures produce cryptic scikit-learn errors with no guidance on why they occurred or how to fix them. An explicit guard here would surface these clearly before the expensive K-means step.

VibhuJawa

Thanks again @weijiac0619 for working on this. I had suggestions around code organization.

I think we should move this outside tutorials to a new eval subfolder ! In that eval folder we should have a eval.video.caption and place this here.

The other main ask is if we can use a huggingface dataset here as input so that our customers and we have a opensource dataset we can use .

The other unknown is if we can have some expected baseline results or some guidance on what a good vs bad resulted and an ability to check that in an automated way.

Signed-off-by: Weijia Chen <weijiac@dgx-a100-01.aselab.nvidia.com>

greptile-apps · 2026-05-27T19:58:40Z

+
+    # Step 2: Run embedding pipeline
+    logger.info("[Step 2/4] Running embedding pipeline ...")
+    pipeline_output = f"{args.output_dir}/_pipeline_output"
+    _run_embedding_pipeline(
+        input_dir=sample_input_dir,
+        output_dir=pipeline_output,
+        model_dir=args.model_dir,
+        split_duration=args.split_duration,
+        aesthetic_threshold=args.aesthetic_threshold,
+    )
+
+    # Step 3: K-means clustering and selection
+    logger.info("[Step 3/4] K-means clustering ...")
+    selected = _kmeans_select(
+        emb_dir=f"{pipeline_output}/ce1_embd",
+        meta_dir=f"{pipeline_output}/metas/v0",


Empty source_video string is treated as a shared source across all clusters

When meta.get("source_video", "") returns "" (e.g., a clip whose metadata file was not found or whose JSON doesn't have a source_video key), src = "". The very first cluster that picks such a clip adds "" to used_sources. Every subsequent cluster then sees "" in used_sources as True, so all other clips without a source path are treated as "already used" and get skipped. This can allow multiple clips from the same unlabeled source video to appear in the final benchmark, defeating the diversity constraint.

Signed-off-by: Weijia Chen <weijiac@dgx-a100-01.aselab.nvidia.com>

weijiac0619 · 2026-06-02T18:16:07Z

is it possible to add caption_clipscore.py to the benchmark suites ? We can pre-cache the dataset, just need to make sure we have consistent clipscore for the same model

Good idea. Will add caption quality scoring as a standalone benchmark entry in a follow-up. plan to wire up the 200-clip dataset + captioning + scoring as a single entry, and validate the pass/fail thresholds on nightly.

weijiac0619 · 2026-06-02T18:22:35Z

@VibhuJawa thanks for the comments. Code moved to eval/video/. HF dataset upload and automated baseline validation will be follow-up.

Signed-off-by: Weijia Chen <weijiac@smc-522ga-0059.aselab.nvidia.com>

Signed-off-by: Weijia Chen <weijiac@nvidia.com>

VibhuJawa · 2026-06-02T19:40:31Z

+determine whether it is a real quality regression.
+
+Scores are deterministic across runs on the same machine when using
+`enforce_eager=True` and `CUBLAS_WORKSPACE_CONFIG=:4096:8`.


QQ: Is the understanding that these result in same exact scores, but without it we get some variation but that variation is within 5% ?

@claude , Can you suggest better phrasing for Baseline Scores ?

sarahyurick

Hi, just familiarizing myself with the PR and left minor comments if they are useful. This is awesome work, thank you!

sarahyurick · 2026-06-02T19:35:08Z

+                uid_to_meta[uid] = json.load(fh)
+
+    logger.info(f"K-means K={num_clusters} ...")
+    kmeans = KMeans(n_clusters=num_clusters, random_state=seed, n_init=10, max_iter=300)


Should these be configurable?

Suggested change

kmeans = KMeans(n_clusters=num_clusters, random_state=seed, n_init=10, max_iter=300)

kmeans = KMeans(n_clusters=num_clusters, random_state=seed, n_init=n_init, max_iter=max_iter)

Hi @sarahyurick , These are scikit-learn's defaults (n_init=10, max_iter=300), which means this is equivalent to KMeans(n_clusters=num_clusters, random_state=seed) without specifying them. They're explicit in the code just so readers know the values at a glance.

sarahyurick · 2026-06-02T19:39:34Z

+    """Encode summaries with CosmosEmbed1 and compute per-clip scores."""
+    logger.info(f"Loading CosmosEmbed1-{variant} from {cosmos_model_dir} ...")
+    model = CosmosEmbed1(variant=variant, utils_only=False, model_dir=cosmos_model_dir)
+    model.setup()


Could it be worth doing any of this in a Curator pipeline or no?

This is a standalone evaluation tool instead of a data processing step. i'm worried that wrapping it in a Curator pipeline would add overhead

VibhuJawa · 2026-06-02T19:47:01Z

+_REPO_ROOT = Path(__file__).parent.parent.parent
+sys.path.insert(0, str(_REPO_ROOT / "tutorials" / "video" / "getting-started"))
+
+from video_split_clip_example import (  # noqa: E402
+    create_video_splitting_argparser,
+    create_video_splitting_pipeline,
+)
+


@suiyoubi / @weijiac0619, Should we move/copy this code to eval or some other place ?

Like it feels generally useful for other evals will use this and benchmarks use this too.

Feels like an awkward place to pull getting started code for these functions ?

agree but video_split_clip_example.py is the existing pipeline entry point. what do you think about folder structure? @suiyoubi

weijiac0619 requested a review from a team as a code owner May 13, 2026 20:10

weijiac0619 requested review from VibhuJawa and removed request for a team May 13, 2026 20:10

weijiac0619 requested review from abhinavg4 and suiyoubi May 13, 2026 20:12

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

weijiac0619 changed the title ~~eval~~ caption quality evaluation May 13, 2026

suiyoubi requested changes May 27, 2026

View reviewed changes

Weijia Chen added 2 commits May 27, 2026 09:31

eval

daf1a9c

Signed-off-by: Weijia Chen <weijiac@dgx-a100-01.aselab.nvidia.com>

refactpr

7671871

Signed-off-by: Weijia Chen <weijiac@dgx-a100-01.aselab.nvidia.com>

weijiac0619 force-pushed the weijia/video_eval branch from 763910d to 7671871 Compare May 27, 2026 18:13

greptile-apps Bot reviewed May 27, 2026

View reviewed changes

VibhuJawa requested changes May 27, 2026

View reviewed changes

Comment thread eval/video/build_benchmark_dataset.py

move eval

20d8347

Signed-off-by: Weijia Chen <weijiac@dgx-a100-01.aselab.nvidia.com>

greptile-apps Bot reviewed May 27, 2026

View reviewed changes

fix

bfdb9b7

Signed-off-by: Weijia Chen <weijiac@dgx-a100-01.aselab.nvidia.com>

VibhuJawa reviewed Jun 2, 2026

View reviewed changes

Comment thread eval/video/README.md Outdated

Comment thread eval/video/README.md

Comment thread eval/video/README.md

weijiac0619 added 2 commits June 2, 2026 18:51

update links

d486bde

Signed-off-by: Weijia Chen <weijiac@smc-522ga-0059.aselab.nvidia.com>

add CosmosEmbed1 variant descriptions

5aff0b3

Signed-off-by: Weijia Chen <weijiac@nvidia.com>

VibhuJawa reviewed Jun 2, 2026

View reviewed changes

sarahyurick reviewed Jun 2, 2026

View reviewed changes

VibhuJawa reviewed Jun 2, 2026

View reviewed changes

		if not os.path.exists(dst):
		os.symlink(os.path.abspath(src), dst)

-        summaries = [summary_cache.get(uid, {}).get(label, "") for uid, label, _ in tasks]
+        summaries = []
+        missing = 0
+        for uid, label, _ in tasks:
+            s = summary_cache.get(uid, {}).get(label, "")
+            if not s:
+                missing += 1
+            summaries.append(s)
+        if missing:
+            print(f"  Warning: {missing} (uid, label) pairs have no cached summary and will produce invalid scores.")

	kmeans = KMeans(n_clusters=num_clusters, random_state=seed, n_init=10, max_iter=300)
	kmeans = KMeans(n_clusters=num_clusters, random_state=seed, n_init=n_init, max_iter=max_iter)

Conversation

weijiac0619 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Usage

Checklist

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

greptile-apps Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

suiyoubi commented May 27, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

VibhuJawa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps Bot May 27, 2026

Choose a reason for hiding this comment

Uh oh!

weijiac0619 commented Jun 2, 2026

Uh oh!

weijiac0619 commented Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarahyurick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

weijiac0619 commented May 13, 2026 •

edited

Loading

greptile-apps Bot commented May 13, 2026 •

edited

Loading