Skip to content

caption quality evaluation#1980

Open
weijiac0619 wants to merge 6 commits into
mainfrom
weijia/video_eval
Open

caption quality evaluation#1980
weijiac0619 wants to merge 6 commits into
mainfrom
weijia/video_eval

Conversation

@weijiac0619
Copy link
Copy Markdown
Contributor

@weijiac0619 weijiac0619 commented May 13, 2026

Description

doc link

Usage

# Add snippet demonstrating usage

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@weijiac0619 weijiac0619 requested a review from a team as a code owner May 13, 2026 20:10
@weijiac0619 weijiac0619 requested review from VibhuJawa and removed request for a team May 13, 2026 20:10
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@weijiac0619 weijiac0619 requested review from abhinavg4 and suiyoubi May 13, 2026 20:12
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 13, 2026

Greptile Summary

This PR introduces a new eval/video/ module for evaluating video caption quality using a "Summarize-then-Align" approach: clips are compressed by an LLM summarizer and then scored against CosmosEmbed1 video embeddings via cosine similarity.

  • build_benchmark_dataset.py: Samples ~3 000 source videos, runs the NeMo Curator embedding pipeline, applies K-means (K=200) on CosmosEmbed1-224p embeddings, and symlinks one representative clip per cluster into a reusable benchmark directory.
  • caption_clipscore.py: Loads pre-computed video embeddings, batch-summarizes captions with a vLLM-hosted LLM, encodes summaries with CosmosEmbed1, and writes per-clip cosine-similarity scores to a CSV alongside per-model mean statistics.
  • README.md: Documents the full four-step workflow (build dataset → generate captions → score → evaluate new model) with CLI examples and expected baseline scores.

Confidence Score: 3/5

The two new scripts work end-to-end for the happy path but have correctness gaps that can silently produce wrong benchmark composition and wrong scores.

The fallback branch in _kmeans_select never records the selected clip's source video in used_sources, so a later cluster can legitimately pick a clip from the same source, violating the one-per-source-video invariant without any error or warning. Combined with the previously noted zero-norm division in _cosine_sim and the basename-collision bug in _build_output, there are multiple independent paths that produce a quietly incorrect benchmark or incorrect scores.

eval/video/build_benchmark_dataset.py (_kmeans_select fallback logic) and eval/video/caption_clipscore.py (_cosine_sim zero-norm guard)

Important Files Changed

Filename Overview
eval/init.py Empty init file; marks the eval/ directory as a Python package.
eval/video/init.py Empty init file; marks eval/video/ as a Python package.
eval/video/README.md Documentation for the caption quality evaluation workflow; covers dataset construction, regression workflow, and CLI usage examples.
eval/video/build_benchmark_dataset.py New script that builds a 200-clip benchmark via K-means on CosmosEmbed1 embeddings; the fallback path in _kmeans_select omits the selected clip's source from used_sources, allowing duplicate-source clips to appear in the benchmark.
eval/video/caption_clipscore.py New script that evaluates caption quality via LLM summarization + CosmosEmbed1 cosine similarity; empty captions produce meaningless scores silently, and _cosine_sim lacks a zero-norm guard (already flagged in a prior review outside the diff).

Sequence Diagram

sequenceDiagram
    participant U as User
    participant B as build_benchmark_dataset.py
    participant P as NeMo Curator Pipeline
    participant KM as KMeans (_kmeans_select)
    participant C as caption_clipscore.py
    participant LLM as vLLM Summarizer
    participant CE as CosmosEmbed1

    U->>B: --video-dir, --output-dir, --num-clusters
    B->>B: _sample_videos() → 3000 mp4 symlinks
    B->>P: _run_embedding_pipeline()
    P-->>B: "ce1_embd/*.pickle, clips/*.mp4, metas/v0/*.json"
    B->>KM: "_kmeans_select(emb_dir, meta_dir, K=200)"
    KM-->>B: selected[(uid, meta)] x 200
    B->>B: _build_output() → symlinks + selected_uids.txt
    U->>C: --embedding-dir, --caption-dirs, --uid-list
    C->>C: _load_uid_list() → common_uuids
    C->>C: _collect_tasks() → (uid, label, caption) tuples
    alt No cached summaries
        C->>LLM: _summarize_captions(tasks)
        LLM-->>C: summaries[]
    else --load-summaries
        C->>C: Load from JSON cache
    end
    C->>CE: _score_summaries() → get_text_embedding per summary
    CE-->>C: text embeddings
    C->>C: _cosine_sim(vid_emb, text_emb) per clip
    C-->>U: results.csv + per-model mean scores
Loading

Reviews (5): Last reviewed commit: "add CosmosEmbed1 variant descriptions" | Re-trigger Greptile

Comment on lines +208 to +227
# Symlink embedding
src = f"{emb_dir}/{uid}.pickle"
dst = f"{output_dir}/ce1_embd/{uid}.pickle"
if os.path.exists(src) and not os.path.exists(dst):
os.symlink(os.path.abspath(src), dst)

# Symlink clip
src = f"{clip_dir}/{uid}.mp4"
dst = f"{output_dir}/clips/{uid}.mp4"
if os.path.exists(src) and not os.path.exists(dst):
os.symlink(os.path.abspath(src), dst)

# Symlink source video
src_video = meta.get("source_video", "")
if src_video and os.path.exists(src_video):
vid_name = os.path.basename(src_video)
dst = f"{output_dir}/input/{vid_name}"
if vid_name not in input_videos_linked and not os.path.exists(dst):
os.symlink(os.path.abspath(src_video), dst)
input_videos_linked.add(vid_name)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 os.path.exists() returns False for broken symlinks, so if a prior run left a broken symlink at dst (e.g. the pipeline output was later cleaned up), the guard passes and os.symlink() immediately raises FileExistsError: [Errno 17] File exists. This affects all three symlink creation blocks (lines 211, 217, 225) and would crash a restart of the script. Replace os.path.exists(dst) with os.path.lexists(dst), which returns True for both valid and broken symlinks.

Suggested change
# Symlink embedding
src = f"{emb_dir}/{uid}.pickle"
dst = f"{output_dir}/ce1_embd/{uid}.pickle"
if os.path.exists(src) and not os.path.exists(dst):
os.symlink(os.path.abspath(src), dst)
# Symlink clip
src = f"{clip_dir}/{uid}.mp4"
dst = f"{output_dir}/clips/{uid}.mp4"
if os.path.exists(src) and not os.path.exists(dst):
os.symlink(os.path.abspath(src), dst)
# Symlink source video
src_video = meta.get("source_video", "")
if src_video and os.path.exists(src_video):
vid_name = os.path.basename(src_video)
dst = f"{output_dir}/input/{vid_name}"
if vid_name not in input_videos_linked and not os.path.exists(dst):
os.symlink(os.path.abspath(src_video), dst)
input_videos_linked.add(vid_name)
# Symlink embedding
src = f"{emb_dir}/{uid}.pickle"
dst = f"{output_dir}/ce1_embd/{uid}.pickle"
if os.path.exists(src) and not os.path.lexists(dst):
os.symlink(os.path.abspath(src), dst)
# Symlink clip
src = f"{clip_dir}/{uid}.mp4"
dst = f"{output_dir}/clips/{uid}.mp4"
if os.path.exists(src) and not os.path.lexists(dst):
os.symlink(os.path.abspath(src), dst)
# Symlink source video
src_video = meta.get("source_video", "")
if src_video and os.path.exists(src_video):
vid_name = os.path.basename(src_video)
dst = f"{output_dir}/input/{vid_name}"
if vid_name not in input_videos_linked and not os.path.lexists(dst):
os.symlink(os.path.abspath(src_video), dst)
input_videos_linked.add(vid_name)

Comment on lines +346 to +347
if not os.path.exists(dst):
os.symlink(os.path.abspath(src), dst)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The same broken-symlink race is present in main(): if dst is a broken symlink from a prior run, os.path.exists(dst) returns False and os.symlink() throws FileExistsError. Use os.path.lexists(dst) here too.

Suggested change
if not os.path.exists(dst):
os.symlink(os.path.abspath(src), dst)
if not os.path.lexists(dst):
os.symlink(os.path.abspath(src), dst)

print(f"\nLoading cached summaries from: {load_summaries}")
with open(load_summaries) as f:
summary_cache = json.load(f)
summaries = [summary_cache.get(uid, {}).get(label, "") for uid, label, _ in tasks]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 When --load-summaries is used and a (uid, label) pair is absent from the cache (e.g. when scoring a new model against an old cache that only contains other models), the summary silently becomes "". That empty string is later passed to model.get_text_embedding(""), which produces an arbitrary or near-zero embedding, and the resulting cosine-similarity score is silently wrong. A warning when the summary is missing will surface this immediately rather than producing a quietly incorrect result.

Suggested change
summaries = [summary_cache.get(uid, {}).get(label, "") for uid, label, _ in tasks]
summaries = []
missing = 0
for uid, label, _ in tasks:
s = summary_cache.get(uid, {}).get(label, "")
if not s:
missing += 1
summaries.append(s)
if missing:
print(f" Warning: {missing} (uid, label) pairs have no cached summary and will produce invalid scores.")

Comment on lines +243 to +248
for i, (uid, label, _caption) in enumerate(tqdm(tasks, unit="cap")):
with open(f"{embedding_dir}/{uid}.pickle", "rb") as f:
arr = pickle.load(f) # noqa: S301
vid_emb = torch.from_numpy(arr).squeeze(0)
text_emb = model.get_text_embedding(summaries[i]).squeeze(0)
clip_scores.setdefault(uid, {})[label] = _cosine_sim(vid_emb, text_emb)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Redundant embedding loads per model label

The loop opens and unpickles {uid}.pickle once per task, meaning the same file is read n_models times for every clip (e.g. 3x for 3 models). With 200 clips and 3 models that is 600 reads instead of 200. Consider caching the video embedding per uid inside the loop to avoid the repeated I/O.

@weijiac0619 weijiac0619 changed the title eval caption quality evaluation May 13, 2026
@suiyoubi
Copy link
Copy Markdown
Contributor

is it possible to add caption_clipscore.py to the benchmark suites ? We can pre-cache the dataset, just need to make sure we have consistent clipscore for the same model

Comment thread tutorials/video/eval/build_benchmark_dataset.py Outdated
Comment thread tutorials/video/eval/build_benchmark_dataset.py Outdated
Comment thread tutorials/video/eval/build_benchmark_dataset.py Outdated
emb_dir = f"{pipeline_output_dir}/ce1_embd"
clip_dir = f"{pipeline_output_dir}/clips"

os.makedirs(f"{output_dir}/ce1_embd", exist_ok=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to create dirs for these ? these should be created by the pipeline already ? I would lean towards asserting those exists (so we make sure the embedding pipeline works )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is run once to build the cached benchmark dataset, not part of nightly.
Pipeline output dirs are already asserted (lines 181-186). The os.makedirs create the benchmark output directory structure. these don't exist yet, so they need to be created.

Comment thread tutorials/video/eval/build_benchmark_dataset.py Outdated
Comment thread tutorials/video/eval/build_benchmark_dataset.py Outdated
Comment thread tutorials/video/eval/caption_clipscore.py Outdated
Weijia Chen added 2 commits May 27, 2026 09:31
Signed-off-by: Weijia Chen <weijiac@dgx-a100-01.aselab.nvidia.com>
Signed-off-by: Weijia Chen <weijiac@dgx-a100-01.aselab.nvidia.com>
Comment on lines +269 to +288
parser.add_argument(
"--split-duration",
type=float,
default=10.0,
help="Fixed-stride split duration in seconds (default: 10.0).",
)
parser.add_argument(
"--aesthetic-threshold",
type=float,
default=3.5,
help="Minimum aesthetic score for clip filtering (default: 3.5).",
)
parser.add_argument(
"--seed",
type=int,
default=42,
help="Random seed for sampling and K-means (default: 42).",
)
return parser.parse_args()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 KMeans crashes when pipeline produces fewer clips than num_clusters

np.stack(embeddings) raises ValueError: need at least one array to stack when the embedding directory is empty (e.g., every clip was filtered out by the aesthetic threshold), and KMeans(n_clusters=200).fit_predict(...) raises ValueError: n_samples=N should be >= n_clusters=200 when the aesthetic filter or small sample size yields fewer clips than the requested number of clusters. Both failures produce cryptic scikit-learn errors with no guidance on why they occurred or how to fix them. An explicit guard here would surface these clearly before the expensive K-means step.

Copy link
Copy Markdown
Contributor

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @weijiac0619 for working on this. I had suggestions around code organization.

I think we should move this outside tutorials to a new eval subfolder ! In that eval folder we should have a eval.video.caption and place this here.

The other main ask is if we can use a huggingface dataset here as input so that our customers and we have a opensource dataset we can use .

The other unknown is if we can have some expected baseline results or some guidance on what a good vs bad resulted and an ability to check that in an automated way.

Comment thread eval/video/build_benchmark_dataset.py
Signed-off-by: Weijia Chen <weijiac@dgx-a100-01.aselab.nvidia.com>
Comment on lines +303 to +319

# Step 2: Run embedding pipeline
logger.info("[Step 2/4] Running embedding pipeline ...")
pipeline_output = f"{args.output_dir}/_pipeline_output"
_run_embedding_pipeline(
input_dir=sample_input_dir,
output_dir=pipeline_output,
model_dir=args.model_dir,
split_duration=args.split_duration,
aesthetic_threshold=args.aesthetic_threshold,
)

# Step 3: K-means clustering and selection
logger.info("[Step 3/4] K-means clustering ...")
selected = _kmeans_select(
emb_dir=f"{pipeline_output}/ce1_embd",
meta_dir=f"{pipeline_output}/metas/v0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Empty source_video string is treated as a shared source across all clusters

When meta.get("source_video", "") returns "" (e.g., a clip whose metadata file was not found or whose JSON doesn't have a source_video key), src = "". The very first cluster that picks such a clip adds "" to used_sources. Every subsequent cluster then sees "" in used_sources as True, so all other clips without a source path are treated as "already used" and get skipped. This can allow multiple clips from the same unlabeled source video to appear in the final benchmark, defeating the diversity constraint.

fix
Signed-off-by: Weijia Chen <weijiac@dgx-a100-01.aselab.nvidia.com>
@weijiac0619
Copy link
Copy Markdown
Contributor Author

is it possible to add caption_clipscore.py to the benchmark suites ? We can pre-cache the dataset, just need to make sure we have consistent clipscore for the same model

Good idea. Will add caption quality scoring as a standalone benchmark entry in a follow-up. plan to wire up the 200-clip dataset + captioning + scoring as a single entry, and validate the pass/fail thresholds on nightly.

@weijiac0619
Copy link
Copy Markdown
Contributor Author

@VibhuJawa thanks for the comments. Code moved to eval/video/. HF dataset upload and automated baseline validation will be follow-up.

Comment thread eval/video/README.md Outdated
Comment thread eval/video/README.md
Comment thread eval/video/README.md
Signed-off-by: Weijia Chen <weijiac@smc-522ga-0059.aselab.nvidia.com>
Signed-off-by: Weijia Chen <weijiac@nvidia.com>
Comment thread eval/video/README.md
determine whether it is a real quality regression.

Scores are deterministic across runs on the same machine when using
`enforce_eager=True` and `CUBLAS_WORKSPACE_CONFIG=:4096:8`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: Is the understanding that these result in same exact scores, but without it we get some variation but that variation is within 5% ?

@claude , Can you suggest better phrasing for Baseline Scores ?

Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, just familiarizing myself with the PR and left minor comments if they are useful. This is awesome work, thank you!

uid_to_meta[uid] = json.load(fh)

logger.info(f"K-means K={num_clusters} ...")
kmeans = KMeans(n_clusters=num_clusters, random_state=seed, n_init=10, max_iter=300)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be configurable?

Suggested change
kmeans = KMeans(n_clusters=num_clusters, random_state=seed, n_init=10, max_iter=300)
kmeans = KMeans(n_clusters=num_clusters, random_state=seed, n_init=n_init, max_iter=max_iter)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sarahyurick , These are scikit-learn's defaults (n_init=10, max_iter=300), which means this is equivalent to KMeans(n_clusters=num_clusters, random_state=seed) without specifying them. They're explicit in the code just so readers know the values at a glance.

"""Encode summaries with CosmosEmbed1 and compute per-clip scores."""
logger.info(f"Loading CosmosEmbed1-{variant} from {cosmos_model_dir} ...")
model = CosmosEmbed1(variant=variant, utils_only=False, model_dir=cosmos_model_dir)
model.setup()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be worth doing any of this in a Curator pipeline or no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a standalone evaluation tool instead of a data processing step. i'm worried that wrapping it in a Curator pipeline would add overhead

Comment on lines +59 to +66
_REPO_ROOT = Path(__file__).parent.parent.parent
sys.path.insert(0, str(_REPO_ROOT / "tutorials" / "video" / "getting-started"))

from video_split_clip_example import ( # noqa: E402
create_video_splitting_argparser,
create_video_splitting_pipeline,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suiyoubi / @weijiac0619, Should we move/copy this code to eval or some other place ?

Like it feels generally useful for other evals will use this and benchmarks use this too.

Feels like an awkward place to pull getting started code for these functions ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree but video_split_clip_example.py is the existing pipeline entry point. what do you think about folder structure? @suiyoubi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants