caption quality evaluation#1980
Conversation
Greptile SummaryThis PR introduces a new
Confidence Score: 3/5The two new scripts work end-to-end for the happy path but have correctness gaps that can silently produce wrong benchmark composition and wrong scores. The fallback branch in _kmeans_select never records the selected clip's source video in used_sources, so a later cluster can legitimately pick a clip from the same source, violating the one-per-source-video invariant without any error or warning. Combined with the previously noted zero-norm division in _cosine_sim and the basename-collision bug in _build_output, there are multiple independent paths that produce a quietly incorrect benchmark or incorrect scores. eval/video/build_benchmark_dataset.py (_kmeans_select fallback logic) and eval/video/caption_clipscore.py (_cosine_sim zero-norm guard) Important Files Changed
Sequence DiagramsequenceDiagram
participant U as User
participant B as build_benchmark_dataset.py
participant P as NeMo Curator Pipeline
participant KM as KMeans (_kmeans_select)
participant C as caption_clipscore.py
participant LLM as vLLM Summarizer
participant CE as CosmosEmbed1
U->>B: --video-dir, --output-dir, --num-clusters
B->>B: _sample_videos() → 3000 mp4 symlinks
B->>P: _run_embedding_pipeline()
P-->>B: "ce1_embd/*.pickle, clips/*.mp4, metas/v0/*.json"
B->>KM: "_kmeans_select(emb_dir, meta_dir, K=200)"
KM-->>B: selected[(uid, meta)] x 200
B->>B: _build_output() → symlinks + selected_uids.txt
U->>C: --embedding-dir, --caption-dirs, --uid-list
C->>C: _load_uid_list() → common_uuids
C->>C: _collect_tasks() → (uid, label, caption) tuples
alt No cached summaries
C->>LLM: _summarize_captions(tasks)
LLM-->>C: summaries[]
else --load-summaries
C->>C: Load from JSON cache
end
C->>CE: _score_summaries() → get_text_embedding per summary
CE-->>C: text embeddings
C->>C: _cosine_sim(vid_emb, text_emb) per clip
C-->>U: results.csv + per-model mean scores
Reviews (5): Last reviewed commit: "add CosmosEmbed1 variant descriptions" | Re-trigger Greptile |
| # Symlink embedding | ||
| src = f"{emb_dir}/{uid}.pickle" | ||
| dst = f"{output_dir}/ce1_embd/{uid}.pickle" | ||
| if os.path.exists(src) and not os.path.exists(dst): | ||
| os.symlink(os.path.abspath(src), dst) | ||
|
|
||
| # Symlink clip | ||
| src = f"{clip_dir}/{uid}.mp4" | ||
| dst = f"{output_dir}/clips/{uid}.mp4" | ||
| if os.path.exists(src) and not os.path.exists(dst): | ||
| os.symlink(os.path.abspath(src), dst) | ||
|
|
||
| # Symlink source video | ||
| src_video = meta.get("source_video", "") | ||
| if src_video and os.path.exists(src_video): | ||
| vid_name = os.path.basename(src_video) | ||
| dst = f"{output_dir}/input/{vid_name}" | ||
| if vid_name not in input_videos_linked and not os.path.exists(dst): | ||
| os.symlink(os.path.abspath(src_video), dst) | ||
| input_videos_linked.add(vid_name) |
There was a problem hiding this comment.
os.path.exists() returns False for broken symlinks, so if a prior run left a broken symlink at dst (e.g. the pipeline output was later cleaned up), the guard passes and os.symlink() immediately raises FileExistsError: [Errno 17] File exists. This affects all three symlink creation blocks (lines 211, 217, 225) and would crash a restart of the script. Replace os.path.exists(dst) with os.path.lexists(dst), which returns True for both valid and broken symlinks.
| # Symlink embedding | |
| src = f"{emb_dir}/{uid}.pickle" | |
| dst = f"{output_dir}/ce1_embd/{uid}.pickle" | |
| if os.path.exists(src) and not os.path.exists(dst): | |
| os.symlink(os.path.abspath(src), dst) | |
| # Symlink clip | |
| src = f"{clip_dir}/{uid}.mp4" | |
| dst = f"{output_dir}/clips/{uid}.mp4" | |
| if os.path.exists(src) and not os.path.exists(dst): | |
| os.symlink(os.path.abspath(src), dst) | |
| # Symlink source video | |
| src_video = meta.get("source_video", "") | |
| if src_video and os.path.exists(src_video): | |
| vid_name = os.path.basename(src_video) | |
| dst = f"{output_dir}/input/{vid_name}" | |
| if vid_name not in input_videos_linked and not os.path.exists(dst): | |
| os.symlink(os.path.abspath(src_video), dst) | |
| input_videos_linked.add(vid_name) | |
| # Symlink embedding | |
| src = f"{emb_dir}/{uid}.pickle" | |
| dst = f"{output_dir}/ce1_embd/{uid}.pickle" | |
| if os.path.exists(src) and not os.path.lexists(dst): | |
| os.symlink(os.path.abspath(src), dst) | |
| # Symlink clip | |
| src = f"{clip_dir}/{uid}.mp4" | |
| dst = f"{output_dir}/clips/{uid}.mp4" | |
| if os.path.exists(src) and not os.path.lexists(dst): | |
| os.symlink(os.path.abspath(src), dst) | |
| # Symlink source video | |
| src_video = meta.get("source_video", "") | |
| if src_video and os.path.exists(src_video): | |
| vid_name = os.path.basename(src_video) | |
| dst = f"{output_dir}/input/{vid_name}" | |
| if vid_name not in input_videos_linked and not os.path.lexists(dst): | |
| os.symlink(os.path.abspath(src_video), dst) | |
| input_videos_linked.add(vid_name) |
| if not os.path.exists(dst): | ||
| os.symlink(os.path.abspath(src), dst) |
There was a problem hiding this comment.
The same broken-symlink race is present in
main(): if dst is a broken symlink from a prior run, os.path.exists(dst) returns False and os.symlink() throws FileExistsError. Use os.path.lexists(dst) here too.
| if not os.path.exists(dst): | |
| os.symlink(os.path.abspath(src), dst) | |
| if not os.path.lexists(dst): | |
| os.symlink(os.path.abspath(src), dst) |
| print(f"\nLoading cached summaries from: {load_summaries}") | ||
| with open(load_summaries) as f: | ||
| summary_cache = json.load(f) | ||
| summaries = [summary_cache.get(uid, {}).get(label, "") for uid, label, _ in tasks] |
There was a problem hiding this comment.
When
--load-summaries is used and a (uid, label) pair is absent from the cache (e.g. when scoring a new model against an old cache that only contains other models), the summary silently becomes "". That empty string is later passed to model.get_text_embedding(""), which produces an arbitrary or near-zero embedding, and the resulting cosine-similarity score is silently wrong. A warning when the summary is missing will surface this immediately rather than producing a quietly incorrect result.
| summaries = [summary_cache.get(uid, {}).get(label, "") for uid, label, _ in tasks] | |
| summaries = [] | |
| missing = 0 | |
| for uid, label, _ in tasks: | |
| s = summary_cache.get(uid, {}).get(label, "") | |
| if not s: | |
| missing += 1 | |
| summaries.append(s) | |
| if missing: | |
| print(f" Warning: {missing} (uid, label) pairs have no cached summary and will produce invalid scores.") |
| for i, (uid, label, _caption) in enumerate(tqdm(tasks, unit="cap")): | ||
| with open(f"{embedding_dir}/{uid}.pickle", "rb") as f: | ||
| arr = pickle.load(f) # noqa: S301 | ||
| vid_emb = torch.from_numpy(arr).squeeze(0) | ||
| text_emb = model.get_text_embedding(summaries[i]).squeeze(0) | ||
| clip_scores.setdefault(uid, {})[label] = _cosine_sim(vid_emb, text_emb) |
There was a problem hiding this comment.
Redundant embedding loads per model label
The loop opens and unpickles {uid}.pickle once per task, meaning the same file is read n_models times for every clip (e.g. 3x for 3 models). With 200 clips and 3 models that is 600 reads instead of 200. Consider caching the video embedding per uid inside the loop to avoid the repeated I/O.
|
is it possible to add caption_clipscore.py to the benchmark suites ? We can pre-cache the dataset, just need to make sure we have consistent clipscore for the same model |
| emb_dir = f"{pipeline_output_dir}/ce1_embd" | ||
| clip_dir = f"{pipeline_output_dir}/clips" | ||
|
|
||
| os.makedirs(f"{output_dir}/ce1_embd", exist_ok=True) |
There was a problem hiding this comment.
do we need to create dirs for these ? these should be created by the pipeline already ? I would lean towards asserting those exists (so we make sure the embedding pipeline works )
There was a problem hiding this comment.
This script is run once to build the cached benchmark dataset, not part of nightly.
Pipeline output dirs are already asserted (lines 181-186). The os.makedirs create the benchmark output directory structure. these don't exist yet, so they need to be created.
763910d to
7671871
Compare
| parser.add_argument( | ||
| "--split-duration", | ||
| type=float, | ||
| default=10.0, | ||
| help="Fixed-stride split duration in seconds (default: 10.0).", | ||
| ) | ||
| parser.add_argument( | ||
| "--aesthetic-threshold", | ||
| type=float, | ||
| default=3.5, | ||
| help="Minimum aesthetic score for clip filtering (default: 3.5).", | ||
| ) | ||
| parser.add_argument( | ||
| "--seed", | ||
| type=int, | ||
| default=42, | ||
| help="Random seed for sampling and K-means (default: 42).", | ||
| ) | ||
| return parser.parse_args() | ||
|
|
There was a problem hiding this comment.
KMeans crashes when pipeline produces fewer clips than
num_clusters
np.stack(embeddings) raises ValueError: need at least one array to stack when the embedding directory is empty (e.g., every clip was filtered out by the aesthetic threshold), and KMeans(n_clusters=200).fit_predict(...) raises ValueError: n_samples=N should be >= n_clusters=200 when the aesthetic filter or small sample size yields fewer clips than the requested number of clusters. Both failures produce cryptic scikit-learn errors with no guidance on why they occurred or how to fix them. An explicit guard here would surface these clearly before the expensive K-means step.
VibhuJawa
left a comment
There was a problem hiding this comment.
Thanks again @weijiac0619 for working on this. I had suggestions around code organization.
I think we should move this outside tutorials to a new eval subfolder ! In that eval folder we should have a eval.video.caption and place this here.
The other main ask is if we can use a huggingface dataset here as input so that our customers and we have a opensource dataset we can use .
The other unknown is if we can have some expected baseline results or some guidance on what a good vs bad resulted and an ability to check that in an automated way.
|
|
||
| # Step 2: Run embedding pipeline | ||
| logger.info("[Step 2/4] Running embedding pipeline ...") | ||
| pipeline_output = f"{args.output_dir}/_pipeline_output" | ||
| _run_embedding_pipeline( | ||
| input_dir=sample_input_dir, | ||
| output_dir=pipeline_output, | ||
| model_dir=args.model_dir, | ||
| split_duration=args.split_duration, | ||
| aesthetic_threshold=args.aesthetic_threshold, | ||
| ) | ||
|
|
||
| # Step 3: K-means clustering and selection | ||
| logger.info("[Step 3/4] K-means clustering ...") | ||
| selected = _kmeans_select( | ||
| emb_dir=f"{pipeline_output}/ce1_embd", | ||
| meta_dir=f"{pipeline_output}/metas/v0", |
There was a problem hiding this comment.
Empty
source_video string is treated as a shared source across all clusters
When meta.get("source_video", "") returns "" (e.g., a clip whose metadata file was not found or whose JSON doesn't have a source_video key), src = "". The very first cluster that picks such a clip adds "" to used_sources. Every subsequent cluster then sees "" in used_sources as True, so all other clips without a source path are treated as "already used" and get skipped. This can allow multiple clips from the same unlabeled source video to appear in the final benchmark, defeating the diversity constraint.
Good idea. Will add caption quality scoring as a standalone benchmark entry in a follow-up. plan to wire up the 200-clip dataset + captioning + scoring as a single entry, and validate the pass/fail thresholds on nightly. |
|
@VibhuJawa thanks for the comments. Code moved to |
Signed-off-by: Weijia Chen <weijiac@smc-522ga-0059.aselab.nvidia.com>
Signed-off-by: Weijia Chen <weijiac@nvidia.com>
| determine whether it is a real quality regression. | ||
|
|
||
| Scores are deterministic across runs on the same machine when using | ||
| `enforce_eager=True` and `CUBLAS_WORKSPACE_CONFIG=:4096:8`. |
There was a problem hiding this comment.
QQ: Is the understanding that these result in same exact scores, but without it we get some variation but that variation is within 5% ?
@claude , Can you suggest better phrasing for Baseline Scores ?
sarahyurick
left a comment
There was a problem hiding this comment.
Hi, just familiarizing myself with the PR and left minor comments if they are useful. This is awesome work, thank you!
| uid_to_meta[uid] = json.load(fh) | ||
|
|
||
| logger.info(f"K-means K={num_clusters} ...") | ||
| kmeans = KMeans(n_clusters=num_clusters, random_state=seed, n_init=10, max_iter=300) |
There was a problem hiding this comment.
Should these be configurable?
| kmeans = KMeans(n_clusters=num_clusters, random_state=seed, n_init=10, max_iter=300) | |
| kmeans = KMeans(n_clusters=num_clusters, random_state=seed, n_init=n_init, max_iter=max_iter) |
There was a problem hiding this comment.
Hi @sarahyurick , These are scikit-learn's defaults (n_init=10, max_iter=300), which means this is equivalent to KMeans(n_clusters=num_clusters, random_state=seed) without specifying them. They're explicit in the code just so readers know the values at a glance.
| """Encode summaries with CosmosEmbed1 and compute per-clip scores.""" | ||
| logger.info(f"Loading CosmosEmbed1-{variant} from {cosmos_model_dir} ...") | ||
| model = CosmosEmbed1(variant=variant, utils_only=False, model_dir=cosmos_model_dir) | ||
| model.setup() |
There was a problem hiding this comment.
Could it be worth doing any of this in a Curator pipeline or no?
There was a problem hiding this comment.
This is a standalone evaluation tool instead of a data processing step. i'm worried that wrapping it in a Curator pipeline would add overhead
| _REPO_ROOT = Path(__file__).parent.parent.parent | ||
| sys.path.insert(0, str(_REPO_ROOT / "tutorials" / "video" / "getting-started")) | ||
|
|
||
| from video_split_clip_example import ( # noqa: E402 | ||
| create_video_splitting_argparser, | ||
| create_video_splitting_pipeline, | ||
| ) | ||
|
|
There was a problem hiding this comment.
@suiyoubi / @weijiac0619, Should we move/copy this code to eval or some other place ?
Like it feels generally useful for other evals will use this and benchmarks use this too.
Feels like an awkward place to pull getting started code for these functions ?
There was a problem hiding this comment.
agree but video_split_clip_example.py is the existing pipeline entry point. what do you think about folder structure? @suiyoubi
Description
doc link
Usage
# Add snippet demonstrating usageChecklist