A CLIP-based toolkit for embedding image folders and generating compact pairwise distance matrices for retrieval and evaluation.
- 🔍 CLIP-based image embedding (any OpenCLIP model, default is Apple's DFN5B-CLIP-ViT-H-14-384)
- ⚡ GPU-accelerated batch inference
- 📦 Compact flattened pairwise distance arrays (upper-triangular matrix, float32/float16 storage, top-k neighbors)
- 🔒 Privacy-preserving series label anonymization helper
- 📊 Mean Average Precision (mAP) computation from either flattened distances or stored top-k neighbors
make run \
INPUT_DIR=/path/to/images \
OUTPUT_DIR=/path/to/output \
MODEL=hf-hub:apple/DFN5B-CLIP-ViT-H-14-384 \
BATCH_SIZE=16 \
DEVICE=cuda \
ANONYMIZE_LABELS=/path/to/labels.json \
PAIRWISE_DTYPE=float16Using Makefile (auto-creates venv and installs dependencies)
make install
source .venv/bin/activateRunning the CLI:
python -m clip_image_similarity.cli \
--input-dir /path/to/images \
--output-dir /path/to/output \
--model hf-hub:apple/DFN5B-CLIP-ViT-H-14-384 \
--batch-size 16 \
--device cudaOr use the Makefile wrapper (installs and activates the venv automatically):
make run INPUT_DIR=/path/to/images OUTPUT_DIR=/path/to/output| Parameter | Required | Default | Description |
|---|---|---|---|
--input-dir, -i |
✅ | - | Root directory containing images to process. |
--output-dir, -o |
✅ | - | Directory where results will be written. |
--model, -m |
❌ | hf-hub:apple/DFN5B-CLIP-ViT-H-14-384 |
Hugging Face Hub model ID for OpenCLIP. |
--batch-size, -b |
❌ | 32 |
Batch size for embedding computation. |
--device, -d |
❌ | Auto (CUDA if available) | Device to run on (e.g., cuda, cuda:0, cpu). |
--pairwise-dtype |
❌ | float32 |
Numeric precision for storing distances (float32 or float16). |
--top-k |
❌ | None | Save top-k neighbors per image instead of full flattened distances. |
--anonymize-labels |
❌ | None | Path to labels JSON (series → image paths); converts to series → indices. |
--image-exts |
❌ | Common formats | Comma-separated list of image extensions (e.g., jpg,png,jpeg). |
--overwrite |
❌ | false |
Allow overwriting existing output files. |
| File | Description |
|---|---|
evaluation_results/pairwise_distances.npz |
Flattened upper-triangular distances (1 - cosine_similarity); dtype float32 (default) or float16 via --pairwise-dtype, saved with dtype metadata. |
evaluation_results/pairwise_topk.npz |
Emitted when --top-k is set; contains per-image neighbor indices/distances plus stored top_k, dtype, and index dtype metadata. |
image_paths.json |
Ordered list of image paths corresponding to indices in the flattened array. DO NOT SHARE IF FILENAMES ARE SENSITIVE. |
series_to_indices.json |
Optional; only written when --anonymize-labels is provided. Maps series -> list of indices for downstream mAP while keeping paths private. |
config.json |
Snapshot of the run configuration. |
If you ran the CLI without --anonymize-labels but later want to generate series_to_indices.json, you can use the standalone script:
make anonymize-labels OUTPUT_DIR=./results LABELS=./path/to/labels.jsonOr run directly:
python -m clip_image_similarity.generate_anonymous_labels \
--output-dir ./results \
--labels ./path/to/labels.json \
--overwrite # optional: overwrite existing series_to_indices.jsonThis reads image_paths.json from the output directory and generates series_to_indices.json using your provided labels file.
After generating results, compute Mean Average Precision from the flattened distances and series indices:
python -m metrics.map \
--distances ./results/evaluation_results/pairwise_distances.npz \
--series-indices ./results/series_to_indices.json \
--output_csv ./results/metrics/map.csvIf you saved top-k neighbors instead of the full flattened distances:
python -m metrics.map \
--topk ./results/evaluation_results/pairwise_topk.npz \
--series-indices ./results/series_to_indices.json \
--output_csv ./results/metrics/map.csvIf you need to derive indices from labels and paths locally instead, provide --labels and --image-paths to metrics/map.py (using the saved image_paths.json), but be aware that sharing paths reveals filenames:
python -m metrics.map \
--distances ./results/evaluation_results/pairwise_distances.npz \
--labels ./resources/labels/images_series_labels.json \
--image-paths ./results/image_paths.json \
--output_csv ./results/metrics/map.csvStart with a small batch size (~16 or 32) and gradually increase while monitoring GPU memory usage. For reference, batch size 256 achieves ~81% VRAM utilization on an RTX 5090 (32GB) when processing 30K images.
Use --pairwise-dtype float16 to reduce storage size by approximately 50% with negligible impact on retrieval accuracy. The default float32 provides higher precision but results in larger output files.
When working with large datasets, consider using --top-k to save only the k nearest neighbors per image instead of the full distance matrix if you want to minimize the size of the output. This significantly reduces storage requirements when k << total number of images.
Important: If you plan to compute mAP later, ensure k is at least as large as the size of the largest series in your labels. Otherwise, some relevant images may be excluded from the evaluation.
Performance benchmarks are available in BENCHMARK.md, including detailed timing breakdowns, resource usage, and throughput metrics.